The final project for my master's degree focused on the problem of “missing” web pages, those URIs that return an error result when retrieved. When a web page is no longer available at a given URI, it may be available at a new URI, and this research proposes and demonstrates a new method for finding the new URI.
Prior research has proposed using the lexical signature of a page as a search query to find the same or similar content at a new URI. A lexical signature (LS) is a few words that are used in that page much more often than they are used in other pages on the Web, and so are thought to describe what the page is about. That LS is then used as a search query which will hopefully find the target page in its results.
Previously-proposed methods for using an LS to find a new URI required either that the page be analyzed before being lost (ref: P&W) or that cached or archived versions of the page be available for analysis. If the page had not previously been analyzed and no cached copies existed, then these methods could not hope to recover the missing page.
We propose a new method for calculating an LS of a page using only its backlinks, that is the pages that link to the missing page. The target page together with its backlinks make up the link neighborhood, a trivial example of which is shown above.
Backlinks are retrieved from a search engine (we used Yahoo!). We process each backlink page to find the text it uses to link to the target page. This text is known as ‘anchor text’. We take all of the terms from the anchor texts of all of the backlinks, and calculate the Term Frequency - Inverse Document Frequency (TF-IDF) value of each. That is, we find the terms that are used commonly to link to the target page and are less common on the rest of the Internet. We take the terms with highest TFIDF to be the lexical signature. We use the LS as a query back to the search engine, and if the method is successful, we find our target URI at the top of the results. An example of this process is shown below.
We consider several variations of this method in order to draw conclusions about the most effective parameters. First, we find that including second-level backlinks is not helpful; they add too much noise to the LS and decrease its effectiveness. This confirms the intuition that second-level backlinks, those pages that don’t link directly to a target page but instead link to the target page’s backlinks, are not as closely related to the target page, and will therefore provide less relevant terms.
Second, we show that only the anchor text provides useful terms to the LS. We also experimented using the anchor text plus five words on either side of the link, or anchor +/-10 words, as well as using all words on the page. We found that using only the anchor text provides the best-performing LS, and every step further away from the anchor text led to worse performance.
Third, we show that a passable LS can be created using only the first ten backlinks retrieved from the search engine. Using the first hundred or thousand backlinks yields a slightly-better performing LS in some cases, but we argue that due to the increased cost associated with retrieving and processing 90 or 990 extra pages, the increased performance isn’t worth it.
Lastly, we recommend using a 4-term LS. This is noteworthy in that most other LS research has concluded that a 5- or 7-term LS is ideal, depending on the desired performance characteristics. We posit that fewer terms are preferable because we are drawing the terms from pages other than the page for which we are searching, therefore we run the risk of including terms which do not appear in the target page, which would likely exclude the target page from a search for the LS. By using fewer terms, there is less risk that we include a word from a backlink page that doesn’t exist in the target page.
Using our recommended method, the target URI appears as the first result for the LS in 56% of our test cases.
The full report can be found at:
Jeb Ware, Martin Klein, Michael L. Nelson, An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages, Technical Report arXiv:1102.0930, February 2011.-- Jeb Ware