Lost pages for this project are pages that return a 404. A 404 response code is an error message indicating that the client was able to communicate with the server but the server could not find what was requested. There are a multitude of possibilities why a page or an entire web site may disappear. These pages may reside only in the cache’s of search engines, or web archives, or just moved from one URI to another. In the context of this experiment Titles are denoted by the TITLE element within a web page. There can only be one title in a web page. The title may not contain anchors, highlighting, or paragraph marks.
What would be most desirable for this experiment would be to take all URIs as our collection set. Regrettably, using the entire web as our test set is unrealistic. Capturing a representative sample set of web-sites for the entire web is not an insignificant task. Therefore, we selected a random collection of web pages from dmoz.org. Dmoz.org, The Open Directory Project, is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors. From this sample we obtained 7314 web-sites as our initial set. After filtering out non-English, we were left with 7157.
Each of these pages titles was feed through the yahoo’s api. Any result within the first ten was considered found. Any result after the first ten returned was considered not found. Intuitively, most results not on the first page of a result set are not viewed or visited.
The data set was comprised of pages titles with a mean character count of 44.7, with a standard deviation of 27.4, giving a range of 17 to 72 characters. Furthermore, if titles were considered as words, anything broken up by white spaces, the mean is 6.7, with a standard deviation of 3.3, giving a range of 3 to 10 terms. Lastly, our data set was broken down into 66% found and 34% not found.
The goal of the experiment is to discern an element or series of components within a title that would allow us to predict the status of a web page. If we summarily said all titles are good titles for our response, we would be correct 66% of the time. Our baseline or point of reference for determining if a test merits consideration, is a test that can discern good titles from bad titles more than 66% of the time.
Our test were broken down into four types:
A title is a sentence structure it seemed reasonable to use the amount of nouns, adverbs, adjectives, etc as a determinant for choosing which title would produce good or bad results.
Using queries with Boolean or, Boolean and, quoted are some combination.
Stop word are words that are considered too ubiquitous and are filtered out prior to submission to a search. Using predefined stop word sets as a percentage of a title as a test.
After looking at the sampling, there were certain titles that would always produce a not found classification. Using these titles as a percentage of a title as a test.
Regrettably, Grammar based, Search based and Stop word test did not provide any better results than assuming all titles would find the resource. On the other hand the use of stop titles, increased our ability to find titles that would produce a not found status by 6%.
The conclusion we may draw from are test are the usefulness of a titles for discovering a good titles is limited. A more useful finding is that excluding stop titles increases the accuracy of discerning a good title from a bad title. A larger data set may led to more stop titles or prove the usefulness current not productive tests.
The full report can be found at:
Jeffery L. Shipman, Martin Klein, Michael L. Nelson, Using Web Page Titles to Rediscover Lost Web Pages, Technical Report arXiv:1002.2439, February 2010.