Not without pride I see two of my papers being accepted at the upcoming conferences ACM Hypertext (HT) and ACM/IEEE Joint Conference on Digital Libraries (JCDL).
The paper "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure" will be published at JCDL. It is co-authored with my advisor Dr. Michael L. Nelson. As part of my ongoing dissertation work we are investigating methods to rediscover missing web pages with the help of the web infratructure (search engines, their caches, the Internet Archive, etc) in real time meaning while the user is browsing the web. This paper evaluates the performance of four of these methods: the title of the web page, its lexical signature (LS) representing the most salient terms of its content, its tags obtained from delicious.com and its neighborhood lexical signature (NHLS), a LS based on content of pages that link to the centroid page.
We generate a corpus of web pages by randomly sampling from the Open Directory Project (DMOZ). We apply all four methods, use their output to query the major search engines (Google, Yahoo!and MSN Live (we conducted the experiments before Bing was introduced)) and evaluate the result set in terms of ranking of the originating page.
Our results confirm a good performance of LSs (detailed in our ECDL 2008 paper) but also show, and this is one point to take away from this paper, that titles perform equally well. We further show that there is value in combining methods. Hence the second take away point: we recommend applying the title based method first and the LS method second (for pages not returned) since it increases the retrieval performance for the rediscovery of missing web pages compared to any single method and is the most resource efficient combination.
I authored the paper "Is This a Good Title" together with Jeff Shipman, a Masters student in the Department of Computer Science at ODU, and Dr. Michael L. Nelson. It will be published at HT 2010.
Here we focus on the retrieval performance of web pages titles and address the following questions: How much do titles change over time (compared to the pages content)? From what age on are they useless for our retrieval purpose? Intuitively not all titles are "good" for search, so can we identify poor performing titles a priori?
We again generated a large set of pages by sampling DMOZ, extracted their titles, downloaded their content and also all available copies of these pages provided from the Internet Archive to investigate the temporal change.
The figure to the left shows that our titles are much more stable over time than the pages content and hence are a great alternative to content based and expensive to generate LSs for retrieving web pages. The change of titles over time is measured using the Levenshtein distance (x-axis) and the content change using Shingles (y-axis). The majority of dots representing URIs are plotted in the top left corner meaning minor titles changes over major content changes.
By analyzing our corpus of titles we identify what we call "stop titles". These are terms that would not qualify as common stop words and hence not be filtered by search engines but that do not contribute to discribing the pages "aboutness". Examples for stop titles are "home", "homepage" and "index". We find that if the ratio of these terms in a given title is above 75% the likelihood of the title performing poorly is very high. The same concept is applicable for the number of single characters of a title in order not to discriminate shorter/longer titles. We can apply these simple tests in real time hence flag a "bad" title a priori. We will skip querying the title for such cases and invest in generating the pages LS right away.
My fellow student Chuck Cartledge also has a paper accepted at HT 2010.
I will travel to both conferences in June and publish a report once I get back.