Saturday, November 10, 2012

2012-11-10: Site Transitions, Cool URIs, URI Slugs, Topsy

Recently I was emailing a friend and wanted to update her about the recent buzz we have enjoyed with Hany SalahEldeen's TPDL 2012 paper about the loss rate of resources shared over Twitter.  I remembered that an article in the MIT Technology Review from the Physics arXiv blog started the whole wave of popular press (e.g., MIT Technology Review, BBC, The Atlantic, Spiegel).  To help convey the amount of social media sharing of these stories, I was sending links to the sites using social media search engine Topsy.  Having recently discovered it, Topsy has quickly become one of my favorite sites.  It does many things, but the part I enjoy most is the ability to prepend "http://topsy.com/" to a URI to discover how many times a URI has been shared and who is sharing it.  For example:

http://www.bbc.com/future/story/20120927-the-decaying-web

becomes:

http://topsy.com/http://www.bbc.com/future/story/20120927-the-decaying-web

and you can see all the tweets that have linked to the bbc.com URI. 

While composing my email I recalled the Technology Review article was the one of the first (September 19, 2012) and most popular, so I did a Google search for the article and converted the resulting URI from:

http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

to:

http://topsy.com/http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

I was surprised when I saw Topsy reported 0 posts about the MIT TR story, because I recalled it being quite large.  I thought maybe it was a transient error and didn't think too much about it until later that night when I was on my home computer where I had bookmarked the MIT TR Topsy URI and it said "900 posts".  Then I looked carefully: the URI I had bookmarked now issues a 301 redirection to another URI:

% curl -I http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352561072"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 15:24:32 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 15:24:32 GMT
X-Varnish: 1779081554
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


A little poking around revealed that technologyreview.com reorganized and rebranded their site on October 24, 2012, and Google had already swapped the prior URI to the article with the new URI.  Their site uses Drupal and it appears their old site did as well but the URIs have changed.  The base URIs (e.g., http://www.technologyreview.com/view/429274/) have stayed the same (and is thus almost "cool"), but the slug has lengthed from 8 terms ("history as recorded on twitter is vanishing from") to the full title ("history as recorded on twitter is vanishing from the web say computer scientists").  Slugs are a nice way to make the URI more human readable, and can be useful in determining what the URI was "about" if (or when) it becomes 404 (see also Martin Klein's dissertation on lexical signatures).  The base URI will 301 redirect to the URI with the slug:

% curl -I http://www.technologyreview.com/view/429274/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352563816"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 16:10:16 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 16:10:16 GMT
X-Varnish: 1779473907
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


But this redirection is transparent to the user, so all the tweets that Topsy analyzes are the versions with slugs.  This results in two URIs for the article: the version from Sept 19 -- Oct 24 that has 900 tweets, and the Oct 24 -- now version that currently has 3 tweets (up from 0 when I first noticed this).  technologyreview.com is to be commended for not breaking the pre-update URIs (see the post about how ctv.ca handled a similar situation) and issuing 301 redirections to the new versions, but it would have been prefereable to have maintained the old URIs completely (perhaps the new software installation has a different default slug length, I'm not familiar with Drupal and in the code examples I can find a limit is not defined). 

Splitting PageRank with URI aliases is a well-known problem that can be addressed with 301 redirects (e.g., this is why most URI shorteners like bitly issue 301 redirects (instead of 302s), so the PageRank will accumulate at the target and not the short URI).  It would be nice if Topsy also merged redirects when computing their pages.  In the example above, that would result in either of the Topsy URIs (pre- and post-October 24) reporting 900+3 = 903 posts (or at least provided that as an option).  

--Michael

Edit: I did some more investigating and found that the slug doesn't matter, only the Drupal node ID of "429274" (those familiar with Drupal probably already knew that).  Here's a URI that should obviously return 404 redirecting to URI with the full title as the slug:

% curl -I http://www.technologyreview.com/view/429274/lasdkfjlajfdsljkaldsf/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352581871"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 21:11:11 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 21:11:11 GMT
X-Varnish: 1782237238
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


This makes the Drupal slug very close to the original Phelps & Wilensky concept of "Robust Hyperlinks Cost Just Five Words Each", which formed the basis for Martin's dissertation mentioned above.  While this is convenient in that it reduces the number of 404s in the world, it is also a bit of a white lie; user agents need to be careful to not assume that the original URI ever existed even though it is issuing a redirect to a target URI. 

Wednesday, November 7, 2012

2012-11-06: TPDL 2012 Conference


It all started last April, particularly on the 9th, when I received an email from the Dr. George Buchanan delivering the good news, my paper have been accepted at the annual international conference on Theory and Practice of Digital Libraries TPDL 2012. Being the Program Chair, Dr. Buchanan sent me the reviews and feedback associated with my paper which was entitled “Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?” which paved the way in the following months for the preparation process to present this paper.


 Along with submitting the paper, Dr. Nelson gave me the permission to submit my PhD proposal to be considered for the Doctoral Consortium at the conference. Scoring my second goal, Dr. Birger Larsen and Dr. Stefan Gradmann sent me a delightful email announcing the committee's acceptance to my proposal and I was invited a day before the conference to present my work at the consortium.

The Hat-trick came a few weeks before the conference in the form of an email from Dr. Birger proposing that I present my work, from the doctoral consortium, at the poster session on the first day of the conference. Overwhelmed with joy, I gladly accepted this gracious invitation and started working on the poster.

After an 8 hour drive to New York and a couple of flights, I arrived to Larnaca airport in Cyprus. I can't complain because two of the most closest activities to my heart are driving and travelling. Anyway, I took the bus to Limassol from the airport and was supposed to take another bus to Paphos, where the conference is held, but unfortunately it didn't come. After a quick chat with two French ladies who happened to be heading to Paphos too, we shared a taxi there and I finally arrived to the hotel which I will be spending the following nights, Cynthiana beach hotel. With a captivating view and spacious suites, Cynthiana hotel was located 10 minutes by bus from the venue of the conference and half way to the center of the city.

After being confused with the British system in driving on the opposite side of the road and being directed by the kind locals to where the station was located, I took the bus on the following morning to the conference venue at Coral Bay Hotel. Dr. Larsen and Dr. Gradmann were waiting for me and the other to guide us to the presentation room. I was assigned Ms. Justyna Walkowska as my mentor to guide the discussion and give me constructive criticism. I personally loved this model of consortia as it gave the opportunity to the mentors to read the proposals in detail prior to the presentations giving them in-depth views of the work, rendering the feedback more constructive and beneficial.



After giving the presentation and receiving the questions and feedback, I sat down and listened to the work of fellow PhD students: Tuan VuTran, Armand Brahaj, and Nut Limsopatham. Shortly after wrapping up the consortium, Dr. Larsen and Dr. Gradmann took us to the city pier to have an authentic Cypriot dinner. The food, the atmosphere, and the company were marvelous. Later that night I arrived back to the hotel exhausted.

The next morning the conference commenced. Following the welcome notes by Dr. Buchannan Dr. Mounia Lalmas gave a marvelous keynote speech entitled “User Engagement in the Digital World”. Dr. Lalmas is a visiting principal scientist at Yahoo! Labs Barcelona. She talked about user engagement and the emotional, cognitive, and behavioral connection between the user and the technological resource. She discussed ways to measure this engagement and to model it, along with some select experiments discussing those several aspects.

After the keynote speech we had a short coffee break where I met some people I haven't seen since JCDL earlier in June. Then I headed to the 2nd track sessions entitled “Analyzing and Enriching Documents “ which included several interesting papers by Róisín Rowley-Brooke, my friend Luis Meneses, Daan Odijk, and Annika Hinze who had 4 papers published in this conference, which I found fascinating. The lunch break followed and I had to do a phone interview with Ms. Lesley Taylor from the Toronto Star who wrote an article about the paper I am presenting at the conference.

Following the lunch I attended the session entitled “Extracting and Indexing” where Guido Sautter, Benjamin Köhncke, and Georgina Tryfou presented their work. The minute madness started shortly after and followed by the poster session.

Standing by my poster in the middle of the room I started explaining my work to interested researchers in the field. After a while I started checking out other neighboring posters and I bet my friend Clare Llewellyn for drinks if she won the best poster award (spoiler alert, she owes me drinks now!) with her brilliant linen cloth poster. Later that evening and after the welcome reception we went out for dinner and drinks in another authentic Cypriot restaurant and had a lovely time.

The following day started with the second keynote speech by Dr. Andreas Lanitis from the Department of Multimedia and Graphic Arts, Cyprus University of Technology entitled: “On the Preservation of Cultural Heritage Through Digitization, Restoration, Reproduction and Usage”. In this captivating talk, Dr. Lanitis discussed the digital preservation of Cypriot Cultural Heritage artifacts, the restoration and reproduction.

After the coffee break I also attended the second track entitled: “Content and Metadata Quality” where two fascinating papers have been presented, one regarding the SKOS vocabularies and the other about meta learning from wiki articles. I was fairly nervous because the following session and just after lunch I was supposed to present my long paper too.

During lunch I had my second phone interview with Ms. Claire Connelly a journalist from News Ltd in Australia also writing an article about our work. Following lunch, this time I joined the 1st track sessions among which I will present my work. It started with Anqi Cui presenting his interesting work with PrEV (Preserving and Providing Web Pages and User-generated Contents). To my surprise he cited my work within his presentation and a sense of accomplishment flooded me. Scientific processes have been analyzed next in the following paper entitled: “Preserving Scientific Processes from Design to Publications”. After that I took the stage and I was surprised by the large number of attendees. The questions were marvelous and Cathy Marshall, among others, gave me very precious feedback. Following my presentation, Ray Larson and Maria Sumbana presented the following two papers.


After the coffee break we returned back to have the last round of sessions in which I chose again track 2 “Information Retrieval” presenting four more papers. At 7 o'clock we gathered by the lobby to board the buses taking us to the outskirts of the town to an authentic Cypriot restaurant. This one was different as it had a band and a folk dancing group who taught us how to do the Cypriot round and line group dancing.

The following morning I packed my bags and checked out before attending the last day of the conference which started with an enticing and captivating talk as usual by Cathy Marshall from Microsoft Research San Francisco. The talk was entitled “Whose content is it anyway? Social media, personal data, and the fate of our digital legacy”, similar to the equally wonderful speech she gave at JCDL. Finally I attended the set of sessions that I have been looking forward to the most, “track 2 User Behavior” presented by Michael Khoo and Catherine Hall, Sally Jo Cunningham, Fernanto Loizides, and my friend Gerhard Gossen.

The closing session followed up next concluding the conference where the best paper/demo/poster awards were handed to the authors among which our friend Clare Llewellyn.
In conclusion it was a really organized and successful conference, our presence was evident three times and I attended several interesting sessions, met old colleagues, made a lot of new contacts, and got really great feedback.

Other Blog Posts:

-- Hany SalahEldeen