Thursday, May 30, 2013

2013-05-30: World Wide Web Conference WWW2013 in Rio de Janeiro, Brazil, Trip Report

After a long overnight flight, I landed in the sunny and beautiful Rio de Janeiro. A couple of months earlier, my paper entitled: “Carbon Dating the Web: Estimating the Age of Web Resources” was accepted at the third annual Temporal Web Analytics Workshop TempWeb03 which is associated with the 22nd World Wide Web conference WWW2013. My colleague Ahmed Al Sum’s paper got accepted as well entitled: “Archival HTTP Redirection Retrieval Policies”. Ahmed wrote a beautiful detailed post about the workshop which I encourage everyone to read.

I arrived on Monday the 13th morning at 6 AM and immediately took a taxi to the Windsor Barra hotel where the conference is held and where I will be residing for the next 5 days. My colleague Ahmed arrived a day earlier so he was bragging that he got the chance to relax and see the sunset on the beautiful beach. After a quick shower I went downstairs to the registration area to receive my ID tag and the conference kit. Everything went completely smooth and the volunteers were extremely helpful. To put the reader in perspective, the main language spoken in Brazil is Portuguese and a minute percentage of the population know English. Prior to my trip, I taught myself the regular greetings in Portuguese and thought with my knowledge in Spanish, English, French, and Arabic I might get by, but unfortunately I was wrong. I had to sign-language  my way out of the airport! Brazilian people are very nice and hospitable but the language stood as a communication barrier. After talking with the young volunteers a little, I salute the organizing committee for this idea, the volunteers were all fluently bilingual, knowledgeable about the city, and are college students who are mostly in the same area of computer science or engineering.

Despite my exhaustion, I rushed to attend the tutorial entitled “Measuring User Engagement” presented by Mounia Lalmas, Heather O’Brien and Elad Yom Tov who couldn’t make it. I met Dr. Lalmas last October at TPDL 2012 in Cyprus. And as my work focuses on several aspects of user analysis on the web I was personally interested in her work and the tutorial was definitely worth skipping the nap time. They started by explaining what is user engagement, its importance and how to measure it. After introducing the concept to the audience they elaborated in the first part of the talk the basic foundations of user engagement. Forrester’s 4 Is, the theory of flow and how it is related to user engagement. After that how to measure user engagement either by self reporting, cognitive engagement via physiological measures, or by interaction engagement i.e., web analytics. They explained every branch in detail along with explaining their approaches and experiments conducted. And finally wrapped up with the second part of the tutorial by explaining the advanced aspects through mobile user engagement and information seeking. Finally the session ended with a Q and A period with the attendees. The beauty of this tutorial lies in the way Mounia and Heather presented everything, through a natural sequence they took a regular attendee who knows nothing about user engagement or any user studies per say and explain all the details and ended by the state of the art experiments in the field. In conclusion, the talk was informative and engaging, definitely worth attending.

After lunch, Ahmed and I headed to the Temporal Web 2013 workshop which was one of the 10 concurrent workshops at WWW. Dr. Marc Spaniol was the chair of the workshop and he introduced the keynote speaker Dr. Omar Alonso. My colleague Ahmed wrote a detailed report about the workshop which I encourage you to read. I was fortunate enough to be invited by Dr. Spaniol to chair one of the three sessions of the workshop which was an honor and a delight. After the workshop ended we all headed out to have a brazilian dinner at a nearby restaurant and bar.

Next morning we started early and after breakfast we headed to the second day of track tutorials. Throughout the conference since there are about 7-10 tracks running simultaneously it was very hard to pick the sessions I wish to attend as several of them were really interesting and related to my work. I attended the (Big) Usage Data in Web Search talk by Ricardo Baeza-Yates, Yoelle Maarek and was presented by the latter. Dr. Maarek described how much queries differ from documents and that the overlap between queries and the documents is very small. She elaborated the stages of the query flow-graph via an example about Barcelona. The stages were in correcting (not barelona but barcelona), specializing (F.C. Barcelona not just barcelona), generalizing (barcelona cheap hotels), and parallel move (F.C. Barcelona. and Real Madrid). She explained the reasons behind separating query session by task not time (going to rio, where to stay and what to eat and visit?) and call it research session.

After lunch we head to attend the second set of workshops. I picked the 2nd International Workshop on Real-Time Analysis and Mining of Social Streams (RAMSS). The keynote speech was presented by Ramesh Sarukkai from Google presented ‘Real-time User Modeling and Prediction: Examples from YouTube’. After a group of fascinating presentations Dr. Gianmarco De Francisci Morales from Yahoo research Barcelona gave an awesome ending keynote speech presenting SAMOA which is a platform that mines big streaming data using map reduce on hadoop.

Next morning was the first day of the conference, and since it didn’t start till 1pm Ahmed and I decided to go to the beach. I picked Copacabana beach so we spent the morning there. After we returned back to the hotel we had lunch and headed to the opening session of the conference. After a few words from a distinguished panel of Brazilian technology leaders and the head of W3C, and a few minor hiccups with the translation headsets, Dr. Louis Von Ahn was introduced to start his keynote speech. It is safe to say that personally, and after all these years attending talks, his speech was the best I have ever attended. Looking around me and checking the twitter #www2013 feed I can see the audience sharing my enthusiasm and focussing on every word he said. Dr. Von Ahn talked about reCaptcha and utilizing human computing in everyday authentication in digitizing and transcribing books. 1.1 billion users helped digitizing books using recaptcha to date resulting in 2 million digitized book annually, which I found fascinating.

Demonstrating the power of the people in human computation, and to help individuals expand their horizons in learning a new language for free in an effective way, he introduced a free language education tool for the world. Learning a new language can be costly and not available for the people in moderate to low income areas, so the motive was to provide a tool that helps an individual to learn a new language via the computer/phone. Rosetta stone have been doing that for years, Dr. Von Ahn argued, but it cost about a 1000 USD. The motive was to provide a similar tool, if not more exciting and easier, free of charge. The game changer was to find a way to fund this project but without burdening the users with subscriptions. Following the same paradigm of reCaptcha by utilizing the computational power of the people, Dr. Von Ahn analyzed the possibility of using the learning/testing phases of the collective users in translating content from one language to another and using selling these translations to fund DuoLingo. He argued that if 1 million native Spanish speakers were starting to learn English, they can translate the entire english content of wikipedia in less than 80 hours. Also after several months of studying the learning curves of people and their skill levels enhancements per language the Duolingo team was able to enhance the learning steps for each language. Also they reached several interesting results for example, that italian women learn english faster than italian men by 10% and that 34 hours on Duolinguo is equivalent to a semester of language learning. After the fascinating keynote speech, Dr. Von Ahn met the entrepreneurs in a meet and greet session which he started by saying: “I am not gonna charge the users”.

Shortly after, the attendees dispersed to attend the sessions that best suit their interests. I have never wanted to exist in two places at once like I did in the following three days of the conference. Interesting work, fascinating findings, and exciting topics, opening my research-eyes to new horizons. Since my interests would not match some of the readers’ matches I encourage you to explore the proceedings. In the next few paragraphs, I will talk about highlights of the session I attended in the next three days.

From the Social Web Engineering research track I attended “Pick-A-Crowd: Tell Me What You Like, and I’ll Tell You What To Do”. The authors discussed that the pulling methodology for workers, performing Human Intelligence Tasks or HITs on Amazon’s Mechanical Turk, is sub optimal. They argued that worker recommendation based on their social profiles is a better way to perform task-to-worker matching. By building an inverted index of the workers through a facebook app called Open Turk they reached a set of very interesting results. The next paper was entitled “Groundhog Day: Near-Duplicate Detection on Twitter”. As the title shows, the aim of this study was to extract the near duplicate tweets and to classify them as exact copies, nearly exact copies, strong-near duplicate, weak near-duplicate, and finally low overlap. They manually labelled nearly 55,000 tweets using dbpedia and wordnet. For the next presentation, I had to be fast in migrating to another research track: “Trust and Enterprise Social Networks” to attend a presentation for a paper entitled: “Mining Expertise and Interests from Social Media” by researchers from IBM research. After that I attended the last presentation in another track of “Privacy and Personalization” entitled “I Know the Shortened URLs You Clicked on Twitter: Inference Attack using Public Click Analytics and Twitter Metadata” where the authors argue to be the first to perform a click history study on Twitter.

Next we went to the posters room to have a coffee break for half an hour. It was definitely educational, the amount of discussions and ideas I was exposed to, talking to the researchers and poster-authors. The following session started at 5pm where I attended the “Negative Links and Anomalies in OSN” track. The first two papers: “What Is the Added Value of Negative Links in Online Social Networks?” and “Predicting Positive and Negative Links in Signed Social Networks by Transfer Learning” were very interesting and informative. The third paper I personally found fascinating which was entitled: “CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks”. Alex Beutel from CMU presented his work during his internship at Facebook by analyzing page likes and spotting spammers and fake accounts. For the last paper in the session, I hopped to the neighboring room and attend the last session in the “Transforming UIs/Personal & Mature Data” track. The paper entitled “Rethinking the Web as a Personal Archive” was presented by Siân Lindley as a joint collaboration with Cathy Marshall from Microsoft Research.

Ahmed and I did not attend the last session as we, and other students, were invited to attend a meet and greet session with the one and only Sir Tim Berners Lee where we got the opportunity to ask him several questions about science, research and industry. Answering Ahmed, he explained the Memento project and talked about web preservation with the audience. After the session, Ahmed and I went to have a walk in a neighboring area and had dinner in a small hole-in-the-wall place which was delightful.

The next morning I decided it was going to be pure relaxation as I was exhausted already. So I spent the morning on the beach opposing to the hotel. After lunch we attended the second keynote speech by Dr. Miguel Nicolelis the Duke School of Medicine Professor of Neuroscience at Duke University. The speech was about the brain to computer interface and the experiments they performed in this field which was both refreshing and fascinating. He ended his speech with an initiative his lab is working in the walk-again project that aims to make a paraplegic person walk and give initial kick at 2014 World Cup in Brazil which I found mind blowing. Following the keynote speech a panel was held entitled: “Net Neutrality and Internet Freedom”.

At 5 we started the first sessions and I attended a multiple of presentations in different tracks. I started with “Wisdom in the Social Crowd: an Analysis of Quora” from the “OSN Analysis and Characterization” track then attended the “User Behavior” track and finally the user and behavior modelling track to attend the presentation “Towards a Robust Modeling of Temporal Interest Change Patterns for Behavioral Targeting”. After a coffee break and more of the poster session, the second session started where I attended the “Mining Collective Intelligence in Groups” presentation in the “Web Mining” track then the last three presentations from the “Recommender Systems” track.

After the sessions, the organizers led us to the busses taking us to the Typical Brazilian gala Dinner. After approximately an hour drive in beautiful Rio we arrived to the restaurant Porcão. On the melodies of sweet samba we were greeted to the open area in the restaurant where a traditional band was playing. After all the buses arrived we were led to the dining room where I tasted some of the most amazing beef I have ever had (Brazilian Picanha). The dinner ended with a reenactment of the infamous carnival dances but in a smaller scale. Finally an amazing band of 18 young performers made the audience dance all night on the songs of The Beatles in a samba infusion that was so captivating.

Next day was the last day of the conference. After few announcements and the closing ceremony, Dr. Jon Kleinberg gave the closing keynote speech. His work with memorable quotes and the experiments he conducted to identify them from movies was quite fascinating and how the probability of adopting behavior depends on number of network neighborhoods that are adopting that behavior. After the keynote speech Ahmed and I attended the developers track to attend the session of “ResourceSync: Leveraging Sitemaps for Resource Synchronization” which is a joint work between Cornell, Michigan, Los Alamos National Lab and our one and only Old Dominion University. After a short coffee break I attended Cathy Marshall’s presentation on “Saving, Reusing, and Remixing Web Video: Using Attitudes and Practices to Reveal Social Norms”. After this presentation, with our bags packed, Ahmed and I left the hotel so we can race through the traffic to catch our flights back home.

A rather funny but unfortunate thing happened at the Miami airport border control. They kept me waiting for 4.5 hours while they process my papers. Well, I was watching a special about Michael Jordan on ESPN so I can’t complain much.

Overall, it was a very successful conference and trip. We got to present our work, represent the research group and Old Dominion University, attended several enlightening sessions, made great contacts, and exchanged a lot of ideas.

For more coverage please check out:
-- Hany M. SalahEldeen

Wednesday, May 29, 2013

2013-05-29 mcurl - Command Line Memento Client

The Memento protocol works in two directions:
  • Server implementation: the server complies with Memento protocol, so it can read the "Accept-Datetime" header, do the content-negotiation in datetime dimension, and return the memento near the requested datetime to the user. Successful examples include: Internet Archive Wayback Machine, British Library Wayback Machine, and DBpedia.
  • Client implementation: the user needs a tool to sets the requested URI with the preferred  datetime in the past. Current tools include: FireFox add-ons MementoFox, British Library Memento Service, and Memento Browser for Android and iPhone.
Today, we are pleased to announce mcurl, a command-line memento client. mcurl is a wrapper for the unix curl command that is capable of doing content negotiation in the datetime dimension with Memento TimeGates. mcurl supports all curl parameters in addition to the new parameters that are Memento related.

Users may use the curl command to do content-negotiation in the datetime dimension by passing the "Accept-Datetime" header with -H argument and connect directly to the TimeGate, however mcurl has more features than that.
  • TimeGate identification: using mcurl, the user needs to specify the datetime and the uri only. mcurl has its own default TimeGate, it could be overwritten by user. Also, mcurl can read the TimeGate from the link response header returned from the URI.
  • Handling redirection: mcurl implemented the HTTP redirection retrieval policy as appeares in section 4 of the Memento Internet Draft v 7.
  • Embedded resources rewriting: mcurl provides two modes for the embedded resources. Strict mode, where mcurl will accept the embedded resources URI from the web archive, and Thorough mode, where mcurl will repeat the content-negotiation for each embedded resource URI to get the best/nearest resource.   Thus, using the Memento Aggregator mcurl can construct the page from multiple archives.
For example, using curl command to get a memento for near Fri, 05 Feb 2010 could be formed as the following:

curl -H "Accept-Datetime: Fri, 05 Feb 2010 14:28:00 GMT"

If you look deeply in the returned page, you will find the embedded resources came from the live web instead of the web archive. It happened because the current Wayback Machine's Memento implementation doesn't provide rewriting for the embedded resources. This problem is easily solved by mcurl.

perl ./ -L  --mode thorough 
--datetime 'Fri, 05 Feb 2010 14:28:00 GMT' 
--replacedump dump.txt

Environment setup
mcurl is written in Perl, version 5 or later is required. Also, curl verion 7.15.5 and HTML::Parser package are required.

Memento related Parameters
mcurl supports a wide range of Memento related identifiers that help the user to set his favorite datetime, timegate and embedded resources mode.
  • -tm, --timemap <link|rdf>: To select the type of Timemap it may be link or html.
  • -tg, --timegate <uri[,uri]>: To select the favorite Timegates.
  • -dt, --datetime <date in rfc822 format>: To select the date in the past (For example, Thu, 31 May 2007 20:35:00 GMT).
  • -mode  <thorough|fast>: To specify mcurl embedded resource policy, default value is thorough.
  • --debug: To enable the debug mode to display more results.
mcurl is available on GitHub repository. There are three files required:,, and 

Usage Examples
In this section, we list some usage examples that explain the behavior of mcurl.
  1. Calling an original resource with the default timegate
  2. -I -L --debug --datetime 'Sun, 23 July 2006 12:00:00 GMT'
    Expected results: it will do the content negotiation in the datetime dimension, it uses the default timegate when required

  3. Calling timemap in link format with the default timegate
  4. -I -L --debug --timemap link
    Expected results: it will download the timemap in application-link format, it uses the default timegate

  5. Calling an original resource with a specific timegate
  6. -I -L --debug --timegate ''
    Expected results: it will do the content negotiation in the datetime dimension and get the last memento, it uses the specified timegate when required

  7. Calling an original resource with a specific timegate
  8. -I -L --debug --datetime 'Sun, 23 July 2006 12:00:00 GMT' --timegate ''
    Expected results: it will do the content negotiation in the datetime dimension, it uses the specified timegate when required

  9. Calling timemap in link format with the specific timegate
  10. -I -L --debug --timemap link --timegate ''
    Expected results: it will download the timemap in application-link format, it uses the specified timegate when required

  11. Calling an original resource that will respond with timegate in response headers
  12. -I -L --debug --datetime "Thu, 23 July 2009 12:00:00 GMT"
    Expected results: it will do the content negotiation in the datetime dimension, the site will provide a timegate which will override the default timegate

  13. Calling an original resource (R1) that has a redirection (R2), (R1) has valid mementos
  14. -I -L --debug --datetime "Thu, 23 July 2009 12:00:00 GMT"
    Expected results: it will do the content negotiation in the datetime dimension for R2.

  15. Calling an original resource (R1) that has a redirection (R2), (R1) does NOT have valid mementos
  16. -I -L --debug --datetime "Thu, 23 July 2009 12:00:00 GMT"
    Expected results: it will do the content negotiation in the datetime dimension using R2.

  17. Calling an original resource that has a timegate redirection
  18. -I -L --debug --datetime "Mon, 23 July 2007 12:00:00 GMT"
    Expected results: it will do the content negotiation in the datetime dimension, the site will provide a timegate which will override the default timegate. The timegate /tg/ has a redirection to /ta/

  19. Calling an original resource that has a timegate redirection
  20. -I -L --debug --datetime "Sat, 23 July 2011 12:00:00 GMT"
    Expected results: it will do the content negotiation in the datetime dimension, the site will provide a timegate which will override the default timegate. The timegate /tg/ has a redirection to /ts/

  21. Calling an original resource with Acceptable time period
  22. -I -L --debug --datetime Thu, 23 July 2009 12:00:00 GMT; -P5MT5H;+P5MT6H'
    Expected results: it will do the content negotiation in the datetime dimension with specified time period which has valid mementos, it uses the default timegate when required

  23. Calling an original resource with NOT Acceptable time period
  24. -I -L --debug --datetime 'Thu, 23 July 2009 12:00:00 GMT; -P5MT5H;+P5MT6H'
    Expected results: it will do the content negotiation in the datetime dimension with specified time period which does not have any valid mementos, it uses the default timegate when required

  25. Calling an original resource with invalid Accept-datetime header
  26. -I --debug --datetime 'Sun, 23 July xxxxxxxxxxxxxxxx'
    Response code: 400

  27. Override the discovered timegate with the specific one
  28. -I -L --debug --datetime "Sat, 23 July 2011 12:00:00 GMT" --timegate '' --override

  29. using the --replacedump switch to dump the replacement for the embedded resources to an external file for further analysis
  30. -L --mode thorough --datetime "Sat, 03 Dec 2010 12:00:00 GMT" --replacedump cnnreplace.txt

  31. accessing the dbpedia archive
  32. -L --mode thorough --datetime "Sat, 03 Dec 2010 12:00:00 GMT"
Ahmed AlSum

Saturday, May 25, 2013

2013-05-25: Game Walkthroughs As A Metaphor for Web Preservation

Do you remember playing the Atari 400/800 game "Star Raiders"?  Probably not, but for me it pretty much defined my existence in middle school: the obvious Star Wars inspiration, the stereo sound, the (for the time) complex game play, the 3D(-ish) first-person orientation -- this was all ground-breaking stuff for 1979.  It, along with games like "Eastern Front (1941)", inspired me at a young age to become a video game developer; an inspiration which did not survive my undergraduate graphics course

I could encourage you to (re)experience the game by pointing you to the ROM image for the game, as well an appropriate emulator (I used "Atari800MacX"), but without the venerable Atari joystick (the same one used in the more famous 2600 system), it just doesn't feel the same to me.  And although the original instructions have been scanned, the game play is complex enough that unlike most games of the era, you can't immediately understand what to do.  So although emulation is possible, probably the best way to "share" my middle school experience with you is through one of the many game walkthroughs that exist on Youtube.

Game walkthroughs are quite popular for a variety of purposes: advertising the game, demonstrating a gamer's proficiency (e.g., speedruns), illustrating short cuts and cheats, even as new form of cinema (e.g., "Diary of A Camper").  Walkthroughs are fascinating to me because they capture the essence of the game (from the point of view of a particular player) in what can be thought of as migration: recording and uploading what was originally an ephemeral experience.  Obviously the game play is canned and not interactive, but in some sense the expertly played Star Raiders session linked above does a better job of conveying the essence of 1981 than emulation, at least with respect to the 10 minute investment that the Youtube video represents.  (And yes, I realize the video was probably generated from an emulator.)  But let's put aside the emulation vs. migration debate for the moment (see David Rosenthal's "Rothenberg Still Wrong" if you'd like to read more about it).

I think game walkthroughs can provide us with an interesting metaphor for web archiving, not simply walkthroughs of web instead of game sessions (though that is possible), but in the sense of capturing a series of snapshots of dynamic services and archiving them.  Given "enough" snapshots, we might be able to reconstruct the output of a black box.

Consider Google Maps: a useful service completely at odds with our current web archiving capabilities that "archiving Google Maps" isn't even a defined concept (see David's IIPC 2012 and 2013 blog posts for background "archiving the future web").  The Internet Archive's Wayback Machine claims to have 11,000+ mementos (archived web pages) for*/

But only the first page is archived, clearly not the entire service.  If you start interacting with the mementos in the Wayback Machine, you'll find they're actually reaching out to the live web (see Justin's "Zombies in the Archives" for more discussion on this topic).  But Google Maps is sharable at each state of a user's interaction.  For example, here is the URL of the ODU Computer Science building in Google Maps:,+Norfolk,+VA&hl=en&sll=36.885425,-76.306227&sspn=0.021179,0.035148&oq=4700+e,+Norfolk,+VA&t=h&hnear=4700+Elkhorn+Ave,+Norfolk,+Virginia+23508&z=16

It shortens to:

for easier sharing.  The shortened URIs of two zoom operations are:

I sent all three URIs to WebCite for archiving and they are accessible at, respectively:

Looking at a screen shot in WebCite, it appears that at least that view is archived:

But looking at the activity log shows that nearly all the connections are actually going to various machines instead of archived versions at

Assuming we solve the problem of archiving all the requests and not reaching out to the live web (e.g., client-side transactional archiving), the next problem would be determining that these two Google Map URIs:,+Norfolk,+VA&hl=en&ll=36.886692,-76.308128&spn=0.003744,0.004393&sll=36.885425,-76.306227&sspn=0.021179,0.035148&oq=4700+e,+Norfolk,+VA&t=h&hnear=4700+Elkhorn+Ave,+Norfolk,+Virginia+23508&z=18,+Norfolk,+VA&hl=en&ll=36.887636,-76.306465&spn=0.003556,0.00537&sll=36.885425,-76.306227&sspn=0.021179,0.035148&oq=4700+e,+Norfolk,+VA&t=h&hnear=4700+Elkhorn+Ave,+Norfolk,+Virginia+23508&z=18

are "similar enough" that they can be substituted for each other in the playback of an archived session.  For example, if an archive has a memento of the first URI but the client is requesting the a memento of the second URI, rather than return a 404 the first URI can probably be substituted in most cases.  Of course, this notion of similarity will be both a function of URIs being archived (e.g., exploiting the fact that the above URIs are about geospatial data) as well as the client accessing the mementos (different sessions may have different thresholds for similarity). 

For example, suppose you wanted to see the state of the ODU campus ca. 2013: from the ODU CS building, to the parking garage on 43rd & Elkhorn, to the football stadium off Hampton Boulevard.  Taking the three URIs above plus seven more, I uploaded an image slideshow to Youtube:

Which certainly preserves the experience to an extent (university campuses are constantly changing and growing and these aerial views will soon be "archival" instead of "current").  But imagine instead of a series of PNGs strung together into a video, there were 10 different HTML pages in an archive (along with the associated images).  You could still "scroll", but the transition from one memento to the next would be discrete (i.e., jerky instead of smooth like the live version of Google Maps).  To stretch the walkthrough metaphor further, it would be helpful if a memento like was navigable not only by the links that appear in the page, but in the context of the archived URIs that precede and follow it; not unlike TAMU's Walden's Paths, but with archived content and the path information as a global property of the memento itself.

Over time there might be enough paths through Google Maps that we might be able to say we've preserved some usable percentage of it -- or at least the ODU CS building, parking garage, and football stadium ca. 2013.  There are a number of issues to be researched to make this easy enough for people to do (many of which our group is investigating), but the popularity of game walkthroughs and their preservation side-effects suggests to me that the web archiving community should be informed by them.  And if so, perhaps that will assuage my middle school dream of being a video game developer. 


Tuesday, May 21, 2013

2013-05-21: An Update About Archiving Tweets

Today I encountered this article about a UK driver bragging on Twitter about hitting a cyclist.  Rather than extend an already lengthy post about archiving tweets from two weeks ago, this example will be its own post. 

Summary: a woman hit a bicyclist participating in a race (the cyclist apparently was not seriously injured) and then bragged about it on Twitter.  The cyclist was apparently not going to report the event, but her bragging changed his mind and he contacted the police:

The driver deleted her Twitter account, but the offending evidence has already been archived -- not just by concerned citizens making copies (check the thread in the Tweet above), but Topsy also has archived the evidence as well.

Interestingly, unlike the Twitpic examples in the previous post, the Instagram images do not have a thumbnail on Topsy (the thumbnails are pulled directly from  I won't be posting about all ill-advised tweets that will surely occur in the future, but this example was too good to pass up.  I predict the driver has earned a mention on an upcoming "Wait Wait... Don't Tell Me!"


Monday, May 20, 2013

2013-05-13: Temporal Web Workshop 2013 Trip Report

On May 13, Hany SalahEldeen and I attended the third Temporal Web Analytic Workshop, collocated with WWW 2013 in Rio De Janeiro, Brazil.

Marc Spaniol, from Max Planck Institute for Informatics, Germany, welcomed the audience in the opening note of the workshop. He emphasized on the target of the workshop to build a community of interest in the temporal web.

Omar Alonso, from Microsoft Silicon Valley, was the keynote speaker with presentation entitled: “Stuff happens continuously: exploring Web contents with temporal information”. Omar divided his presentation into three parts: Time in document collection, Social data, and Exploring the web using time.

In the Time in document collection, Omar gave an intro about the temporal dimension of the document. He defined the characteristics of the temporal by first defining “What is Time?”. The time may be used in normalized format or hierarchy format. The time has 4 types: times; duration; sets, which may explicit (i.e., May 2, 2012) or implicit (i.e., labor day); or relative expressions. There are different approaches to extract the temporal expressions like: Temporal Tagger, Named Entity Recognition (NER) for time. We can express the time in TimeML format. Omar explained that people care about temporality because it describes their landmarks and evolution such as: Winning a game for soccer player or financial quarters for accountant. Including the crowd with the temporal, we can achieve a complete annotated calendar for free by combining all the hot topics for the year.

Then, Omar explained the effect of social media on the concepts of “Temporal and Document”.
  • Twitter has limited the document to 140 characters, Time in Twitter is supported by: Trending topics, e.g., Mothers day; hashtags, e.g., #tempweb2013; cashtag, hash tag started with $ for financial information (e.g., $apple); and group chats, people tweet in specific time to discuss specific topic. 
  • Time in Facebook is known by the Timeline, photos over time, and the generic events. 
  • Temporally-Aware Signals. User interests may be time sensitive, for example tweeting about recent, seasonal, or ongoing activities.  
  • Community Question Answering (CQA) also has a temporal dimension. CQA helps the user to answer the questions that the user can't answer using the web search engines. Some answers don't change through the time (i.e., what is the distance to the moon?), others are time-sensitive. 
  • Reddit, which is a sharing platform popular in US, has also a Time dimension. Reddit is so popular to attract famous people to communicate with the crowd. 
  • Reviewing systems such as: Amazon, yelp, and Foursquare holds a temporal characteristics as the review may be changed through the time. 
  • Time in Wikipedia is tracked by the evolution of edits by users.
After that, Omar moved to the last part of his presentation that exploration of the web using time. The correlating between the different sources on the web by time could give us a better understanding of the event, for example relation between hashtag and Wikipedia article. CrowdTiles is an example of combining the Web, Twitter, and Wikipedia as a part of Bing search results
While this approach works well with the popular events, it need a modification for looking back for the not popular events. Combining different data sources introduces new research questions: how to manage the duplicate and the near duplicates, what is the temporal precedence between them, how to rank the results by temporal value, and how to evaluate the success of these techniques.

Session 1: Web Archiving

Daniel Gomes, from the Portuguese Web Archive, gave a presentation entitled “A Survey of Web Archive Search Architectures”. Daniel gave an overview about the current search paradigms in the web archives. The use-cases showed that the users demand google-like search from the web archives. The survey found from the web archives under the experiment: 89% have URL search, 79% have metadata, 67% have fulltext search. These numbers had been computed based on the publications about the web archives and the authors experience.

Then, Daniel gave three examples about the search systems on the web archives. Portuguese web archive provides fulltext search for 1.2B docs using NutchWAX. The partition technique is based on time and document. EverLast is p2p architectures, the tasks (crawling, versioning, and indexing) are distributed between different nodes. Wayback machine is a url search architecture, it used flat sorted files (called CDX) to index the webpages. Daniel proposed a new one single portal for across-web archives search. The new system has the challenge of the spread of the web archive data across different systems and technologies. The system has a prototype that was tested on the Portuguese web archive and showed a good results. The new system requires a new rank mechanism of search results from different sources additional to design a user interface that combines these sources.

The next presentation in the session was my presentation that entitled “Archival HTTP Redirection Retrieval Policies”. In this presentation, we studied the URI lookup in the web archive taking in the consideration the HTTP redirection status for the live or archived URI. We proposed two new measurements: Stability, computing the change of the URI status and location through time; and Reliability, computing the percentage of mementos that will end with 200 HTTP status to the total number of mementos per TimeMap. Finally, we proposed two new retrieval policies.

Daniel Gomes gave another presentation entitled “Creating a Billion-Scale Searchable Web Archive”. He gave their experience to build the Portuguese web archive. First, they integrated data from three collections, some of them were on CD formats. They built tools to convert the saved web files to arc format. Then, Portuguese web archive started their own live web crawl on 2007, focusing on Portuguese speaking domain except .br. They built Hertirix add-on, called DeDuplicator, that saves 41% disk space on weekly crawl and 7% for daily one, with total 26.5 TB/year. The Portuguese Web Archive has enabled fulltext searching, has internationalization support, and has a new graphical design.

Session 2: Identifying and leveraging time information

Julia Kiseleva, from Emory University, presented “Predicting temporal hidden contexts in web sessions”. In her presentation, Julia analyzed web log as a set of user actions. She aimed to find contexts that help to build more accurate local models. Julia built a user navigation graph, she used to partition mechanisms. Horizontal partition based on context (e.g., Geographical position) and Vertical position based on the action alphabet (e.g., Ready to buy or Just Browsing). Julia used in her experiement. Also, she suggested using sitemap to define the set of applicable steps.

Hany SalahEldeen, from Old Dominion University, presented “Carbon Dating The Web: Estimating the Age of Web Resources”. Hany estimated the creation date of the URIs based on different sources such as: crawling date from Google, first observation on the web archives, or the first mention in the social media.

Omar Alonso presented “Timelines as Summaries of Popular Scheduled Events”. Omar built a framework with minute level granularity to compare the events during the game with the social media reactions. Omar gave some examples from World Cup and the tweets about the game. The results showed a strong relationship between the game events and the user activities on Twitter.

Session 3: Studies & Experience Sharing

Lucas Miranda presented “Characterizing Video Access Patterns in Mainstream Media Portals”. Lucas studied the video access patterns on the major Brazilian media providers. Lucas showed some figures that summarized their results.

Hideo Joho, from University of Tsukuba, presented “A Survey of Temporal Web Search Experience”. Hideo studied the temporal aspects in the web search by surveying 110 persons to answer 18 questions related to their recent search experience. Hideo showed quantitative and qualitative analysis for his results.

LAWA project wrote a TempWeb 2013 Roundup report.
Ahmed AlSum

Monday, May 13, 2013

2013-05-09: HTTP Mailbox - Asynchronous RESTful Communication

We often encounter web services that take a very long time to respond to our HTTP requests. In the case of an eventual network failure, we are forced to issue the same HTTP request again. We frequently consume web services that do not support REST. If they did, we could utilize the full range of HTTP methods while retaining the functionality of our application, even when the external API we utilize in our application changes. We sometime wish to set up a web service that takes job requests, processes long running job queues and notifies the clients individually or in groups. HTTP does not allow multicast or broadcast messaging. HTTP also requires the client to stay connected to the server while the request is being processed.

Introducing HTTP Mailbox - An Asynchronous RESTful HTTP Communication System. In a nutshell, HTTP Mailbox is a mailbox for HTTP messages. Using its RESTful API, anyone can send an HTTP message (request or response) to anyone else independent of the availability, or even the identity of recipient(s). The HTTP Mailbox stores these messages and delivers them on demand. Each HTTP message is encapsulated in the body of another HTTP message and sent to the HTTP Mailbox using a POST method. Similarly, the HTTP Mailbox encapsulates the HTTP message in the body of its response when a GET request is made to retrieve the messages.

Tunneling HTTP traffic over HTTP was also explored in the Relay HTTP. But the Relay HTTP relays the live HTTP traffic back and forth and does not store HTTP messages. It works like a proxy server to only overcome JavaScript's cross-origin restriction in Ajax requests. The Relay HTTP still requires the client and server along with the relay server to meet in time.

Store and forward nature of the HTTP Mailbox is inspired by Linda. We have taken the simplicity of Linda model and implemented it using HTTP on the scale of the Web. This approach has enabled asynchronous, indirect, time-uncoupled, space-uncoupled, individual, and group communication over HTTP. Time-uncoupling refers to no need of sender and recipient(s) meeting in time to communicate while space-uncoupling refers to no need of sender and recipient(s) knowing each other's identity to communicate. The HTTP Mailbox also enabled utilization of the full range of HTTP methods otherwise unavailable to standard clients and servers.

The above figure shows the lifecycle of a typical HTTP message using the HTTP Mailbox in four steps. We will walk through this example to explain how it works. Assume that the client wants to send the following HTTP message to the server at

> PATCH /tasks/1 HTTP/1.1
> Host:
> Content-Type: text/task-patch
> Content-Length: 11
> Status=Done

Step 1, assuming that the HTTP Mailbox server is running on, therefore the message will be encapsulated in a POST message like this:

> POST /hm/ HTTP/1.1
> Host:
> HM-Sender:
> Content-Type: message/http; msgtype: request
> Content-Length: 108
> PATCH /tasks/1 HTTP/1.1
> Host:
> Content-Type: text/task-patch
> Content-Length: 11
> Status=Done

Step 2, the HTTP Mailbox will store the message and return the URI of newly created message in the response as follows:

< HTTP/1.1 201 Created
< Location:
< Date: Thu, 20 Dec 2012 02:22:56 GMT
Step 3, makes an HTTP GET request to the HTTP Mailbox server to retrieve its messages. To retrieve the most recent message sent to "" a request will look like this:

> GET /hm/ HTTP/1.1
> Host:

Step 4, the response from the HTTP Mailbox will contain the most recent message sent to "".  The response will also include a "Link" header that will give the URLs to navigate through the chain of messages for that recipient.

< HTTP/1.1 200 OK
< Date: Thu, 20 Dec 2012 02:10:22 GMT
< Link: <>; rel="first",
<  <>; rel="last self",
<  <>; rel="previous",
<  <>; rel="current"
< Via: Sent by
<  on behalf of
<  delivered by
< Content-Type: message/http; msgtype: request
< Content-Length: 108
< PATCH /tasks/1 HTTP/1.1
< Host:
< Content-Type: text/task-patch
< Content-Length: 11
< Status=Done

A tech report is published on arXiv, describing the HTTP Mailbox in details. A reference implementation of the HTTP Mailbox can be found on GitHub.

We have already used the HTTP Mailbox successfully in the following applications.
  • Preserve Me! - a distributed web object preservation system that establishes social network among web objects and uses the HTTP Mailbox for its communication needs.
  • Preserve Me! Viz - a dynamic and interactive network graph visualization tool to give insight of the Preserve Me! graph and communication.

Where else can we use the HTTP Mailbox?
  • Warrick - a tool to restore lost websites. It can use HTTP Mailbox to accept service requests and status notification.
  • Carbon Dating the Web - a tool to find out the age of a resource at a given URL. This process usually takes few minutes to complete each request in the queue. It can utilize the HTTP Mailbox to accept service requests and send the response when ready.
  • Device notifications - related to software updates, general application messaging.
  • Any application that needs asynchronous RESTful HTTP messaging.


Sawood Alam

Tuesday, May 7, 2013

2013-05-07: Who Is Archiving Your Tweets?

Who is archiving your tweets?

You're probably thinking "the Library of Congress".  And you're right, since 2010 they have been (see the announcements from Twitter and LC).  But LC is currently providing access only to researchers, and the scale of the archive makes access challenging (see LC's January 2013 white paper that provides a status update on the project).

To say I think this joint project between LC and Twitter is exciting and important is an understatement; I could go on about the scholarly importance, the cultural and technological record, the phenomena of social media, etc.  So I was surprised (but in retrospect, should not have been) when almost immediately afterwards projects like surfaced so you could opt out of the archiving of your public tweets.

However, while you might be able to prevent LC from archiving your tweets, companies like Topsy are archiving them, or at least some of them.  Tospy is one of my new, favorite sites in part because they archive tweets; not necessarily because archiving them is the right thing to do (tm), but presumably because: 1) it allows them to build interesting services on top of the tweets, and 2) deleting them is probably more work than not deleting them*.  Hany SalahEldeen and I began exploring Topsy in the context of his research on temporal intent in social media link sharing.

Although I think they think their primary business model is searching the social web, to me the most interesting services are generating the retweet and link neighborhoods for tweets.  For example:

is Topsy's page about me, and:

provides all the various tweets that linked to:  Topsy will promote your status to "influential" or "highly influential", presumably based on a mix of followers and/or retweets (e.g., Clay Shirky is "highly influential" with 274k followers, but Farrah is "influential" with less than 500 followers but presumably many retweets).  Tweets about links can be "interesting", and that appears to be based on a tweet having different text from the HTML title of the target link.

Let's look at some examples of how Topsy is archiving at least some tweets. This September 28, 2012 story on cited our TPDL 2012 paper, but also started with a nice quote for motivation and context:
On January 28 2011, three days into the fierce protests that would eventually oust the Egyptian president Hosni Mubarak, a Twitter user called Farrah posted a link to a picture that supposedly showed an armed man as he ran on a "rooftop during clashes between police and protesters in Suez". I say supposedly, because both the tweet and the picture it linked to no longer exist.  Instead they have been replaced with error messages that claim the message – and its contents – "doesn’t exist".
It's true: although the user "Farrah" still exists, she has deleted many of her tweets from during the Egyptian Revolution.  For example, the tweet and the picture linked to in the tweet are 404:

But if we prepend the twitpic URI with "" to get:

we see the original tweet, and a small but not full-size version of the image:

And it is not just that tweet & image, there are many others as well (twitter URI, twitpic URI):

Topsy will also archive the tweets marked for deletion from the LC archive.  For example, tweets like this from the author of no longer exist and presumably were deleted before inclusion in the LC archive:

But we can see these tweets continue to exist in Topsy:

Note that the above link is a relative offset, so the actual tweets might scroll off that page.  This reflects a limitation of the service at least with respect to being an actual archive: it offers only a limited window (at least for the free service) of 100 pages of 10 tweets each.  For active accounts this 1000 tweet window will scroll by quickly.  For example, the right-wing politics site was giddy when a White House staffer mistook/misspelled "congenital" as "congenial" in this now deleted June 29, 2012 tweet:

But at the time of this writing, 100 pages back only takes you to January 29, 2013 so we can't see if Topsy has archived this tweet.

In my recent presentation at the 2013 IIPC meeting, I mentioned the zombie movie trope of not using the word "zombie" to describe zombies (i.e., no one in a zombie movie has ever heard of zombies).  I drew the parallel of not "using the a-word" -- perhaps the best, commercially viable archives don't use the word "archive".  I don't believe Topsy markets its services as an "archive", but that is what is providing (modulo the 100 page limitation as well as not supporting archival protocols like Memento).  On the other hand, the word "archive" denotes a certain level of permanency, and who knows if Topsy will survive in the marketplace?  This list from has a number of social media companies, many of which are now defunct.  If Topsy goes under, most likely its extensive archives will disappear as well.  True, most of the material won't be missed, but historically important material, such as Farrah's live tweeting of the Egyptian Revolution will disappear with it.  And since it is not clear how to monetize archives and with actual archives such as WebCite running a donation campaign, we should be reminded that what we perceive as "archives" are really just web sites.  So who will archive the archives?


Edit: See also

* = I don't have any details about how Topsy is designed, their business relationship with Twitter, or anything of the like.  Nor have I paid for a "pro" account or anything like that.  All observations are from my position of being outside and looking in.