Monday, September 9, 2013

2013-09-09: MS Thesis: HTTP Mailbox - Asynchronous RESTful Communication

It is my pleasure to report the successful completion of my Master's degree thesis entitled "HTTP Mailbox - Asynchronous RESTful Communication". I have defended my thesis on July 11th and got my written thesis accepted on August 23rd 2013. In this blog post I will briefly describe the problem that the thesis is targeting at followed by proposed and implemented solution to the problem. I will walk through an example that will illustrate the usage of the HTTP Mailbox then I will provide various links and resources to further explore the HTTP Mailbox.

Traditionally, general web services used only the GET and POST methods of HTTP while several other HTTP methods like PUT, PATCH, and DELETE were rarely utilized. Additionally, the Web was mainly navigated by humans using web browsers and clicking on hyperlinks or submitting HTML forms. Clicking on a link is always a GET request while HTML forms only allow GET and POST methods. Recently, several web frameworks/libraries have started supporting RESTful web services through APIs. To support HTTP methods other than GET and POST in browsers, these frameworks have used hidden HTML form fields as a workaround to convey the desired HTTP method to the server application. In such cases, the web server is unaware of the intended HTTP method because it receives the request as POST. Middleware between the web server and the application may override the HTTP method based on special hidden form field values. Unavailability of the servers is another factor that affects the communication. Because of the stateless and synchronous nature of HTTP, a client must wait for the server to be available to perform the task and respond to the request. Browser-based communication also suffers from cross-origin restrictions for security reasons.

We describe HTTP Mailbox, a mechanism to enable RESTful HTTP communication in an asynchronous mode with a full range of HTTP methods otherwise unavailable to standard clients and servers. HTTP Mailbox also allows for multicast semantics via HTTP. We evaluate a reference implementation using ApacheBench (a server stress testing tool) demonstrating high throughput (on 1,000 concurrent requests) and a systemic error rate of 0.01%. Finally, we demonstrate our HTTP Mailbox implementation in a human-assisted Web preservation application called "Preserve Me!" and a visualization application called "Preserve Me! Viz".

The HTTP Mailbox is inspired by the pre-Web distributed computing model Linda and modern Web scale distributed computing architecture REST. It tunnels the HTTP traffic over HTTP using message/http (or application/http) MIME type and stores the HTTP messages (requests/responses) along with some extra metadata for later retrieval. The HTTP Mailbox provides a RESTful API to send and retrieve asynchronous HTTP messages. For a quick walk-through of the thesis please refer to the oral presentation slides (HTML) or access them on SlideShare. A complete copy of the thesis (PDF) is also available publicly at:
Sawood Alam, HTTP Mailbox - Asynchronous RESTful Communication, MS Thesis, Computer Science Department, Old Dominion University, August 2013.


Our preliminary implementation code can be found on GitHub. We have also deployed an instance of our implementation on Heroku for public use. This instance internally uses Fluidinfo service for message storage. Let us have a look at the deployed service to illustrate its usage.

Let us assume that we want to check the HTTP Mailbox to see if there any messages for http://example.com/all. Our HTTP Mailbox API endpoint is located at http://httpmailbox.herokuapp.com/hm/. Hence we will make a GET request as illustrated below.

$ curl -i http://httpmailbox.herokuapp.com/hm/http://example.com/all
HTTP/1.1 404 Not Found
Content-Type: message/http
Date: Mon, 09 Sep 2013 16:59:13 GMT
Server: HTTP Mailbox
Content-Length: 0
Connection: keep-alive

This indicates that there are no messages for the given URI. Now let us POST something to that URI first. We have an example file named "welcome.txt" that is a valid HTTP message which we want to send to http://example.com/all.

$ cat welcome.txt
POST /all HTTP/1.1
Host: example.com
Content-Type: text/plain
Content-Length: 32

Welcome to the HTTP Mailbox! :-)

Now let us POST this message to the given URI.

$ curl -i -X POST --data-binary @welcome.txt \
> -H "Sender: hm-deployer" \
> -H "Content-Type: message/http" \
> http://httpmailbox.herokuapp.com/hm/http://example.com/all
HTTP/1.1 201 Created
Content-Type: message/http
Date: Mon, 09 Sep 2013 17:13:02 GMT
Location: http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6
Server: HTTP Mailbox
Content-Length: 0
Connection: keep-alive

Now that we have POSTed the message, we can retrieve it anytime later.

$ curl -i http://httpmailbox.herokuapp.com/hm/http://example.com/all
HTTP/1.1 200 OK
Content-Type: message/http
Date: Mon, 09 Sep 2013 17:15:33 GMT
Link: <http://httpmailbox.herokuapp.com/hm/http://example.com/all>; rel="current",
 <http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6>; rel="self",
 <http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6>; rel="first",
 <http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6>; rel="last"
Memento-Datetime: Mon, 09 Sep 2013 17:13:01 GMT
Server: HTTP Mailbox
Via: sent by 128.82.4.75 on behalf of hm-deployer, delivered by http://httpmailbox.herokuapp.com/hm/
Content-Length: 114
Connection: keep-alive

POST /all HTTP/1.1
Host: example.com
Content-Type: text/plain
Content-Length: 32

Welcome to the HTTP Mailbox! :-)

So far, there is only one message for the given URI. If more messages are posted to the same URI, above retrieval request will only retrieve the last message of the chain. From there the "Link" header can be used to navigate through the message chain.

We have been using HTTP Mailbox service in various applications including "Preserve Me!" and "Preserve Me! Viz". Following screenshot illustrates its usage in "Preserve Me!".


We would like to thank GitHub for hosting our code, Heroku for running our HTTP Mailbox instance on their cloud infrastructure, and Fluidinfo for storing messages in their "tag and value" style RESTful storage system.

I am grateful to my advisor Dr. Michael L. Nelson, committee members Dr. Michele C. Weigle  and Dr. Ravi Mukkamala, colleagues and everyone else who helped me in the process of getting my Master's degree. Now, I am continuing my research under the guidance of Dr. Michael L. Nelson at Old Dominion University.

Resources

--
Sawood Alam

Sunday, September 8, 2013

2013-09-06: Wolfram Data Summit 2013 Trip Report

I was fortunate enough to be invited to present at the 2013 Wolfram Data Summit in Washington DC, September 5-6, 2013.  My talk was about the future of web archiving, but the focus of the data summit was "big data".  As such, there was a variety of disciplines represented at the summit since the unifying factor was the scale of the data.  Logistics dictated that I missed several of the presentations, but many of the ones I did attend were very engaging.  The slides will be posted at the Wolfram site later, but I'll provide some short summaries below (2013-11-26 edit: the presentations are now available).

First was Greg Newby presenting about Project Gutenberg, the long-running collection of free ebooks.  His focus was on PG as a portable collection, which is subtly different from universal access from different interfaces (even if the interface is just Google).  The focus was more on PG as a collection to be explored and personalized services to be built-on.  During the question and answer period someone asked "what's next for Project Gutenberg?", and during lunch the next day me, Greg, and others talked about PG and Open Annotation, and maybe uploading some content to Rap Genius (I got the idea from Rob Sanderson).

Andrew Ng gave a skype presentation (which, unlike most video presentations, worked rather well) about Coursera.  I'm rather skeptical about most universities' stampede for MOOCs, but I should probably start looking for quality segments in Coursera to augment my own classes.

Another really engaging discussion was Paul Lamere of The Echo Nest.  With lots of illustrative examples using pop music, Paul gave one of the most well-received presentations of the summit.   We learned that band names are not getting longer (I was surprised, I thought they were, but older conventions like "Herb Albert and the Tijuana Brass" make for long names), metal fans are more "passionate" (defined as replays/favorites) than dub step fans (that one was easy), and that we can easily tell human drummers from machines by analyzing variances in the signal (his example was the variations in "So Lonely").  Paul's blog, Music Machinery, is worth checking out. 

Eric Newburger of the US Census Bureau gave an excellent presentation about how Census data is the original "big data".  Tufte fans will enjoy checking out his presentation (prior presentations are available at the moment).  He made a good pitch for using Census data for ground truth for a variety of business purposes, but you really should check out some of the early visualizations. 

Ryan Cordell and David Smith of Northeastern gave a great presentation about "infectious texts", a project to mine early US newspapers for early "viral" memes.  Apparently early newspapers were equal parts news, fiction, and apocryphal stories half-way between truth and fiction, and editors would fill their local papers with large-scale copying from other newspapers, with and without attribution.  The project analyzes the types of stories chosen for 19th century retweeting, the networks of reuse (which don't always match geography and population networks), their temporal patterns, etc.  During the Q&A period and later during lunch we speculated about identifying timeless stories (e.g., the soldier returning from war) and reintroducing them to Facebook & Twitter and see if they reignite.  The project uses LC data from the Chronicling of America project, and the OCR data is especially noisy and requires a host of tricks to align and find the reused portions.

Roger Macdonald of the Internet Archive discussed the Television Archive, which features 2M+ hours of TV news.  I'm guilty of thinking the Internet Archive is just web pages (of which they have some 338B), but they have a great deal more: 30k software titles, 600k books, 900k films/movies, 1M audio recordings (many concerts), and 2M ebooks.  The TV news archive features a very attractive and useful interface for browsing, search, and sharing its content. 

Leslie Johnston from the Library of Congress gave an overview of LC's collections and services.  Most of these I was already familiar with, but I'll mention two sites that I was not aware of.  First, the venerable THOMAS will be replaced with a new congress.gov (see the beta version now) and will will soon feature APIs for accessing the data behind the site.  See these reviews: O'Reilly, TechPresident.  I was also unavailable of id.loc.gov, a site that gathers the various naming, standards, and vocabulary functions into one place.  I knew LC performed this function, but I didn't know of this particular site.

Eric Rabkin gave a fascinating talk about the analysis of titles of works of science fiction and what that revealed about the society that they reflect.  Quoting from his "Genre Evolution Project" page:
We study literature as a living thing, able to adapt to society’s desires and able to influence those desires. Currently, we are tracking the evolution of pulp science fiction short stories published between 1926 and 1999. Just as a biologist might ask the question, “How does a preference for mating with red-eyed males effect eye color distribution in seven generations of fruit flies?” the GEP might ask, “How does the increasing representation of women as authors of science fiction affect the treatment of medicine in the 1960s and beyond?”
In addition the slides (when they're available), you might be interested in his SF course on Coursera.

I gave the last presentation of the day, talking about trends in web archiving.  I gave a high-level overview of some of our recent JCDL and TPDL papers, as well as mentioning long-running projects like Memento and how they integrate the various public web archives, most of which most people have never heard of. 




Since I was the last presentation of the summit, we had an extended question and answer period with a handful of people who were not in a hurry to leave and jump in DC traffic.  I ended up meeting my friend Terry for dinner and then headed back to Norfolk at about 7:45 that evening. 

Overall this was a really interesting summit and I enjoyed the multidisciplinary nature of presentations.  I regret that I ended up missing as many as I did, but that's how things worked out.  I would definitely recommend the 2014 summit.  While waiting for the 2013 presentations to be posted, you might want to check out the presentations from 2012, 2011, and 2010

--Michael