Thursday, December 19, 2013

2013-12-19: 404 - Your interview has been depublished

Early November 2013 I gave an invited presentation at the EcoCom conference (picture left) and at the Spreeeforum, an informal gathering of researchers to facilitate knowledge exchange and foster collaborations. EcoCom was organized by Prof. Dr. Michael Herzog and his SPiRIT team and the Spreeforum was hosted by Prof. Dr. Jürgen Sieck who leads the INKA research group. Both events were supported by the Alcatel-Lucent Stiftung for Communications research. In my talks I gave a high-level overview of the state of the art in web archiving, outlined the benefits of the Memento protocol, pointed at issues and challenges web archives face today, and gave a demonstration of the Memento for Chrome extension.

Following the talk at the Spreeforum I was asked to give an interview for the German radio station Inforadio (you may think of it as Germany's NPR). The piece was aired on Monday, November 18th at 7.30am CET. As I had left Germany already I was not able to listen to it live but was happy to find the corresponding article online that basically contained the transcript of the aired report and an audio file was embedded in the document. I immediately bookmarked the page.

A couple of weeks later I revisited the article at its original URI only to find it was no longer available (screenshot left). Now, we all know that the web is dynamic and hence links break and even we have seen odd dynamics at other media companies before but in this case, as I was about to find out, it was higher powers that caused the detrimental effect. Inforadio is a public radio station and therefore, as many others in Germany and throughout Europe, to a large extent financed by the public (as of 2013 the broadcast receiving license is 17.98 Euros (almost USD 25) per month per household). As such they are subject to the "Rundfunkstaatsvertrag", which is a contract between the German states to regulate broadcasting rights. The 12th amendment to this contract from 2009 mandates that most online content must be removed after 7 days of publication. Huh? Yeah, I know, it sounds like a very bad joke but it is not. It even lead to coining the term "depublish" - a paradox by itself. I had considered public radio stations as "memory organizations", in league with libraries, museums, etc. How wrong was I and how ironic is this, given my talk's topic!? For what it's worth though, the content does not have to be deleted from the repository but it has to be taken offline.

I can only speculate about the reasons for this mandate but to me believable opinions circulate indicating  that private broadcasters and news publishers complained about unfair competition. In this sense, the claim was made that "eternal" availability of broadcasted content on the web is unfair competition as the private sector is not given the appropriate funds to match that competitive advantage. Another point that supposedly was made is that this online service goes beyond the mandate of public radio stations and hence would constitute a misguided use of public money. To me personally, none of this makes any sense. Broadcasters of all sorts have realized that content (text, audio, and video) is increasingly consumed online and hence are adjusting their offerings. How this can be seen as unfair competition is unclear to me.

But back to my interview. Clearly, one can argue (or not) whether the document is worth preserving but my point here is a different one:
Not only did I bookmark the page when I saw it, I also immediately tried to push it into as many web archives as I could. I tried the Internet Archive's new "save page now" service but, to add insult to injury, Inforadio also has a robots.txt file in place that prohibits the IA from crawling the page. To the best of my knowledge this is not part of the 12th amendment to the "Rundfunkstaatsvertrag" so the broadcaster could actually take action to preserve their online content. Other web sites of public radio and TV stations such as Deutschlandfunk or ZDF do not prohibit archives from crawling their pages.



Fortunately, the archiving service Archive.is was able to grab the page (screenshot left) but the audio feed is lost.



Just one more thing (Peter Falk style):
Note that the original URI of the page:

http://www.inforadio.de/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

when requested in a web browser redirects (200-style) to:

http://www.inforadio.de/error/404.html?/rbb/inf/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

The good news here: it is not a soft 404 so the error is somewhat robot friendly. The bad news is that the original URI is thrown away. As the original URI is the only key for a search in web archives, we can not retrieve any archived copies (such as the one I created in Archive.is) without it. Unfortunately, this is not only true for manual searches but it also undermines automatic retrieval of archives copies by clients such as the browser extension Memento for Chrome. As stressed in our recent talk at CNI this is very bad practice and unnecessarily makes life harder for those interested in obtaining archived copies of web pages at large, not only my radio interview.

--
Martin

Wednesday, December 18, 2013

2013-12-18: Avoiding Spoilers with the Memento Mediawiki Extension

From Modern Family to the Girl with the Dragon Tatoo, fans have created a flood of fan-based wikis based on their favorite television, book, and movie series. This dedication to fiction has allowed fans to settle disputes and encourage discussion using these resources.
These resources, coupled with the rise in experiencing fiction long after it is initially released, has given rise to another cultural phenomenon: spoilers. Using a fan-based resource is wonderful for those who are current with their reading/watching, but is fraught with disaster for those who want to experience the great reveals and have not caught up yet.
Memento can help here.
Above is a video showing how the Memento Chrome Extension from Los Alamos National Laboratory (LANL) can be used to avoid spoilers while browsing for information on Downtown Abbey. This wiki is of particular interest because the TV show is released in the United Kingdom long before it is released in other countries. The wiki has a nice sign warning all visitors about impending spoilers should they read the pages within, but the warning is redundant, seeing as fans who have not caught up will know that spoilers are implied.
A screenshot of the page with this notice is shown below.
We can use Memento to view pre-spoiler versions.
To avoid spoilers for Downtown Abbey Series 4, we choose a date prior to its release: August 30, 2012. Then we use LANL's Memento Chrome Extension to browse to that date. The HTTP conversation for this exchange is captured using Google Chrome's Live HTTP Headers Extension and detailed in the steps below.
1. The Chrome Memento Extension sends a HEAD request to the site using Memento's Accept-Datetime header*.
2. Because there are no Memento headers in the response, it connects to LANL's Memento aggregator using a GET request with the same Accept-Datetime header and gets back a 302 redirection response.
3. Then it follows the URI from the Location response header to a TimeGate specifically set up for Wikia, making another GET request using the Accept-Datetime request header on that URI. The TimeGate uses the date given by Accept-Datetime to determine which revision of a page to retrieve. The URI for this revision is sent back in the Location response header as part of the 302 redirection response.
4. From here it performs a final GET request on the URI specified in the Location response header, which is the revision of the article closest to the date requested. A screenshot of that revision is shown below, without the spoiler warning.
Even though this method works, it is not optimal.
The external Memento aggregator must know about the site and provide a site-specific TimeGate.  In this case, the aggregator is merely looking for the presence of "wikia.com" in the URI and redirecting to the appropriate TimeGate in step 3. Behind the scenes, the Mediawiki API is used to acquire the list of past revisions and the TimeGate selects the best one in step 4. This requires LANL, or another Memento participant like the UK National Archives, to provide a TimeGate for all possible Wiki sites on the Internet, which is not possible.
To see where this is relevant, let's look at the fan site A Wiki of Ice and Fire, detailing information on the series A Song of Ice and Fire (aka Game of Thrones). LANL has no Memento TimeGate specifically for this real fan wiki, unlike what we saw with the Downtown Abbey site.
Here's a screenshot of the starting page for this demonstration. Let's assume we want to avoid spoilers from the book A Dance With Dragons, which was released in July 2011, so we choose the date of June 30, 2011.
1. The Chrome Memento Extension connects with an Accept-Datetime request header, hoping for a response with Memento headers.
2. Because there were no Memento headers in the response, it turns to the Memento Aggregator at LANL, which serves as the TimeGate, using the datetime given by the Accept-Datetime request header to find the closest version of the page to the requested date. The TimeGate then provides a Location response header containing the archived version of the page at the Internet Archive.
3. Using the URI from that Location response header, the page is then retrieved directly from the Internet Archive.

But this page has a date of 27 Apr 2011, which is missing information we want, like who played this character in the TV series, which was added to the 7 June 2011 revision of the page. This is because the Internet Archive only contains two revisions around our requested datetime: 27 Apr 2011 and 1 Aug 2011.  Even though the fan wiki contains the 7 June 2011 revision, the Internet Archive does not.

Fortunately, there is the native Memento Mediawiki Extension, supported by the Andrew Mellon Foundation, which addresses these issues. It has been developed jointly by Old Dominion University and LANL. Mediawiki was chosen because it is the most widely used Wiki software, used in sites such as Wikipedia and Wikia.

This native extension allows direct access to all revisions of a given page, avoiding spoilers. It can also return the data directly, requiring no Memento aggregators or other additional external infrastructure.
We set up a demonstration wiki using data from the same Game of Thrones fan wiki above. The video above shows this extension in action. Because our demonstration wiki has the native extension installed, it allows for access to all revisions of each article.
We will try the same scenario using this Memento-enabled wiki.
Here is a screenshot of the starting page for this demonstration.
In this case, because the Memento Mediawiki Extension has full Memento support, the HTTP messages sent are different. We again use the date June 30, 2011 to show that we can acquire information about a given article without revealing any spoilers from the book A Dance With Dragons, which was released on July 2011.
1. The Memento Chrome Extension sends an Accept-Datetime request header, but this time Mediawiki itself is serving as the TimeGate, deciding on the page closest to, but not over, the date requested. Mediawiki then issues its own 302 redirection response.
2. That response gives a Location response header pointing to the correct revision of the page, which was published on June 7, 2011, prior to the release of A Dance With Dragons. From here the Memento Chrome Extension can issue a GET request on that URI to retrieve the correct representation of the page.
As this demonstrates, running the Memento Mediawiki Extension on a fan wiki will ensure that site visitors can not only browse the site spoiler free, but also will get the date closest, but not over, their requested date. This way they avoid spoilers and don't miss any information.

To recap, the native extension gives us the following benefits:
  1. The Memento Infrastructure cannot know about all possible wikis and provide TimeGates for each one, so the chances of a wiki having one are low.
  2. The Internet Archive does not have all revisions of each fan wiki page, meaning that visitors to a fan wiki may miss out on information.
  3. Visitors to the fan wiki site who are trying to avoid spoilers don't need to worry about any issues with the Memento wiki TimeGate infrastructure. Changes to a wiki's API can threaten the whole process, and APIs change frequently while Memento is established by a more stable RFC. 
If you are running a fan wiki and want to help your visitors avoid spoilers, the Memento Mediawiki Extension is what you need. Please contact us and we'll help you customize it to your needs, if necessary.

--Shawn

* = Memento for Chrome version 0.1.11 actually performs two HEAD requests on the resource, but this will be fixed in the next release.

Friday, December 13, 2013

2013-12-13: Hiberlink Presentation at CNI Fall 2013

Herbert and Martin attended the recent Fall 2013 CNI meeting in Washington DC, where they gave an update about the Hiberlink Project (joint with the University of Edinburgh), which is about preserving the referential integrity of the scholarly record. In other words, we link to the general web in our technical publications (and not just other scholarly material) and of course the links rot over time.  But the scholarly publication environment does give us several hooks to help us access web archives to uncover the correct material. 

As always, there are many slides but they are worth the time to study them.  Of particular importance are slides 8--18, which helps differentiate Hiberlink from other projects, and slides 66-99 which walk through a demonstration of the "Missing Link" concepts (along with the Memento for Chrome extension) can be used to address the problem of link rot.  In particular, absent specific versiondate attributes on a link, such as:

<a versiondate="some-date-value" href="...">

A temporal context can be inferred from the "datePublished" META value defined by schema.org:

<META itemprop="datePublished" content="some-ISO-8601-date-value">



Again, the slides are well-worth your time.

--Michael


Thursday, November 28, 2013

2013-11-28: Replaying the SOPA Protest

In an attempt to limit online piracy and theft of intellectual property, the U.S. Government proposed the Stop Online Privacy Act (SOPA). This act was widely unpopular. On January 18th, 2012, many prevalent websites (e.g., XKCD) organized a world-wide blackout of their websites in protest of SOPA.

While the attempted passing of SOPA may end up being a mere footnote in history, the overwhelming protest in response is significant. This event is an important observance and should be archived in our Web archives. However, some methods of implementing the protest (such as JavaScript and Ajax) made the resulting representations unarchiveable by archival services at the time. As a case study, we will examine the Washington, D.C. Craigslist site and the English Wikipedia page. All screenshots of the live protests were taken during the protest on January 18th, 2012. The screenshots of the mementos were taken on November 27th, 2013.

Screenshot of the live Craigslist SOPA Protest from January 18th, 2012.

Craigslist put up a blackout page that would only provide access to the site through a link that appears after a timeout. In order to preserve the SOPA splash page on the Craigslist site, we submitted the URI-R for the Washington D.C. Craigslist page to WebCite for archiving producing a memento for the SOPA screen:

http://webcitation.org/query?id=1326900022520273

At the bottom of the SOPA splash page, JavaScript counts down from 10 to 1 and then provides a link to enter the site. The countdown operates properly in the memento, providing an accurate capture of the resource as it existed on January 18th, 2012.

Screenshot of the Craigslist protest memento in WebCite.


The countdown on the page is created with JavaScript that is included in the HTML:


The countdown behavior is archived along with the page content because the JavaScript script creating the countdown is captured with the content and is available when the onload event fires on the client and subsequent startCountDown code is executed. However, the link that appears at the bottom of the screen dereferences to the live version of Craigslist. Notice that the live Craigslist page has no reference to the SOPA protest. Since WebCite is a page-at-a-time archival service, it only archives the initial representation and all embedded resources, meaning the the linked Craigslist page is missed during archiving.

Screenshot of the Craigslist homepage linked from the
protest splash page. This is also the live version of the
homepage.

The Heritrix crawler archived the Craigslist page on January 18th, 2012. The Internet Archive contains a memento for the protest:

Screenshot of the Craigslist protest splash page in the Wayback Machine.

as does Archive-It:


The Internet Archive memento has the same splash page and countdown as the WebCite memento that was captured with the Heritrix crawler. The link on the Internet Archive memento leads to a memento of the Craigslist page rather than the live version albeit with archival timestamps one day and 13 hours, 30 minutes, and 44 seconds apart (2012-01-19 18:34:32 vs 2012-01-18 05:03:48):

Screenshot of the Craigslist homepage memento, linked from the
protest splash screen.

The Internet Archive converts embedded links to be relative to the archive rather than target the live web. Since Heritrix also crawled the linked page, the embedded link dereferences to the proper memento with a note embedded in the HTML protesting SOPA.

The Craigslist protest was readily archived by WebCite, Archive-It, and the Internet Archive. Policies within each archival institution impacted how the Craigslist homepage (past the protest splash screen) is referenced and accessed by archive users. This differs from the Wikipedia protest, which was not readily archived.

Screenshot of the live Wikipedia SOPA Protest.

Wikipedia displayed a splash screen protesting SOPA blocking access to all content on the site. A version of which is still available live on Wikipedia as of November 27th, 2013:

The live version of the Wikipedia SOPA Protest.

On January 18th, 2012, we submitted the page to WebCite to produce a memento that did not captured the splash page. Instead the memento has only the representation hidden by the splash page.

A screenshot of the WebCite memento of the Wikipedia SOPA Protest.

The mementos captured by Heritrix and presented through the Internet Archive's Wayback Machine and Archive-It are also missing the SOPA splash page.

A screenshot of the Internet Archive memento of the
Wikipedia SOPA Protest.

A screenshot of the Archive-It memento of the
Wikipedia SOPA Protest.

To investigate the cause of the missing splash page further, we requested that WebCite archive the current version of the Wikipedia blackout page on November 27th, 2013. The new memento does not capture the splash page, either:

A screenshot of the WebCite memento of the current
Wikipedia blackout page.

Heritrix has also created a memento for the current blackout page on August 24th, 2013. This memento suffers the same problem as the aforementioned mementos and does not capture the splash page:

A screenshot of the Internet Archive memento of the current
Wikipedia blackout page.

When looking through the client-side DOM of the Wikipedia mementos we reference, there is no mention of the splash page protesting SOPA. This means the page was loaded from either Cascading Style Sheets (CSS) or JavaScript. Since clicking the browser's "Stop" button prevents the splash page from appearing, we hypothesize (and show) that JavaScript is responsible for loading the splash page. JavaScript loads the image needed for the splash page as a result of a client-side event. Since the tools have no way of executing the event, the tools have no way of knowing to archive the image.

When we load the live blackout resource, we see that there are several files loaded by Wikimedia. Some of the JavaScript files return a 403 Forbidden response since they are blocked by the Wikipedia Robots.txt file:

Google Chrome's developer console showing the resources requested
by http://web.archive.org/web/20130824022954/http://en.wikipedia.org/?banner=blackout
and their associated response codes.

Specifically, the Robots.txt file preventing these resources from being archived is:

http://bits.wikimedia.org/robots.txt

The Robots.txt file is archived, as well:

http://web.archive.org/web/*/http://bits.wikimedia.org/robots.txt

We will look at one specific HTTP request for a JavaScript file:



This JavaScript file contains code defining a function that adds CSS to the page, overlaying an image as a splash page and overlays the associated text on the image (I have added the line breaks for readability):



Without execution of the insertBanner function, the archival tools will not know to archive the image of the splash page (WP_SOPA_Splash_Full.jpg) or the overlayed text. In this example, Wikimedia is constructing the URI of the image and using Ajax to request the resource:


The blackout image is available in the Internet Archive, but the mementos in the Wayback Machine do not attempt to load it:

http://web.archive.org/web/20120118165255/http://upload.wikimedia.org/wikipedia/commons/9/98/WP_SOPA_Splash_Full.jpg

Without execution of the client-side JavaScript and subsequent capture of the splash screen, the SOPA blackout protest is not seen by the archival service.

We have presented two different uses of JavaScript by two different web sites and its impact on the archivability of their SOPA protests. The Craigslist mementos provide representations of the SOPA protest, although the archives may be missing associated content due to policy differences and intended use. The Wikipedia mementos do not provide a representation of the protest. While the constituent parts of the Wikipedia protest are not entirely lost, they are not properly reconstituted, making the representation unarchivable with the tools available on January 18th, 2012 and November 27th, 2013.

We have previously demonstrated that JavaScript in mementos can cause strange things to happen. This is another example of how technologies that normally improve a user's browsing experience can actually be more difficult, if not impossible, to archive.


--Justin F. Brunelle





Thursday, November 21, 2013

2013-11-21: 2013 Southeast Women in Computing Conference (SEWIC)


Last weekend (Nov 14-17), I was honored to give a keynote at the Southeast Women in Computing Conference (SEWICC), located at the beautiful Lake Guntersville State Park in north Alabama.  The conference was organized by Martha Kosa and Ambareen Siraj (Tennessee Tech University), and Jennifer Whitlow (Georgia Tech).


Videos from the keynotes and pictures from the weekend will soon be posted on the conference website.  (UPDATE 1/24/14: Flickr photostream and links to keynote videos added.)


The 220+ attendees included faculty, graduate students, undergraduates, and even some high school students (and even some men!).

On Friday night, Tracy Camp from the Colorado School of Mines presented the first keynote, "What I Know Now... That I Wish I Knew Then".  It was a great kickoff to the conference and provided a wealth of information on (1) the importance of mentoring, networking, and persevering, (2) tips on negotiating and time management, and (3) advice on dealing with the Impostor Syndrome.  (Watch the video)

During her talk, she pointed out that women's participation != women's interest.  She had some statistics showing that in 1970 the percentages of women in law school (5%), business school (4%), medical school (8%), and high school sports (4%) were very low.  Then she contrasted that with data from 2005: law school (48%), business school (45%), medical school (50%), and high school sports (42%).  The goal was to counter the frequent comment that "Oh, women aren't in computing and technology because they're just not interested."

She also listed qualities that might indicate that you have the impostor syndrome. From my discussions with friends and colleagues, it's very common among women in computing and technology.  (I've heard that there a few men who suffer from this, too!)  Here's the list:
  • Do you secretly worry that others will find out that you're not as bright/capable as they think you are?
  • Do you attribute your success to being a fluke or "no big deal"?
  • Do you hate making a mistake, being less than fully prepared, or not doing things perfectly?
  • Do you believe that others are smarter and more capable than you are?
Saturday morning began with my keynote, "Telling Stories with Web Archives".


I talked a bit about web archiving in general and then described Yasmin AlNoamany's PhD work on using the archives for storytelling.  The great part for me was to be able to introduce the Internet Archive and the Wayback Machine to lots of people.  I got several comments from both students and faculty afterwards with ideas of how they would incorporate the Wayback Machine in their work and studies. (Watch the video)


After my talk, I attend a session on Education.  J.K. Sherrod and Zach Guzman from Pellissippi State Community College in Knoxville, TN presented "Using the Raspberry Pi in Education".  They had been teaching cluster computing using Mac Minis, but it was getting expensive, so they started purchasing Raspberry Pi devices (~$35) for their classes.  The results were impressive.  Since the devices run a full version of Linux, they even were able to implement a Beowulf Cluster.

I followed this up by attending a panel "Being a Woman in Technology: What Does it Mean to Us?"  The panelists and audience discussed both positive connotations and challenges to being a woman in technology.  This produced some amazing stories, including one student who related being told by a professor that she was no good at math and was "a rotten mango".
After lunch, several students presented 5 minute lightning talks on strategies for success in school and life.  It was great to see so many students excited to share their experiences and lessons learned.

The final keynote was given on Saturday night by Valentina Salapura, from IBM TJ Watson on "Cloud Computing 24/7".  After telling her story and things she learned along the way (and including a snapshot  from the Wayback Machine of her former academic webpage), she described the motivation and promise of cloud computing.


Sunday was the last day, and I attend a talk by Ruthe Farmer, Director of Strategic Initiatives, NCWIT on "Research and Opportunities for Women in Technology".  The National Center for Women & Information Technology was started in 2004 and is a non-profit that partners with corporations, academic institutions, and other agencies with the goal of increasing women's participation in technology and computing.  One of their slogans is "50:50 by 2020".  There's a wealth of information and resources available on the NCWIT website (including the NCWIT academic alliance and Aspirations in Computing program).

Ruthe described the stereotype threat that affects both women and men.  This is the phenomena where awareness of negative stereotypes associated with a peer group can inhibit performance.  She described a study where a group of white men from Stanford were given a math test.  Before the test, one set of students were reminded of the stereotype that Asian students outperform Caucasian students in math, and the other set was not reminded of this stereotype.  The stereotype threatened test takers performed worse than the control set.

Before we left on Sunday, I had the opportunity to sit in the red chair. Sit With Me is a promotion by NCWIT to recognize the role of women in computing.  "We sit to recognize the value of women's technical contributions.  We sit to embrace women's important perspectives and increase their participation."

All in all, it was a great weekend.  I drank lots of sweet tea, heard great southern accents that reminded me of home (Louisiana), and met amazing women from around the southeast, including students and faculty from Trinity University (San Antonio, TX), Austin Peay University, Georgia Tech, James Madison University (Virginia), Tennessee Tech, Pellissippi State Community College (Knoxville, TN), Murray State University (Kentucky), NC State, Rhodes College (Memphis, TN), Univ of Georgia, Univ of Tennessee, and the Girls Preparatory School (Chattanooga, TN).
There are plans for another SEWIC Conference in 2015.
-Michele

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

Circulating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below).  Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive).  But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives.  Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details). 

Consider this randomly chosen 2009 speech:

http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx

Right now it produces a custom 404 page (see Note 3 below):


Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):




So it seems clear that this speech will not disappear down a memory hole.  But how do you discover these copies in these archives?  Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework.  If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.

If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):



The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI. 

We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:
  • Internet Archive: crawl everything
  • Archive-It: collections defined by subscribers
  • UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
  • Archive.is: archives individual pages on user request
Download and install the Chrome extension and all of these archives and more will easily available to you.

-- Michael and Herbert

Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt.  For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013).  "We've always been at war with Eastasia."

Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others).  Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):



But access was restored sometime in the space of three hours before I could generate a screen shot:



Why was it restored?  Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?).  The 08:36:36 version of robots.txt has:

...
Disallow: /News/News_stories/2008/
Disallow: /News/News_stories/2009/
Disallow: /News/News_stories/2010/01/
... 

But the 18:10:19 version has:
...  
Disallow: /News/Blogs.aspx
Disallow: /News/Blogs/
...  

These "Disallow" rules no longer match the URI of the original speech.  I guess the Internet Archive cached the disallow rule and it just now expired one week later.  See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.

The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:



Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate.  Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:



See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.

--Michael

Tuesday, November 19, 2013

2013-11-19: REST, HATEOAS, and Follow Your Nose

This post is hardly timely, but I wanted to gather together some resources that I have been using for REST (Representational State Transfer) and HATEOAS (Hypermedia as the Engine of Application State).  It seems like everyone claims to be RESTful, but mentioning HATEOAS is frequently met with silence.  Of course, these terms come from Roy Fielding's PhD dissertation, but I won't claim that it is very readable (it is not the nature of dissertations to be readable...).  Fortunately he's provided more readable blog posts about REST and HATEOAS. At the risk of aggressively over-simplifying things, REST = "URIs are nouns, not verbs" and HATEOAS = "follow your nose".

"Follow your nose" simply means that when a client dereferences a URI, the entity that is returned is responsible for providing a set of links that allows the user agent to transition to the next state.  This standard procedure in HTML: you follow links to guide you through an online transaction (e.g., ordering a book from Amazon) all the time -- it is so obvious you don't even think about it.  Unfortunately, we don't hold our APIs to the same level; non-human user agents are expected to make state transitions based on all kinds of out-of-band information encoded in the applications.  When there is a change in the states and the transitions between them, there is no direct communication with the application and it simply breaks.

I won't dwell on REST because most APIs get the noun/verb thing right, but we seem to be losing ground on HATEOAS (and technically, HATEOAS is a constraint on REST and not something "in addition to REST", but I'm not here to be the REST purity police).

There are probably many good descriptions of HATEOAS and I apologize if I've left your favorite out, but these are the two that I use in my Web Server Design course (RESTful design isn't goal of the course, but more like a side benefit).  Yes, you could read about a book about REST, but these two slide decks will get your there in minutes.  

The first is from Jim Webber entitled "HATEOAS: The Confusing Bit from REST".  There is a video of Jim presenting these slides as well as a white paper about it (note: the white paper is less correct than the slides when it comes to things like specific MIME types).  He walks you through a simple but illustrative (HTTP) RESTful implementation of ordering coffee.  If the user agent knows the MIME type "applications/vnd.restbucks+xml" (and the associated Link rel types), then it can follow the embedded links to transition from state-to-state.  And if you don't know how to do the the right thing (tm) with this MIME type, you should stop what you're doing. 




It seems like the Twitter API is moving away from HATEOAS.  Brian Mulloy has a nice blog post about this (from which I took the image at the top of this post).  The picture nicely summarizes that from an HTML representation of a Tweet there are all manner of state transitions available, but the equivalent json representation is effectively a dead-end; the possible state transitions have to be in the mind of the programmer and encoded in the application.  Their API returns MIME types of "application/json" just like 1000 other APIs and it is up to your program to sort out the details.  Twitter's 1.1 API, with things like removing support for RSS, is designed for lock-in and not abstract ideals like HATEOAS.  Arguably all the json APIs, with their ad-hoc methods for including links and uniform MIME type, are a step away from HATEOAS (see the stackoverflow.com discussion). 

The second presentation also address a pet peeve of mine: API deprecation (e.g., my infatuation with Topsy has tempered after they crashed all the links I had created -- grrr.).  The presentation "WHY HATEOAS: A simple case study on the often ignored REST constraint" from Wayne Lee walks through a proper way to define your API with an eye to new versions, feature evolution, etc. 




Again, I'm sure there are many other quality REST and HATEOAS resources but I wanted to gather the couple that I had found useful into one place and not just have them buried in class notes.  Apologies for being about five years late to the party.

--Michael

Wednesday, November 13, 2013

2013-11-13: 2013 Archive-It Partner Meeting Trip Report

test

On November 12, I attended the 2013 Archive-It Partner Meeting in Salt Lake City, Utah, our research group's second year of attendance (see 2012 Trip Report). The meeting started off casually at 9am with breakfast and registration. Once everyone was settled, Kristine Hanna, the Director of Archiving Services at Internet Archive introduced her team that was present of the meeting. Kristine acknowledged the fire at Internet Archive last week and the extent of the damage. "It did burn to the ground but thankfully, nobody was injured." She reminded the crowd of partners to review Archive-It's storage and preservation policy and mentioned the redundancies in-place, including a soon-to-be mirror at our very own ODU. Kristine then mentioned news of a new partnership with Reed Technologies to jointly market and sell Archive-It (@archiveitorg). She reassured the audience that nothing would change beyond having more resources for them to accomplish their goals.

Kristine then briefly mentioned the upcoming release of Archive-It 5.0, which would be spoken about in-depth in a later presentation. She asked everyone in the room (of probably 50 or so attendees) to introduce themselves and to state their affiliated. With the intros out of the way, the presentations began.

Kate Legg of National Center for Atmospheric Research (NCAR) presented "First steps toward digital preservation at NCAR". She started by saying that NCAR is a federally funded research and development center (FFRDC) whose mission is to "preserve, maintain and make accessible records and materials that document the history and scientific contributions of NCAR". With over 70 collections and 1500 employees, digital preservation is on the organization's radar. Their plan, while they have a small library and staff, is to accomplish this along with other competing priorities.

"Few people were thinking about the archives for collecting current information", Kate described of some of the organization not understanding that preserving now will create archives for later. "The archive is not just where old where old stuff goes, but new stuff as well." One of the big obstacles for the archiving initiatives of the organizations has been funding. Even with this limitation, however, NCAR was still able to subscribe to Archive-It through a low level subscription. With this subscription, they started to preserve their Facebook group but increasingly found huge amounts of data, including videos, that they felt was too resource heavy to archive. The next step for the initiative is to add a place on the organization's webpage where archive content will be accessible to the public.

Jaime McCurry (@jaime_ann) of Folger Shakespeare Library followed Kate with "The Short and the Long of It: Web Archiving at the Folger Shakespeare Library". Jaime is currently participating in the National Digital Stewardship Residency where her goal is to establish local routines and best practices for archiving and preserving the library's born-digital content. They have two collection with over 6 millions documents (over 400 gigabytes of data) currently where the topic being collected is to preserve content on the web relating to the works of Shakespeare (particularly in social media and from festivals). In trying to describe the large extent of the available content, Jaime said, "In trying to archive Shakespeare's presence on the web, you really have to define what you're looking for. Shakespeare is everywhere!". She noted that one of the first things she realized when she first started on the project at Folger was that nobody knew that the organization was performing web archiving, so she wished to establish an organization-wide web archiving policy. One of the recent potential targets of her archiving project was the NYTimes' Hamlet contest wherein the newspaper suggested Instagram users create 15-second clips of their interpretation of a passage from the play. Because this related to Shakespeare, it would be an appropriate target for the Folger Shakespeare Library.

EDIT: Jamie also created a trip report of the meeting on her blog.

After Jaime finished, Sharon Farnel of University of Alberta began her presentation "Metadata workflows for web archiving – thinking and experimenting with ‘metadata in bulk’". In her presentation she referenced a project called Blacklight, an open source project that provides a discovery interface for any Solr index via a customizable, template-based user interface. In her collection, from the context of metadata, she wished to think about where and why discovery of content tasks place in web archiving. She utilized a mixed model wherein entries might have MARC records, Dublin Core data or both. Sharon emphasized that metadata was an important functionality of Archive-It. To better parse the data, her group created XSLT stylesheets to be able to export the data into a more interoperable format like Excel, which it could then be imported back into Blacklight after manipulation. She referenced some of the difficulties in working the the different technologies but said, "None of these tools were a perfect solution on their own but by combining the tools in-house, we can get good results with the metadata."

After a short break (to caffeinate), Abbie Grotke (@agrtoke) of Library of Congress remotely presented "NDSA Web Archiving Survey Update". In her voice-over presentation from DC, she gave preliminary results of the NDSA Web Archiving Survey, stating that the initiative of the NDIIP program had yielded about 50 respondents so far. For the most part, the biggest concern about web archiving reported by the survey participants was database preservation followed by social media and video archiving. She stated that the survey is still open and encouraged attendees to take it (Take it here).

Trevor Alvord of Brigham Young University was next with "A Muddy Leak: Using the ARL Code of Best Practices in Fair Use with Web Archiving". His efforts with the L. Tom Perry Special Collections at BYU was to build a thematic based collection based on Mormonism. He illustrated that many battles had been fought and won over digital preservation content rights (e.g., Perfect 10 vs. Google and Students vs. iParadigms), so his collection should be justified based on the premises in those cases. "Web archiving is routinely done by two wealthiest corporations (Google and Microsoft)", he quoted Jonathan Band, a recognized figure in the lawsuits versus Google. "In the last few months, libraries have prevailed.", Trevor said, "Even with our efforts, we have not received any complaints about their website being archived by libraries."

Trevor then went on to describe the problem with his data set, alluding to the Teton Dam flooding wherein millions of documents are being produced about Mormonism and now he is having to capture whatever he can. This is partially due to the lowering of the age allowed for missionaries and the Mormon church's encouragement for young Mormons to post online. He showed two examples of Mormon "mommy" bloggers Peace Love Lauren, a very small impact bloggers and NieNie Dialogs, a very popular blog. He asked the audience, "How do you prioritize what content to archive given popular content is more important but also more likely to be preserved?"

Following Trevor, Vinay Goel of Internet Archive presented "Web Archive Analysis". He started by saying that "Most Partners access Archive-It via the Wayback Machine." where other methods would be by using the Archive-It search service or downloading the archival contents. He spoke of de-duplication and how it is represented in WARCs via a revisit record. The core of his presentation spoke of the various WARC meta formats, Web Archive Transformation (WAT) files and CDX files, the format used for WARC indexing. "WAT files are WARC metadata records.", he said, "CDX files are space delimited text files that record where a file resides in a WARC and its offset." Vinay has come up with an analysis toolkit that would allow researchers to express question they want to ask about the archives in a high level language that would then be translated to a low level language understandable by an analysis system. "We can capture YouTube content", he said, giving an example use case, "but the content is difficult to replay." Some of the analysis information he displayed was identifying this non-replayable content in the archives and showing the in-degree and out-degree information of each resource. Further, his toolkit is useful in studying how this linking behavior changes over time.

The crowd then broke for lunch only to return to Scott Reed (@vector_ctrl) of Internet Archive presenting the new features that would be present in the next iteration of Archive-It, 5.0. The new system, among other things, allows users to create test crawls and is better at social media archiving. Some of the goals to be implemented in the system before the end of the year is to get the best capture and display the capture in currently existing tools. Scott mentioned an effort by Archive-It to utilize phantomjs (with which we're familiar at WS-DL through our experiments) through a feature they're calling "Ghost". Further, the new version promises to have an API. Along with Scott, Maria LaCalle spoke of a survey completed about the current version of Archive-It and Solomon Kidd spoke of work done on user interface refinements of the upcoming system.

Following Scott, the presentations continued with your author, Mat Kelly (@machawk1) presenting "Archive What I See Now".

After I finished my presentation, the final presentation of the day was by Debbie Kempe of The Frick Collection and Frick Art Reference Library with "Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art Resources". In her presentation, she stated that there was a broad overlap of art between the Brooklyn Museum, Museum of Modern Art, and the Frick Art Reference Library. Citing Abbie Grotke's survey from earlier, she reminded the audience that no museums responded to the survey, which is problematic for evaluating their archiving needs. "Not all information is digital in the art community", Debbie said. In initiating archiving effort, it wasn't so much clear to the museums' organizers as to why or how web archiving of their content should be done but rather, "Who will do it?" and "How will we pay for it?" She ran a small experiment in accomplishing the preservation tasks of the museum and is now subsequently running a longer "experiment", given more content is being create that is digital and less in print in their collections. In the longer trial, she hopes to test and formulate a sustainable workflow, including re-skilling and organizational changes.

After Debbie, the crowd was freed into a Birds of a Feather session to discuss issues about web archiving that interested each individual, to which I collected with a group about "Capture", given my various software projects relating to the topic. After the BoF session, Lori Donovan and Kristine Hanna adjourned the room to a following reception.

Overall, I felt the trip to Utah to meet with a group with a common interest was a unique experience that I don't get at other conferences where some of the audiences' focuses are disjoint from one another. The feedback I received on my research and the discussion I had with various attendees was extremely valuable in learning how the Archive-It community works and I hope to attend again next year.

EDIT: Since publishing this post, the Archive-It team have posted the slides from the Partner Meeting.

— Mat (@machawk1)