Thursday, November 28, 2013

2013-11-28: Replaying the SOPA Protest

In an attempt to limit online piracy and theft of intellectual property, the U.S. Government proposed the Stop Online Privacy Act (SOPA). This act was widely unpopular. On January 18th, 2012, many prevalent websites (e.g., XKCD) organized a world-wide blackout of their websites in protest of SOPA.

While the attempted passing of SOPA may end up being a mere footnote in history, the overwhelming protest in response is significant. This event is an important observance and should be archived in our Web archives. However, some methods of implementing the protest (such as JavaScript and Ajax) made the resulting representations unarchiveable by archival services at the time. As a case study, we will examine the Washington, D.C. Craigslist site and the English Wikipedia page. All screenshots of the live protests were taken during the protest on January 18th, 2012. The screenshots of the mementos were taken on November 27th, 2013.

Screenshot of the live Craigslist SOPA Protest from January 18th, 2012.

Craigslist put up a blackout page that would only provide access to the site through a link that appears after a timeout. In order to preserve the SOPA splash page on the Craigslist site, we submitted the URI-R for the Washington D.C. Craigslist page to WebCite for archiving producing a memento for the SOPA screen:

http://webcitation.org/query?id=1326900022520273

At the bottom of the SOPA splash page, JavaScript counts down from 10 to 1 and then provides a link to enter the site. The countdown operates properly in the memento, providing an accurate capture of the resource as it existed on January 18th, 2012.

Screenshot of the Craigslist protest memento in WebCite.


The countdown on the page is created with JavaScript that is included in the HTML:


The countdown behavior is archived along with the page content because the JavaScript script creating the countdown is captured with the content and is available when the onload event fires on the client and subsequent startCountDown code is executed. However, the link that appears at the bottom of the screen dereferences to the live version of Craigslist. Notice that the live Craigslist page has no reference to the SOPA protest. Since WebCite is a page-at-a-time archival service, it only archives the initial representation and all embedded resources, meaning the the linked Craigslist page is missed during archiving.

Screenshot of the Craigslist homepage linked from the
protest splash page. This is also the live version of the
homepage.

The Heritrix crawler archived the Craigslist page on January 18th, 2012. The Internet Archive contains a memento for the protest:

Screenshot of the Craigslist protest splash page in the Wayback Machine.

as does Archive-It:


The Internet Archive memento has the same splash page and countdown as the WebCite memento that was captured with the Heritrix crawler. The link on the Internet Archive memento leads to a memento of the Craigslist page rather than the live version albeit with archival timestamps one day and 13 hours, 30 minutes, and 44 seconds apart (2012-01-19 18:34:32 vs 2012-01-18 05:03:48):

Screenshot of the Craigslist homepage memento, linked from the
protest splash screen.

The Internet Archive converts embedded links to be relative to the archive rather than target the live web. Since Heritrix also crawled the linked page, the embedded link dereferences to the proper memento with a note embedded in the HTML protesting SOPA.

The Craigslist protest was readily archived by WebCite, Archive-It, and the Internet Archive. Policies within each archival institution impacted how the Craigslist homepage (past the protest splash screen) is referenced and accessed by archive users. This differs from the Wikipedia protest, which was not readily archived.

Screenshot of the live Wikipedia SOPA Protest.

Wikipedia displayed a splash screen protesting SOPA blocking access to all content on the site. A version of which is still available live on Wikipedia as of November 27th, 2013:

The live version of the Wikipedia SOPA Protest.

On January 18th, 2012, we submitted the page to WebCite to produce a memento that did not captured the splash page. Instead the memento has only the representation hidden by the splash page.

A screenshot of the WebCite memento of the Wikipedia SOPA Protest.

The mementos captured by Heritrix and presented through the Internet Archive's Wayback Machine and Archive-It are also missing the SOPA splash page.

A screenshot of the Internet Archive memento of the
Wikipedia SOPA Protest.

A screenshot of the Archive-It memento of the
Wikipedia SOPA Protest.

To investigate the cause of the missing splash page further, we requested that WebCite archive the current version of the Wikipedia blackout page on November 27th, 2013. The new memento does not capture the splash page, either:

A screenshot of the WebCite memento of the current
Wikipedia blackout page.

Heritrix has also created a memento for the current blackout page on August 24th, 2013. This memento suffers the same problem as the aforementioned mementos and does not capture the splash page:

A screenshot of the Internet Archive memento of the current
Wikipedia blackout page.

When looking through the client-side DOM of the Wikipedia mementos we reference, there is no mention of the splash page protesting SOPA. This means the page was loaded from either Cascading Style Sheets (CSS) or JavaScript. Since clicking the browser's "Stop" button prevents the splash page from appearing, we hypothesize (and show) that JavaScript is responsible for loading the splash page. JavaScript loads the image needed for the splash page as a result of a client-side event. Since the tools have no way of executing the event, the tools have no way of knowing to archive the image.

When we load the live blackout resource, we see that there are several files loaded by Wikimedia. Some of the JavaScript files return a 403 Forbidden response since they are blocked by the Wikipedia Robots.txt file:

Google Chrome's developer console showing the resources requested
by http://web.archive.org/web/20130824022954/http://en.wikipedia.org/?banner=blackout
and their associated response codes.

Specifically, the Robots.txt file preventing these resources from being archived is:

http://bits.wikimedia.org/robots.txt

The Robots.txt file is archived, as well:

http://web.archive.org/web/*/http://bits.wikimedia.org/robots.txt

We will look at one specific HTTP request for a JavaScript file:



This JavaScript file contains code defining a function that adds CSS to the page, overlaying an image as a splash page and overlays the associated text on the image (I have added the line breaks for readability):



Without execution of the insertBanner function, the archival tools will not know to archive the image of the splash page (WP_SOPA_Splash_Full.jpg) or the overlayed text. In this example, Wikimedia is constructing the URI of the image and using Ajax to request the resource:


The blackout image is available in the Internet Archive, but the mementos in the Wayback Machine do not attempt to load it:

http://web.archive.org/web/20120118165255/http://upload.wikimedia.org/wikipedia/commons/9/98/WP_SOPA_Splash_Full.jpg

Without execution of the client-side JavaScript and subsequent capture of the splash screen, the SOPA blackout protest is not seen by the archival service.

We have presented two different uses of JavaScript by two different web sites and its impact on the archivability of their SOPA protests. The Craigslist mementos provide representations of the SOPA protest, although the archives may be missing associated content due to policy differences and intended use. The Wikipedia mementos do not provide a representation of the protest. While the constituent parts of the Wikipedia protest are not entirely lost, they are not properly reconstituted, making the representation unarchivable with the tools available on January 18th, 2012 and November 27th, 2013.

We have previously demonstrated that JavaScript in mementos can cause strange things to happen. This is another example of how technologies that normally improve a user's browsing experience can actually be more difficult, if not impossible, to archive.


--Justin F. Brunelle





Thursday, November 21, 2013

2013-11-21: 2013 Southeast Women in Computing Conference (SEWIC)


Last weekend (Nov 14-17), I was honored to give a keynote at the Southeast Women in Computing Conference (SEWICC), located at the beautiful Lake Guntersville State Park in north Alabama.  The conference was organized by Martha Kosa and Ambareen Siraj (Tennessee Tech University), and Jennifer Whitlow (Georgia Tech).


Videos from the keynotes and pictures from the weekend will soon be posted on the conference website.  (UPDATE 1/24/14: Flickr photostream and links to keynote videos added.)


The 220+ attendees included faculty, graduate students, undergraduates, and even some high school students (and even some men!).

On Friday night, Tracy Camp from the Colorado School of Mines presented the first keynote, "What I Know Now... That I Wish I Knew Then".  It was a great kickoff to the conference and provided a wealth of information on (1) the importance of mentoring, networking, and persevering, (2) tips on negotiating and time management, and (3) advice on dealing with the Impostor Syndrome.  (Watch the video)

During her talk, she pointed out that women's participation != women's interest.  She had some statistics showing that in 1970 the percentages of women in law school (5%), business school (4%), medical school (8%), and high school sports (4%) were very low.  Then she contrasted that with data from 2005: law school (48%), business school (45%), medical school (50%), and high school sports (42%).  The goal was to counter the frequent comment that "Oh, women aren't in computing and technology because they're just not interested."

She also listed qualities that might indicate that you have the impostor syndrome. From my discussions with friends and colleagues, it's very common among women in computing and technology.  (I've heard that there a few men who suffer from this, too!)  Here's the list:
  • Do you secretly worry that others will find out that you're not as bright/capable as they think you are?
  • Do you attribute your success to being a fluke or "no big deal"?
  • Do you hate making a mistake, being less than fully prepared, or not doing things perfectly?
  • Do you believe that others are smarter and more capable than you are?
Saturday morning began with my keynote, "Telling Stories with Web Archives".


I talked a bit about web archiving in general and then described Yasmin AlNoamany's PhD work on using the archives for storytelling.  The great part for me was to be able to introduce the Internet Archive and the Wayback Machine to lots of people.  I got several comments from both students and faculty afterwards with ideas of how they would incorporate the Wayback Machine in their work and studies. (Watch the video)


After my talk, I attend a session on Education.  J.K. Sherrod and Zach Guzman from Pellissippi State Community College in Knoxville, TN presented "Using the Raspberry Pi in Education".  They had been teaching cluster computing using Mac Minis, but it was getting expensive, so they started purchasing Raspberry Pi devices (~$35) for their classes.  The results were impressive.  Since the devices run a full version of Linux, they even were able to implement a Beowulf Cluster.

I followed this up by attending a panel "Being a Woman in Technology: What Does it Mean to Us?"  The panelists and audience discussed both positive connotations and challenges to being a woman in technology.  This produced some amazing stories, including one student who related being told by a professor that she was no good at math and was "a rotten mango".
After lunch, several students presented 5 minute lightning talks on strategies for success in school and life.  It was great to see so many students excited to share their experiences and lessons learned.

The final keynote was given on Saturday night by Valentina Salapura, from IBM TJ Watson on "Cloud Computing 24/7".  After telling her story and things she learned along the way (and including a snapshot  from the Wayback Machine of her former academic webpage), she described the motivation and promise of cloud computing.


Sunday was the last day, and I attend a talk by Ruthe Farmer, Director of Strategic Initiatives, NCWIT on "Research and Opportunities for Women in Technology".  The National Center for Women & Information Technology was started in 2004 and is a non-profit that partners with corporations, academic institutions, and other agencies with the goal of increasing women's participation in technology and computing.  One of their slogans is "50:50 by 2020".  There's a wealth of information and resources available on the NCWIT website (including the NCWIT academic alliance and Aspirations in Computing program).

Ruthe described the stereotype threat that affects both women and men.  This is the phenomena where awareness of negative stereotypes associated with a peer group can inhibit performance.  She described a study where a group of white men from Stanford were given a math test.  Before the test, one set of students were reminded of the stereotype that Asian students outperform Caucasian students in math, and the other set was not reminded of this stereotype.  The stereotype threatened test takers performed worse than the control set.

Before we left on Sunday, I had the opportunity to sit in the red chair. Sit With Me is a promotion by NCWIT to recognize the role of women in computing.  "We sit to recognize the value of women's technical contributions.  We sit to embrace women's important perspectives and increase their participation."

All in all, it was a great weekend.  I drank lots of sweet tea, heard great southern accents that reminded me of home (Louisiana), and met amazing women from around the southeast, including students and faculty from Trinity University (San Antonio, TX), Austin Peay University, Georgia Tech, James Madison University (Virginia), Tennessee Tech, Pellissippi State Community College (Knoxville, TN), Murray State University (Kentucky), NC State, Rhodes College (Memphis, TN), Univ of Georgia, Univ of Tennessee, and the Girls Preparatory School (Chattanooga, TN).
There are plans for another SEWIC Conference in 2015.
-Michele

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

Circulating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below).  Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive).  But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives.  Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details). 

Consider this randomly chosen 2009 speech:

http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx

Right now it produces a custom 404 page (see Note 3 below):


Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):




So it seems clear that this speech will not disappear down a memory hole.  But how do you discover these copies in these archives?  Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework.  If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.

If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):



The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI. 

We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:
  • Internet Archive: crawl everything
  • Archive-It: collections defined by subscribers
  • UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
  • Archive.is: archives individual pages on user request
Download and install the Chrome extension and all of these archives and more will easily available to you.

-- Michael and Herbert

Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt.  For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013).  "We've always been at war with Eastasia."

Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others).  Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):



But access was restored sometime in the space of three hours before I could generate a screen shot:



Why was it restored?  Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?).  The 08:36:36 version of robots.txt has:

...
Disallow: /News/News_stories/2008/
Disallow: /News/News_stories/2009/
Disallow: /News/News_stories/2010/01/
... 

But the 18:10:19 version has:
...  
Disallow: /News/Blogs.aspx
Disallow: /News/Blogs/
...  

These "Disallow" rules no longer match the URI of the original speech.  I guess the Internet Archive cached the disallow rule and it just now expired one week later.  See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.

The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:



Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate.  Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:



See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.

--Michael

Tuesday, November 19, 2013

2013-11-19: REST, HATEOAS, and Follow Your Nose

This post is hardly timely, but I wanted to gather together some resources that I have been using for REST (Representational State Transfer) and HATEOAS (Hypermedia as the Engine of Application State).  It seems like everyone claims to be RESTful, but mentioning HATEOAS is frequently met with silence.  Of course, these terms come from Roy Fielding's PhD dissertation, but I won't claim that it is very readable (it is not the nature of dissertations to be readable...).  Fortunately he's provided more readable blog posts about REST and HATEOAS. At the risk of aggressively over-simplifying things, REST = "URIs are nouns, not verbs" and HATEOAS = "follow your nose".

"Follow your nose" simply means that when a client dereferences a URI, the entity that is returned is responsible for providing a set of links that allows the user agent to transition to the next state.  This standard procedure in HTML: you follow links to guide you through an online transaction (e.g., ordering a book from Amazon) all the time -- it is so obvious you don't even think about it.  Unfortunately, we don't hold our APIs to the same level; non-human user agents are expected to make state transitions based on all kinds of out-of-band information encoded in the applications.  When there is a change in the states and the transitions between them, there is no direct communication with the application and it simply breaks.

I won't dwell on REST because most APIs get the noun/verb thing right, but we seem to be losing ground on HATEOAS (and technically, HATEOAS is a constraint on REST and not something "in addition to REST", but I'm not here to be the REST purity police).

There are probably many good descriptions of HATEOAS and I apologize if I've left your favorite out, but these are the two that I use in my Web Server Design course (RESTful design isn't goal of the course, but more like a side benefit).  Yes, you could read about a book about REST, but these two slide decks will get your there in minutes.  

The first is from Jim Webber entitled "HATEOAS: The Confusing Bit from REST".  There is a video of Jim presenting these slides as well as a white paper about it (note: the white paper is less correct than the slides when it comes to things like specific MIME types).  He walks you through a simple but illustrative (HTTP) RESTful implementation of ordering coffee.  If the user agent knows the MIME type "applications/vnd.restbucks+xml" (and the associated Link rel types), then it can follow the embedded links to transition from state-to-state.  And if you don't know how to do the the right thing (tm) with this MIME type, you should stop what you're doing. 




It seems like the Twitter API is moving away from HATEOAS.  Brian Mulloy has a nice blog post about this (from which I took the image at the top of this post).  The picture nicely summarizes that from an HTML representation of a Tweet there are all manner of state transitions available, but the equivalent json representation is effectively a dead-end; the possible state transitions have to be in the mind of the programmer and encoded in the application.  Their API returns MIME types of "application/json" just like 1000 other APIs and it is up to your program to sort out the details.  Twitter's 1.1 API, with things like removing support for RSS, is designed for lock-in and not abstract ideals like HATEOAS.  Arguably all the json APIs, with their ad-hoc methods for including links and uniform MIME type, are a step away from HATEOAS (see the stackoverflow.com discussion). 

The second presentation also address a pet peeve of mine: API deprecation (e.g., my infatuation with Topsy has tempered after they crashed all the links I had created -- grrr.).  The presentation "WHY HATEOAS: A simple case study on the often ignored REST constraint" from Wayne Lee walks through a proper way to define your API with an eye to new versions, feature evolution, etc. 




Again, I'm sure there are many other quality REST and HATEOAS resources but I wanted to gather the couple that I had found useful into one place and not just have them buried in class notes.  Apologies for being about five years late to the party.

--Michael

Wednesday, November 13, 2013

2013-11-13: 2013 Archive-It Partner Meeting Trip Report

test

On November 12, I attended the 2013 Archive-It Partner Meeting in Salt Lake City, Utah, our research group's second year of attendance (see 2012 Trip Report). The meeting started off casually at 9am with breakfast and registration. Once everyone was settled, Kristine Hanna, the Director of Archiving Services at Internet Archive introduced her team that was present of the meeting. Kristine acknowledged the fire at Internet Archive last week and the extent of the damage. "It did burn to the ground but thankfully, nobody was injured." She reminded the crowd of partners to review Archive-It's storage and preservation policy and mentioned the redundancies in-place, including a soon-to-be mirror at our very own ODU. Kristine then mentioned news of a new partnership with Reed Technologies to jointly market and sell Archive-It (@archiveitorg). She reassured the audience that nothing would change beyond having more resources for them to accomplish their goals.

Kristine then briefly mentioned the upcoming release of Archive-It 5.0, which would be spoken about in-depth in a later presentation. She asked everyone in the room (of probably 50 or so attendees) to introduce themselves and to state their affiliated. With the intros out of the way, the presentations began.

Kate Legg of National Center for Atmospheric Research (NCAR) presented "First steps toward digital preservation at NCAR". She started by saying that NCAR is a federally funded research and development center (FFRDC) whose mission is to "preserve, maintain and make accessible records and materials that document the history and scientific contributions of NCAR". With over 70 collections and 1500 employees, digital preservation is on the organization's radar. Their plan, while they have a small library and staff, is to accomplish this along with other competing priorities.

"Few people were thinking about the archives for collecting current information", Kate described of some of the organization not understanding that preserving now will create archives for later. "The archive is not just where old where old stuff goes, but new stuff as well." One of the big obstacles for the archiving initiatives of the organizations has been funding. Even with this limitation, however, NCAR was still able to subscribe to Archive-It through a low level subscription. With this subscription, they started to preserve their Facebook group but increasingly found huge amounts of data, including videos, that they felt was too resource heavy to archive. The next step for the initiative is to add a place on the organization's webpage where archive content will be accessible to the public.

Jaime McCurry (@jaime_ann) of Folger Shakespeare Library followed Kate with "The Short and the Long of It: Web Archiving at the Folger Shakespeare Library". Jaime is currently participating in the National Digital Stewardship Residency where her goal is to establish local routines and best practices for archiving and preserving the library's born-digital content. They have two collection with over 6 millions documents (over 400 gigabytes of data) currently where the topic being collected is to preserve content on the web relating to the works of Shakespeare (particularly in social media and from festivals). In trying to describe the large extent of the available content, Jaime said, "In trying to archive Shakespeare's presence on the web, you really have to define what you're looking for. Shakespeare is everywhere!". She noted that one of the first things she realized when she first started on the project at Folger was that nobody knew that the organization was performing web archiving, so she wished to establish an organization-wide web archiving policy. One of the recent potential targets of her archiving project was the NYTimes' Hamlet contest wherein the newspaper suggested Instagram users create 15-second clips of their interpretation of a passage from the play. Because this related to Shakespeare, it would be an appropriate target for the Folger Shakespeare Library.

EDIT: Jamie also created a trip report of the meeting on her blog.

After Jaime finished, Sharon Farnel of University of Alberta began her presentation "Metadata workflows for web archiving – thinking and experimenting with ‘metadata in bulk’". In her presentation she referenced a project called Blacklight, an open source project that provides a discovery interface for any Solr index via a customizable, template-based user interface. In her collection, from the context of metadata, she wished to think about where and why discovery of content tasks place in web archiving. She utilized a mixed model wherein entries might have MARC records, Dublin Core data or both. Sharon emphasized that metadata was an important functionality of Archive-It. To better parse the data, her group created XSLT stylesheets to be able to export the data into a more interoperable format like Excel, which it could then be imported back into Blacklight after manipulation. She referenced some of the difficulties in working the the different technologies but said, "None of these tools were a perfect solution on their own but by combining the tools in-house, we can get good results with the metadata."

After a short break (to caffeinate), Abbie Grotke (@agrtoke) of Library of Congress remotely presented "NDSA Web Archiving Survey Update". In her voice-over presentation from DC, she gave preliminary results of the NDSA Web Archiving Survey, stating that the initiative of the NDIIP program had yielded about 50 respondents so far. For the most part, the biggest concern about web archiving reported by the survey participants was database preservation followed by social media and video archiving. She stated that the survey is still open and encouraged attendees to take it (Take it here).

Trevor Alvord of Brigham Young University was next with "A Muddy Leak: Using the ARL Code of Best Practices in Fair Use with Web Archiving". His efforts with the L. Tom Perry Special Collections at BYU was to build a thematic based collection based on Mormonism. He illustrated that many battles had been fought and won over digital preservation content rights (e.g., Perfect 10 vs. Google and Students vs. iParadigms), so his collection should be justified based on the premises in those cases. "Web archiving is routinely done by two wealthiest corporations (Google and Microsoft)", he quoted Jonathan Band, a recognized figure in the lawsuits versus Google. "In the last few months, libraries have prevailed.", Trevor said, "Even with our efforts, we have not received any complaints about their website being archived by libraries."

Trevor then went on to describe the problem with his data set, alluding to the Teton Dam flooding wherein millions of documents are being produced about Mormonism and now he is having to capture whatever he can. This is partially due to the lowering of the age allowed for missionaries and the Mormon church's encouragement for young Mormons to post online. He showed two examples of Mormon "mommy" bloggers Peace Love Lauren, a very small impact bloggers and NieNie Dialogs, a very popular blog. He asked the audience, "How do you prioritize what content to archive given popular content is more important but also more likely to be preserved?"

Following Trevor, Vinay Goel of Internet Archive presented "Web Archive Analysis". He started by saying that "Most Partners access Archive-It via the Wayback Machine." where other methods would be by using the Archive-It search service or downloading the archival contents. He spoke of de-duplication and how it is represented in WARCs via a revisit record. The core of his presentation spoke of the various WARC meta formats, Web Archive Transformation (WAT) files and CDX files, the format used for WARC indexing. "WAT files are WARC metadata records.", he said, "CDX files are space delimited text files that record where a file resides in a WARC and its offset." Vinay has come up with an analysis toolkit that would allow researchers to express question they want to ask about the archives in a high level language that would then be translated to a low level language understandable by an analysis system. "We can capture YouTube content", he said, giving an example use case, "but the content is difficult to replay." Some of the analysis information he displayed was identifying this non-replayable content in the archives and showing the in-degree and out-degree information of each resource. Further, his toolkit is useful in studying how this linking behavior changes over time.

The crowd then broke for lunch only to return to Scott Reed (@vector_ctrl) of Internet Archive presenting the new features that would be present in the next iteration of Archive-It, 5.0. The new system, among other things, allows users to create test crawls and is better at social media archiving. Some of the goals to be implemented in the system before the end of the year is to get the best capture and display the capture in currently existing tools. Scott mentioned an effort by Archive-It to utilize phantomjs (with which we're familiar at WS-DL through our experiments) through a feature they're calling "Ghost". Further, the new version promises to have an API. Along with Scott, Maria LaCalle spoke of a survey completed about the current version of Archive-It and Solomon Kidd spoke of work done on user interface refinements of the upcoming system.

Following Scott, the presentations continued with your author, Mat Kelly (@machawk1) presenting "Archive What I See Now".

After I finished my presentation, the final presentation of the day was by Debbie Kempe of The Frick Collection and Frick Art Reference Library with "Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art Resources". In her presentation, she stated that there was a broad overlap of art between the Brooklyn Museum, Museum of Modern Art, and the Frick Art Reference Library. Citing Abbie Grotke's survey from earlier, she reminded the audience that no museums responded to the survey, which is problematic for evaluating their archiving needs. "Not all information is digital in the art community", Debbie said. In initiating archiving effort, it wasn't so much clear to the museums' organizers as to why or how web archiving of their content should be done but rather, "Who will do it?" and "How will we pay for it?" She ran a small experiment in accomplishing the preservation tasks of the museum and is now subsequently running a longer "experiment", given more content is being create that is digital and less in print in their collections. In the longer trial, she hopes to test and formulate a sustainable workflow, including re-skilling and organizational changes.

After Debbie, the crowd was freed into a Birds of a Feather session to discuss issues about web archiving that interested each individual, to which I collected with a group about "Capture", given my various software projects relating to the topic. After the BoF session, Lori Donovan and Kristine Hanna adjourned the room to a following reception.

Overall, I felt the trip to Utah to meet with a group with a common interest was a unique experience that I don't get at other conferences where some of the audiences' focuses are disjoint from one another. The feedback I received on my research and the discussion I had with various attendees was extremely valuable in learning how the Archive-It community works and I hope to attend again next year.

EDIT: Since publishing this post, the Archive-It team have posted the slides from the Partner Meeting.

— Mat (@machawk1)

Friday, November 8, 2013

2013-11-08: Proposals for Tighter Integration of the Past and Current Web

The Memento Team is soliciting feedback on two white papers that address related proposals for more tightly integrating the past and current web.

The first is "Thoughts on Referencing, Linking, Reference Rot", which is inspired by the hiberlink project.  This paper proposes making temporal semantics part of the HTML <a> element, via "versiondate" and "versionurl" attributes that respectively include the datetime the link was created and optionally a link to an archived version of the page (in case the live web version becomes 404, goes off topic, etc.).  The idea is that "versiondate" can be used as a Memento-Datetime value by a client, and "versionurl" can be used to record a URI-M value.  This approach is inspired by the Wikipedia Citation Template, which has many metadata fields, including "accessdate" and "archiveurl".  For example, in the article about the band "Coil", one of the links to the source material is broken, but the Citation Template has values for both "accessdate" and "archiveurl":



Unfortunately, when this is transformed into HTML the semantics are lost or relegated to microformats:



A (simple) version with machine-actionable links suitable for the Memento Chrome extension or Zotero could have looked like this in the past, ready to activate when the link eventually went 404:



The second paper, "Memento Capabilities for Wikipedia", "describes added value that Memento can bring to Wikipedia and other MediaWiki platforms.  One is enriching their external links with the recommendations from our first paper (described above), and the second is about native Memento support for wikis.

Native Memento support is possible via a new Memento Extension for MediaWiki servers that we announced for testing and feedback on the wikitech-l list. This new extension is the result of a significant re-engineering effort guided by feedback received from Wikipedia experts to a previous version.  When installed, this extension allows clients to access the "history" portion of wikis in the same manner as they access web archives.  For example, if you wanted to visit the Coil article as it existed on February 2, 2007 instead of wading through the many pages of the article's history, your client would use the Memento protocol to access a prior version with the "Accept-Datetime" request header:



and the server would eventually redirect you to:



In a future blog post we will describe how using a Memento-enabled wiki can be used to avoid spoilers on fan wikis (e.g., The Songs of Ice and Fire wiki) by setting the Accept-Datetime to be right before a episode or book is released.

We've only provided a summary of the content of the two papers and we invite everyone to review them and provide us with feedback (here, twitter, email, etc.). 

--Michael & Herbert

Sunday, November 3, 2013

2013-11-2: WSDL NFL Power Rankings Week 9

We are halfway through the 2013 NFL season and it is time for our WSDL mid-season rankings. Both conferences have one winless team, Jacksonville in the AFC and Tampa Bay in the NFC.  The NFC is looking rather lackluster this year with no standout teams so far. The NFC East teams in particular need to get their acts together. The AFC appears to be dominating the League with  a number of teams that are performing quite well. Two teams that show up on the top of every power ranking list are the Denver Broncos and the Kansas City Chiefs.

Kansas City has a great defense, using our efficiency ratings they are rated as the fifth best defense in the league. However a good defense will only get you so far when your offense is ranked at 27th out of 32. Denver on the other hand has the highest ranked offense in our system with a lot of that on Peyton Mannings shoulders. A good passing offense correlates quite well with a team that wins games.

Here is where our ranking system rates each of the teams. The size of each circle is the rank of the team, larger circle, higher rank. The arrows are wins and point from the loser to the winner.





Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week Denver beat the Redskins 45 to 21, in the graph a directed edge from the Redskins to Denver with a weight of 24 was created.

Our rankings differ from most of the others you will find. One of the reasons is that our algorithm takes the  strength of schedule into account. Denver has the EASIEST strength of schedule out of the entire league and Kansas City is only a few teams above them. Kansas City is the most over-rated team in the league according to our ratings with the largest difference between our rating and the average of power rankings found on the Internet.

We are not saying that one or the other isn't going to go to the playoffs because neither one has more than one or two good opponents in the schedule, they will continue to win against mediocre opponents. Look for them to stumble against good teams and they are scheduled to play each other twice in the second half of the season.

Our rankings:

Indianapolis Colts         0.103362928
San Diego Chargers      0.076939541
Cincinnati Bengals       0.070915973
New England Patriots   0.059724726
Cleveland Browns        0.055835742      
San Francisco 49ers      0.050768683       
New Orleans Saints      0.047412718       
Oakland Raiders           0.047354270      
Denver Broncos            0.044543605      
Seattle Seahawks          0.043283728      
Miami Dolphins           0.041914789      
Green Bay Packers       0.041591108       
Kansas City Chiefs       0.034758260       
Detroit Lions                0.027743700      
Tennessee Titans          0.025216934      
New York Jets               0.022644022     
Chicago Bears               0.022401144      
Arizona Cardinals         0.021622872      
Houston Texans             0.021445405      
Baltimore Ravens          0.019740968      
Washington Redskins    0.017983031      
Dallas Cowboys            0.014199985      
Carolina Panthers          0.013691420      
St. Louis Rams              0.012693121      
Pittsburgh Steelers         0.010377841      
Buffalo Bills                  0.009797045      
Philadelphia Eagles       0.008701059     
New York Giants           0.007876355      
Atlanta Falcons              0.007223360      
Minnesota Vikings         0.007014133      
Jacksonville Jaguars      0.005610766      
Tampa Bay Buccaneers  0.005610766       

--Greg Szalkowski