Thursday, December 19, 2013
Following the talk at the Spreeforum I was asked to give an interview for the German radio station Inforadio (you may think of it as Germany's NPR). The piece was aired on Monday, November 18th at 7.30am CET. As I had left Germany already I was not able to listen to it live but was happy to find the corresponding article online that basically contained the transcript of the aired report and an audio file was embedded in the document. I immediately bookmarked the page.
original URI only to find it was no longer available (screenshot left). Now, we all know that the web is dynamic and hence links break and even we have seen odd dynamics at other media companies before but in this case, as I was about to find out, it was higher powers that caused the detrimental effect. Inforadio is a public radio station and therefore, as many others in Germany and throughout Europe, to a large extent financed by the public (as of 2013 the broadcast receiving license is 17.98 Euros (almost USD 25) per month per household). As such they are subject to the "Rundfunkstaatsvertrag", which is a contract between the German states to regulate broadcasting rights. The 12th amendment to this contract from 2009 mandates that most online content must be removed after 7 days of publication. Huh? Yeah, I know, it sounds like a very bad joke but it is not. It even lead to coining the term "depublish" - a paradox by itself. I had considered public radio stations as "memory organizations", in league with libraries, museums, etc. How wrong was I and how ironic is this, given my talk's topic!? For what it's worth though, the content does not have to be deleted from the repository but it has to be taken offline.
I can only speculate about the reasons for this mandate but to me believable opinions circulate indicating that private broadcasters and news publishers complained about unfair competition. In this sense, the claim was made that "eternal" availability of broadcasted content on the web is unfair competition as the private sector is not given the appropriate funds to match that competitive advantage. Another point that supposedly was made is that this online service goes beyond the mandate of public radio stations and hence would constitute a misguided use of public money. To me personally, none of this makes any sense. Broadcasters of all sorts have realized that content (text, audio, and video) is increasingly consumed online and hence are adjusting their offerings. How this can be seen as unfair competition is unclear to me.
But back to my interview. Clearly, one can argue (or not) whether the document is worth preserving but my point here is a different one:
Not only did I bookmark the page when I saw it, I also immediately tried to push it into as many web archives as I could. I tried the Internet Archive's new "save page now" service but, to add insult to injury, Inforadio also has a robots.txt file in place that prohibits the IA from crawling the page. To the best of my knowledge this is not part of the 12th amendment to the "Rundfunkstaatsvertrag" so the broadcaster could actually take action to preserve their online content. Other web sites of public radio and TV stations such as Deutschlandfunk or ZDF do not prohibit archives from crawling their pages.
Fortunately, the archiving service Archive.is was able to grab the page (screenshot left) but the audio feed is lost.
Just one more thing (Peter Falk style):
Note that the original URI of the page:
when requested in a web browser redirects (200-style) to:
The good news here: it is not a soft 404 so the error is somewhat robot friendly. The bad news is that the original URI is thrown away. As the original URI is the only key for a search in web archives, we can not retrieve any archived copies (such as the one I created in Archive.is) without it. Unfortunately, this is not only true for manual searches but it also undermines automatic retrieval of archives copies by clients such as the browser extension Memento for Chrome. As stressed in our recent talk at CNI this is very bad practice and unnecessarily makes life harder for those interested in obtaining archived copies of web pages at large, not only my radio interview.
Wednesday, December 18, 2013
Fortunately, there is the native Memento Mediawiki Extension, supported by the Andrew Mellon Foundation, which addresses these issues. It has been developed jointly by Old Dominion University and LANL. Mediawiki was chosen because it is the most widely used Wiki software, used in sites such as Wikipedia and Wikia.
This native extension allows direct access to all revisions of a given page, avoiding spoilers. It can also return the data directly, requiring no Memento aggregators or other additional external infrastructure.
To recap, the native extension gives us the following benefits:
- The Memento Infrastructure cannot know about all possible wikis and provide TimeGates for each one, so the chances of a wiki having one are low.
- The Internet Archive does not have all revisions of each fan wiki page, meaning that visitors to a fan wiki may miss out on information.
- Visitors to the fan wiki site who are trying to avoid spoilers don't need to worry about any issues with the Memento wiki TimeGate infrastructure. Changes to a wiki's API can threaten the whole process, and APIs change frequently while Memento is established by a more stable RFC.
* = Memento for Chrome version 0.1.11 actually performs two HEAD requests on the resource, but this will be fixed in the next release.
Friday, December 13, 2013
As always, there are many slides but they are worth the time to study them. Of particular importance are slides 8--18, which helps differentiate Hiberlink from other projects, and slides 66-99 which walk through a demonstration of the "Missing Link" concepts (along with the Memento for Chrome extension) can be used to address the problem of link rot. In particular, absent specific versiondate attributes on a link, such as:
<a versiondate="some-date-value" href="...">
A temporal context can be inferred from the "datePublished" META value defined by schema.org:
<META itemprop="datePublished" content="some-ISO-8601-date-value">
Again, the slides are well-worth your time.
Thursday, November 28, 2013
|Screenshot of the live Craigslist SOPA Protest from January 18th, 2012.|
Craigslist put up a blackout page that would only provide access to the site through a link that appears after a timeout. In order to preserve the SOPA splash page on the Craigslist site, we submitted the URI-R for the Washington D.C. Craigslist page to WebCite for archiving producing a memento for the SOPA screen:
|Screenshot of the Craigslist protest memento in WebCite.|
|Screenshot of the Craigslist homepage linked from the|
protest splash page. This is also the live version of the
The Heritrix crawler archived the Craigslist page on January 18th, 2012. The Internet Archive contains a memento for the protest:
|Screenshot of the Craigslist protest splash page in the Wayback Machine.|
as does Archive-It:
The Internet Archive memento has the same splash page and countdown as the WebCite memento that was captured with the Heritrix crawler. The link on the Internet Archive memento leads to a memento of the Craigslist page rather than the live version albeit with archival timestamps one day and 13 hours, 30 minutes, and 44 seconds apart (2012-01-19 18:34:32 vs 2012-01-18 05:03:48):
|Screenshot of the Craigslist homepage memento, linked from the|
protest splash screen.
The Internet Archive converts embedded links to be relative to the archive rather than target the live web. Since Heritrix also crawled the linked page, the embedded link dereferences to the proper memento with a note embedded in the HTML protesting SOPA.
The Craigslist protest was readily archived by WebCite, Archive-It, and the Internet Archive. Policies within each archival institution impacted how the Craigslist homepage (past the protest splash screen) is referenced and accessed by archive users. This differs from the Wikipedia protest, which was not readily archived.
|Screenshot of the live Wikipedia SOPA Protest.|
Wikipedia displayed a splash screen protesting SOPA blocking access to all content on the site. A version of which is still available live on Wikipedia as of November 27th, 2013:
|The live version of the Wikipedia SOPA Protest.|
On January 18th, 2012, we submitted the page to WebCite to produce a memento that did not captured the splash page. Instead the memento has only the representation hidden by the splash page.
|A screenshot of the WebCite memento of the Wikipedia SOPA Protest.|
The mementos captured by Heritrix and presented through the Internet Archive's Wayback Machine and Archive-It are also missing the SOPA splash page.
|A screenshot of the Internet Archive memento of the |
Wikipedia SOPA Protest.
|A screenshot of the Archive-It memento of the |
Wikipedia SOPA Protest.
To investigate the cause of the missing splash page further, we requested that WebCite archive the current version of the Wikipedia blackout page on November 27th, 2013. The new memento does not capture the splash page, either:
|A screenshot of the WebCite memento of the current|
Wikipedia blackout page.
Heritrix has also created a memento for the current blackout page on August 24th, 2013. This memento suffers the same problem as the aforementioned mementos and does not capture the splash page:
|A screenshot of the Internet Archive memento of the current|
Wikipedia blackout page.
|Google Chrome's developer console showing the resources requested|
and their associated response codes.
Specifically, the Robots.txt file preventing these resources from being archived is:
The Robots.txt file is archived, as well:
The blackout image is available in the Internet Archive, but the mementos in the Wayback Machine do not attempt to load it:
--Justin F. Brunelle
Thursday, November 21, 2013
Last weekend (Nov 14-17), I was honored to give a keynote at the Southeast Women in Computing Conference (SEWICC), located at the beautiful Lake Guntersville State Park in north Alabama. The conference was organized by Martha Kosa and Ambareen Siraj (Tennessee Tech University), and Jennifer Whitlow (Georgia Tech).
Videos from the keynotes and pictures from the weekend will soon be posted on the conference website. (UPDATE 1/24/14: Flickr photostream and links to keynote videos added.)
The 220+ attendees included faculty, graduate students, undergraduates, and even some high school students (and even some men!).
During her talk, she pointed out that women's participation != women's interest. She had some statistics showing that in 1970 the percentages of women in law school (5%), business school (4%), medical school (8%), and high school sports (4%) were very low. Then she contrasted that with data from 2005: law school (48%), business school (45%), medical school (50%), and high school sports (42%). The goal was to counter the frequent comment that "Oh, women aren't in computing and technology because they're just not interested."
She also listed qualities that might indicate that you have the impostor syndrome. From my discussions with friends and colleagues, it's very common among women in computing and technology. (I've heard that there a few men who suffer from this, too!) Here's the list:
- Do you secretly worry that others will find out that you're not as bright/capable as they think you are?
- Do you attribute your success to being a fluke or "no big deal"?
- Do you hate making a mistake, being less than fully prepared, or not doing things perfectly?
- Do you believe that others are smarter and more capable than you are?
I talked a bit about web archiving in general and then described Yasmin AlNoamany's PhD work on using the archives for storytelling. The great part for me was to be able to introduce the Internet Archive and the Wayback Machine to lots of people. I got several comments from both students and faculty afterwards with ideas of how they would incorporate the Wayback Machine in their work and studies. (Watch the video)
After my talk, I attend a session on Education. J.K. Sherrod and Zach Guzman from Pellissippi State Community College in Knoxville, TN presented "Using the Raspberry Pi in Education". They had been teaching cluster computing using Mac Minis, but it was getting expensive, so they started purchasing Raspberry Pi devices (~$35) for their classes. The results were impressive. Since the devices run a full version of Linux, they even were able to implement a Beowulf Cluster.
I followed this up by attending a panel "Being a Woman in Technology: What Does it Mean to Us?" The panelists and audience discussed both positive connotations and challenges to being a woman in technology. This produced some amazing stories, including one student who related being told by a professor that she was no good at math and was "a rotten mango".
You're not a rotten mango #sewic2013 pic.twitter.com/8i7OTBttDYAfter lunch, several students presented 5 minute lightning talks on strategies for success in school and life. It was great to see so many students excited to share their experiences and lessons learned.
— Shannon Wood (@Shannonanagains) November 16, 2013
Valentina Salapura, from IBM TJ Watson on "Cloud Computing 24/7". After telling her story and things she learned along the way (and including a snapshot from the Wayback Machine of her former academic webpage), she described the motivation and promise of cloud computing.
Sunday was the last day, and I attend a talk by Ruthe Farmer, Director of Strategic Initiatives, NCWIT on "Research and Opportunities for Women in Technology". The National Center for Women & Information Technology was started in 2004 and is a non-profit that partners with corporations, academic institutions, and other agencies with the goal of increasing women's participation in technology and computing. One of their slogans is "50:50 by 2020". There's a wealth of information and resources available on the NCWIT website (including the NCWIT academic alliance and Aspirations in Computing program).
Ruthe described the stereotype threat that affects both women and men. This is the phenomena where awareness of negative stereotypes associated with a peer group can inhibit performance. She described a study where a group of white men from Stanford were given a math test. Before the test, one set of students were reminded of the stereotype that Asian students outperform Caucasian students in math, and the other set was not reminded of this stereotype. The stereotype threatened test takers performed worse than the control set.
Sit With Me is a promotion by NCWIT to recognize the role of women in computing. "We sit to recognize the value of women's technical contributions. We sit to embrace women's important perspectives and increase their participation."
All in all, it was a great weekend. I drank lots of sweet tea, heard great southern accents that reminded me of home (Louisiana), and met amazing women from around the southeast, including students and faculty from Trinity University (San Antonio, TX), Austin Peay University, Georgia Tech, James Madison University (Virginia), Tennessee Tech, Pellissippi State Community College (Knoxville, TN), Murray State University (Kentucky), NC State, Rhodes College (Memphis, TN), Univ of Georgia, Univ of Tennessee, and the Girls Preparatory School (Chattanooga, TN).
.@Conservatives put speeches in Streisand's house: http://t.co/6aRiOsHwxO @UKWebArchive: http://t.co/BGD3tYavEx via @lljohnston @hhockxCirculating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below). Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive). But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives. Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details).
— Michael L. Nelson (@phonedude_mln) November 13, 2013
Consider this randomly chosen 2009 speech:
Right now it produces a custom 404 page (see Note 3 below):
Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):
So it seems clear that this speech will not disappear down a memory hole. But how do you discover these copies in these archives? Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework. If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.
If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):
The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI.
We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:
- Internet Archive: crawl everything
- Archive-It: collections defined by subscribers
- UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
- Archive.is: archives individual pages on user request
-- Michael and Herbert
Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt. For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013). "We've always been at war with Eastasia."
Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others). Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):
But access was restored sometime in the space of three hours before I could generate a screen shot:
Why was it restored? Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?). The 08:36:36 version of robots.txt has:
Disallow: /News/News_stories/2008/ Disallow: /News/News_stories/2009/ Disallow: /News/News_stories/2010/01/
But the 18:10:19 version has:
Disallow: /News/Blogs.aspx Disallow: /News/Blogs/
These "Disallow" rules no longer match the URI of the original speech. I guess the Internet Archive cached the disallow rule and it just now expired one week later. See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.
The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:
Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate. Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:
See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.
Tuesday, November 19, 2013
"Follow your nose" simply means that when a client dereferences a URI, the entity that is returned is responsible for providing a set of links that allows the user agent to transition to the next state. This standard procedure in HTML: you follow links to guide you through an online transaction (e.g., ordering a book from Amazon) all the time -- it is so obvious you don't even think about it. Unfortunately, we don't hold our APIs to the same level; non-human user agents are expected to make state transitions based on all kinds of out-of-band information encoded in the applications. When there is a change in the states and the transitions between them, there is no direct communication with the application and it simply breaks.
I won't dwell on REST because most APIs get the noun/verb thing right, but we seem to be losing ground on HATEOAS (and technically, HATEOAS is a constraint on REST and not something "in addition to REST", but I'm not here to be the REST purity police).
There are probably many good descriptions of HATEOAS and I apologize if I've left your favorite out, but these are the two that I use in my Web Server Design course (RESTful design isn't goal of the course, but more like a side benefit). Yes, you could read about a book about REST, but these two slide decks will get your there in minutes.
The first is from Jim Webber entitled "HATEOAS: The Confusing Bit from REST". There is a video of Jim presenting these slides as well as a white paper about it (note: the white paper is less correct than the slides when it comes to things like specific MIME types). He walks you through a simple but illustrative (HTTP) RESTful implementation of ordering coffee. If the user agent knows the MIME type "applications/vnd.restbucks+xml" (and the associated Link rel types), then it can follow the embedded links to transition from state-to-state. And if you don't know how to do the the right thing (tm) with this MIME type, you should stop what you're doing.
It seems like the Twitter API is moving away from HATEOAS. Brian Mulloy has a nice blog post about this (from which I took the image at the top of this post). The picture nicely summarizes that from an HTML representation of a Tweet there are all manner of state transitions available, but the equivalent json representation is effectively a dead-end; the possible state transitions have to be in the mind of the programmer and encoded in the application. Their API returns MIME types of "application/json" just like 1000 other APIs and it is up to your program to sort out the details. Twitter's 1.1 API, with things like removing support for RSS, is designed for lock-in and not abstract ideals like HATEOAS. Arguably all the json APIs, with their ad-hoc methods for including links and uniform MIME type, are a step away from HATEOAS (see the stackoverflow.com discussion).
The second presentation also address a pet peeve of mine: API deprecation (e.g., my infatuation with Topsy has tempered after they crashed all the links I had created -- grrr.). The presentation "WHY HATEOAS: A simple case study on the often ignored REST constraint" from Wayne Lee walks through a proper way to define your API with an eye to new versions, feature evolution, etc.
Again, I'm sure there are many other quality REST and HATEOAS resources but I wanted to gather the couple that I had found useful into one place and not just have them buried in class notes. Apologies for being about five years late to the party.
Wednesday, November 13, 2013
testOn November 12, I attended the 2013 Archive-It Partner Meeting in Salt Lake City, Utah, our research group's second year of attendance (see 2012 Trip Report). The meeting started off casually at 9am with breakfast and registration. Once everyone was settled, Kristine Hanna, the Director of Archiving Services at Internet Archive introduced her team that was present of the meeting. Kristine acknowledged the fire at Internet Archive last week and the extent of the damage. "It did burn to the ground but thankfully, nobody was injured." She reminded the crowd of partners to review Archive-It's storage and preservation policy and mentioned the redundancies in-place, including a soon-to-be mirror at our very own ODU. Kristine then mentioned news of a new partnership with Reed Technologies to jointly market and sell Archive-It (@archiveitorg). She reassured the audience that nothing would change beyond having more resources for them to accomplish their goals.
Kristine then briefly mentioned the upcoming release of Archive-It 5.0, which would be spoken about in-depth in a later presentation. She asked everyone in the room (of probably 50 or so attendees) to introduce themselves and to state their affiliated. With the intros out of the way, the presentations began.
Kate Legg of National Center for Atmospheric Research (NCAR) presented "First steps toward digital preservation at NCAR". She started by saying that NCAR is a federally funded research and development center (FFRDC) whose mission is to "preserve, maintain and make accessible records and materials that document the history and scientific contributions of NCAR". With over 70 collections and 1500 employees, digital preservation is on the organization's radar. Their plan, while they have a small library and staff, is to accomplish this along with other competing priorities.
"Few people were thinking about the archives for collecting current information", Kate described of some of the organization not understanding that preserving now will create archives for later. "The archive is not just where old where old stuff goes, but new stuff as well." One of the big obstacles for the archiving initiatives of the organizations has been funding. Even with this limitation, however, NCAR was still able to subscribe to Archive-It through a low level subscription. With this subscription, they started to preserve their Facebook group but increasingly found huge amounts of data, including videos, that they felt was too resource heavy to archive. The next step for the initiative is to add a place on the organization's webpage where archive content will be accessible to the public.
Jaime McCurry (@jaime_ann) of Folger Shakespeare Library followed Kate with "The Short and the Long of It: Web Archiving at the Folger Shakespeare Library". Jaime is currently participating in the National Digital Stewardship Residency where her goal is to establish local routines and best practices for archiving and preserving the library's born-digital content. They have two collection with over 6 millions documents (over 400 gigabytes of data) currently where the topic being collected is to preserve content on the web relating to the works of Shakespeare (particularly in social media and from festivals). In trying to describe the large extent of the available content, Jaime said, "In trying to archive Shakespeare's presence on the web, you really have to define what you're looking for. Shakespeare is everywhere!". She noted that one of the first things she realized when she first started on the project at Folger was that nobody knew that the organization was performing web archiving, so she wished to establish an organization-wide web archiving policy. One of the recent potential targets of her archiving project was the NYTimes' Hamlet contest wherein the newspaper suggested Instagram users create 15-second clips of their interpretation of a passage from the play. Because this related to Shakespeare, it would be an appropriate target for the Folger Shakespeare Library.
After Jaime finished, Sharon Farnel of University of Alberta began her presentation "Metadata workflows for web archiving – thinking and experimenting with ‘metadata in bulk’". In her presentation she referenced a project called Blacklight, an open source project that provides a discovery interface for any Solr index via a customizable, template-based user interface. In her collection, from the context of metadata, she wished to think about where and why discovery of content tasks place in web archiving. She utilized a mixed model wherein entries might have MARC records, Dublin Core data or both. Sharon emphasized that metadata was an important functionality of Archive-It. To better parse the data, her group created XSLT stylesheets to be able to export the data into a more interoperable format like Excel, which it could then be imported back into Blacklight after manipulation. She referenced some of the difficulties in working the the different technologies but said, "None of these tools were a perfect solution on their own but by combining the tools in-house, we can get good results with the metadata."
After a short break (to caffeinate), Abbie Grotke (@agrtoke) of Library of Congress remotely presented "NDSA Web Archiving Survey Update". In her voice-over presentation from DC, she gave preliminary results of the NDSA Web Archiving Survey, stating that the initiative of the NDIIP program had yielded about 50 respondents so far. For the most part, the biggest concern about web archiving reported by the survey participants was database preservation followed by social media and video archiving. She stated that the survey is still open and encouraged attendees to take it (Take it here).
Trevor Alvord of Brigham Young University was next with "A Muddy Leak: Using the ARL Code of Best Practices in Fair Use with Web Archiving". His efforts with the L. Tom Perry Special Collections at BYU was to build a thematic based collection based on Mormonism. He illustrated that many battles had been fought and won over digital preservation content rights (e.g., Perfect 10 vs. Google and Students vs. iParadigms), so his collection should be justified based on the premises in those cases. "Web archiving is routinely done by two wealthiest corporations (Google and Microsoft)", he quoted Jonathan Band, a recognized figure in the lawsuits versus Google. "In the last few months, libraries have prevailed.", Trevor said, "Even with our efforts, we have not received any complaints about their website being archived by libraries."
Trevor then went on to describe the problem with his data set, alluding to the Teton Dam flooding wherein millions of documents are being produced about Mormonism and now he is having to capture whatever he can. This is partially due to the lowering of the age allowed for missionaries and the Mormon church's encouragement for young Mormons to post online. He showed two examples of Mormon "mommy" bloggers Peace Love Lauren, a very small impact bloggers and NieNie Dialogs, a very popular blog. He asked the audience, "How do you prioritize what content to archive given popular content is more important but also more likely to be preserved?"
Following Trevor, Vinay Goel of Internet Archive presented "Web Archive Analysis". He started by saying that "Most Partners access Archive-It via the Wayback Machine." where other methods would be by using the Archive-It search service or downloading the archival contents. He spoke of de-duplication and how it is represented in WARCs via a revisit record. The core of his presentation spoke of the various WARC meta formats, Web Archive Transformation (WAT) files and CDX files, the format used for WARC indexing. "WAT files are WARC metadata records.", he said, "CDX files are space delimited text files that record where a file resides in a WARC and its offset." Vinay has come up with an analysis toolkit that would allow researchers to express question they want to ask about the archives in a high level language that would then be translated to a low level language understandable by an analysis system. "We can capture YouTube content", he said, giving an example use case, "but the content is difficult to replay." Some of the analysis information he displayed was identifying this non-replayable content in the archives and showing the in-degree and out-degree information of each resource. Further, his toolkit is useful in studying how this linking behavior changes over time.
The crowd then broke for lunch only to return to Scott Reed (@vector_ctrl) of Internet Archive presenting the new features that would be present in the next iteration of Archive-It, 5.0. The new system, among other things, allows users to create test crawls and is better at social media archiving. Some of the goals to be implemented in the system before the end of the year is to get the best capture and display the capture in currently existing tools. Scott mentioned an effort by Archive-It to utilize phantomjs (with which we're familiar at WS-DL through our experiments) through a feature they're calling "Ghost". Further, the new version promises to have an API. Along with Scott, Maria LaCalle spoke of a survey completed about the current version of Archive-It and Solomon Kidd spoke of work done on user interface refinements of the upcoming system.
After I finished my presentation, the final presentation of the day was by Debbie Kempe of The Frick Collection and Frick Art Reference Library with "Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art Resources". In her presentation, she stated that there was a broad overlap of art between the Brooklyn Museum, Museum of Modern Art, and the Frick Art Reference Library. Citing Abbie Grotke's survey from earlier, she reminded the audience that no museums responded to the survey, which is problematic for evaluating their archiving needs. "Not all information is digital in the art community", Debbie said. In initiating archiving effort, it wasn't so much clear to the museums' organizers as to why or how web archiving of their content should be done but rather, "Who will do it?" and "How will we pay for it?" She ran a small experiment in accomplishing the preservation tasks of the museum and is now subsequently running a longer "experiment", given more content is being create that is digital and less in print in their collections. In the longer trial, she hopes to test and formulate a sustainable workflow, including re-skilling and organizational changes.
After Debbie, the crowd was freed into a Birds of a Feather session to discuss issues about web archiving that interested each individual, to which I collected with a group about "Capture", given my various software projects relating to the topic. After the BoF session, Lori Donovan and Kristine Hanna adjourned the room to a following reception.
Overall, I felt the trip to Utah to meet with a group with a common interest was a unique experience that I don't get at other conferences where some of the audiences' focuses are disjoint from one another. The feedback I received on my research and the discussion I had with various attendees was extremely valuable in learning how the Archive-It community works and I hope to attend again next year.— Mat (@machawk1)