Thursday, December 14, 2017

2017-12-14: Storify Will Be Gone Soon, So How Do We Preserve The Stories?

Popular Storytelling service, Storify, will be shut down on May 16, 2018. Storify has been used by journalists and researchers to create stories about events and topics of interest. It has a wonderful interface, shown below, that allows one to insert text, but also add social cards and other content from a variety of services, including Twitter, Instagram, Facebook, YouTube, Getty Images, and of course regular HTTP URIs.
This screenshot displays the Storify editing Interface.
As shown below, Storify is used by news sources to build and publish stories about unfolding events, as seen below for the Boston NPR Station WBUR.
Storify is used by WBUR in Boston to convey news stories.
It is also the visualization platform used for summarizing Archive-It collections in the Dark and Stormy Archives (DSA) Framework, developed by WS-DL members Yasmin AlNoamany, Michele Weigle, and Michael Nelson. In a previous blog post, I covered why this visualization technique works and why many other tools fail to deliver it effectively. An example story produced by the the DSA is shown below.
This Storify story summarizes Archive-It Collection 2823 about a Russian plane crash on September 7, 2011.

Ian Milligan provides an excellent overview of the importance of Storify and the issues surrounding its use. Storify stories have been painstakingly curated and the aggregation of content is valuable in and of itself, so before Storify disappears, how do we save these stories?

Saving the Content from Storify



Manually


Storify does allow a user to save their own content, one story at a time. Once you've logged in, you can perform the following steps:
1. Click on My Stories
2. Select the story you wish to save
3. Choose the ellipsis menu from the upper right corner
4. Select Export
5. Choose the output format: HTML, XML, or JSON

Depending on your browser and its settings, the resulting content may display in your browser or a download dialog may appear. URIs for each file format do match a pattern. In our example story above, the slug for the story is 2823spst0s and our account name is ait_stories. The different formats for our example story reside at the following URIs.
  • JSON file format: https://api.storify.com/v1/stories/ait_stories/2823spst0s
  • XML file format: https://storify.com/ait_stories/2823spst0s.xml
  • Static HTML file format: https://storify.com/ait_stories/2823spst0s.html
If one already has the slugs and the account names, they can save any public story. Private stories, however, can only be saved by the owner of the story. What if we do not know the slugs of all of our stories? What if we want to save someone else's stories?

Using Storified From DocNow


For saving the HTML, XML, and JSON formats of Storify stories, Ed Summers, creator of twarc, has created the storified utility as part of the DocNow project. Using this utility, one can save public stories from any Storify account in the 3 available formats. I used the utility to save the stories from the DSA's own ait_stories account. After ensuring I had installed python and pip, I was able to install and use the utility as follows:
  1. git clone https://github.com/DocNow/storified.git
  2. pip install requests
  3. cd storified
  4. python ./storified.py ait_stories # replace ait_stories with the name of the account you wish to save
Update: Ed Summers mentions that one can now run pip install storified, replacing these steps. One only needs to then run storified.py ait_stories, again replacing ait_stories with the account name you wish to save.

Storified creates a directory with the given account name containing sub-directories named after each story's slug. For our Russia Plane crash example, I have the following:
~/storified/ait_stories/2823spst0s % ls -al
total 416
drwxr-xr-x   5 smj  staff    160 Dec 13 16:46 .
drwxr-xr-x  48 smj  staff   1536 Dec 13 16:47 ..
-rw-r--r--   1 smj  staff  58107 Dec 13 16:46 index.html
-rw-r--r--   1 smj  staff  48440 Dec 13 16:46 index.json
-rw-r--r--   1 smj  staff  98756 Dec 13 16:46 index.xml
I compared the content produced by the manual process above with the output from storified and there are slight differences in metadata between the authenticated manual export and the anonymous export generated by storified. Last seen dates and view counts are different in the JSON export, but there are no other differences. The XML and HTML exports of each process have small differences, such as <canEdit>false</canEdit> in the storified version versus <canEdit>true</canEdit> in the manual export. These small differences are likely due to the fact that I had to authenticate to manually export the story content whereas storified works anonymously. The content of the actual stories, however, is the same. I have created a GitHub gist showing the different exported content.

Update: Nick B pointed out that the JSON files — and only the JSON files — generated either by manual export or via the storified tool are incomplete. I have tested his assertion with our example story (2823spst0s) and can confirm that the JSON files only contain the first 19 social cards. To acquire the rest of the metadata about a story collection in JSON format, one must use the Storify API. The XML and static HTML outputs do contain data for all social cards and it is just the JSON export that appears to lack completeness. Good catch!

Using storified, I was able to extract and save our DSA content to Figshare for posterity. Figshare provides persistence as part of its work with the the Digital Preservation Network, and used CLOCKSS prior to March 2015.

That covers extracting the base story text and structured data, but what about the images and the rest of the experience? Can we use web archives instead?

Using Web Archiving on Storify Stories



Storify Stories are web resources, so how well can they be archived by web archives? Using our example Russia Plane Crash story, with a screenshot shown below, I submitted its URI to several web archiving services and then used the WS-DL memento damage application to compute the memento damage of the resulting memento.
A screenshot of our example Storify story, served from storify.com.

A screenshot of our Storify story served from the Internet Archive, after submission via the Save Page Now Utility.
A screenshot of our Storify story served from archive.is.

A screenshot of our Storify story served from webrecorder.io.


A screenshot of our Storify story served via WAIL version 1.2.0-beta3.
Platform Memento Damage Score Visual Inspection Comments
Original Page at Storify 0.002
  • All social cards complete
  • Views Widget works
  • Embed Widget works
  • Livefyre Comments widget is present
  • Interactive Share Widget contains all images
  • No visible pagination animation
Internet Archive with Save Page Now 0.053
  • Missing the last 5 social cards
  • Views Widget does not work
  • Embed Widget works
  • Livefyre Comments widget is missing
  • Interactive Share Widget contains all images
  • Pagination animation runs on click and terminates with errors
Archive.is 0.000
  • Missing the last 5 social cards
  • Views Widget does not work
  • Embed Widget does not work
  • Livefyre Comments widget is missing
  • Interactive Share Widget is missing
  • Pagination animation is replaced by "Next Page" which goes nowhere
Webrecorder.io 0.051*
  • Missing the last 5 social cards, but can capture all with user interaction while recording
  • Views Widget works
  • Embed Widget works
  • Livefyre Comments widget is missing
  • Interactive Share Widget contains all images
  • No visible pagination animation
WAIL 0.025
  • All social cards complete
  • Views Widget works, but is missing downward arrow
  • Embed Widget is missing images, but otherwise works
  • Livefyre Comments widget is missing
  • Interactive Share Widget is missing images
  • Pagination animation runs and does not terminate


Out of these platforms, Archive.is has the lowest memento damage score, but in this case the memento damage tool has been misled by how Archive.is produces its content. Because Archive.is takes a snapshot of the DOM at the time of capture and does not preserve the JavaScript on the page, it may score low on Memento Damage, but also has no functional interactive widgets and is also missing 5 social cards at the end of the page. The memento damage tool crashed while trying to provide a damage score for Webrecorder.io; its score has been extracted from logging information.

I visually evaluated each platform for the authenticity of its reproduction of the interactivity of the original page. I did not expect functions that relied on external resources to work, but I did expect menus to appear and images to be present when interacting with widgets. In this case, Webrecorder.io produces the most authentic reproduction, only missing the Livefyre comments widget. Storify stories, however, do not completely display the entire story at load time. Once a user scrolls down, JavaScript retrieves the additional content. Webrecorder.io will not acquire this additional paged content unless the user scrolls the page manually while recording.

WAIL, on the other hand, retrieved all of the social cards. Even though it failed to capture some of the interactive widgets, it did capture all social cards and, unlike webrecorder.io, does not require any user interaction once seeds are inserted. On playback, however, it does still display the animated pagination widget as seen below, misleading the user to believe that more content is loading.
A zoomed in screenshot from WAIL's playback engine with the pagination animation outlined in a red box.


WAIL also has the capability of crawling the web resources linked to from the social cards themselves, making them suitable choices if linked content is more important than complete authentic reproduction.

The most value comes from the social cards and the text of the story, and not the interactive widgets. Rather than using the story URIs themselves, one can avoid the page load pagination problems by just archiving the static HTML version of the story mentioned above — use https://storify.com/ait_stories/2823spst0s.html rather than https://storify.com/ait_stories/2823spst0s. I have tested the static HTML URIs in all tools and have discovered that all social cards were preserved.
The static HTML page version of the same story, missing interactive widgets, but containing all story content.

Unfortunately, other archived content probably did not link to the static HTML version. Because of this, if one were trying to browse a web archive's collection and followed a link intended to reach a Storify story, they would not see it, even though the static HTML version may have been archived. In other words, web archives would not know to canonicalize https://storify.com/ait_stories/2823spst0s.html and https://storify.com/ait_stories/2823spst0s.

Summary



As with most preservation, the goal of the archivist needs to be clear before attempting to preserve Storify stories. Using the manual method or DocNow's storified, we can save the information needed to reconstruct the text of the social cards and other text of the story, but with missing images and interactive content. Aiming web archiving platforms at the Storify URIs, we can archive some of the interactive functionality of Storify, with some degree of success, but also with loss of story content due to automated pagination.

For the purposes of preserving the visualization that is the story, I recommend using a web archiving tool to archive the static HTML version, which will preserve the images and text as well as the visual flow of the story so necessary for successful storytelling. I also recommend performing a crawl to preserve not only the story, but the items linked from the social cards. Keep in mind that web pages likely link to the Storify story URI and not its static HTML URI, hampering discovery within large web archives.

Even though we can't save Storify the organization, we can save the content of Storify the web site.

-- Shawn M. Jones


Updated on 2017/12/14 at 3:30 PM EST with note about pip install storified thanks to Ed Summers' feedback.

Updated on 2017/12/15 at 11:20 PM EST with note about the JSON formatted export missing stories thanks to Nick B's feedback.

Monday, December 11, 2017

2017-12-11: Difficulties in timestamping archived web pages

Figure 1: A web page from nasa.gov is archived
 by Michael's Evil Wayback in July 2017.
Figure 2: When visiting the same archived page in October 2017,
we found that the content of the page has been tampered with. 
The 2016 Survey of Web Archiving in the United States shows an increasing trend of using public and private web archives in addition to the Internet Archive (IA). Because of this tendency we should consider the question of validity of archived web pages deleivered by these archives. 
Let us look at an example where the important web page https://climate.nasa.gov/vital-signs/carbon-dioxide/, that keeps a record of the carbon dioxide (CO2) level in the Earth’s atmosphere, is captured by a private web archive “Michael’s Evil Wayback” on July 17, 2017 at 18:51 GMT. At this time, as Figure 1 shows, the CO2 was 406.31 ppm.
When revisiting the same archived page in October 2017, we should be presented with the same content. Surprisingly, CO2 changed and became 270.31 ppm as Figure 2 shows. So which one is the “real” archived archived page?
We can simply detect that the content of an archived web page has been modified by generating a cryptographic hash value on the returned HTML code. For example, the following command will download the web page https://climate.nasa.gov/vital-signs/carbon-dioxide/ and generate a SHA-256 hash value on its HTML content
$ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256
b87320c612905c17d1f05ffb2f9401ef45a6727ed6c80703b00240a209c3e828  -
The next figure illustrates how the simple apporach of generating hashes can detect any tampering with content of archived pages. In this example, the "black hat" in the figure (i.e., Michael’s Evil Wayback) has changed the CO2 to a lower value (i.e., in favor of individuals or organizations who deny that CO2 is one of the main causes of global warming).  
Another possible solution to validate archived web pages is to use timestamping. If a trusted timestamp is issued on an archived web page, anyone should verify that a particular representation of the web page has existed in a specific time in the past.
As of today, many systems, such as OriginStamp and OpenTimestamps offer a free-of-charge service to generate blockchain-based trusted timestamps of digital documents, such as Bitcoin. These tools perform multiple steps to successfully create timestamps. One of these steps requires computing a hash value which represents the content of the resource (i.e, by the cURL command above). Next, this hash value is converted to a Bitcoin's address, then a Bitcoin's transaction is made where one of the two sides of the transaction (i.e., the source and destination) should be the new generated address. Once approved by the blockchain, the transaction creation datetime is considered to be a trusted timestamp. Shawn Jones describes in "Trusted Timestamping of Mementos" how to create trusted timestamp of archived web pages using blockchain networks.
In our technical report "Difficulties of Timestamping Archived Web Pages", we show that trusted timestamping archived web pages is not an easy task for several reasons. The main reason is that a hash value calculated on the content of  an archived web page (i.e., memento) should be repeatable. That is we should always obtain the same hash value each time we retrieve the memento. In addition to those difficulties, we introduced some requirements to be fulfilled in order to generate repeatable hash values of mementos.

--Mohamed Aturban


Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, "Difficulties of Timestamping Archived Web Pages." 2017. Technical Report. arXiv:1712.03140.

Monday, December 4, 2017

2017-12-03: Introducing Docker - Application Containerization & Service Orchestration


For the last few years, Docker, the application containerization technology, has been gaining a lot of attraction from the DevOps community and lately it has made its way to the academia and research community as well. I have been following it since its inception in 2013. For the last couple years, it has become a daily driver for me. At the same time, I have been encouraging my colleagues to use Docker in their research projects. As a result, we are gradually moving away from one virtual machine (VM) per project to a swarm of nodes running containers of various projects and services. If you have accessed MemGator, CarbonDate, Memento Damage, Story Graph or some other WS-DL services lately, you have been served from our Docker deployment. We even have an on-demand PHP/MySQL application deployment system using Docker for the CS418 - Web Programming course.



In the last summer, Docker Inc. selected me as the Docker Campus Ambassador for Old Dominion University. While I have already given some Docker talks to some more focused groups, with the campus ambassador hat on, I decided to organize an event where grads and undergrads of the Computer Science department at large can benefit.


The CS department accepted it as a colloquium, scheduled for Nov 29, 2017. We were anticipating about 50 participants, but many more showed up. The increasing interest of students towards containerization technology can be taken as an indicator of its usefulness and perhaps it should be included as part of some courses offered in future.


The session lasted for a little over an hour. It started with some slides motivating with a Dockerization story and a set of problems that potentially Docker can solve. Slides then introduced some basics of Docker and further illustrated how a simple script can be packaged into an image and distributed using DockerHub. The presentation followed by a live demo of a step-by-step evolution of a simple script into a multi-container application using micro-service architecture while demonstrating various aspects of Docker in each step. Finally, the session was opened for questions and answers.


For the purpose of illustration I prepared an application that scrapes a given web page to extract links from it. The demo code has folders for various steps as it progresses from a simple script to a multi-service application stack. Each folder has a README file to explain changes from the previous step and instructions to run the application. The code is made available on GitHub. Following is a brief summary of the demo.

Step 0

The Step 0 has a simple linkextractor.py Python script (as shown below) that accepts a URL as an argument and prints all the hyperlinks on the page out.


However, running this rather simple script might raise some of the following issues:

  • Is the script executable? (chmod a+x linkextractor.py)
  • Is Python installed on the machine?
  • Can you install software on the machine?
  • Is "pip" installed?
  • Are "requests" and "beautifulsoup4" Python libraries installed?

Step 1

The Step 1 includes a simple Dockerfile to it to automate installation of all the requirements and build an isolated self-contained image.


Inclusion of this Dockerfile ensures that the script will run without any hiccups in a Docker container as a one-off command.

Step 2

The Step 2 makes some changes in the Python script; 1) to convert extracted paths to full URLs, 2) to extract both links and anchor texts, and 3) to move the main logic in a function and return an object so that the script can be used as a module in other scripts.

This step illustrates that new changes in the code will not affect any running containers and will not impact an image that was built already (unless overridden). Building a new image with a different tag allows co-existence of both the versions that can be run as desired.

Step 3

The Step 3 adds another Python file main.py that utilizes the module written in the previous step to expose the link extraction as a web service API that returns JSON response. Libraries required are extracted in the requirements.txt file. The Dockerfile is updated to accommodate these changes and to by default run the server rather than the script as a one-off command.

This step demonstrates how host and container ports are mapped to expose the service running inside a container.

Step 4

The Step 4 moves all the code, written so far for the JSON API, in a separate folder to build an independent image. In addition to that, it adds a PHP file index.php in a separate folder that serves as a front-end application which internally communicates with the Python API for link extraction. To glue these services together a docker-compose.yml file is added as shown below.


This step demonstrates how multiple services can be orchestrated using Docker Compose. We did not crate a custom image for the PHP application, instead demonstrated how the code can be mounted inside a container (in this case a container based on the official php:7-apache image). This allows any modifications of the code reflected immediately inside the running container, which could be very handy in the development mode.

Step 5

The Step 5 adds another Dockerfile to build a custom image of the front-end PHP application. The Python API server is updated to utilize Redis for caching. Additionally, the docker-compose.yml file is updated to reflect changes in the front-end application ( the"web" service block) and to include a service of Redis from its official Docker image.

This step illustrates how easy it is to progressively add components to compose a multi-container service stack. At this stage, the demo application architecture reflects what is illustrated in the title image of this post (the first figure).

Step 6

The Step 6 completely replaces the Python API service component with an equivalent Ruby implementation. Some slight modifications are made in the docker-compose.yml file to reflect these changes. Additionally, a "logs" directory is mounted in the Ruby API service as a volume for persistent storage.

This step illustrates how easily any component of a micro-service architecture application stack can be swapped out with an equivalent service. Additionally, it demonstrates volumes for persistent storage so that containers can remain stateless.


The video recording of the session is made available on YouTube as well as on the colloquium recordings page of the department (the latter has more background noise). Slides and demo codes are made available under appropriate permissive licenses to allow modification and reuse.

Resources



--
Sawood Alam

Wednesday, November 22, 2017

2017-11-22: Deploying the Memento-Damage Service





Many web services such as archive.isArchive-ItInternet Archive, and UK Web Archive have provided archived web pages or mementos for us to use. Nowadays, the web archivists have shifted their focus from how to make a good archive to measuring how well the archive preserved the page. It raises a question about how to objectively measure the damage of a memento that can correctly emulate user (human) perception.

Related to this, Justin Brunelle devised a prototype for measuring the impact of missing embedded resources (the damage) on a web page. Brunelle, in his IJDL paper (and the earlier JCDL version), describes that the quality of a memento depends on the availability of its resources. The straight percentage of missing resources in a memento is not always a good indicator of how "damaged" it is. For example, one page could be missing several small icons whose absence users never even notice, and a second page could be missing a single embedded video (e.g., a Youtube page). Even though the first page is missing more resources, intuitively the second page is more damaged and less useful for users. The damage value ranges from 0 to 1, where a damage of 1 means the web page lost all of its embedded resources. Figure 1 gives an illustration of how this prototype works.

Figure 1. The overview flowchart of Memento Damage
Although this prototype has been successfully proven to be capable of measuring the damage, it is not user ready. Thus, we implemented a web service, called Memento-Damage, based on the prototype.

Analyzing the Challenges

Reducing the Calculation Time

As previously explained, the basic notion of damage calculation is mirroring human perception of a memento. Thus, we analyze the screenshot of the web page as a representation of how the page looks in the user's eyes. This screenshot analysis takes the most time of the entire damage calculation process.

The initial prototype is built on top of the Perl programming language and used PerlMagick to analyze the screenshot. This library dumps the color scheme (RGB) of each pixel in the screenshot into a file. This output file will then be loaded by the prototype for further analysis. Dumping and reading the pixel colors of the screenshot take a significant amount of time and it is repeated according to the number of stylesheets the web page has. Therefore, if a web page has 5 stylesheets, the analysis will be repeated 5 times even though it uses the same screenshot as the basis.

Simplifying the Installation and Making It Distributable
Before running the prototype, users are required to install all dependencies manually. The list of dependencies is not provided. Users must discover it themselves by identifying the error that appears during the execution. Furthermore, we need to ‘package’ and deploy this prototype into a ready-to-use and distributable tool that can be used widely in various communities. How? By providing 4 different ways of using the service:
Solving Other Technical Issues
Several technical issues that needed to be solved included:
  1. Handling redirection (status_code  = 301, 302, or 307).
  2. Providing some insights and information.
    The user will not only get the final damage value but also will be informed about the detail of the process that happened during the crawling and calculation process as well as the components that make up the value of the final damage. If an error happened, the error info will also be provided.  
  3. Dealing with overlapping resources and iframes. 

Measuring the Damage

Crawling the Resources
When a user inputs a URI-M into the Memento-Damage service, the tool will check the content-type of the URI-M and crawl all resources. The properties of the resources, such as size and position of an image, will be written into a log file. Figure 2 summarizes the crawling process conducted in Memento-Damage. Along with this process, a screenshot of the website will also be created.

Figure 2. The crawling process in Memento-Damage

Calculating the Damage
After crawling the resources, Memento-Damage will start calculating the damage by reading the log files that are previously generated (Figure 3). Memento-Damage will first read the network log and examine the status_code of each resource. If a URI is redirected (status_code = 301 or 302), it will chase down the final URI by following the URI in the header location as depicted in Figure 4. Each resource will be processed in accordance with its type (image, css, javascript, text, iframe) to obtain its actual and potential damage value. Then, the total damage is computed using the formula:
\begin{equation}\label{eq:tot_dmg}T_D = \frac{A_D}{P_D}\end{equation}
where:
     $ T_D = Total Damage \\
        A_D = Actual Damage \\
        P_D = Potential Damage $

The formula above can be further elaborated into:
$$ T_D = \frac{A_{D_i} + A_{D_c} + A_{D_j} + A_{D_m} + A_{D_t} + A_{D_f}}{P_{D_i} + P_{D_c} + P_{D_j} + P_{D_m} + P_{D_t} + P_{D_f}} $$
where each subscript notation represent image (i), css (c), javascript (j), multimedia (m), text, and iframe (f), respectively. 
For image analysis, we use Pillow, a python image library that has better and faster performance than PerlMagick. Pillow can read pixels in an image without dumping it into a file to speed up the analysis process. Furthermore, we modify the algorithm so that we only need to run the analysis script once for all stylesheets.
Figure 3. The calculation process in Memento-Damage

Figure 4. Chasing down a redirected URI

Dealing with Overlapping Resources

Figure 5. Example of a memento that contains overlapping resources (accessed on March 30th, 2017)
URIs with overlapping resources such the one illustrated in Figure 5 need to be treated differently to prevent the damage value from being double counted. To solve this problem, we created a concept of rectangle (Rectangle = xmin, ymin, xmax, ymax). We perceive the overlapping resources as rectangles and calculate the size of the intersection area. The size of one of the overlapped resource will be reduced by the intersection size, while the other overlapped resource will get the whole full size. Figure 6 and Listing 1 give the illustration of the rectangle concept.
Figure 6. Intersection concept for overlapping resources in an URI
def rectangle_intersection_area(a, b):
    dx = min(a.xmax, b.xmax) - max(a.xmin, b.xmin)
    dy = min(a.ymax, b.ymax) - max(a.ymin, b.ymin)
    if (dx >= 0) and (dy >= 0): 
        return dx * dy
Listing 1. Measuring image rectangle in Memento-Damage

Dealing with Iframes

Dealing with iframe is quite tricky and requires some customization. First, by default, crawling process cannot access content inside of iframe using native javascript or JQuery selector due to a cross-domain problem. This problem becomes more complicated when this iframe is nested in another iframe(s). Therefore, we need to find a way to switch from main frame to the iframe. To handle this problem, we utilize the API provided by PhantomJS that facilitates switching from one iframe to another. Second, the location properties of the resources inside of iframe are calculated relative to that particular iframe position, not to the main frame position. It could lead to a wrong damage calculation. Thus, for a resource located inside an iframe, its position must be computed in a nested calculation by taking into account the position of its parent frame(s)

Using Memento-Damage

The Web Service

a. Website
The Memento-Damage website gives the easiest way to use the Memento-Damage tool. However, since it runs on a resource-limited server provided by ODU, it is not recommended for calculating damage a large number of requests. Figure 7 shows a brief preview of the website.
Figure 7. The calculation result from Memento-Damage

b. REST API
The REST API service is part of the web service which facilitates damage calculation from any HTTP Client (e.g. web browser, CURL, etc) and gives output in JSON format. This makes it possible for the user to do further analysis with the resulting output. Using REST API, a user can create a script and calculate damage on a few number of URIs (e.g. 5).
The default REST API usage for memento damage is:
http://memento-damage.cs.odu.edu/api/damage/[the input URI-M]
Listing 2 and Listing 3 show examples of using Memento-Damage REST API with CURL and Python.
curl http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci
Listing 2. Using Memento-Damage REST API with Curl
import requests
resp = requests.get('http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci')
print resp.json()
Listing 3. Using Memento-Damage REST API as embedded code in Python

Local Service

a. Docker Version
The Memento-Damage docker image uses Ubuntu-LXDE for the desktop environment. A fixed desktop environment is used to avoid an inconsistency issue of the damage value of the same URI run by different machines with different operating systems. We found that PhantomJS, the headless browser that is used for generating the screenshot, rendered the web page in accordance with the machine's desktop environment. Hence, the same URI could have a slightly different screenshot, and thus different damage values when run on different machines (Figure 8). 


Figure 8. Screenshot of https://web.archive.org/web/19990125094845/http://www.dot.state.al.us taken by PhantomJS run on 2 machines with different OS.
To start using the Docker version of Memento-Damage, the user can follow these  steps: 
  1. Pull the docker image:
    docker pull erikaris/memento-damage
    
  2. Run the docker image:
    docker run -it -p :80 --name  
    
    Example:
    docker run -i -t -p 8080:80 --name memdamage erikaris/memento-damage:latest
    
    After this step is completed, we now have the Memento-Damage web service running on
    http://localhost:8080/
    
  3. Run memento-damage as a CLI using the docker exec command:
    docker exec -it <container name> memento-damage <URI>
    Example:
    docker exec -it memdamage memento-damage http://odu.edu/compsci
    
    If the user wants to work from the inside of the Docker's terminal, use this following command:
    docker exec -it <container name> bash
    Example:
    docker exec -it memdamage bash 
  4. Start exploring the Memento-Damage using various options (Figure 9) that can be obtained by typing
    docker exec -it memdamage memento-damage --help
    or if the user is already inside the Docker's container, just simply type:
    memento-damage --help
~$ docker exec -it memdamage memento-damage --help
Usage: memento-damage [options] <URI>

Options:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output-dir=OUTPUT_DIR
                        output directory (optional)
  -O, --overwrite       overwrite existing output directory
  -m MODE, --mode=MODE  output mode: "simple" or "json" [default: simple]
  -d DEBUG, --debug=DEBUG
                        debug mode: "simple" or "complete" [default: none]
  -L, --redirect        follow url redirection

Figure 9. CLI options provided by Memento-Damage

Figure 10 depicts an output generated by CLI-version Memento-Damage using complete debug mode (option -d complete).
Figure 10. CLI-version Memento-Damage output using option -d complete
Further details about using Docker to run Memento-Damage is available on http://memento-damage.cs.odu.edu/help/.

b. Library
The library version offers functionality (web service and CLI) that is relatively similar to that of the Docker version. It is aimed at the people who already have all the dependencies (PhantomJS 2.xx and Python 2.7) installed on their machine and do not want to bother installing Docker. The latest library version can be downloaded from GitHub.
Start using the library by following these steps:
  1. Install the library using the command:
    sudo pip install web-memento-damage-X.x.tar.gz
  2. Run the Memento-Damage as a web service:     memento-damage-server
  3. Run the Memento-Damage via CLI:   memento-damage <URI>
  4. Explore available options by using option --help:
    memento-damage-server --help                     (for the web service)
    or 
    memento-damage --help                                  (for CLI)

Testing on a Large Number of URIs

To prove our claim that Memento-Damage can handle a large number of URIs, we conducted a test on 108,511 URI-Ms using a testing script written in Python. The testing used the Docker version of Memento-Damage that was run on a machine with specification: Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz, Memory 4 GiB. The testing result is shown below.

Summary of Data
=================================================
Total URI-M: 108511
Number of URI-M are successfully processed: 80580
Number of URI-M are failed to process: 27931
Number of URI-M has not processed yet: 0

With a dataset of 108,511 input URI-Ms tested, we found 80,580 URI-Ms were successfully processed while the rest, 27,931 URI-Ms, failed to process. The processing failure on those 27,931 URI-Ms happened because of the concurrent limitation access from Internet Archive. On average, 1 URI-M needs 32.5 seconds processing time. This is 110 times faster than the prototype version, which takes an average of 1 hour to process 1 URI-M. In some cases, the prototype version even took almost 2 hours to process 1 URI-M. 

From those successfully processed URI-Ms, we managed to create some visualizations to help us better understand the result as can be seen below. The first graph (Figure 11) shows the number of average missing embedded resources per memento per year according to the damage value (Dm) and the missing resources (Mm). The most interesting case appeared in 2011 where the Dm value was significantly higher than Mm. It means, although on the average the URI-Ms in 2011 only lost 4% of their resources, these losses actually caused 4 (four) times higher damages than the number showed by Mm. On the other hand, in 2008, 2010, 2013, and 2017, Dm value is lower than Mm, which implies those missing resources are less important.
Figure 11. The average embedded resources missed per memento per year
Figure 12. Comparison of All Resources vs Missing Resources

The second graph (Figure 12) shows the number of total resources in each URI-Ms and its missing resources.  The x-axis represents each URI-M sorted in descending order by the number of resources, while the y-axis represents the number of resources owned by each URI-M. This graph gives us an insight that almost every URI-M lost at least one of its embedded resources. 

Summary

In this research, we have improved the calculation method for measuring the damage on a memento (URI-M) based on the prototype from the previous version. The improvements include reducing calculation time, fixing various bugs, handling redirection and a new type of resources. We successfully developed the Memento-Damage into a comprehensive tool that has the ability to show the detail of every resource that contributes to the damage. Furthermore, it also provides several approaches for utilizing the tool such as python library and the Docker version. The testing result shows that Memento-Damage works faster compared to the prototype version and can handle a larger number of mementos. Table 1 summarizes the improvements that we made on Memento-Damage compared to the initial prototype. 

No
Subject
Prototype
Memento Damage
1.Programming LanguageJavascript + PerlJavascript + Python
2.InterfaceCLICLI
Website
REST API
3.DistributionSource CodeSource Code
Python library
Docker
4.OutputPlain TextPlain Text
JSON
5.Processing timeVery slowFast
6.Includes IFrameNAAvailable
7.Redirection HandlingNAAvailable
8.Resolve OverlapNAAvailable
9.Blacklisted URIsOnly has 1 blacklisted URI which is added manuallyAdd several new blacklisted URIs. Blacklisted URIs are identified based on a certain pattern.
10.Batch executionNot supportedSupported
11.DOM selector capabilityonly support simple selection querysupport complex selection query
12.Input filteringNAOnly process an input of HTML format
Table 1. Improvement on Memento-Damage compared to the initial prototype

Help Us to Improve

This tool still needs a lot of improvements to increase its functionality and provide a better user experience. We really hope and strongly encourage everyone, especially people who work in web archiving field, to try this tool and give us feedback. Please kindly read the FAQ and HELP before starting using the Memento-Damage. Help us to improve and tell us what we can do better by posting any bugs, errors, issues, or difficulties that you find in this tool on our GitHub

- Erika Siregar -

Monday, November 20, 2017

2017-11-20: Dodging the Memory Hole 2017 Trip Report

At the Internet Archive, it was rainy in San Francisco, but that did not deter those of us attending Dodging the Memory Hole 2017. We engaged in discussions about a very important topic: the preservation of online news content.


Keynote: Brewster Kahle, founder and digital librarian for the Internet Archive

Brewster Kahle is well known in digital preservation and especially web archiving circles. He founded the Internet Archive in May 1996. The WS-DL and LANL's Prototyping Team collaborate heavily with those from the Internet Archive, so hearing his talk was quite inspirational.




We are familiar with the Internet Archive's efforts to archive the Web, visible mostly through the Wayback Machine, but the goal of the Internet Archive is "Universal Access to All Knowledge", something that Kahle equates to the original Library of Alexandria or putting humans on the moon. To that extent, he highlighted many initiatives by the Internet Archive to meet this goal. He mentioned that the contents of a book take up roughly a MegaByte. With 28 TeraBytes the works of the Library of Congress can be stored digitally—digitizing it is another matter, but it is completely doable, and by digitizing it we remove restrictions on access due to distance and other factors. Why stop with documents? There are many other types of content. Kahle highlighted the efforts by the Internet Archive to make television content, video games, audio, and more. They also have a loaning program whereby they allow users to borrow books, which are also digitized using book scanners. He stressed that, because of its mission to provide content to all, the Internet Archive is indeed a library.



As a library, the Internet Archive also becomes a target for governments seeking information on the activities of their citizens. Kahle highlighted one incident in which the FBI sent a letter demanding information from the Internet Archive. Thanks to help from the Electronic Frontier Foundation, the Internet Archive sued the United States government and won, defending the rights of those using their services.



Kahle emphasized that we can all help with preserving the web by helping the Internet Archive build its holdings of web content. The Internet Archive contains a form with a simple "save page now" button, but they also support other methods of submitting content.



Contributions from Los Alamos National Laboratory (LANL) and Old Dominion University (ODU)


Martin Klein from LANL and Mark Graham from the Internet Archive




Martin Klein presented work on Robust Links. Martin briefly used motivating work he had done with Herbert Van de Sompel at Los Alamos National Laboratory, mentioning the problems of link rot, and content drift, the latter of which I also worked on.
He covered how one can create links that are robust by:
  1. submitting a URI to a web archive
  2. decorating the link HTML so that future users can reach archived versions of the linked content
For the first item, he talked about how one can use tools like the Internet Archive's "Save Page Now" button as well as WS-DL's own ArchiveNow. The second item is covered by the Robust Links specification. Mark Graham, Directory of the Wayback Machine at the Internet Archive, further expanded upon Martin's talk by describing how the Wayback Extension also provides the capability to save pages, navigate the archive, and more. It is available for Chrome, Safari, and Firefox. It is shown in the screenshots below.
A screenshot of the Wayback Extension in Chrome.
A screenshot of the Wayback Extension in Safari. Note the availability of the option "Site Map", which is not present in the Chrome version
A screenshot of the Wayback Extension in Firefox. Note how there is less functionality.


Of course, the WS-DL efforts of ArchiveNow and Mink augment these preservation efforts by submitting content to multiple web archives, including the Internet Archive.



I enjoyed one of the most profound revelations from Martin and Mark's talk: URIs are addresses, not the content that was on the page at the moment you read it. I realize that efforts like IPFS are trying to use hashes to address this dichotomy, but the web has not yet migrated to them.

Shawn M. Jones from ODU




I presented a lightning talk highlighting a blog post from earlier this year where I try to answer the question: where can we post stories summarizing web archive collections? I talked about why storytelling works as a visualization method for summarizing collections and then evaluated a number of storytelling and curation tools with the goal of finding those that best support this visualization method.


Selected Presentations


I tried to cover elements of all presentations while live tweeting during the event, and wish I could go into more detail here, but, as usual I will only cover a subset.

Mark Graham highlighted the Internet Archive's relationships with online news content. He highlighted a report by Rachel Maddow where she used the Internet Archive to recover tweets posted by former US National Security Advisor Michael Flynn, thus incriminating him. He talked about other efforts, such as NewsGrabber, Archive-It, and the GDELT project, which all further archive online news or provide analysis of archived content. Most importantly, he covered "News At Risk"—content that has been removed from the web by repressive regimes, further emphasizing the importance of archiving it for future generations. In that vein, he discussed the Environmental Data & Governance Initiative, set up to archive environmental data from government agencies after Donald Trump's election.

Ilya Kreymer and Anna Perricci presented their work on Webrecorder, web preservation software hosted at webrecorder.io. An impressive tool for "high fidelity" web archiving, Webrecorder allows one to record a web browsing session and save it to a WARC. Kreymer demonstrated its use on a CNN news web site with an embedded video, showing how the video was captured as well as the rest of the content on the page. The Webrecorder.io web platform allows one to record using their native browser, or they can choose from a few other browsers and configurations in case the user agent plays a role in the quality of recording or playback. For offline use, they have also developed Webrecorder Player, in which one can playback their WARCs without requiring an Internet connection. Anna Perricci said that it is perfect for browsing a recorded web session on an airplane. Contributors to this blog have written about Webrecorder before.

Katherine Boss, Meredith Broussard, Fernando Chirigati, and Rémi Rampin discussed the problems surrounding the preservation of news apps: interactive content on news sites that allow readers to explore data collected by journalists on a particular issue. Because of their dynamic nature, news apps are difficult to archive. Unlike static documents, they can not be printed or merely copied. They often consist of client and server side code developed without a focus on reproducibility. Preserving news apps often requires the assistance of the organization that created the news app, which is not always available. Rémi Rampin noted that, for those organizations that were willing to help them, their group had had success using the research reproducibility tool reprozip to preserve and play back news apps.

Roger Macdonald and Will Crichton provided an overview of the Internet Archive's efforts to provide information from TV news. They have employed the Esper video search tool as a way to explore their colleciton. Because it is difficult for machines to derive meaning from pixels within videos, they used captioning as a way to provide for effective searching and analysis of the TV news content at the Internet Archive. Their goal is to allow search engines to connect fact checking to the TV media. To this end, they employed facial recognition on hours of video to find content where certain US politicians were present. From there one can search for a politician and see where they have given interviews on such news channels as CNN, BBC, and Fox News. Alternatively, they are exploring the use of identifying the body position of each person in a frame. Using this, it might be possible to answer questions such as "every video where a man is standing over a woman". The goal is to make video as easy as text to search for meaning.

Maria Praetzellis highlighted a project named Community Webs that uses Archive-It. Community Webs provides libraries the tools necessary to preserve news and other content relevant to their local community. Through community webs, local public libraries receive education and training, help with collection development, and archiving services and infrastructure.

Kathryn Stine and Stephen Abrams presented the work done on the Cobweb Project. Cobweb provides an environment where many users can collaborate to produce seeds that can then be captured by web archiving initiatives. If an event is unfolding and news stories are being written, the documents containing these stories may change quickly, thus it is imperative for our cultural memory that these stories be captured as close to publication as possible. Cobweb provides an environment for the community to create a collection of seeds and metadata related to one of these events.
Matthew Weber shared some results from the News Measures Research Project. This project started as an attempt to "create an archive of local news content in order to assess the breadth and depth of local news coverage in the United States". The researchers were surprised to discover that local news in the United States covers a much larger area than expected: 546 miles on average. Most areas are "woefully underserved". Consolidation of corporate news ownership has led to fewer news outlets in many areas and the focus of these outlets is becoming less local and more regional. These changes are of concern because the press is important to the democratic processes within the United States.

Social



As usual, I met quite a few people during our meals and breaks. I appreciate talks over lunch with Sativa Peterson of Arizona State Library and Carolina Hernandez of the University of Oregon. It was nice to discuss the talks and their implications for journalism with Eva Tucker of Centered Media and Barrett Golding of Hearing Voices. I also appreciated feedback and ideas from Ana Krahmer of the University of North Texas, Kenneth Haggerty of the University of Missouri, Matthew Collins of the University of San Francisco Gleeson Library, Kathleen A. Hansen of University of Minnesota, and Nora Paul, retired director of Minnesota Journalism Center. I was especially intrigued by discussions with Mark Graham on using storytelling with web archives, Rob Brackett of Brackett Development, who is interested in content drift, and James Heilman, who works on WikiProject Medicine with Wikipedia.


Summary


Like last year, Dodging the Memory Hole was an inspirational conference highlighting current efforts to save online news. Having it at the Internet Archive further provided expertise and stimulated additional discussion on the techniques and capabilities afforded by web archives. Pictures of the event are available on Facebook. Video coverage is broken up into several YouTube videos: Day 1 before lunch, Day 1 after lunch, Day 2 before lunch, Day 2 after lunch, and lightning talks. DTMH highlights the importance of news in an era of a changing media presence in the United States, further emphasizing that web archiving can help us fact-check statements so we can hold onto a record of not only how we got here, but also guide where we might go next. -- Shawn M. Jones