Monday, October 16, 2017

2017-10-16: Visualizing Webpage Changes Over Time - new NEH Digital Humanities Advancement Grant

In August, we were excited to be awarded an 18-month Digital Humanities Advancement Grant from the National Endowment for the Humanities (NEH) and the Institute of Museum and Library Services (IMLS).  Our project, "Visualizing Webpage Changes Over Time", was one of 31 awards made through this joint NEH/IMLS program (award announcement).
Michele C. Weigle and Michael L. Nelson - ODU
Deborah Kempe - Frick Art Reference Library and New York Art Resources Consortium
Pamela Graham and Alex Thurman - Columbia University Libraries
Oct 2017 – Mar 2019, $75,000
As web archives grow in importance and size, techniques for understanding how a web page changes through time need to adapt from an assumption of scarcity (just a few copies of a page, no more than a few weeks or months apart) to one of abundance (tens of thousands of copies of a page, spanning as much as 20 years). This project, a joint effort among ODU, the New York Art Resources Consortium (NYARC), and Columbia University Libraries (CUL), will research and develop tools for efficient visualization of and interaction with archived web pages. This work will be informed by and in support of CUL’s and NYARC’s existing web archiving activities.

This project is an extension of AlSum and Nelson's Thumbnail Summarization Techniques for Web Archives, published in ECIR 2014 (presentation slides), and our previous work, funded by an incentive grant from Columbia University Libraries and the Andrew W. Mellon Foundation.

For this project, we will develop
  1. a tool for visualizing web page changes in arbitrary web archives
  2. a plug-in for the popular Wayback Machine web archiving system
  3. scripts for easy embedding of the visualizations in live web pages, providing tighter integration of the archived web and live web. 
The visualizations we will develop fall into three main categories:
  • grid view - This view would show the entire thumbnail summary in a grid.
  • interactive timeline view - This view would place the thumbnail summary on an interactive timeline. Depending upon the size of the TimeMap, other mementos (those not selected as part of the summary) may be indicated on the timeline as well.
  • single thumbnail view - The area of this view would be a single thumbnail. In addition to the standard screenshot, we will also develop visualizations that employ a Twitter-style card. We propose to develop several instances of this view:
    • image slider - This view would be similar to iPhoto image previews, where the thumbnail image changes as the user moves their mouse across the image
    • animated GIF - This view would automatically cycle through the selected thumbnails, similar to our “What Did It Look Like?” service.
    • video - This view would be a video of the thumbnails. We would use existing services, such as YouTube or Instagram, which allow annotation with links to mementos and direct access to particular thumbnails. For instance, YouTube provides access to particular points in a video through the #t={mm}m{ss}s URL parameter (e.g.,
We are grateful for the continued support of NEH and IMLS for our web archiving research and look forward to producing exciting tools and services for the community.

-Michele and Michael

Tuesday, September 19, 2017

2017-09-19: Carbon Dating the Web, version 4.0

With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0.

We've also decided to switch from MementoProxy and take advantage of the Memgator Aggregator tool built by Sawood Alam.

Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue. Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below.

The previous version of this project, Carbon Date 3.0, added Pubdate extraction, Twitter searching, and Bing search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay. We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides the Memento-Datetime retrieved from an archive's HTTP headers. However, news articles can contain metadata indicating the actual publication date or time. This gives our tool a more accurate time of an article's publication.

Whats New

With APIs changing over time it was decided we needed a proper way to test Carbon Date. To address this issue, we decided to use the popular Travis CI. Travis CI enables us to test our application every day using a cron job. Whenever an API changes, a piece of code breaks, or is styled in an unconventional way, we'll get a nice notification saying something has broken.

CarbonDate contains modules for getting dates for URIs from Google, Bing, Bitly and Memgator. Over time the code has had various styles and no sort of convention. To address this issue, we decided to conform all of our python code to pep8 formatting conventions.

We found that when using Google query strings to collect dates we would always get a date at midnight. This is simply because there is not timestamp, but rather a just year, month and day. This caused Carbon Date to always choose this as the lowest date. Therefore we've changed this to be the last second of the day instead of the first of the day. For example, the date '2017-07-04T00:00:00' becomes '2017-07-04T23:59:59' which allows a better precision for timestamp created.

We've also decided to change the JSON format to something more conventional. As shown below:

Other sources explored

It has been a long term goal to continuously find and add available sources to the Carbon Date application that bring offer a creation date. However, not all the sources we explore bring what we expect. Below there is a list of APIs and other sources that were tested but were unsuccessful in returning a URI creation date. We explored URL shortener APIs such as:
The bitly URL shortener still remains the best as the Bitly API allows a lookup of full URLs not just shortened ones.

How to use

Carbon Date is built on top of Python 3 (most machines have Python 2 by default). Therefore we recommend installing Carbon Date with Docker.

We do also host the server version here: However, carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally via Docker.


After installing docker you can do the following:

2013 Dataset explored

The Carbon Date application was originally built by Hany SalahEldeen, mentioned in his paper in 2013. In 2013 they created a dataset of 1200 URIs to test this application and it was considered the "gold standard dataset." It's now four years later and we decided to test that dataset again.

We found that the 2013 dataset had to be updated. The dataset originally contained URIs and actual creation dates collected from the WHOIS domain lookup, sitemaps, atom feeds and page scraping. When we ran the dataset through the Carbon Date application, we found Carbon Date successfully estimated 890 creation dates but 109 URIs had estimated dates older than their actual creation dates. This was due to the fact that various web archive sites found mementos with creation dates older than what the original sources provided or sitemaps might have taken updated page dates as original creation dates. Therefore, we've taken taken the oldest version of the archived URI and taken that as the actual creation date to test against.

We found that 628 of the 890 estimated creation dates matched the actual creation date, achieving a 70.56% accuracy - originally 32.78% when conducted by Hany SalahEldeen. Below you can see a polynomial curve to the second degree used to fit the real creation dates.


Q: I can't install Docker on my computer for various reason. How can I use this tool?
A: If you can't use Docker then my recommendation is to download the source code from Github and install using a python virtual environment and installing the dependencies with pip from there. 

Q: After x amount of requests Google doesn't give a date when I think it should. What's happening?
A: Google is very good at catching programs (robots) that aren't using their APIs. Carbon Dating is not using an API but rather doing a string query, like a browser would be, and then looking at the results. You might have hit a Captcha so Google might lock Carbon Date out for a while.

Q: I sent a simple website like to Carbon Date to check the date of creation, but it says it was not found in any archive. Why is that?
A: Websites like,,, etc., all have an exceedingly large number of mementos. The Memgator tool is searching for tens of thousands of mementos for these websites across multiple archiving websites. This request can take minutes which eventually leads to a timeout, which in turn means Carbon Date will return zero archives.

Q: I have another issue not listed here, where can I ask questions?
A: This project is open source on github. Just navigate to the issues tab on Github, start a new issue and ask away!

Carbon Date 4.0? What about 3.0?

With this being Carbon Date 4.0, that means there has been three blogs previously for this project! You can find them here:
-Grant Atkins

Wednesday, September 13, 2017

2017-09-13: Pagination Considered Harmful to Archiving

Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015

Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015

While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive. In fact, 21 of 75 pages (or 28%)  had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017).  Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews, single page writings) and surveyed regarding their expectation for later access to additional content that was linked from or appeared on the main page.  Specifically, they investigated (1) how relationships between page content affect expectations and (2) how perceptions of content value relate to internet resources. In other words, if I save the main page as a resource, what else should I expect to be saved along with it?

I experienced this paradox first hand when I attempted to locate an historical entry from the 2016 edition of the U.S. News Best Global University Rankings.  As shown in Figure 1, October 30, 2015 is a particular date of interest because on the day prior, a revision of the original ranking for the University at Buffalo-SUNY was reported. The university's ranking was revised due to incorrect data related to the number of PhD awards. A re-calculation of the ranking metrics resulted in a change of the university's ranking position from a tie at No. 344 to a tie at No. 181.

Figure 3 - Summary of U.S. News*/


Figure 4 - Capture of U.S. News Revision for Buffalo-SUNY

A search of the Internet Archive, Figure 3, shows the U.S. News web site was saved 669 times between October 28, 2014 and September 3, 2017. We should first note that regardless of the ranking year you choose to locate via a web search, U.S. News reuses the same URL from year to year. Therefore, an inquiry against the live web will always direct you to their most recent publication. As of September 3, 2017, the redirect would be to the 2017 edition of their ranking list. Next, as shown in Figure 2, the 2016 U.S. News ranking list consisted of 750 universities presented in groups of 10 spread across 75 web pages. Therefore, the revised entry for the University at Buffalo-SUNY at rank No. 181 should appear on page 19, Figure 4.

Page No.
Start Date
End Date
Table 1 - Page Captures of U.S. News (Abbreviated Listing)

While I could readily locate the main page of the 2016 list as it appeared on October 30, 2015, I noted that subsequent pages were archived with diminishing frequency and over a much shorter period of time. We see in Table 1, after the first three pages, there can be a significant variance in the frequency with which the remaining site pages are crawled. And, as was noted earlier, more than a quarter (28%) of the ranking list cannot be reconstructed at all. Ainsworth and Nelson examined the degree of temporal drift that can occur during the display of sparsely archived pages using the Sliding Target policy allowed by the web archive user interface (UI); namely many years in just a few clicks. Since a substantial portion of the U.S. News ranking list is missing, it is very likely the web browsing experience will result in a hybrid list of universities that encompasses different ranking years as the user follows the page links.

Figure 5 - Frequency of Page Captures

Ultimately, we found page 19 had been captured three times during the specified time frame. However, the page containing the revised ranking that was of interest, Figure 4, was not available in the archive until March 14, 2016; almost five months after the ranking list had been updated. Further, in Figure 5, we note heavy activity for the first few and last few pages of the ranking list which may occur because, as shown on Figure 2, these links are presented prominently on page 1. The remaining pages 3 through 5 must be discovered manually by clicking on the next page. We note, in Figure 5, here the sporadic capture scheme for these intermediate pages.

Current web designs which feature pagination create a frustrating experience for the user when subsequent pages are omitted in the archive. It was my expectation that all pages associated with the ranking list would be saved in order to maintain the integrity of the complete listing of universities as they appeared on the publication date. My intuition is consistent with Poursadar and Shipman, who among their other conclusions, noted that navigational distance from the primary page can affect perceptions regarding what is considered to be viable content that should be preserved.  However, for multi-page articles, nearly 80% of the participants in their study considered linked information in the later pages as part of the resource. This perception was especially profound "when the content of the main and connected pages are part of a larger composition or set of information" as in perhaps a ranking list.

Overall, the findings of Poursadar and Shipman along with our personal observations indicate that archiving systems require an alternative methodology or domain rules that recognize when content spread across multiple pages represent a single collection or a composite resource that should be preserved in its entirety. From a design perspective, we can only wonder why there isn't a "view all" link on multi-page content such as the U.S. News ranking list. This feature might present a way to circumvent paginated design schemes so the Internet Archive can obtain a complete view of a particular web site; especially if the "view all" link is located on the first few pages which appear to be crawled most often. On the other hand, the use of pagination might also represent a conscious choice by the web designer or site owner as a way to limit page scraping even though people can still find a way to do so. Ultimately, the collateral damage associated with this type of design scheme is an uneven distribution in the archive; resulting in an incomplete archival record.


Scott G. Ainsworth and Michael L. Nelson. "Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive." International Journal on Digital Libraries 16: 129-144. DOI: 10.1007/s00799-014-0120-4

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Faryaneh Poursardar and Frank Shipman, "What Is Part of That Resource?User Expectations for Personal Archiving.", Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries, 2017.
-- Corren (@correnmccoy)

Sunday, August 27, 2017

2017-08-27: Four WS-DL Classes Offered for Fall 2017

An unprecedented four Web Science & Digital Library (WS-DL) courses will be offered in Fall 2017:
Finally, although they are not WS-DL courses per se, WS-DL member Corren McCoy is also teaching CS 462 Cybersecurity Fundamentals again this semester, and WS-DL alumnus Dr. Charles Cartledge is teaching CS 395 "Data Analysis Bootcamp".

I'm especially proud of this semester's breadth of course offerings and the participation by two alumni and one current WS-DL member.


2017-08-27: Media Manipulation research at the Berkman Klein Center at Harvard University Trip Report

A photo of me inside "The Yellow House" -
The Berkman Klein Center for Internet & Society
On June 5, 2017, I started work as an Intern at the Berkman Klein Center for Internet & Society at Harvard University under the supervision of Dr. Rob Faris, the Research Director for the Berkman Klein Center. This was a wonderful opportunity to conduct news media related research, and my second consecutive Summer research at Harvard. The Berkman Klein Center is an interdisciplinary research center that researches the means to tackle some of the biggest challenges on the Internet. Located in a yellow house at the Harvard Law School, the Center is committed to studying the development, dynamics, norms and standards of cyberspace. The center has produced many significant contributions such as the review of ICANN (Internet Corporation for Assigned Names and Numbers) and the founding of the DPLA (Digital Public Library of America).
During the first week of my Internship, I met with Dr. Faris to identify the research I would conduct in collaboration with Media Cloud at Berkman. Media Cloud, is an open-source platform for studying media ecosystems. The Media Cloud platform provides various tools for studying media such as Dashboard, Topic Mapper, and Source Manager
Media Cloud tools for visualizing and analyzing online news
Dashboard helps you see how a specific topic is spoken about in digital media. Topic Mapper helps you conduct topic in-depth analysis by identifying the most influential sources and stories. Source Manager helps explore the Media Cloud vast collection of digital media. The Media Cloud collection consists of about 547 million stories from over 200 countries. Some of the most recent Media Cloud research publications include: "Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election" and "Partisan Right-Wing Websites Shaped Mainstream Press Coverage Before 2016 Election."

Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election | Berkman Klein Center

In this study, we analyze both mainstream and social media coverage of the 2016 United States presidential election. We document that the majority of mainstream media coverage was negative for both candidates, but largely followed Donald Trump's agenda: when reporting on Hillary Clinton, coverage primarily focused on the various scandals related to the Clinton Foundation and emails.

Partisan Right-Wing Websites Shaped Mainstream Press Coverage Before 2016 Election, Berkman Klein Study Finds | Berkman Klein Center

The study found that on the conservative side, more attention was paid to pro-Trump, highly partisan media outlets. On the liberal side, by contrast, the center of gravity was made up largely of long-standing media organizations.

Rob and I narrowed my research area to media manipulation. Given, the widespread concern about the spread of fake news especially during the 2016 US General Elections, we sought to study the various forms of media manipulation and possible measures to mitigate this problem. I worked closely with Jeff Fossett, a co-intern on this project. My research about media manipulation began with a literature review of the state of the art. Jeff Fosset and I explored various research and news publications about media manipulation.


How the Trump-Russia Data Machine Games Google to Fool Americans

A year ago I was part of a digital marketing team at a tech company. We were maybe the fifth largest company in our particular industry, which was drones. But we knew how to game Google, and our site was maxed out.

Roger Sollenberger revealed a different kind of organized misinformation/disinformation campaign from the conventional publication of fake news (pure fiction - fabricated news). This campaign is based on Search Engine Optimization (SEO) of websites. Highly similar less popular (by Alexa rank) and often fringe news websites beat more popular traditional news sites in the Google rankings. They do this by optimizing their content with important trigger keywords, generate a massive volume of similar fresh (constantly updated) content, and link among themselves.

Case 2: Data & Society - Manipulation and Disinformation Online

Media Manipulation and Disinformation Online

Data & Society  introduced the subject of media manipulation in this report: media manipulators such as some far-right groups exploit the media's proclivity for sensationalism and novelty over newsworthiness. They achieve this through the strategic use of social media, memes, and bots to increase the visibility of their ideas through a process known as "attention hacking."

Case 3: Comprop - Computational Propaganda in the United States of America: Manufacturing Consensus Online 

Computational Propaganda in the United States of America: Manufacturing Consensus Online

This report by the Computational Propaganda Project at Oxford University, illustrated the influence of bots during the 2016 US General Elections and dynamics between bots and human users. They illustrated how armies of bots allowed campaigns, candidates, and supporters to achieve two key things during the 2016 election, first, to manufacture consensus and second, to democratize online propaganda.
COMPROP: 2016 US General Election sample graph showing network of humans (black nodes) retweeting bots (green nodes)
Their findings showed that armies of bots were built to follow, retweet, or like a candidate's content making that candidate seem more legitimate, more widely supported, than they actually are. In addition, they showed that the largest botnet in the pro-Trump network was almost 4 times larger than the largest botnet in the pro-Clinton network:
COMPROP: 2016 US General Election sample botnet graph showing the pro-Clinton and more sophisticated pro-Trump botnets

Based on my study, I consider media manipulation as:

calculated efforts taken to circumvent or harness the mainstream media in order to set agendas and propagate ideas, often with the utilization of social media as a vehicle.
On close consideration of the different mechanisms of media manipulation, I saw a common theme. This common theme of media manipulation was described by the Computational Propaganda Project as "Manufactured Consensus." The idea of manufactured consensus is: we naturally lend some credibility to stories we see at multiple different sources, and media manipulators know this. Consequently, some media manipulators manufacture consensus around a story, sometimes pure fabrication (fake news), and some other times, they take some truth from a story, distort it, and replicate this distorted version of the truth across multiple source to manufacture consensus. An example of this manufacture of consensus is Case 1. Manufactured consensus ranges from fringe disinformation websites to bots on Twitter which artificially boost the popularity of a false narrative. The idea of manufactured consensus motivated by attempt to first, provide a means of identifying consensus, second, learn to distinguish organic consensus from manufactured consensus both within news sources and Twitter. I will briefly outline the beginning part of my study to identify consensus specifically across news media sources.
I acknowledge there are multiple ways to define "consensus in news media," so I define consensus within news media:
 as a state in which multiple news sources report on the same or highly similar stories.
Let us use a graph (nodes and edges) to identify consensus in new media networks. In this graph representation, nodes represent news stories from media sources and consensus is captured by edges between the highly similar or the same news stories (nodes). For example, Graph 1 below shows consensus between NPR and BBC  for a story about a shooting in a Moscow court.
Graph 1: Consensus within NPR and BBC for "Shooting at a Moscow Court" story
Consensus may occur within the mainstream left news media (Graph 2) or the mainstream right media (Graph 3). 
Graph 2: Consensus on the left (CNN, NYTimes & WAPO) media for the "Transgender ban" story
Graph 3: Consensus on the right (Breitbart, The blaze, & Fox) for the "Republican senators who killed the Skinny repeal bill" story
Consensus may or may not be exclusive to left, center or right media, but if there is consensus across different media networks (e.g., mainstream left, center and right), we would like to capture or approximate the level of bias or "spin" expressed by the various media networks (left, center and right). For example, on August 8, 2017, during a roundtable in his Golf Club in Bedminster New Jersey, President Trump said if North Korea continues to threaten the US, "they will be met with fire and fury." Not surprisingly, various news media reported this story, in order words there was consensus within left, center and right news media organizations (Graph 4) for the "Trump North Korea fire and fury" story.
Graph 4: Consensus across left, center and right media networks for the "Trump North Korea fire and fury story"
Let us inspect the consensus Graph 4 closely, beginning with the left, then the center and right, consider the following titles:
  1. Calm down: we’re (probably) not about to go to war with North Korea, I’m at least, like, 75 percent sure.
  2. Trump now sounds more North Korea-y than North Korea
Politicus USA:
  1. Trump Threatens War With North Korea While Hanging Out In The Clubhouse Of His Golf Club
Huffington Post, Washington Post, and NYTimes, respectively:
The Hill:
Gateway Pundit, The Daily Caller, Fox, Breitbart, Conservative Tribune:
  1. WOW! North Korea Says It Is ‘Seriously Considering’ Military Strike on Guam (Gateway Pundit)
  2. North Korea: We May Attack Guam (The Daily Caller)
  3. Trump: North Korea 'will be met with fire and fury like the world has never seen' if more threats emerge (Fox)
  4. Donald Trump Warns North Korea: Threats to United States Will Be Met with ‘Fire and Fury’ (Breitbart)
  5. Breaking: Trump Promises Fire And Fury.. Attacks North Korea In Unprecedented Move (Conservative Tribune)
  6. North Korea threatens missile strike on Guam (Washington Examiner)
In my opinion, on the left, consider the critical outlook offered by Vox, claiming the President sounded like the North Korean Dictator Kim Jong Un. At the center, The Hill emphasized the unfavorable response of some senators due to the President's statement. I think one might say the left and some parts of the center painted the President as reckless due to his threats. On the right, consider the focus on the North Korean threat to strike Guam. The choice of words and perspectives reported on this common story exemplifies the "spin" due to the political bias of the various polarized media. We would like to capture this kind of spin. But it is important to note that our goal is NOT to determine what is the truth. Instead, if we can identify consensus and go beyond consensus to capture or approximate spin, this solution could be useful in studying media manipulation. It is also relevant to investigate if spin is related to misinformation or disinformation.
I believe the prerequisite for quantifying spin is identifying consensus. The primitive operation of identifying consensus is the binary operation of measuring the similarity (or distance) between two stories. I have began this analysis with an algorithm in development. This algorithm was applied to generate Graphs 1-4. Explanation of this algorithm is beyond the scope of this post, but you may see the algorithm in action through this polar media consensus graph, which periodically computes a consensus graph for left, center and right media.
Consensus graph generated by an algorithm in development
The second part of our study was to identify consensus on Twitter. I will strive to report the developments of this research as well as a formal introduction of the consensus identifying algorithm when our findings are concrete. 
In addition to researching media manipulation, I had the pleasure to see the 4th of July fireworks across the Charles River from Matt's rooftop, and attend Law and Cyberspace lectures hosted by Harvard Law School Professors - Jonathan Zittrain and Urs Gasser. I had the wonderful opportunity to teach Python and learn from my fellow interns, as well as present my media manipulation research to Harvard LIL.

Media manipulation is only going to evolve, making its study crucial. I am grateful for the constant guidance of my Ph.D supervisors, Dr. Michael Nelson and Dr. Michele Weigle, and am also very grateful to the Dr. Rob Faris at Media Cloud, and the rest of Berkman Klein community for providing me with the opportunity to research this pertinent subject.


Saturday, August 26, 2017

2017-08-26: rel="bookmark" also does not mean what you think it means

Extending our previous discussion about how the proposed rel="identifier" is different from rel="canonical" (spoiler alert: "canonical" is only for pages with duplicative text), here I summarize various discussions about why we can't use rel="bookmark" for the proposed scenarios.  We've already given a brief review of why rel="bookmark" won't work (spoiler alert: it is explicitly prohibited for HTML <link> elements or HTTP Link: headers) but here we more deeply explore the likely original semantics. 

I say "likely original semantics" because:
  1. the short phrases in the IANA link relations registry ("Gives a permanent link to use for bookmarking purposes") and the HTML5 specification ("Gives the permalink for the nearest ancestor section") are not especially clear, nor is the example in the HTML5 specification.
  2.  rel="bookmark" exists to address a problem, anonymous content, that has been so thoroughly solved that the original motivation is hard to appreciate. 
In our Signposting work, we had originally hoped we could use rel="bookmark" to mean "please use this other URI when you press control-D".  For example, we hoped the HTML at could have:

<link rel="bookmark"></link>

And when the user hit "control-D" (the typical keyboard sequence for bookmarking), the user agent would use the URI instead of the current URI at  But alas, that's not why rel="bookmark" was created, and the original intention is likely why rel="bookmark" is prohibited from <link> elements.  I say likely because the motivation is not well documented and I'm inferring it from the historical evidence and context.

In the bad old days of the early web, newsfeeds, blogs, forums, and the like did not universally support deep links, or permalinks, to their content.  A blog would consist of multiple posts displayed within a single page.  For example page 1 of a blog would have the seven most recent posts, page 2 would have the previous seven posts, etc.  The individual posts were effectively anonymous: you could link to the "top" of a blog (e.g.,, but links to individual posts were not supported; for example this individual post from 2015 is no longer on page 1 of the blog and without the ability to link directly to its permalink, one would have click backwards through many pages to discover it.

Of course, now we take such functionality for granted -- we fully expect to have direct links to individual posts, comments, etc.  The earliest demonstration I can find is from this blog post from 2000 (the earliest archived version is from 2003, here's the 2003 archived version of top-level link to the blog where you can see the icon the post mentions).   This early mention of a permalink does not use the term "permalink" or relation rel="bookmark"; those would follow later. 

The implicit model with permalinks appears to be that there would be > 1 rel="bookmark" assertions within a single page, thus the relation is restricted to <a> and <area> elements.  This is because <link> elements apply to the entire context URI (i.e., "the page") and not to specific links, so having > 1 <link> elements with rel="bookmark" would not allow agents to understand the proper scoping of which element "contains" the content that has the stated permalink (e.g., this bit of javascript promotes rel="bookmark" values into <link> elements, but scoping is lost).  An ASCII art figure is order here:

|                            |
|  <A href="blog.html"       |
|     rel=bookmark>          |
|  Super awesome alphabet    |
|  blog! </a>                |
|  Each day is a diff letter!|
|                            |
|  +---------------------+   |
|  | A is awesome!!!!    |   |
|  | <a href="a.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for A </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | B is better than A! |   |
|  | <a href="b.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for B </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | C is not so great.  |   |
|  | <a href="c.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for C </a>|   |
|  +---------------------+   |
|                            |

$ curl blog.html
Super awesome alphabet blog!
Each day is a diff letter!
A is awesome!!!!
permalink for A
B is better than A!
permalink for B 
C is not so great.
permalink for C
$ curl a.html
A is awesome!!!!
permalink for A
$ curl b.html
B is better than A!
permalink for B 
$ curl c.html
C is not so great.
permalink for C

In the example above, the blog has a rel="bookmark" to itself ("blog.html") and since the <a> element appears at the "top level" of the HTML, it is understood that the scope of the element applies to the entire page.  In the subsequent posts, the scope of the link is bound to some ancestor element (perhaps a <div> element) and thus it does not apply to the entire page.  The rel="bookmark" to "blog.html" is perhaps unnecessary, the user agent already knows its own context URI (in other words, a user agent typically knows the URL of the page it is currently displaying (but might not in some conditions, like being the response to a POST request), but surfacing the link with an <a> element makes it easy for the user to right-click, copy-n-paste, etc.  If "blog.html" had four <link rel="bookmark" > elements, the links would not be easily available for user interaction and scoping information would be lost.

And it's not just for external content ("a.html", "b.html", "c.html") like the example above.  In the example below, rel="bookmark" is used to provide permalinks for individual comments contained within a single post.

|                            |
|  <A href="a.html"          |
|     rel=bookmark>          |
|  A is awesome!!!!</a>      |
|                            |
|  +---------------------+   |
|  | <a name="1"></a>    |   |
|  | Boo -- I hate A.    |   |
|  | <a href="a.html#1"  |   |
|  |    rel=bookmark>    |   |
|  | 2017-08-01 </a>     |   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | <a name="2"></a>    |   |
|  | a series of tubes!  |   |
|  | <a href="a.html#2"  |   |
|  |    rel=bookmark>    |   |
|  | 2017-08-03 </a>     |   |
|  +---------------------+   |
|                            |

This style exposes the direct links of the individual comments, and in this case the anchor text for the permalink is the datestamp of when the post was made (by convention, permalinks often have anchor text or title attributes of "permalink", "permanent link", datestamps, the title of the target page, or variations of these approaches).  Again, it would not make sense to have three separate <link rel="bookmark" > elements here, obscuring scoping information and inhibiting user interaction. 

So why prohibit <link rel="bookmark" > elements?  Why not allow just a single <link rel="bookmark" > element in the <head> of the page, which would by definition enforce the scope to apply to the entire document?  I'm not sure, but I guess it stems from 1) the intention of surfacing the links to the user, 2) the assumption that a user-agent already knows the URI of the current page, and 3) the assumption that there would be > 1 bookmarks per page.  I suppose uniformity was valued over expressiveness. A 1999 HTML specification does not explicitly mention the <link> prohibition, but it does mention having several bookmarks per page.

An interesting side note is that while typing self-referential, scoped links with rel="bookmark" to differentiate them from just regular links to other pages seemed like a good idea ca. 1999, such links are now so common that many links with the anchor text "permalink" or "permanent link" often do not bother to use rel="bookmark" (e.g., Wikipedia pages all have "permanent link" in the left-hand column, but do not use rel="bookmark" in the HTML source, but the blogger example captured in the image above does use bookmark).  The extra semantics are no longer novel and are contextually obvious. 

In summary, in much the same way there is confusion about rel="canonical",  which is better understood as rel="hey-google-index-this-url-instead", perhaps a better name for rel="bookmark" would have been rel="right-click".  If you s/bookmark/right-click/g, the specifications and examples make a lot more sense. 

--Michael & Herbert

N.B. This post is a summary of discussions in a variety of sources, including this WHATWG issue, this tweet storm, and this IETF email thread

Friday, August 25, 2017

2017-08-25: University Twitter Engagement: Using Twitter Followers to Rank Universities

Figure 1 Summing primary and secondary followers for @ODUNow

Figure 1: Summing primary and secondary followers for @ODUNow
Our University Twitter Engagement (UTE) rank is based on the friend and extended follower network of primary and affiliated secondary Twitter accounts referenced on a university's home page. We show that UTE has a significant, positive correlation with expert university reputation rankings (e.g., USN&WR, THE, ARWU) as well as rankings by endowment, enrollment, and athletic expenditures (EEE). As illustrated in Figure 1, we bootstrap the process by starting with the URI for the university's homepage obtained from the detailed institutional profile information in the ranking lists. For each URI, we navigated to the associated webpage and searched the HTML source for links to valid Twitter handles. Once the Twitter screen name was identified, the Twitter GET users/Show API was used to retrieve the URI from the profile of each user name. If the domain of the URI matched exactly or resolved to the known domain of the institution, we considered the account to be one of the university's official, primary Twitter handles since the user had self-associated with the university via the URI reference.

As an example, the user names @NBA, @DukeAnnualFund, @DukeMBB, and @DukeU were extracted from the page source of the Duke University homepage ( However, only @DukeAnnualFund and @DukeU are considered official primary accounts because their respective URIs, and, are in the same domain as the university.  On the other hand, @DukeMBB maps to, which is not in the same domain as, so we don't include it among the official accounts. Ultimately, we delve deeper into the first and second degree relationships between Twitter followers to identify the pervasiveness of  the university community which includes not only academics, but sports teams, high profile faculty members, and other sponsored organizations.

We aggregated the rankings from multiple expert sources to calculate an adjusted reputation rank (ARR) for each university which allows direct comparison based on position in the list and provides a collective perspective of the individual rankings. In rank-to-rank comparisons using Kendall's Tau, we observed a significant, positive rank correlation (τ=0.6018) between UTE and ARR which indicates that UTE could be a viable proxy for ranking atypical institutions normally excluded from traditional lists.  We also observed a strong correlation (τ=0.6461) between UTE and EEE suggesting that universities with high enrollments, endowments, and/or athletic budgets also have high academic rank. The top 20 universities as ranked by UTE are shown in Table 1. We've highlighted a few universities where there is a significant disparity between the ARR and the UTE ranking which indicates larger Twitter followings than can be explained just by academic rank.

University UTE Rank ARR Rank
Harvard University 1 1
Stanford University 2 2
Cornell University 3 10
Yale University 4 7
University of Pennsylvania 5 8
Arizona State University--Tempe 6 59
Columbia University in the City of New York 7 4
Texas A&M University--College Station 8 39
Wake Forest University 9 74
University of Texas--Austin 10 16
Pennsylvania State University--Main Campus 11 24
University of Michigan--Ann Arbor 12 10
University of Minnesota--Twin Cities 13 16
Ohio State University--Main Campus 14 22
Princeton University 15 4
University of Wisconsin--Madison 16 14
University of Notre Dame 17 46
Boston University 18 21
University of California--Berkeley 19 3
Oklahoma State University--Main Campus 20 100

Table 1: Top 20 Universities Ranked by UTE for Comparison With ARR

We have prepared an extensive report of our findings as a technical report available on arXiv (linked below). We have also posted all of the ranking and supporting data used in this study which includes a social media rich dataset containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

- Corren (@correnmccoy)

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Monday, August 14, 2017

2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns

Last Wednesday, August 9, 2017, I was invited to give a talk to some summer interns of the Computer Science Department at Old Dominion University. Every summer our department invites some undergrad students from India and hosts them for about a month to work on some projects under a research lab here as summer interns. During this period, various research groups introduce their work to those interns to encourage them to become potential graduate applicants. Those interns also act as academic ambassadors who motivate their colleagues back in India for higher studies.

This year, Mr. Ajay Gupta invited a group of 20 students from Acharya Institute of Technology and B.N.M. Institute of Technology and supervised them during their stay at Old Dominion University. Like the last year, I was selected from the Web Science and Digital Libraries Research Group again to introduce them with the concept of web archiving and various researches of our lab. An overview of the talk can be found in my last year's post.

Recently, I have been selected as the Docker Campus Ambassador for ODU. I thought it would be a great opportunity to introduce those interns with the concept of software containerization. Among numerous other benefits, it would help them deal with the "works on my machine" problem (also known as "magic laptop" problem), common in students' life.

After finishing the web archiving talk, I briefly introduced them with the basic concepts and building blocks of Docker. Then I illustrated the process of containerization with the help of a very simple example.

I encouraged them to interrupt me during the talk to ask any relevant questions as both the topics were fairly new for them. Additionally, I tried to bring in references from Indian culture, politics, and cinema to make it more engaging for them. Overall, I was very happy with the kind of questions they were asking, which gave me the confidence that they were actually absorbing these new concepts and not asking questions just for the sake of grabbing some swags that included T-shirts and stickers from Docker and Memento.

Sawood Alam

Friday, August 11, 2017

2017-08-11: Where Can We Post Stories Summarizing Web Archive Collections?

A social card generated by Facebook for my previous blog post.
Rich links, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, social card. These visualizations of web objects now pervade our existence on and off of the Web. The concept has been used to render web documents as results in academic research projects, like in Omar Alonso's "What's Happening and What Happened: Searching the Social Web". oEmbed is a standard for producing rich embedded representations of web objects for a variety of consuming services. Google experiments with using richer objects in their search results, even including images and other content from pages. Facebook, Twitter, Tumblr, Storify, and other tools use these cards. They have become so ubiquitous that services that do not produce these cards, like Google Hangouts, seem antiquated. These cards also no longer just sit within the confines of the web browser, being used in Apple's iMessage application since the release of iOS 10, as shown below. For simplicity, I will use the term social card for the rest of this post.

Apple's iOS iMessage app also generates social cards. This example also shows a card linking to my previous blog post.
Why use these cards? Why not just allow applications to copy and paste links as plaintext URIs? For many end users, URIs are unwieldy. Consider the URI below. Even though copying and pasting mitigates many of the issues with having to type this URI, it is still quite long. There is also very little information in the URI indicating to what document it will lead the end user.,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35.3644614,-109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1d-106.287162!2d35.8440582

Now consider the following social card from Facebook for this same URI. The card tells the user that it is from Google Maps and contains directions from Old Dominion University to Los Alamos National Laboratory. Most importantly, it does not require that the user know any details about how Google Maps constructs its URIs.

A social card on Facebook generated from a Google Maps URI that represents a document providing directions from Old Dominion University to Los Alamos National Laboratory.

In effect, social cards are visualizations of web objects, piercing the veil created by the opaqueness of a URI. Thanks to social cards, the end user gets some information about the content of the URI before clicking on it, preventing them from visiting a site they may not have time or bandwidth for. In Yasmin AlNoamany's Dark and Stormy Archives (DSA), she uses social cards in Storify stories to summarize mementos from Archive-It collections. These stories take the form of 28 high quality mementos represented by social cards ordered by publication date. The screenshot below shows the Storify story containing links generated by the DSA for Archive-It collection 3649 about the Boston Marathon Bombing in 2013.

The Dark and Stormy Archives (DSA) application summarizes Archive-It collections as a collection of 28 well-chosen, high-quality mementos that are ordered by publication date and then visualized as social cards in Storify. This screenshot shows the Storify output of the DSA for Archive-It collection 3649 about the Boston Marathon Bombing in 2013.
A visualization requiring increased cognitive load requires more effort from the end user, and, in some cases hinders performance. Earlier attempts at visualizing Archive-It collections by Padia and others required training in how to use each visualization and their complexity may have produced increased cognitive load in the end user. A well-chosen, reduced set of links visualized as social cards works better than other visualizations that attempt to summarize web archives due to the low cognitive load required on behalf of the consumer. Each social card as a visualization into itself, hence a collection of social cards becomes an instance of the visualization technique of small multiples.
Small multiples was initially categorized in 1983 by Tufte in his Visual Display of Quantitative Information, but are present as a technique as far back as Eadweard Muybridge's Horse in Motion from 1886. Small multiples allow the user the ability to compare the same attributes of different sets of data. Consider the line graphs in the example below. Each details the expenses for different departments in an organization during the time period ranging from July to December. Note how the same x-axis on each graph allows the viewer to compare the expenses over time between each department. The key is that each visualization places the same data in the same spatial region, allowing for easy comparison.
An example of small multiples. Source: Wikipedia.

Each social card is a data item consisting of multiple attributes. The same attribute for each item is presented in the same spatial region of a given card. This allows the user to scan the list of cards for a given attribute, such as title, without being overwhelmed by the values of the rest of the attributes present. This consistency makes it easy to compare each card in the set. Below is a diagram of a given Storify card with annotations detailing its attributes. This becomes an effective storytelling method for events because users can see the cards in the order that their respective content was written.
Storify cards consist of multiple attributes that are visualized in the same spatial region on each card. This card exists for the live link
AlNoamany uses Storify in this way, but how well might other tools work for visualizing the output of the DSA? Can they serve as a replacement for Storify?

This post is a re-examination of the landscape since AlNoamany's dissertation to see if there are tools other than Storify that the DSA can use. It covers the tools living in the spaces of content curation, storytelling, and social media. AlNoamany's dissertation lists several that fit into different categories, and understanding these categories led to the discovery of more tools. The tools discussed in this post come from three sources: AlNoamany's disseration, "Content Curation Tools: The Ultimate List" by Curata, and "40 Social Media Curation Sites and Tools" by Shirley Williams. Curation takes many forms for many different reasons, but not all of them are suitable for the DSA framework. After this journey, I settle on four tools -- Facebook, Storify, Tumblr, and Twitter Moments -- that might be useful contenders.

For some tools, in order to test how well the generated social cards and collections for mementos, I used the URIs from the Boston Marathon Bombing 2013 Stories 3649spst0s and 3649spst1s generated as part of AlNoamany's DSA work. If I needed to contrast them with live web examples, I used the URI

Engaging With Customers

A number of tools exist for the purpose of customer engagement. They provide the ability to curate content from the web with the goal of increasing confidence in a brand.

With collections can be shared internally so that they can be reviewed by teams in order to craft a message. It allows an organization to curate their own content and coordinate a single message across multiple social channels. They provide social monitoring and analysis of the impact of their message. They use their curated content to develop plans for addressing trends, dealing with crises (e.g., the recent Pepsi commercial fiasco), and ensuring that customers know the company is a key player in the market (e.g., IBM's Big Data & Analytics Hub). Cision, FlashIssue, FollozeSpredfast, Sharpr, Sprinklr, and Trap!t are tools with a similar focus.I requested demos and discussions about these tools with their corresponding companies, but only recieved feedback from and Spredfast who were instrumental in helping understand this space.

Roojoom, Curata, and SimilarWeb, and Waywire Enterprises focus more on helping influence the development of the corporate web site with curated content. RockTheDeadLine offers to curate content on the organization's behalfCurationSuite (formerly CurationTraffic) focuses on providing a curated collection as a component of a WordPress blog. These services go one step further and provide site integration components in addition to mere content curation. Curata has a lot of documentation and several whitepapers that helped me understand the reasons for these tools.

Hootsuite, Pluggio, PostPlanner, SproutSocial focus on collecting and organizing content and responses from social media. They do not provide collections for public consumption in the same way Storify or a Facebook album would. Hootsuite in particular provides a way to gather content from many different social networking accounts at once while synchronizing outgoing posts across all of them.

All of these tools offer analytics packages that permit the organization to see how the produced newsletter or web content is performing. Though these tools do focus on curating content, their focus is customer engagement and marketing. Most of these tools focus on trends and web content in aggregate rather than showcasing individual web resources.

Our focus in this project is to find new ways of visualizing and summarizing Archive-It collections. Though some of these tools might be capable of doing this, their cost and unused functionality make them a poor fit for our purpose.

Focusing on the Present

Some tools allow the user to supply a topic as a seed for curated content. The tool will then use that topic and its own internal curation service to locate content that may be useful to the user. A good example is a local newspaper. A resident of Santa Fe, for example, will likely want to know what content is relevant to their city, and hence would be better served by the curation services of the Santa Fe New Mexican than they would by the Seattle Times. The newspaper changes every day, but the content reflects the local area. presents a different collection each day based on the user's keywords. I created "The Science Daily", which changes every day. The content for June 4, 2017 (left) is different from the content for June 5, 2017 (right).
This category of curation tools is not limited by geographic location. The input in the system is a set of search terms representing a topic. and UpContent allow one to create a personalized newspaper about a given topic that changes each day, providing fresh content to the user. ContentGems is much the same, but supports a complex workflow system that can be edited to supply content from multiple sources. ContentGems also allows one to share their generated paper via email, Twitter, RSS feed, website widgets, RSS, IFTTT, Zapier, and a whole host of other services. DrumUp uses a variety of sources from the general web and social media to generate topic-specific collections. They also allow the user to schedule social media posts to Facebook, Twitter, and LinkedIn. Where appears to be focused on a single user, ContentGems and DrumUp easily stretch into customer engagement, and UpContent offers different capabilities depending on to which tier the user has subscribed.
(left) The Tweeted Times shows some of the tweets from the author's Twitter feed.
(right) Tagboard requires that a user supply a hashtag as input before creating a collection. 
The Tweeted Times and Tagboard both focus on content from social media. The Tweeted Times attempts to summarize a user's twitter feed and later publishes that summary at a URI for the end user to consume. Tagboard uses hashtags from Facebook or Twitter as seeds to their content curation system. 

The tools in this section focus on content from the present. They do not allow a user to supply a list of URIs which are then stored in a collection, hence are not suitable for inclusion into Dark and Stormy Archives.

Sharing and the Lack of Social Cards

There is a spectrum of sharing. Storify allows one to share their collection publicly for all to see. Other tools expect only subscribed accounts to view their collections. In these cases, subscribed accounts may be acquired for free or at cost. Feedly supports sharing of collections only for other users in one's team, a grouping of users that can view each other's content. Pinboard and Pocket are slightly less restrictive, permitting other portal users to view their content. In addition, both Pinboard and Pocket promise paying customers the ability to archive their saved web resources for later viewing. Shareist only shares content via email and on social media, not producing a web-based visualization of a collection. We are interested in tools that allow us to not only share collections of mementos on the web, but also share them with as few barriers to viewing as possible.

Huzzaz and Vidinterest only support URIs to web resources that contain video. Both support YouTube and Vimeo URIs, but only Vidinterest supports Dailymotion. Neither support general URIs, let alone URI-Ms from Archive-It. Instagram and Flickr work specifically with images, and they do not create social cards for URIs. Sutori allows one to curate URIs, but does not create social cards. Even though Twitter may render a social card in the Tweets, the card is not present when the tweets are visualized in a collection using Togetter.

A screenshot of a Tweet containing a social card for
A screenshot of a Togetter collection of live links containing the Tweet from above as the fourth in the collection. Note that none of these URIs show a social card, even
This screenshot shows a live link inserted into a Sutori story, with no social card.

A test post in Instagram where I attempted to add several URIs as comments, including the URI used in the Twitter example above. Instagram produced no social cards for these URIs and did not make them links either.

Card Size Matters

Some tools change the size of the card for effect, or to allow extra data in one card rather than another. These size change interrupt the visual flow of the small multiples paradigm I mentioned in the introduction. While good for presenting in newspapers or other tools that collect articles, such size changes make it difficult to follow the flow of events in a story. They create additional cognitive load on the user, forcing her to constantly ask "does this different sized card come before or after the other cards in my view?" and "how does this card fit into the story timeline?"


Flipboard orders the social cards from left to right then up and down, but changes the size of some of the cards.

Flipboard often makes the first social card the largest, dominating the collection as seen in the screenshot above. Sometimes it will choose another card in the collection and increase its size as well. Flipboard also has other issues. In the screenshot below, we see a social card rendered for a live link, but in the screenshot below that we see that Flipboard does not do so well with mementos.
A social card generated in Flipboard for the live URI
A screenshot of a collection of mementos about the Boston Marathon Bombing stored in Flipboard.

In this collection, changes the size of some social cards based on the amount of data present in the card. changes the size of some social cards due to the presence of large images or extra text in the snippet. These changes distort the visual flow of the collection. There are also restrictions, even for paying users, on the amount of content that can be stored, with even a top subscription of $33 per month being limited to only 15 collections.


Flockler alters the sizes of some cards based on the information present. Note: because I only had a trial account, this Flockler collection may no longer be accessible.
Flockler alters the size of its cards based on the information present. Cards with images, titles, and snippets are larger than those with just text. As shown below, sometimes Flockler cannot extract any information and generates empty cards or cards whose title is the URI.

A screenshot of social cards generated from Archive-It mementos in a Flockler collection about the Boston Marathon Bombing. The one on top just displays the link while the one in the middle is empty. Links to mementos: topmiddlebottom.


The same mementos visualized in social cards in this Pinterest collection. Pinterest supports collections, but does not typically generate social cards, favoring images.

Pinterest has a distinct focus on images, but does create social cards  (i.e., "pins" in the Pinterest nomenclature) for web resources. The system requires a user to select an image for each pin. Interestingly, the first image presented when a user is generating a pin is often the same that is selected by Storify when it generates social cards. Unfortunately, the images are all different sizes, making it difficult to follow the sequence of events in the story.

In addition to the size issue, if Pinterest cannot find an image in a page or if the image is too small, it will not create a social card. It could not find any images for URI-M and all images for were too small.

If an image is too small, Pinterest will issue an error and refuse to post the link.
Pinterest also presents another problem. During the processing of some social cards, Pinterest converts the URI-M into a URI-R. For example, in the screenshot above we see that one of the social cards bears the domain name "", but clicking on the card leads one to card for "".


As seen in this collection, Juxtapost changes the size of social cards and even moves them out of the way for advertisements (top right text says "--Advertisement--"). Which direction does the story flow?

Juxtapost is the other tool which changes the size of the social cards. In addition, it requires that the end user select and image and insert a description for every card. If it weren't for the changing sizes of each card, the manual labor may also make this unsuitable for use in the DSA.

Juxtapost also refuses to add a resource (e.g., for which it can find no images.


Google+ collection for the Boston Marathon Bombing viewed with a window size of 2033 x 1254.
The same Google+ collection viewed with a window size of 1080 x 1263.

The same Google+ collection viewed in a window resized to 945 x 1265.
As shown in the screenshots above, the direction and size of the cards in a Google+ collection changes depending on the resolution used to view the collection. This is likely a result of adjusting the page for mobile screen sizes. In spite of the fact that Google+ had no problems generating cards for all of our test mementos, the first figure above does not indicate well in which direction the events in the story unfolded and thus this information is lost in Google+

Problems That APIs Might Solve

Of course, the Dark and Stormy Archives software generates its visualization automatically. This makes the use of a web API quite important for the tool. The DSA generates 28 links per Archive-It collection. Would it be acceptable for a human to submit these links to one of these tools much like I have done? What if the collection changes frequently and the DSA must be rerun to account for these changes?

In addition to freeing humans from creating stories, AlNoamany was able to use the Storify API to assist Storify in developing richer social cards, adding dates and favicons to override and improve upon the information that Storify extracted from mementos. The human interface for Storify also had some problems creating cards for mementos, and these problems could be overcome by using the Storify API.

Pearltrees has no API.  I could not find APIs for Symbaloo, eLink, ChannelKit, or BagTheWeb. Listly has an API, but it is not public.

BagTheWeb requires additional information supplied by the user in order to create a social card. As seen below, BagTheWeb does not generate any social card data based solely on the URI. If there were an API, the DSA platform might be able to address some of these shortcomings. Symbaloo is much the same. It chooses an image, but often favors the favicon over an image selected from the article.

This is a screenshot of a social card created by BagTheWeb for
A screenshot of a card created by Symbaloo for the same URI.
Pearltrees has problems that may be addressed by an API that allows the user to specify information. The example screenshot below displays a Firefox error instead of a selected image in the social card. This is especially surprising because the system was able to extract the title from the destination URI. Pearltrees also tends to convert URI-Ms to URI-Rs, linking to the live page instead of the archived Archive-It page.

A screenshot of two social cards created from Archive-It mementos by Pearltrees in a collection about the Boston Marathon. The one on the left displays a Firefox error instead of a selected image for the memento. Links to mementos: left, right.
The social cards generated by eLink have a selected image, a title, and a text snippet. Sometimes, however, they do not seem to find the image, as seen in the screenshot below. also has similar problems for some URIs, also shown below. An API call that allows one to select an image for the card would help improve this tool's shortcomings.

A screenshot of two social cards generated from Archive-It mementos from an eLink collection about the Boston Marathon Bombing. The one of the left shows a missing selected image while the one on the right displays fine. Links to mementos: left, right.
ChannelKit usually generates nice social cards, complete with a title, text snippet, and a selected image or web page thumbnail. Sometimes, as shown below, the resulting card contains no information and a human must intervene. Listly also has issues with some of the links submitted to it. It usually generates a title, text snippet, and selected image, but in some cases, as shown below, just lists the URI. Flockler also has similar problems, shown below. An API call that allows one to supply the missing information would be helpful in addressing these issues.

A screenshot of the social cards generated from Archive-It mementos in a ChannelKit collection about the Boston Marathon Bombing. The one on the right shows no useful information. Links to mementos: left, right.
A screenshot of social cards generated from Archive-It mementos in a Listly collection about the Boston Marathon Bombing. The one on the top has no information but the URI. The one on the bottom contains a title, selected image, and snippet. Links to mementos: top, bottom.

Curation Tools Useful for Visualization of Archive-It Collections

The final four tools have APIs, produce social cards, and allow for collections. I decided to review these five in more detail using the mementos generated by the DSA tool against the Archive-It collection 3639 about the Boston Marathon Bombing in 2013, corresponding to this Storify story. I created these collections by hand and did not use their associated API. Storify is already in use in the DSA, and hence I did not bother to review it again here.

In this section I discuss these tools and their shortcomings. I also discuss how DSA might be able to overcome some of those shortcomings with the tool's API.


Selected mementos from a run of the Dark and Stormy Archives tool on Archive-It collection 3649 about the Boston Marathon Bombing as visualized in social cards in Facebook comments where the collection is stored as a Facebook post.

With 1.871 billion users, Facebook is the most popular social media tool. Facebook supports social cards in posts and comments. Facebook also supports creating albums of photos, but not of posts. Posts contain comments, however. In order to generate a series of social cards in a collection, I gave the post the title of the collection and supplied each URI-M in the story to a separate comment. In this way, I generate a story much like AlNoamany had done with Storify.

A screenshot of two Facebook comments. The URI-M below generated a social card, but the URI-M above did not.
As seen above, Facebook does occasionally fail to generate social cards for links. The Facebook API could be used to update such comments with a photo and a snippet, if necessary. Providing additional images is not possible, as Facebook posts and comments will not generate a social card if the post/comment already has an image.


The same mementos visualized in social cards in Tumblr where the collection is denoted by a hashtag.
Weighing in with 550 million users is Tumblr. Tumblr is a complex social media tool supporting many different types of posts. A user selects which type of post they desire and then supply the necessary data or metadata. For example, if a user wanted to generate something like a Facebook post or a Twitter tweet, they would choose "Text". The interface for selecting a type of post is shown below.

This screenshot shows the interface used by a user when they wish to post to Tumblr. It shows the different types of posts possible with the tool.
The post type "Link" produces a social card for the supplied link. In addition to the social card generated by Tumblr, the "Link" post can also be adorned with an additional photo, video, or text.

All of these post types are available as part of the Tumblr API. If a social card lacks an image, or if the DSA wants to supply additional text, the post can be updated appropriately.

I use hashtags to create collections on Tumblr. The hashtags are confined to a specific blog controlled by that blog's user, hence posts outside of the blog do not intrude into the collection, as would happen with hashtags on Twitter or Facebook.

Twitter Moments

This Twitter Moment contains tweets that contain the URI-Ms from our Dark and Stormy Archives summary.

Twitter has 317 million users worldwide. While all tools required that the user title the collection in some way, Twitter Moments requires that the user upload an image separately in order to create a collection. This image serves as the striking image for the collection. The user is also compelled to supply a description.

Sadly, much like Flipboard, Twitter does not appear to generate social cards for URI-Ms from Archive-It. Shown below in a Twitter Moment, the individual URI-Ms are displayed in their tweets with no additional visualization.
Unfortunately, as we see in the same Twitter Moment, tweets do not render social cards for our Archive-It URI-Ms.
DSA could use the Twitter API to add images and additional text (up to 140 characters of course) to supplement these tweets. At that point, the DSA is building its own social cards out of tweets.

Other Paradigms

In this post, I tried to find the tools that would replace Storify as it currently exists, but what about different paradigms of storytelling? The point of the DSA framework is to visualize an Archive-It collection. Other visualization techniques could use the tools I have discarded on this list. For example, Instagram has been used successfully by activist organizations and government entities as a storytelling tool. It is also being actively used by journalists. Even though works primarily through photos, is there some way we can use it for storytelling like these people have been doing? What other paradigms can we explore for storytelling?


Considering how Storify is used in the Dark and Stormy Archives framework took me on a long ride through the world of online curation. I read about tools that are used purely for customer engagement, those that live in the perpetual now, those that do not provide public sharing, those that do not provide social cards, and those that do not support our use of small multiples. I reviewed tools that do seem to have some problems generating social cards from Archive-It mementos, and provide no API with which to address the issues.
I finally came down to three tools that may serve as replacements for Storify, with varying degrees of capability. The collections housing the same story derived from Archive-It collection 3649 are here:

Twitter does not appear to make social cards for Archive-It mementos, and hence passes this issue onto Twitter Moments. In this case, Twitter requires that the DSA supply more information than just a URI to create social cards and hence is a poor choice to replace Storify. Facebook and Tumblr do create social cards for most URIs and provide an API that can be used to augment these cards. These tools have 1.871 billion and 550 million users, respectively. Because of this familiarity, they also satisfy one of the other core requirements of the DSA: an interface that people already know how to use.

-- Shawn M. Jones

Acknowledgements: A special thanks to the folks at Flockler for extending my trial, Curata for producing so much trade literature on curation, and Sarah Zickelbrach at Cision, Jeffery at, and Chase Schlachter from Spredfast for answering my questions and helping me to understand the space where some of these tools live.