Monday, August 14, 2017

2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns


Last Wednesday, August 9, 2017, I was invited to give a talk to some summer interns of the Computer Science Department at Old Dominion University. Every summer our department invites some undergrad students from India and hosts them for about a month to work on some projects under a research lab here as summer interns. During this period, various research groups introduce their work to those interns to encourage them to become potential graduate applicants. Those interns also act as academic ambassadors who motivate their colleagues back in India for higher studies.

This year, Mr. Ajay Gupta invited a group of 20 students from Acharya Institute of Technology and B.N.M. Institute of Technology and supervised them during their stay at Old Dominion University. Like the last year, I was selected from the Web Science and Digital Libraries Research Group again to introduce them with the concept of web archiving and various researches of our lab. An overview of the talk can be found in my last year's post.



Recently, I have been selected as the Docker Campus Ambassador for ODU. I thought it would be a great opportunity to introduce those interns with the concept of software containerization. Among numerous other benefits, it would help them deal with the "works on my machine" problem (also known as "magic laptop" problem), common in students' life.


After finishing the web archiving talk, I briefly introduced them with the basic concepts and building blocks of Docker. Then I illustrated the process of containerization with the help of a very simple example.



I encouraged them to interrupt me during the talk to ask any relevant questions as both the topics were fairly new for them. Additionally, I tried to bring in references from Indian culture, politics, and cinema to make it more engaging for them. Overall, I was very happy with the kind of questions they were asking, which gave me the confidence that they were actually absorbing these new concepts and not asking questions just for the sake of grabbing some swags that included T-shirts and stickers from Docker and Memento.


--
Sawood Alam

Friday, August 11, 2017

2017-08-11: Where Can We Post Stories Summarizing Web Archive Collections?



A social card generated by Facebook for my previous blog post.
Rich links, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, social card. These visualizations of web objects now pervade our existence on and off of the Web. The concept has been used to render web documents as results in academic research projects, like in Omar Alonso's "What's Happening and What Happened: Searching the Social Web". oEmbed is a standard for producing rich embedded representations of web objects for a variety of consuming services. Google experiments with using richer objects in their search results, even including images and other content from pages. Facebook, Twitter, Tumblr, Storify, and other tools use these cards. They have become so ubiquitous that services that do not produce these cards, like Google Hangouts, seem antiquated. These cards also no longer just sit within the confines of the web browser, being used in Apple's iMessage application since the release of iOS 10, as shown below. For simplicity, I will use the term social card for the rest of this post.

Apple's iOS iMessage app also generates social cards. This example also shows a card linking to my previous blog post.
Why use these cards? Why not just allow applications to copy and paste links as plaintext URIs? For many end users, URIs are unwieldy. Consider the URI below. Even though copying and pasting mitigates many of the issues with having to type this URI, it is still quite long. There is also very little information in the URI indicating to what document it will lead the end user.

https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35.3644614,-109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1d-106.287162!2d35.8440582

Now consider the following social card from Facebook for this same URI. The card tells the user that it is from Google Maps and contains directions from Old Dominion University to Los Alamos National Laboratory. Most importantly, it does not require that the user know any details about how Google Maps constructs its URIs.

A social card on Facebook generated from a Google Maps URI that represents a document providing directions from Old Dominion University to Los Alamos National Laboratory.

In effect, social cards are visualizations of web objects, piercing the veil created by the opaqueness of a URI. Thanks to social cards, the end user gets some information about the content of the URI before clicking on it, preventing them from visiting a site they may not have time or bandwidth for. In Yasmin AlNoamany's Dark and Stormy Archives (DSA), she uses social cards in Storify stories to summarize mementos from Archive-It collections. These stories take the form of 28 high quality mementos represented by social cards ordered by publication date. The screenshot below shows the Storify story containing links generated by the DSA for Archive-It collection 3649 about the Boston Marathon Bombing in 2013.

The Dark and Stormy Archives (DSA) application summarizes Archive-It collections as a collection of 28 well-chosen, high-quality mementos that are ordered by publication date and then visualized as social cards in Storify. This screenshot shows the Storify output of the DSA for Archive-It collection 3649 about the Boston Marathon Bombing in 2013.
A visualization requiring increased cognitive load requires more effort from the end user, and, in some cases hinders performance. Earlier attempts at visualizing Archive-It collections by Padia and others required training in how to use each visualization and their complexity may have produced increased cognitive load in the end user. A well-chosen, reduced set of links visualized as social cards works better than other visualizations that attempt to summarize web archives due to the low cognitive load required on behalf of the consumer. Each social card as a visualization into itself, hence a collection of social cards becomes an instance of the visualization technique of small multiples.
Small multiples was initially categorized in 1983 by Tufte in his Visual Display of Quantitative Information, but are present as a technique as far back as Eadweard Muybridge's Horse in Motion from 1886. Small multiples allow the user the ability to compare the same attributes of different sets of data. Consider the line graphs in the example below. Each details the expenses for different departments in an organization during the time period ranging from July to December. Note how the same x-axis on each graph allows the viewer to compare the expenses over time between each department. The key is that each visualization places the same data in the same spatial region, allowing for easy comparison.
An example of small multiples. Source: Wikipedia.


Each social card is a data item consisting of multiple attributes. The same attribute for each item is presented in the same spatial region of a given card. This allows the user to scan the list of cards for a given attribute, such as title, without being overwhelmed by the values of the rest of the attributes present. This consistency makes it easy to compare each card in the set. Below is a diagram of a given Storify card with annotations detailing its attributes. This becomes an effective storytelling method for events because users can see the cards in the order that their respective content was written.
Storify cards consist of multiple attributes that are visualized in the same spatial region on each card. This card exists for the live link http://www.history.com/topics/boston-marathon-bombings.
AlNoamany uses Storify in this way, but how well might other tools work for visualizing the output of the DSA? Can they serve as a replacement for Storify?

This post is a re-examination of the landscape since AlNoamany's dissertation to see if there are tools other than Storify that the DSA can use. It covers the tools living in the spaces of content curation, storytelling, and social media. AlNoamany's dissertation lists several that fit into different categories, and understanding these categories led to the discovery of more tools. The tools discussed in this post come from three sources: AlNoamany's disseration, "Content Curation Tools: The Ultimate List" by Curata, and "40 Social Media Curation Sites and Tools" by Shirley Williams. Curation takes many forms for many different reasons, but not all of them are suitable for the DSA framework. After this journey, I settle on four tools -- Facebook, Storify, Tumblr, and Twitter Moments -- that might be useful contenders.

For some tools, in order to test how well the generated social cards and collections for mementos, I used the URIs from the Boston Marathon Bombing 2013 Stories 3649spst0s and 3649spst1s generated as part of AlNoamany's DSA work. If I needed to contrast them with live web examples, I used the URI http://www.history.com/topics/boston-marathon-bombings.

Engaging With Customers


A number of tools exist for the purpose of customer engagement. They provide the ability to curate content from the web with the goal of increasing confidence in a brand.


With Falcon.io collections can be shared internally so that they can be reviewed by teams in order to craft a message. It allows an organization to curate their own content and coordinate a single message across multiple social channels. They provide social monitoring and analysis of the impact of their message. They use their curated content to develop plans for addressing trends, dealing with crises (e.g., the recent Pepsi commercial fiasco), and ensuring that customers know the company is a key player in the market (e.g., IBM's Big Data & Analytics Hub). Cision, FlashIssue, FollozeSpredfast, Sharpr, Sprinklr, and Trap!t are tools with a similar focus.I requested demos and discussions about these tools with their corresponding companies, but only recieved feedback from Falcon.io and Spredfast who were instrumental in helping understand this space.

Roojoom, Curata, and SimilarWeb, and Waywire Enterprises focus more on helping influence the development of the corporate web site with curated content. RockTheDeadLine offers to curate content on the organization's behalfCurationSuite (formerly CurationTraffic) focuses on providing a curated collection as a component of a WordPress blog. These services go one step further and provide site integration components in addition to mere content curation. Curata has a lot of documentation and several whitepapers that helped me understand the reasons for these tools.

Hootsuite, Pluggio, PostPlanner, SproutSocial focus on collecting and organizing content and responses from social media. They do not provide collections for public consumption in the same way Storify or a Facebook album would. Hootsuite in particular provides a way to gather content from many different social networking accounts at once while synchronizing outgoing posts across all of them.

All of these tools offer analytics packages that permit the organization to see how the produced newsletter or web content is performing. Though these tools do focus on curating content, their focus is customer engagement and marketing. Most of these tools focus on trends and web content in aggregate rather than showcasing individual web resources.

Our focus in this project is to find new ways of visualizing and summarizing Archive-It collections. Though some of these tools might be capable of doing this, their cost and unused functionality make them a poor fit for our purpose.

Focusing on the Present

Some tools allow the user to supply a topic as a seed for curated content. The tool will then use that topic and its own internal curation service to locate content that may be useful to the user. A good example is a local newspaper. A resident of Santa Fe, for example, will likely want to know what content is relevant to their city, and hence would be better served by the curation services of the Santa Fe New Mexican than they would by the Seattle Times. The newspaper changes every day, but the content reflects the local area. 
Paper.li presents a different collection each day based on the user's keywords. I created "The Science Daily", which changes every day. The content for June 4, 2017 (left) is different from the content for June 5, 2017 (right).
This category of curation tools is not limited by geographic location. The input in the system is a set of search terms representing a topic. Paper.li and UpContent allow one to create a personalized newspaper about a given topic that changes each day, providing fresh content to the user. ContentGems is much the same, but supports a complex workflow system that can be edited to supply content from multiple sources. ContentGems also allows one to share their generated paper via email, Twitter, RSS feed, website widgets, RSS, IFTTT, Zapier, and a whole host of other services. DrumUp uses a variety of sources from the general web and social media to generate topic-specific collections. They also allow the user to schedule social media posts to Facebook, Twitter, and LinkedIn. Where Paper.li appears to be focused on a single user, ContentGems and DrumUp easily stretch into customer engagement, and UpContent offers different capabilities depending on to which tier the user has subscribed.
(left) The Tweeted Times shows some of the tweets from the author's Twitter feed.
(right) Tagboard requires that a user supply a hashtag as input before creating a collection. 
The Tweeted Times and Tagboard both focus on content from social media. The Tweeted Times attempts to summarize a user's twitter feed and later publishes that summary at a URI for the end user to consume. Tagboard uses hashtags from Facebook or Twitter as seeds to their content curation system. 

The tools in this section focus on content from the present. They do not allow a user to supply a list of URIs which are then stored in a collection, hence are not suitable for inclusion into Dark and Stormy Archives.

Sharing and the Lack of Social Cards


There is a spectrum of sharing. Storify allows one to share their collection publicly for all to see. Other tools expect only subscribed accounts to view their collections. In these cases, subscribed accounts may be acquired for free or at cost. Feedly supports sharing of collections only for other users in one's team, a grouping of users that can view each other's content. Pinboard and Pocket are slightly less restrictive, permitting other portal users to view their content. In addition, both Pinboard and Pocket promise paying customers the ability to archive their saved web resources for later viewing. Shareist only shares content via email and on social media, not producing a web-based visualization of a collection. We are interested in tools that allow us to not only share collections of mementos on the web, but also share them with as few barriers to viewing as possible.

Huzzaz and Vidinterest only support URIs to web resources that contain video. Both support YouTube and Vimeo URIs, but only Vidinterest supports Dailymotion. Neither support general URIs, let alone URI-Ms from Archive-It. Instagram and Flickr work specifically with images, and they do not create social cards for URIs. Sutori allows one to curate URIs, but does not create social cards. Even though Twitter may render a social card in the Tweets, the card is not present when the tweets are visualized in a collection using Togetter.

A screenshot of a Tweet containing a social card for http://www.history.com/topics/boston-marathon-bombings.
A screenshot of a Togetter collection of live links containing the Tweet from above as the fourth in the collection. Note that none of these URIs show a social card, even
This screenshot shows a live link http://www.history.com/topics/boston-marathon-bombings inserted into a Sutori story, with no social card.

A test post in Instagram where I attempted to add several URIs as comments, including the URI http://www.history.com/topics/boston-marathon-bombings used in the Twitter example above. Instagram produced no social cards for these URIs and did not make them links either.

Card Size Matters


Some tools change the size of the card for effect, or to allow extra data in one card rather than another. These size change interrupt the visual flow of the small multiples paradigm I mentioned in the introduction. While good for presenting in newspapers or other tools that collect articles, such size changes make it difficult to follow the flow of events in a story. They create additional cognitive load on the user, forcing her to constantly ask "does this different sized card come before or after the other cards in my view?" and "how does this card fit into the story timeline?"

Flipboard


Flipboard orders the social cards from left to right then up and down, but changes the size of some of the cards.

Flipboard often makes the first social card the largest, dominating the collection as seen in the screenshot above. Sometimes it will choose another card in the collection and increase its size as well. Flipboard also has other issues. In the screenshot below, we see a social card rendered for a live link, but in the screenshot below that we see that Flipboard does not do so well with mementos.
A social card generated in Flipboard for the live URI http://www.history.com/topics/boston-marathon-bombings.
A screenshot of a collection of mementos about the Boston Marathon Bombing stored in Flipboard.

Scoop.it

In this Scoop.it collection, Scoop.it changes the size of some social cards based on the amount of data present in the card.
Scoop.it changes the size of some social cards due to the presence of large images or extra text in the snippet. These changes distort the visual flow of the collection. There are also restrictions, even for paying users, on the amount of content that can be stored, with even a top subscription of $33 per month being limited to only 15 collections.

Flockler

Flockler alters the sizes of some cards based on the information present. Note: because I only had a trial account, this Flockler collection may no longer be accessible.
Flockler alters the size of its cards based on the information present. Cards with images, titles, and snippets are larger than those with just text. As shown below, sometimes Flockler cannot extract any information and generates empty cards or cards whose title is the URI.


A screenshot of social cards generated from Archive-It mementos in a Flockler collection about the Boston Marathon Bombing. The one on top just displays the link while the one in the middle is empty. Links to mementos: topmiddlebottom.

Pinterest


The same mementos visualized in social cards in this Pinterest collection. Pinterest supports collections, but does not typically generate social cards, favoring images.

Pinterest has a distinct focus on images, but does create social cards  (i.e., "pins" in the Pinterest nomenclature) for web resources. The system requires a user to select an image for each pin. Interestingly, the first image presented when a user is generating a pin is often the same that is selected by Storify when it generates social cards. Unfortunately, the images are all different sizes, making it difficult to follow the sequence of events in the story.

In addition to the size issue, if Pinterest cannot find an image in a page or if the image is too small, it will not create a social card. It could not find any images for URI-M http://wayback.archive-it.org/3649/20140404170835/https://sites.tufts.edu/museumstudents/2013/06/27/help-create-the-boston-marathon-bombing-archive/ and all images for http://wayback.archive-it.org/3649/20130422044829/https://twitter.com/LadieAuPair/status/325365298196795394/ were too small.

If an image is too small, Pinterest will issue an error and refuse to post the link.
Pinterest also presents another problem. During the processing of some social cards, Pinterest converts the URI-M into a URI-R. For example, in the screenshot above we see that one of the social cards bears the domain name "wayback.archive-it.org", but clicking on the card leads one to card for "newyorker.com".

Juxtapost


As seen in this collection, Juxtapost changes the size of social cards and even moves them out of the way for advertisements (top right text says "--Advertisement--"). Which direction does the story flow?

Juxtapost is the other tool which changes the size of the social cards. In addition, it requires that the end user select and image and insert a description for every card. If it weren't for the changing sizes of each card, the manual labor may also make this unsuitable for use in the DSA.

Juxtapost also refuses to add a resource (e.g., http://wayback.archive-it.org/3649/20140408194419/http://www.boston.com/yourtown/news/watertown/2014/01/digital_archive_exhibit_on_marathon_bombing_to_visit_waterto.html) for which it can find no images.

Google+


Google+ collection for the Boston Marathon Bombing viewed with a window size of 2033 x 1254.
The same Google+ collection viewed with a window size of 1080 x 1263.


The same Google+ collection viewed in a window resized to 945 x 1265.
As shown in the screenshots above, the direction and size of the cards in a Google+ collection changes depending on the resolution used to view the collection. This is likely a result of adjusting the page for mobile screen sizes. In spite of the fact that Google+ had no problems generating cards for all of our test mementos, the first figure above does not indicate well in which direction the events in the story unfolded and thus this information is lost in Google+


Problems That APIs Might Solve

Of course, the Dark and Stormy Archives software generates its visualization automatically. This makes the use of a web API quite important for the tool. The DSA generates 28 links per Archive-It collection. Would it be acceptable for a human to submit these links to one of these tools much like I have done? What if the collection changes frequently and the DSA must be rerun to account for these changes?

In addition to freeing humans from creating stories, AlNoamany was able to use the Storify API to assist Storify in developing richer social cards, adding dates and favicons to override and improve upon the information that Storify extracted from mementos. The human interface for Storify also had some problems creating cards for mementos, and these problems could be overcome by using the Storify API.

Pearltrees has no API.  I could not find APIs for Symbaloo, eLink, ChannelKit, or BagTheWeb. Listly has an API, but it is not public.

BagTheWeb requires additional information supplied by the user in order to create a social card. As seen below, BagTheWeb does not generate any social card data based solely on the URI. If there were an API, the DSA platform might be able to address some of these shortcomings. Symbaloo is much the same. It chooses an image, but often favors the favicon over an image selected from the article.

This is a screenshot of a social card created by BagTheWeb for http://www.history.com/topics/boston-marathon-bombings.
A screenshot of a card created by Symbaloo for the same URI.
Pearltrees has problems that may be addressed by an API that allows the user to specify information. The example screenshot below displays a Firefox error instead of a selected image in the social card. This is especially surprising because the system was able to extract the title from the destination URI. Pearltrees also tends to convert URI-Ms to URI-Rs, linking to the live page instead of the archived Archive-It page.

A screenshot of two social cards created from Archive-It mementos by Pearltrees in a collection about the Boston Marathon. The one on the left displays a Firefox error instead of a selected image for the memento. Links to mementos: left, right.
The social cards generated by eLink have a selected image, a title, and a text snippet. Sometimes, however, they do not seem to find the image, as seen in the screenshot below. Scoop.it also has similar problems for some URIs, also shown below. An API call that allows one to select an image for the card would help improve this tool's shortcomings.

A screenshot of two social cards generated from Archive-It mementos from an eLink collection about the Boston Marathon Bombing. The one of the left shows a missing selected image while the one on the right displays fine. Links to mementos: left, right.
ChannelKit usually generates nice social cards, complete with a title, text snippet, and a selected image or web page thumbnail. Sometimes, as shown below, the resulting card contains no information and a human must intervene. Listly also has issues with some of the links submitted to it. It usually generates a title, text snippet, and selected image, but in some cases, as shown below, just lists the URI. Flockler also has similar problems, shown below. An API call that allows one to supply the missing information would be helpful in addressing these issues.

A screenshot of the social cards generated from Archive-It mementos in a ChannelKit collection about the Boston Marathon Bombing. The one on the right shows no useful information. Links to mementos: left, right.
A screenshot of social cards generated from Archive-It mementos in a Listly collection about the Boston Marathon Bombing. The one on the top has no information but the URI. The one on the bottom contains a title, selected image, and snippet. Links to mementos: top, bottom.


Curation Tools Useful for Visualization of Archive-It Collections

The final four tools have APIs, produce social cards, and allow for collections. I decided to review these five in more detail using the mementos generated by the DSA tool against the Archive-It collection 3639 about the Boston Marathon Bombing in 2013, corresponding to this Storify story. I created these collections by hand and did not use their associated API. Storify is already in use in the DSA, and hence I did not bother to review it again here.

In this section I discuss these tools and their shortcomings. I also discuss how DSA might be able to overcome some of those shortcomings with the tool's API.

Facebook


Selected mementos from a run of the Dark and Stormy Archives tool on Archive-It collection 3649 about the Boston Marathon Bombing as visualized in social cards in Facebook comments where the collection is stored as a Facebook post.

With 1.871 billion users, Facebook is the most popular social media tool. Facebook supports social cards in posts and comments. Facebook also supports creating albums of photos, but not of posts. Posts contain comments, however. In order to generate a series of social cards in a collection, I gave the post the title of the collection and supplied each URI-M in the story to a separate comment. In this way, I generate a story much like AlNoamany had done with Storify.

A screenshot of two Facebook comments. The URI-M below generated a social card, but the URI-M above did not.
As seen above, Facebook does occasionally fail to generate social cards for links. The Facebook API could be used to update such comments with a photo and a snippet, if necessary. Providing additional images is not possible, as Facebook posts and comments will not generate a social card if the post/comment already has an image.

Tumblr

The same mementos visualized in social cards in Tumblr where the collection is denoted by a hashtag.
Weighing in with 550 million users is Tumblr. Tumblr is a complex social media tool supporting many different types of posts. A user selects which type of post they desire and then supply the necessary data or metadata. For example, if a user wanted to generate something like a Facebook post or a Twitter tweet, they would choose "Text". The interface for selecting a type of post is shown below.

This screenshot shows the interface used by a user when they wish to post to Tumblr. It shows the different types of posts possible with the tool.
The post type "Link" produces a social card for the supplied link. In addition to the social card generated by Tumblr, the "Link" post can also be adorned with an additional photo, video, or text.

All of these post types are available as part of the Tumblr API. If a social card lacks an image, or if the DSA wants to supply additional text, the post can be updated appropriately.

I use hashtags to create collections on Tumblr. The hashtags are confined to a specific blog controlled by that blog's user, hence posts outside of the blog do not intrude into the collection, as would happen with hashtags on Twitter or Facebook.

Twitter Moments


This Twitter Moment contains tweets that contain the URI-Ms from our Dark and Stormy Archives summary.

Twitter has 317 million users worldwide. While all tools required that the user title the collection in some way, Twitter Moments requires that the user upload an image separately in order to create a collection. This image serves as the striking image for the collection. The user is also compelled to supply a description.

Sadly, much like Flipboard, Twitter does not appear to generate social cards for URI-Ms from Archive-It. Shown below in a Twitter Moment, the individual URI-Ms are displayed in their tweets with no additional visualization.
Unfortunately, as we see in the same Twitter Moment, tweets do not render social cards for our Archive-It URI-Ms.
DSA could use the Twitter API to add images and additional text (up to 140 characters of course) to supplement these tweets. At that point, the DSA is building its own social cards out of tweets.

Other Paradigms


In this post, I tried to find the tools that would replace Storify as it currently exists, but what about different paradigms of storytelling? The point of the DSA framework is to visualize an Archive-It collection. Other visualization techniques could use the tools I have discarded on this list. For example, Instagram has been used successfully by activist organizations and government entities as a storytelling tool. It is also being actively used by journalists. Even though works primarily through photos, is there some way we can use it for storytelling like these people have been doing? What other paradigms can we explore for storytelling?

Summary


Considering how Storify is used in the Dark and Stormy Archives framework took me on a long ride through the world of online curation. I read about tools that are used purely for customer engagement, those that live in the perpetual now, those that do not provide public sharing, those that do not provide social cards, and those that do not support our use of small multiples. I reviewed tools that do seem to have some problems generating social cards from Archive-It mementos, and provide no API with which to address the issues.
I finally came down to three tools that may serve as replacements for Storify, with varying degrees of capability. The collections housing the same story derived from Archive-It collection 3649 are here:


Twitter does not appear to make social cards for Archive-It mementos, and hence passes this issue onto Twitter Moments. In this case, Twitter requires that the DSA supply more information than just a URI to create social cards and hence is a poor choice to replace Storify. Facebook and Tumblr do create social cards for most URIs and provide an API that can be used to augment these cards. These tools have 1.871 billion and 550 million users, respectively. Because of this familiarity, they also satisfy one of the other core requirements of the DSA: an interface that people already know how to use.

-- Shawn M. Jones

Acknowledgements: A special thanks to the folks at Flockler for extending my trial, Curata for producing so much trade literature on curation, and Sarah Zickelbrach at Cision, Jeffery at Falcon.io, and Chase Schlachter from Spredfast for answering my questions and helping me to understand the space where some of these tools live.

Monday, August 7, 2017

2017-08-07: rel="canonical" does not mean what you think it means

The rel="identifier" draft has been submitted to the IETF.  Some of the feedback we've received via Twitter and email are variations of 'why don't you use rel="canonical" to link to the DOI?'  We discussed this in our original blog post about rel="identifier", but in fairness that post discussed a great deal of things and through updates and comments it has become quite lengthy. 

The short answer is that rel="canonical" handles cases where there are two or more URIs for a single resource (AKA "URI aliases"), whereas  rel="identifier" specifies relationships between multiple resources.

Having two or more URIs for the same resource is also known as "DUST: different URLs, similar text".  This is common place with SEO and catalogs (see the 2009 Google blog post and help center article about rel="canonical").  RFC 6596 gives abstract examples, but below we will examine real world examples (only one of which I'm fully prepared to buy).

Consider the two lexigraphically different URIs for the same resource (in this case, Amazon's page for DJ Shadow's upcoming EP "The Mountain Has Fallen"):
  1. https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow
  2. https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q
The first URI is what I got when I searched amazon.com for "dj shadow" and clicked on a search result.  The second URI is the "canonical" version that should be indexed by Google et al.  The first URI uses an HTML <link> element to inform search engines about the second URI so they know they haven't found two different resources with two different URIs:

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow" | grep -i canonical
<link rel="canonical" href="https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q" />


We can see that the HTML is not exactly the same (which would be trivial for the search engines to dedup), but can see the rendered HTML is essentially the same, with the exception of the navigation trail ("‹ Back to search results for "dj shadow"") vs. the categorization ("CDs & Vinyl › Dance & Electronic › Electronica") on the left-hand side, right above the EP artwork:

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow" | wc
   12711   16648  446841


$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q" | wc
   12802   17120  459761




It is clear there is no need for a search engine to index both pages.   The raw HTML is nearly (but not exactly!) the same and unless it is aware of amazon.com URI patterns, your crawler would not easily discover that they refer to the same resource.  We can construct a similar example with ebay.com: again the raw HTML differs slightly but in this case I cannot tell a difference in the rendered HTML:

$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713?hash=item33d3451f79:g:6G8AAOSwiBpZcMhO&vxp=mtr" | fmt | grep --context canonical | tail -3
    hreflang="es-ni" /><link rel="canonical"
    href="http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713"
    /><lmeta Property="og:image"


$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713" | wc
    2678    9225  189098


$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713?hash=item33d3451f79:g:6G8AAOSwiBpZcMhO&vxp=mtr" | wc
    2688    9246  189235




So why can't we use rel="canonical" for, say, DOIs and publisher pages?  In the case of DOIs, a technical reason is that the resource identified by the DOI and the resource identified by the publisher's page are not the same resource.  Admittedly this is a detour into the esoteric realm of HTTP 303 semantics, but the HTTP URI of a DOI does not have a representation and the publisher's URI does; the resources identified by these URIs are related but fundamentally different.

Another reason would be when you wish to specify part-whole relationships between resources that comprise the resource identified by a DOI.  For example, XML vs. HTML, Zip file(s) of associated code and data, embedded (and "recontextualizable"!) images, sound, or video, etc.  This would be for the purpose of expressing identity, and would not preclude combination with navigation (e.g., rel="up") or SEO links (e.g., rel="canonical"). These identification patterns are presented in more detail at the Signposting web site.

Another argument against using rel="canonical" for linking to DOIs (and friends) is that publishers are already using canonical to manage SEO within their own domain.  In the example below, springer.com signals to search engines that the URI in the third redirect from the DOI is canonical and not the previous two:

$ curl -iL --silent http://dx.doi.org/10.1007/978-3-319-43997-6_35 | egrep -i "(HTTP/1.1 [0-9]|^location:|rel=.canonical)"
HTTP/1.1 303 See Other
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
HTTP/1.1 301 Moved Permanently
Location: https://link.springer.com/10.1007/978-3-319-43997-6_35
HTTP/1.1 302 Found
Location: https://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
HTTP/1.1 200 OK
        <link rel="canonical" href="https://link.springer.com/chapter/10.1007/978-3-319-43997-6_35"/>


Furthermore,  publishers are specifying DOIs with a variety of incompatible ad hoc approaches (see the prior blog post for examples), meaning there is demand for this function even though there is currently not a standardized method of achieving it.

But there are other applications for rel="identifier" outside of scholarly content.   Consider the Wikipedia page for DJ Shadow.  As I type this, it has not yet been edited to include the upcoming EP mentioned above, but there's a good chance that by the time you read this that will have changed.


I can reference the particular version of the page using the "permalink", which yields the URI https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397.   That page will remain static, and never mention "The Mountain Has Fallen".  That page does use rel="canonical" to link back to the generic, current version of the page:

$ curl --silent -i "https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397" | grep "rel=.canonical"
<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>


Which is entirely expected and desirable: we don't want Google to separately index the 1000s of prior versions of this page, just the latest version.  The generic version of the page also asserts that it is canonical:

$ curl --silent -i "https://en.wikipedia.org/wiki/DJ_Shadow" | grep "rel=.canonical"
<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

But if I were using a reference manager to cite https://en.wikipedia.org/wiki/DJ_Shadow, and if that page also had:

<link rel="identifier" href="https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397"/>

Then the reference manager would cite the specific version of the page, providing a machine-readable version of the human-readable guidance already provided under the "Cite This Page" link.  This use of rel="identifier" would not collide with the rel="canonical" which is already in place for SEO*.  In this Wikipedia example, the two rels coexist and specify URI preferences for different purposes:
  • rel="canonical": preferred for content indexing
  • rel="identifier": preferred for referencing
Herbert insisted on a New Mexico specific example, so we'll consider the ubiquitous multi-page articles, designed to expand content to increase advertising revenue.  Of interest to us is page 5 of this particular article about TV continuity errors: http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5.  It uses rel="canonical" to inform search engines to strip off any common, superfluous arguments that might be also be present (e.g., "&utm_source=...&utm_medium=...&utm_campaign=..."):

$ curl -i --silent "http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5" | grep canonical
<link rel="canonical" href="http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5" />

Assuming for a moment that coolimba.com wanted to facilitate referencing of this page as part of an aggregation, it could include:

<link rel="identifier up" href="http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/" />

In this case, rel="up" also serves as a simple navigation function, if you chose to view these pages as a tree and not a list (if this is indeed a list, then "up" is probably not applicable).  But note that rel="up" would not be applicable in the Wikipedia (or even DOI) example(s) above.  Also note that rel="up" and rel="identifier" sharing the same URI is something of a coincidence: if a multi-page article has more than two "levels" then we would expect the URIs to diverge.

In conclusion, SEO/indexing and referencing are different functions and thus require different rel types; cases where the target URIs overlap should be considered coincidences.  rel="canonical" is used to collapse multiple URIs that yield duplicative text into a single, preferred URI to facilitate indexing, and rel="identifier" is used to select a single URI from among multiple URIs that yield different text to facilitate referencing. 


--Michael & Herbert


P.S. To return to our original pop culture reference: "have fun storming the castle!"


* Note that rel="permalink" and rel="bookmark" (the former was never registered and ultimately supplanted by the latter) do different things and are not usable in HTTP Link headers; see the prior blog post for details.

2017-08-09 edit: See also this Twitter moment about rel="bookmark".   I'll try to turn this into a separate blog post in the future.