Tuesday, September 19, 2017

2017-09-19: Carbon Dating the Web, version 4.0

With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0.

We've also decided to switch from MementoProxy and take advantage of the Memgator Aggregator tool built by Sawood Alam.

Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue. Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below.

The previous version of this project, Carbon Date 3.0, added Pubdate extraction, Twitter searching, and Bing search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay. We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides the Memento-Datetime retrieved from an archive's HTTP headers. However, news articles can contain metadata indicating the actual publication date or time. This gives our tool a more accurate time of an article's publication.

Whats New

With APIs changing over time it was decided we needed a proper way to test Carbon Date. To address this issue, we decided to use the popular Travis CI. Travis CI enables us to test our application every day using a cron job. Whenever an API changes, a piece of code breaks, or is styled in an unconventional way, we'll get a nice notification saying something has broken.

CarbonDate contains modules for getting dates for URIs from Google, Bing, Bitly and Memgator. Over time the code has had various styles and no sort of convention. To address this issue, we decided to conform all of our python code to pep8 formatting conventions.

We found that when using Google query strings to collect dates we would always get a date at midnight. This is simply because there is not timestamp, but rather a just year, month and day. This caused Carbon Date to always choose this as the lowest date. Therefore we've changed this to be the last second of the day instead of the first of the day. For example, the date '2017-07-04T00:00:00' becomes '2017-07-04T23:59:59' which allows a better precision for timestamp created.

We've also decided to change the JSON format to something more conventional. As shown below:

Other sources explored

It has been a long term goal to continuously find and add available sources to the Carbon Date application that bring offer a creation date. However, not all the sources we explore bring what we expect. Below there is a list of APIs and other sources that were tested but were unsuccessful in returning a URI creation date. We explored URL shortener APIs such as:
The bitly URL shortener still remains the best as the Bitly API allows a lookup of full URLs not just shortened ones.

How to use

Carbon Date is built on top of Python 3 (most machines have Python 2 by default). Therefore we recommend installing Carbon Date with Docker.

We do also host the server version here: http://cd.cs.odu.edu/. However, carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally via Docker.


After installing docker you can do the following:

2013 Dataset explored

The Carbon Date application was originally built by Hany SalahEldeen, mentioned in his paper in 2013. In 2013 they created a dataset of 1200 URIs to test this application and it was considered the "gold standard dataset." It's now four years later and we decided to test that dataset again.

We found that the 2013 dataset had to be updated. The dataset originally contained URIs and actual creation dates collected from the WHOIS domain lookup, sitemaps, atom feeds and page scraping. When we ran the dataset through the Carbon Date application, we found Carbon Date successfully estimated 890 creation dates but 109 URIs had estimated dates older than their actual creation dates. This was due to the fact that various web archive sites found mementos with creation dates older than what the original sources provided or sitemaps might have taken updated page dates as original creation dates. Therefore, we've taken taken the oldest version of the archived URI and taken that as the actual creation date to test against.

We found that 628 of the 890 estimated creation dates matched the actual creation date, achieving a 70.56% accuracy - originally 32.78% when conducted by Hany SalahEldeen. Below you can see a polynomial curve to the second degree used to fit the real creation dates.


Q: I can't install Docker on my computer for various reason. How can I use this tool?
A: If you can't use Docker then my recommendation is to download the source code from Github and install using a python virtual environment and installing the dependencies with pip from there. 

Q: After x amount of requests Google doesn't give a date when I think it should. What's happening?
A: Google is very good at catching programs (robots) that aren't using their APIs. Carbon Dating is not using an API but rather doing a string query, like a browser would be, and then looking at the results. You might have hit a Captcha so Google might lock Carbon Date out for a while.

Q: I sent a simple website like http://apple.com to Carbon Date to check the date of creation, but it says it was not found in any archive. Why is that?
A: Websites like apple.com, cnn.com, google.com, etc., all have an exceedingly large number of mementos. The Memgator tool is searching for tens of thousands of mementos for these websites across multiple archiving websites. This request can take minutes which eventually leads to a timeout, which in turn means Carbon Date will return zero archives.

Q: I have another issue not listed here, where can I ask questions?
A: This project is open source on github. Just navigate to the issues tab on Github, start a new issue and ask away!

Carbon Date 4.0? What about 3.0?

With this being Carbon Date 4.0, that means there has been three blogs previously for this project! You can find them here:
-Grant Atkins

Wednesday, September 13, 2017

2017-09-13: Pagination Considered Harmful to Archiving

Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015

Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015

While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive. In fact, 21 of 75 pages (or 28%)  had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017).  Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews, single page writings) and surveyed regarding their expectation for later access to additional content that was linked from or appeared on the main page.  Specifically, they investigated (1) how relationships between page content affect expectations and (2) how perceptions of content value relate to internet resources. In other words, if I save the main page as a resource, what else should I expect to be saved along with it?

I experienced this paradox first hand when I attempted to locate an historical entry from the 2016 edition of the U.S. News Best Global University Rankings.  As shown in Figure 1, October 30, 2015 is a particular date of interest because on the day prior, a revision of the original ranking for the University at Buffalo-SUNY was reported. The university's ranking was revised due to incorrect data related to the number of PhD awards. A re-calculation of the ranking metrics resulted in a change of the university's ranking position from a tie at No. 344 to a tie at No. 181.

Figure 3 - Summary of U.S. News


Figure 4 - Capture of U.S. News Revision for Buffalo-SUNY

A search of the Internet Archive, Figure 3, shows the U.S. News web site was saved 669 times between October 28, 2014 and September 3, 2017. We should first note that regardless of the ranking year you choose to locate via a web search, U.S. News reuses the same URL from year to year. Therefore, an inquiry against the live web will always direct you to their most recent publication. As of September 3, 2017, the redirect would be to the 2017 edition of their ranking list. Next, as shown in Figure 2, the 2016 U.S. News ranking list consisted of 750 universities presented in groups of 10 spread across 75 web pages. Therefore, the revised entry for the University at Buffalo-SUNY at rank No. 181 should appear on page 19, Figure 4.

Page No.
Start Date
End Date
Table 1 - Page Captures of U.S. News (Abbreviated Listing)

While I could readily locate the main page of the 2016 list as it appeared on October 30, 2015, I noted that subsequent pages were archived with diminishing frequency and over a much shorter period of time. We see in Table 1, after the first three pages, there can be a significant variance in the frequency with which the remaining site pages are crawled. And, as was noted earlier, more than a quarter (28%) of the ranking list cannot be reconstructed at all. Ainsworth and Nelson examined the degree of temporal drift that can occur during the display of sparsely archived pages using the Sliding Target policy allowed by the web archive user interface (UI); namely many years in just a few clicks. Since a substantial portion of the U.S. News ranking list is missing, it is very likely the web browsing experience will result in a hybrid list of universities that encompasses different ranking years as the user follows the page links.

Figure 5 - Frequency of Page Captures

Ultimately, we found page 19 had been captured three times during the specified time frame. However, the page containing the revised ranking that was of interest, Figure 4, was not available in the archive until March 14, 2016; almost five months after the ranking list had been updated. Further, in Figure 5, we note heavy activity for the first few and last few pages of the ranking list which may occur because, as shown on Figure 2, these links are presented prominently on page 1. The remaining pages 3 through 5 must be discovered manually by clicking on the next page. We note, in Figure 5, here the sporadic capture scheme for these intermediate pages.

Current web designs which feature pagination create a frustrating experience for the user when subsequent pages are omitted in the archive. It was my expectation that all pages associated with the ranking list would be saved in order to maintain the integrity of the complete listing of universities as they appeared on the publication date. My intuition is consistent with Poursadar and Shipman, who among their other conclusions, noted that navigational distance from the primary page can affect perceptions regarding what is considered to be viable content that should be preserved.  However, for multi-page articles, nearly 80% of the participants in their study considered linked information in the later pages as part of the resource. This perception was especially profound "when the content of the main and connected pages are part of a larger composition or set of information" as in perhaps a ranking list.

Overall, the findings of Poursadar and Shipman along with our personal observations indicate that archiving systems require an alternative methodology or domain rules that recognize when content spread across multiple pages represent a single collection or a composite resource that should be preserved in its entirety. From a design perspective, we can only wonder why there isn't a "view all" link on multi-page content such as the U.S. News ranking list. This feature might present a way to circumvent paginated design schemes so the Internet Archive can obtain a complete view of a particular web site; especially if the "view all" link is located on the first few pages which appear to be crawled most often. On the other hand, the use of pagination might also represent a conscious choice by the web designer or site owner as a way to limit page scraping even though people can still find a way to do so. Ultimately, the collateral damage associated with this type of design scheme is an uneven distribution in the archive; resulting in an incomplete archival record.


Scott G. Ainsworth and Michael L. Nelson. "Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive." International Journal on Digital Libraries 16: 129-144. DOI: 10.1007/s00799-014-0120-4

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Faryaneh Poursardar and Frank Shipman, "What Is Part of That Resource?User Expectations for Personal Archiving.", Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries, 2017.
-- Corren (@correnmccoy)

Sunday, August 27, 2017

2017-08-27: Four WS-DL Classes Offered for Fall 2017

An unprecedented four Web Science & Digital Library (WS-DL) courses will be offered in Fall 2017:
Finally, although they are not WS-DL courses per se, WS-DL member Corren McCoy is also teaching CS 462 Cybersecurity Fundamentals again this semester, and WS-DL alumnus Dr. Charles Cartledge is teaching CS 395 "Data Analysis Bootcamp".

I'm especially proud of this semester's breadth of course offerings and the participation by two alumni and one current WS-DL member.


2017-08-27: Media Manipulation research at the Berkman Klein Center at Harvard University Trip Report

A photo of me inside "The Yellow House" -
The Berkman Klein Center for Internet & Society
On June 5, 2017, I started work as an Intern at the Berkman Klein Center for Internet & Society at Harvard University under the supervision of Dr. Rob Faris, the Research Director for the Berkman Klein Center. This was a wonderful opportunity to conduct news media related research, and my second consecutive Summer research at Harvard. The Berkman Klein Center is an interdisciplinary research center that researches the means to tackle some of the biggest challenges on the Internet. Located in a yellow house at the Harvard Law School, the Center is committed to studying the development, dynamics, norms and standards of cyberspace. The center has produced many significant contributions such as the review of ICANN (Internet Corporation for Assigned Names and Numbers) and the founding of the DPLA (Digital Public Library of America).
During the first week of my Internship, I met with Dr. Faris to identify the research I would conduct in collaboration with Media Cloud at Berkman. Media Cloud, is an open-source platform for studying media ecosystems. The Media Cloud platform provides various tools for studying media such as Dashboard, Topic Mapper, and Source Manager
Media Cloud tools for visualizing and analyzing online news
Dashboard helps you see how a specific topic is spoken about in digital media. Topic Mapper helps you conduct topic in-depth analysis by identifying the most influential sources and stories. Source Manager helps explore the Media Cloud vast collection of digital media. The Media Cloud collection consists of about 547 million stories from over 200 countries. Some of the most recent Media Cloud research publications include: "Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election" and "Partisan Right-Wing Websites Shaped Mainstream Press Coverage Before 2016 Election."

Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election | Berkman Klein Center

In this study, we analyze both mainstream and social media coverage of the 2016 United States presidential election. We document that the majority of mainstream media coverage was negative for both candidates, but largely followed Donald Trump's agenda: when reporting on Hillary Clinton, coverage primarily focused on the various scandals related to the Clinton Foundation and emails.

Partisan Right-Wing Websites Shaped Mainstream Press Coverage Before 2016 Election, Berkman Klein Study Finds | Berkman Klein Center

The study found that on the conservative side, more attention was paid to pro-Trump, highly partisan media outlets. On the liberal side, by contrast, the center of gravity was made up largely of long-standing media organizations.

Rob and I narrowed my research area to media manipulation. Given, the widespread concern about the spread of fake news especially during the 2016 US General Elections, we sought to study the various forms of media manipulation and possible measures to mitigate this problem. I worked closely with Jeff Fossett, a co-intern on this project. My research about media manipulation began with a literature review of the state of the art. Jeff Fosset and I explored various research and news publications about media manipulation.


How the Trump-Russia Data Machine Games Google to Fool Americans

A year ago I was part of a digital marketing team at a tech company. We were maybe the fifth largest company in our particular industry, which was drones. But we knew how to game Google, and our site was maxed out.

Roger Sollenberger revealed a different kind of organized misinformation/disinformation campaign from the conventional publication of fake news (pure fiction - fabricated news). This campaign is based on Search Engine Optimization (SEO) of websites. Highly similar less popular (by Alexa rank) and often fringe news websites beat more popular traditional news sites in the Google rankings. They do this by optimizing their content with important trigger keywords, generate a massive volume of similar fresh (constantly updated) content, and link among themselves.

Case 2: Data & Society - Manipulation and Disinformation Online

Media Manipulation and Disinformation Online

Data & Society  introduced the subject of media manipulation in this report: media manipulators such as some far-right groups exploit the media's proclivity for sensationalism and novelty over newsworthiness. They achieve this through the strategic use of social media, memes, and bots to increase the visibility of their ideas through a process known as "attention hacking."

Case 3: Comprop - Computational Propaganda in the United States of America: Manufacturing Consensus Online 

Computational Propaganda in the United States of America: Manufacturing Consensus Online

This report by the Computational Propaganda Project at Oxford University, illustrated the influence of bots during the 2016 US General Elections and dynamics between bots and human users. They illustrated how armies of bots allowed campaigns, candidates, and supporters to achieve two key things during the 2016 election, first, to manufacture consensus and second, to democratize online propaganda.
COMPROP: 2016 US General Election sample graph showing network of humans (black nodes) retweeting bots (green nodes)
Their findings showed that armies of bots were built to follow, retweet, or like a candidate's content making that candidate seem more legitimate, more widely supported, than they actually are. In addition, they showed that the largest botnet in the pro-Trump network was almost 4 times larger than the largest botnet in the pro-Clinton network:
COMPROP: 2016 US General Election sample botnet graph showing the pro-Clinton and more sophisticated pro-Trump botnets

Based on my study, I consider media manipulation as:

calculated efforts taken to circumvent or harness the mainstream media in order to set agendas and propagate ideas, often with the utilization of social media as a vehicle.
On close consideration of the different mechanisms of media manipulation, I saw a common theme. This common theme of media manipulation was described by the Computational Propaganda Project as "Manufactured Consensus." The idea of manufactured consensus is: we naturally lend some credibility to stories we see at multiple different sources, and media manipulators know this. Consequently, some media manipulators manufacture consensus around a story, sometimes pure fabrication (fake news), and some other times, they take some truth from a story, distort it, and replicate this distorted version of the truth across multiple source to manufacture consensus. An example of this manufacture of consensus is Case 1. Manufactured consensus ranges from fringe disinformation websites to bots on Twitter which artificially boost the popularity of a false narrative. The idea of manufactured consensus motivated by attempt to first, provide a means of identifying consensus, second, learn to distinguish organic consensus from manufactured consensus both within news sources and Twitter. I will briefly outline the beginning part of my study to identify consensus specifically across news media sources.
I acknowledge there are multiple ways to define "consensus in news media," so I define consensus within news media:
 as a state in which multiple news sources report on the same or highly similar stories.
Let us use a graph (nodes and edges) to identify consensus in new media networks. In this graph representation, nodes represent news stories from media sources and consensus is captured by edges between the highly similar or the same news stories (nodes). For example, Graph 1 below shows consensus between NPR and BBC  for a story about a shooting in a Moscow court.
Graph 1: Consensus within NPR and BBC for "Shooting at a Moscow Court" story
Consensus may occur within the mainstream left news media (Graph 2) or the mainstream right media (Graph 3). 
Graph 2: Consensus on the left (CNN, NYTimes & WAPO) media for the "Transgender ban" story
Graph 3: Consensus on the right (Breitbart, The blaze, & Fox) for the "Republican senators who killed the Skinny repeal bill" story
Consensus may or may not be exclusive to left, center or right media, but if there is consensus across different media networks (e.g., mainstream left, center and right), we would like to capture or approximate the level of bias or "spin" expressed by the various media networks (left, center and right). For example, on August 8, 2017, during a roundtable in his Golf Club in Bedminster New Jersey, President Trump said if North Korea continues to threaten the US, "they will be met with fire and fury." Not surprisingly, various news media reported this story, in order words there was consensus within left, center and right news media organizations (Graph 4) for the "Trump North Korea fire and fury" story.
Graph 4: Consensus across left, center and right media networks for the "Trump North Korea fire and fury story"
Let us inspect the consensus Graph 4 closely, beginning with the left, then the center and right, consider the following titles:
  1. Calm down: we’re (probably) not about to go to war with North Korea, I’m at least, like, 75 percent sure.
  2. Trump now sounds more North Korea-y than North Korea
Politicus USA:
  1. Trump Threatens War With North Korea While Hanging Out In The Clubhouse Of His Golf Club
Huffington Post, Washington Post, and NYTimes, respectively:
The Hill:
Gateway Pundit, The Daily Caller, Fox, Breitbart, Conservative Tribune:
  1. WOW! North Korea Says It Is ‘Seriously Considering’ Military Strike on Guam (Gateway Pundit)
  2. North Korea: We May Attack Guam (The Daily Caller)
  3. Trump: North Korea 'will be met with fire and fury like the world has never seen' if more threats emerge (Fox)
  4. Donald Trump Warns North Korea: Threats to United States Will Be Met with ‘Fire and Fury’ (Breitbart)
  5. Breaking: Trump Promises Fire And Fury.. Attacks North Korea In Unprecedented Move (Conservative Tribune)
  6. North Korea threatens missile strike on Guam (Washington Examiner)
In my opinion, on the left, consider the critical outlook offered by Vox, claiming the President sounded like the North Korean Dictator Kim Jong Un. At the center, The Hill emphasized the unfavorable response of some senators due to the President's statement. I think one might say the left and some parts of the center painted the President as reckless due to his threats. On the right, consider the focus on the North Korean threat to strike Guam. The choice of words and perspectives reported on this common story exemplifies the "spin" due to the political bias of the various polarized media. We would like to capture this kind of spin. But it is important to note that our goal is NOT to determine what is the truth. Instead, if we can identify consensus and go beyond consensus to capture or approximate spin, this solution could be useful in studying media manipulation. It is also relevant to investigate if spin is related to misinformation or disinformation.
I believe the prerequisite for quantifying spin is identifying consensus. The primitive operation of identifying consensus is the binary operation of measuring the similarity (or distance) between two stories. I have began this analysis with an algorithm in development. This algorithm was applied to generate Graphs 1-4. Explanation of this algorithm is beyond the scope of this post, but you may see the algorithm in action through this polar media consensus graph, which periodically computes a consensus graph for left, center and right media.
Consensus graph generated by an algorithm in development
The second part of our study was to identify consensus on Twitter. I will strive to report the developments of this research as well as a formal introduction of the consensus identifying algorithm when our findings are concrete. 
In addition to researching media manipulation, I had the pleasure to see the 4th of July fireworks across the Charles River from Matt's rooftop, and attend Law and Cyberspace lectures hosted by Harvard Law School Professors - Jonathan Zittrain and Urs Gasser. I had the wonderful opportunity to teach Python and learn from my fellow interns, as well as present my media manipulation research to Harvard LIL.

Media manipulation is only going to evolve, making its study crucial. I am grateful for the constant guidance of my Ph.D supervisors, Dr. Michael Nelson and Dr. Michele Weigle, and am also very grateful to the Dr. Rob Faris at Media Cloud, and the rest of Berkman Klein community for providing me with the opportunity to research this pertinent subject.


Saturday, August 26, 2017

2017-08-26: rel="bookmark" also does not mean what you think it means

Extending our previous discussion about how the proposed rel="identifier" is different from rel="canonical" (spoiler alert: "canonical" is only for pages with duplicative text), here I summarize various discussions about why we can't use rel="bookmark" for the proposed scenarios.  We've already given a brief review of why rel="bookmark" won't work (spoiler alert: it is explicitly prohibited for HTML <link> elements or HTTP Link: headers) but here we more deeply explore the likely original semantics. 

I say "likely original semantics" because:
  1. the short phrases in the IANA link relations registry ("Gives a permanent link to use for bookmarking purposes") and the HTML5 specification ("Gives the permalink for the nearest ancestor section") are not especially clear, nor is the example in the HTML5 specification.
  2.  rel="bookmark" exists to address a problem, anonymous content, that has been so thoroughly solved that the original motivation is hard to appreciate. 
In our Signposting work, we had originally hoped we could use rel="bookmark" to mean "please use this other URI when you press control-D".  For example, we hoped the HTML at http://www.sciencedirect.com/science/article/pii/S038800011400151X could have:

<link rel="bookmark">http://dx.doi.org/10.1016/j.langsci.2014.12.003</link>

And when the user hit "control-D" (the typical keyboard sequence for bookmarking), the user agent would use the doi.org URI instead of the current URI at sciencedirect.com.  But alas, that's not why rel="bookmark" was created, and the original intention is likely why rel="bookmark" is prohibited from <link> elements.  I say likely because the motivation is not well documented and I'm inferring it from the historical evidence and context.

In the bad old days of the early web, newsfeeds, blogs, forums, and the like did not universally support deep links, or permalinks, to their content.  A blog would consist of multiple posts displayed within a single page.  For example page 1 of a blog would have the seven most recent posts, page 2 would have the previous seven posts, etc.  The individual posts were effectively anonymous: you could link to the "top" of a blog (e.g., blog.dshr.org), but links to individual posts were not supported; for example this individual post from 2015 is no longer on page 1 of the blog and without the ability to link directly to its permalink, one would have click backwards through many pages to discover it.

Of course, now we take such functionality for granted -- we fully expect to have direct links to individual posts, comments, etc.  The earliest demonstration I can find is from this blog post from 2000 (the earliest archived version is from 2003, here's the 2003 archived version of top-level link to the blog where you can see the icon the post mentions).   This early mention of a permalink does not use the term "permalink" or relation rel="bookmark"; those would follow later. 

The implicit model with permalinks appears to be that there would be > 1 rel="bookmark" assertions within a single page, thus the relation is restricted to <a> and <area> elements.  This is because <link> elements apply to the entire context URI (i.e., "the page") and not to specific links, so having > 1 <link> elements with rel="bookmark" would not allow agents to understand the proper scoping of which element "contains" the content that has the stated permalink (e.g., this bit of javascript promotes rel="bookmark" values into <link> elements, but scoping is lost).  An ASCII art figure is order here:

|                            |
|  <A href="blog.html"       |
|     rel=bookmark>          |
|  Super awesome alphabet    |
|  blog! </a>                |
|  Each day is a diff letter!|
|                            |
|  +---------------------+   |
|  | A is awesome!!!!    |   |
|  | <a href="a.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for A </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | B is better than A! |   |
|  | <a href="b.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for B </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | C is not so great.  |   |
|  | <a href="c.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for C </a>|   |
|  +---------------------+   |
|                            |

$ curl blog.html
Super awesome alphabet blog!
Each day is a diff letter!
A is awesome!!!!
permalink for A
B is better than A!
permalink for B 
C is not so great.
permalink for C
$ curl a.html
A is awesome!!!!
permalink for A
$ curl b.html
B is better than A!
permalink for B 
$ curl c.html
C is not so great.
permalink for C

In the example above, the blog has a rel="bookmark" to itself ("blog.html") and since the <a> element appears at the "top level" of the HTML, it is understood that the scope of the element applies to the entire page.  In the subsequent posts, the scope of the link is bound to some ancestor element (perhaps a <div> element) and thus it does not apply to the entire page.  The rel="bookmark" to "blog.html" is perhaps unnecessary, the user agent already knows its own context URI (in other words, a user agent typically knows the URL of the page it is currently displaying (but might not in some conditions, like being the response to a POST request), but surfacing the link with an <a> element makes it easy for the user to right-click, copy-n-paste, etc.  If "blog.html" had four <link rel="bookmark" > elements, the links would not be easily available for user interaction and scoping information would be lost.

And it's not just for external content ("a.html", "b.html", "c.html") like the example above.  In the example below, rel="bookmark" is used to provide permalinks for individual comments contained within a single post.

|                            |
|  <A href="a.html"          |
|     rel=bookmark>          |
|  A is awesome!!!!</a>      |
|                            |
|  +---------------------+   |
|  | <a name="1"></a>    |   |
|  | Boo -- I hate A.    |   |
|  | <a href="a.html#1"  |   |
|  |    rel=bookmark>    |   |
|  | 2017-08-01 </a>     |   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | <a name="2"></a>    |   |
|  | a series of tubes!  |   |
|  | <a href="a.html#2"  |   |
|  |    rel=bookmark>    |   |
|  | 2017-08-03 </a>     |   |
|  +---------------------+   |
|                            |

This style exposes the direct links of the individual comments, and in this case the anchor text for the permalink is the datestamp of when the post was made (by convention, permalinks often have anchor text or title attributes of "permalink", "permanent link", datestamps, the title of the target page, or variations of these approaches).  Again, it would not make sense to have three separate <link rel="bookmark" > elements here, obscuring scoping information and inhibiting user interaction. 

So why prohibit <link rel="bookmark" > elements?  Why not allow just a single <link rel="bookmark" > element in the <head> of the page, which would by definition enforce the scope to apply to the entire document?  I'm not sure, but I guess it stems from 1) the intention of surfacing the links to the user, 2) the assumption that a user-agent already knows the URI of the current page, and 3) the assumption that there would be > 1 bookmarks per page.  I suppose uniformity was valued over expressiveness. A 1999 HTML specification does not explicitly mention the <link> prohibition, but it does mention having several bookmarks per page.

An interesting side note is that while typing self-referential, scoped links with rel="bookmark" to differentiate them from just regular links to other pages seemed like a good idea ca. 1999, such links are now so common that many links with the anchor text "permalink" or "permanent link" often do not bother to use rel="bookmark" (e.g., Wikipedia pages all have "permanent link" in the left-hand column, but do not use rel="bookmark" in the HTML source, but the blogger example captured in the image above does use bookmark).  The extra semantics are no longer novel and are contextually obvious. 

In summary, in much the same way there is confusion about rel="canonical",  which is better understood as rel="hey-google-index-this-url-instead", perhaps a better name for rel="bookmark" would have been rel="right-click".  If you s/bookmark/right-click/g, the specifications and examples make a lot more sense. 

--Michael & Herbert

N.B. This post is a summary of discussions in a variety of sources, including this WHATWG issue, this tweet storm, and this IETF email thread

Friday, August 25, 2017

2017-08-25: University Twitter Engagement: Using Twitter Followers to Rank Universities

Figure 1 Summing primary and secondary followers for @ODUNow

Figure 1: Summing primary and secondary followers for @ODUNow
Our University Twitter Engagement (UTE) rank is based on the friend and extended follower network of primary and affiliated secondary Twitter accounts referenced on a university's home page. We show that UTE has a significant, positive correlation with expert university reputation rankings (e.g., USN&WR, THE, ARWU) as well as rankings by endowment, enrollment, and athletic expenditures (EEE). As illustrated in Figure 1, we bootstrap the process by starting with the URI for the university's homepage obtained from the detailed institutional profile information in the ranking lists. For each URI, we navigated to the associated webpage and searched the HTML source for links to valid Twitter handles. Once the Twitter screen name was identified, the Twitter GET users/Show API was used to retrieve the URI from the profile of each user name. If the domain of the URI matched exactly or resolved to the known domain of the institution, we considered the account to be one of the university's official, primary Twitter handles since the user had self-associated with the university via the URI reference.

As an example, the user names @NBA, @DukeAnnualFund, @DukeMBB, and @DukeU were extracted from the page source of the Duke University homepage (www.duke.edu). However, only @DukeAnnualFund and @DukeU are considered official primary accounts because their respective URIs, annualfund.duke.edu and duke.edu, are in the same domain as the university.  On the other hand, @DukeMBB maps to GoDuke.com/MBB, which is not in the same domain as duke.edu, so we don't include it among the official accounts. Ultimately, we delve deeper into the first and second degree relationships between Twitter followers to identify the pervasiveness of  the university community which includes not only academics, but sports teams, high profile faculty members, and other sponsored organizations.

We aggregated the rankings from multiple expert sources to calculate an adjusted reputation rank (ARR) for each university which allows direct comparison based on position in the list and provides a collective perspective of the individual rankings. In rank-to-rank comparisons using Kendall's Tau, we observed a significant, positive rank correlation (τ=0.6018) between UTE and ARR which indicates that UTE could be a viable proxy for ranking atypical institutions normally excluded from traditional lists.  We also observed a strong correlation (τ=0.6461) between UTE and EEE suggesting that universities with high enrollments, endowments, and/or athletic budgets also have high academic rank. The top 20 universities as ranked by UTE are shown in Table 1. We've highlighted a few universities where there is a significant disparity between the ARR and the UTE ranking which indicates larger Twitter followings than can be explained just by academic rank.

University UTE Rank ARR Rank
Harvard University 1 1
Stanford University 2 2
Cornell University 3 10
Yale University 4 7
University of Pennsylvania 5 8
Arizona State University--Tempe 6 59
Columbia University in the City of New York 7 4
Texas A&M University--College Station 8 39
Wake Forest University 9 74
University of Texas--Austin 10 16
Pennsylvania State University--Main Campus 11 24
University of Michigan--Ann Arbor 12 10
University of Minnesota--Twin Cities 13 16
Ohio State University--Main Campus 14 22
Princeton University 15 4
University of Wisconsin--Madison 16 14
University of Notre Dame 17 46
Boston University 18 21
University of California--Berkeley 19 3
Oklahoma State University--Main Campus 20 100

Table 1: Top 20 Universities Ranked by UTE for Comparison With ARR

We have prepared an extensive report of our findings as a technical report available on arXiv (linked below). We have also posted all of the ranking and supporting data used in this study which includes a social media rich dataset containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

- Corren (@correnmccoy)

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Monday, August 14, 2017

2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns

Last Wednesday, August 9, 2017, I was invited to give a talk to some summer interns of the Computer Science Department at Old Dominion University. Every summer our department invites some undergrad students from India and hosts them for about a month to work on some projects under a research lab here as summer interns. During this period, various research groups introduce their work to those interns to encourage them to become potential graduate applicants. Those interns also act as academic ambassadors who motivate their colleagues back in India for higher studies.

This year, Mr. Ajay Gupta invited a group of 20 students from Acharya Institute of Technology and B.N.M. Institute of Technology and supervised them during their stay at Old Dominion University. Like the last year, I was selected from the Web Science and Digital Libraries Research Group again to introduce them with the concept of web archiving and various researches of our lab. An overview of the talk can be found in my last year's post.

Recently, I have been selected as the Docker Campus Ambassador for ODU. I thought it would be a great opportunity to introduce those interns with the concept of software containerization. Among numerous other benefits, it would help them deal with the "works on my machine" problem (also known as "magic laptop" problem), common in students' life.

After finishing the web archiving talk, I briefly introduced them with the basic concepts and building blocks of Docker. Then I illustrated the process of containerization with the help of a very simple example.

I encouraged them to interrupt me during the talk to ask any relevant questions as both the topics were fairly new for them. Additionally, I tried to bring in references from Indian culture, politics, and cinema to make it more engaging for them. Overall, I was very happy with the kind of questions they were asking, which gave me the confidence that they were actually absorbing these new concepts and not asking questions just for the sake of grabbing some swags that included T-shirts and stickers from Docker and Memento.

Sawood Alam