Tuesday, December 20, 2016

2016-12-20: Archiving Pages with Embedded Tweets

I'm from Louisiana and used Archive-It to build a collection of webpages about the September flood there (https://www.archive-it.org/collections/7760/).

One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy', highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page):

Live page, screenshot generated on Sep 9, 2016

To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived.

First, let's look at how embedded Tweets work on the live web. According to Twitter: "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on Twitter’s servers which converts the <blockquote> into a fully-rendered Tweet."

Here's the first embedded tweet (https://twitter.com/vernonernst/status/765398679649943552), with a picture of a long line of trucks pulling their boats to join the Cajun Navy.
First embedded tweet - live web

Here's the source for this embedded tweet:
<blockquote class="twitter-tweet" data-width="500"> <p lang="en" dir="ltr">
<a target="_blank" href="https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a> </p> — Vernon Ernst (@vernonernst) <a href="https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a> </blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

When the widgets.js script executes in the browser, it transforms the <blockquote class="twitter-tweet"> element into a <twitterwidget>:
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552">

Now, let's consider how the various archives handle this.

Archive-It

Since I'd been using Archive-It to create the collection, that was the first tool I used to capture the page. Archive-It uses the Internet Archive's Heritrix crawler and Wayback Machine for replay. I set the crawler to archive the page and embedded resources, but not to follow links. No special scoping rules were applied.

http://wayback.archive-it.org/7760/20160818180453/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/
Archive-It, captured on Aug 18, 2016
Here's how the first embedded tweet displayed in Archive-It:
Embedded tweet as displayed in Archive-It


Here's the source (as rendered in the DOM) upon playback in Archive-It's Wayback Machine:
<blockquote class="twitter-tweet twitter-tweet-error" data-conversation="none" data-width="500" data-twitter-extracted-i1479916001246582557="true">
<p lang="en" dir="ltr"> <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/CajunNavy?src=hash" target="_blank" rel="external nofollow">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/LouisianaFlood?src=hash" target="_blank" rel="external nofollow">#LouisianaFlood</a> <a href="http://wayback.archive-it.org/7760/20160818180453/https://t.co/HaugQ7Jvgg" target="_blank" rel="external nofollow">pic.twitter.com/HaugQ7Jvgg</a> </p> <p>— Vernon Ernst (@vernonernst) <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/vernonernst/status/765398679649943552" target="_blank" rel="external nofollow">August 16, 2016</a></p></blockquote>
<p> <script async="" 
src="//wayback.archive-it.org/7760/20160818180453js_/http://platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script> </p>

Except for the links being re-written to point to the archive, this is the same as the original embed source, rather than the transformed version.  Upon playback, although widgets.js was archived (http://wayback.archive-it.org/7760/20160818180456js_/http://platform.twitter.com/widgets.js?4fad35), it is not able to modify the DOM as it does on the live web (widgets.js loads additional JavaScript that was not archived).

webrecorder.io

Next up is the on-demand service, webrecorder.io. Webrecorder.io is able to replay the embedded tweets as on the live web.

https://webrecorder.io/mweigle/south-louisiana-flood---2016/20160909144135/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/

Webrecorder.io, viewed Sep 29, 2016

The HTML source (https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/) looks similar to the original embed (except for re-written links):
<blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr"><a target="_blank" href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world!  <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a></p>&mdash; Vernon Ernst (@vernonernst) <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote>
<script async src="//wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

Upon playback, we see that webrecorder.io is able to fully execute the widgets.js script, so the transformed HTML looks like the live web (with the inserted <twitterwidget> element):
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552"></twitterwidget>
<script async="" src="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

Note that widgets.js is archived and is loaded from webrecorder.io, not the live web.

archive.is

archive.is is another on-demand archiving service.  As with webrecorder.io, the embedded tweets are shown as on the live web.

http://archive.is/5JcKx
archive.is, captured Sep 9, 2016

archive.is executes and then flattens JavaScript, so although the embedded tweet looks similar to how it's rendered in webrecorder.io and on the live web, the source is completely different:
<article style="direction:ltr;display:block;">
...

<a href="https://archive.is/o/5JcKx/twitter.com/vernonernst/status/765398679649943552/photo/1" style="color:rgb(43, 123, 185);text-decoration:none;display:block;position:absolute;top:0px;left:0px;width:100%;height:328px;line-height:0;background-color: rgb(255, 255, 255); outline: invert none 0px; "><img alt="View image on Twitter" src="http://archive.is/5JcKx/fc15e4b873d8a1977fbd6b959c166d7b4ea75d9d" title="View image on Twitter" style="width:438px;max-width:100%;max-height:100%;line-height:0;height:auto;border-width: 0px; border-style: none; border-color: white; "></a>
...

</article>
...
<blockquote cite="https://twitter.com/vernonernst/status/765398679649943552" style="list-style: none outside none; border-width: medium; border-style: none; margin: 0px; padding: 0px; border-color: white; ">
...

<span>#</span><span>CajunNavy</span></a>
on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://archive.is/o/5JcKx/https://twitter.com/hashtag/LouisianaFlood?src=hash" style="direction:ltr;background-color: transparent; color:rgb(43, 123, 185);text-decoration:none;outline: invert none 0px; "><span>#</span><span>LouisianaFlood</span></a>
</p>
...
</blockquote>


WARCreate

WARCreate is a Google Chrome extension that our group developed to allow users to archive the page they are currently viewing in their browser.  It was last actively updated in 2014, though we are currently working on updates to be released in 2017.

The image below shows the result of the page being captured with WARCreate and replayed in webarchiveplayer

WARCreate, captured Sep 9, 2016, replayed in webarchiveplayer
Upon replay, WARCreate is not able to display the tweet at all.  Here's the close-up of where the tweets should be:

WARCreate capture replayed in webarchiveplayer, with tweets missing
Examining both the WARC file and the source of the archived page helps to explain what's happening.

Inside the WARC, we see:
<h4>In stepped a group known as the <E2><80><9C>Cajun Navy<E2><80><9D>:</h4>
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
<p><script async="" src="//platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>


This is the same markup that's in the DOM upon replay in webarchiveplayer, except for the script source being written to localhost:
<h4>In stepped a group known as the “Cajun Navy”:</h4>
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
<p><script async="" src="//localhost:8090/20160822124810js_///platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>


WARCreate captures the HTML after the page has fully loaded.  So what's happening here is that the page loads, widgets.js is executed, the DOM is changed (thus the <twitterwidget> tag), and then WARCreate saves the transformed HTML. But, what we don't get is the widgets.js script in order to be able to properly display <twitterwidget>. Our expectation is that with fixes to allow WARCreate to archive the loaded JavaScript, the embedded tweet would be displayed as on the live web.

Discussion
 
Each of these four archiving tools operates on the embedded tweet in a different way, highlighting the complexities of archiving asynchronously loaded JavaScript and DOM transformations.
  • Archive-It (Heritrix/Wayback) - archives the HTML returned in the HTTP response and JavaScript loaded from the HTML
  • Webrecorder.io - archives the HTML returned in the HTTP response, JavaScript loaded from the HTML, and JavaScript loaded after execution in the browser
  • Archive.is - fully loads the webpage, executes JavaScript, rewrites the resulting HTML, and archives the rewritten HTML
  • WARCreate - fully loads the webpage, executes JavaScript, and archives the transformed HTML
It is useful to examine how different archiving tools and playback engines render complex webpages, especially those that contain embedded media.  Our forthcoming update to the Archival Acid Test will include tests for embedded content replay.

-Michele