Tuesday, April 24, 2018

2018-04-24: Let's Get Visual and Examine Web Page Surrogates


Why visualize individual web pages? A variety of visualizations of individual web pages exist, but why do we need them when we can just choose a URI from a list and put it in our web browser? URIs are intended to be opaque: text from the underlying web resource does not need to exist in the URI.

Consider http://dx.doi.org/10.1007/s00799-016-0200-8. Where does it go? Should we click on it? What content exists under the veil of the URI? Will it meet our needs?

Now consider this web page surrogate produced by embed.ly for the same URI:

Avoiding spoilers: wiki time travel with Sheldon Cooper

A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if...
If we were looking for research papers about avoiding spoilers for TV shows, then we know that clicking on this surrogate will take us to something that meets our information needs. If we were searching for marine mammals, then this surrogate shows us that the underlying page will not be very satisfying. In this case, the surrogate is intended to give the user enough information to answer the question: should I click on this?

Last year, when I reviewed a number of live web curation and social media tools, I was primarily focused on tools that produce social cards like the one above. This was because social cards appeared to be the lingua franca of web page surrogates. Social cards are not the only surrogate in use today and definitely not the only surrogate evaluated in literature. In this post, I cover several surrogates that have been evaluated and then talk about the studies in which they played a part. I was curious as to which surrogate might be best for collections of mementos.

Different Web Page Surrogates




Text Snippet


Text snippets are one of the earliest surrogates. They only require fetching a given web page before selecting the text to be used in the snippet. The text selection can be done via many different methods like El-Beltagy's "KP-Miner: A Keyphrase Extraction System for English and Arabic Documents" and Chen's "A Practical System of Keyphrase Extraction for Web Pages". Text snippets are typically used by search engines for displaying results.

The Google search result text snippet for Michele Weigle's ODU CS page.
The Bing search result text snippet for Michele Weigle's ODU CS homepage. Note that Bing did not capture the last modified date, but does list a series of links on the bottom of the snippet, drawn from the menu of homepage
The DuckDuckGo search result for Michele Weigle's ODU CS homepage. Note that DuckDuckGo displays the favicon and generates a different text snippet from Google and Bing.

In the above search results for Michele Weigle's ODU CS homepage, the text snippets are slightly different depending on the search engine. Because there is a lot of variation in web pages, there are a lot of possibilities when building text snippets.

Text snippets still receive a bit of research, with Maxwell evaluating the effectiveness of snippet length in 2017 as part of "A Study of Snippet Length and Informativeness" (university repository copy).

As a group, text snippets listed one per row on a web page. This is optimal for search results, as the position of the result conveys its relevancy. This format affects how many surrogates can be viewed at once. Where text snippets are viewed one per row, more thumbnails can fit into the same amount of space.

Thumbnail


A thumbnail is produced by loading the given page in a browser and taking a screenshot of the contents of the browser window. They have been used in many forms. The Safari web browser uses them to display the content of tabs.

The Safari web browser uses thumbnails to show surrogates for web pages  that are currently loaded in its tabs.
In "Visual preview for link traversal on the World Wide Web", Kopetzky demonstrated that thumbnails could be used to provide a preview of a linked page via a mouseover effect so that users could decide if a link was worth clicking. In "Data Mountain: Using Spatial Memory for Document Management" (Microsoft Research copy), Robertson proposed using a 3D virtual environment for organizing a corpus of web pages where each page is visualized as a thumbnail. Outside of the web, file management tools, such as macOS's Finder, use thumbnails to provide visual previews of documents.

An example of the interface for Data Mountain, a 3D environment for browsing web pages via thumbnails.

macOS Finder displaying thumbnails of file contents.

In the web archiving world, the UK Web Archive uses thumbnails to show a series of mementos so one can compare the content of each memento, effectively viewing the content drift over time. Thumbnails are also used in our own What Did It Look Like?, a platform that animates thumbnails so one can watch the changes to a web page over the years. Our group is also investigating the use of thumbnails for summarizing how a single webpage has changed over time, using three different visualizations: an animation, a grid view, and an interactive timeline view.

The UK Web Archive uses thumbnails to show different mementos for the same resource, allowing the user to view web page changes over time.

What Did It Look Like? allows the user to watch a web page change over time by animating the thumbnails of the mementos of a resource.

The size of thumbnails has a serious effect on their utility. If the thumbnail is too large, it does not provide room for comparison of surrogates. If the thumbnail is too small, users cannot see what is in the image. Thumbnails are also difficult for users to understand if a page consists mostly of text or has no unique features. In "How People Recognize Previously Seen Web Pages from Titles, URLs and Thumbnails", Kaasten established that the optimal thumbnail size is 208x208 pixels.

The viewport of a thumbnail is also an important part of its construction. Depending on what we want to emphasize on a web page, we may need to generate a thumbnail from content "below the fold". Aula evaluated the use of thumbnails that were the same size, but had magnified a portion of a web page at 20% versus 38%. She found that users performed better with thumbnails at a magnification of 20%.

Enhanced Thumbnail


In 2001, Woodruff introduced the enhanced thumbnail in "Using Thumbnails to Search the Web" (author copy). Prior to taking the screenshot of the browser as with a normal thumbnail, the HTML of the page is modified to make certain terms stand out. In the example below, changes in font size and background color emphasize certain terms of a page. The goal is to draw attention to these terms in hopes that search engine users could find relevant pages faster.

Examples of Thumbnails and Enhanced Thumbnails:
(a) Plain thumbnail
(b) Enhanced Thumbnail using HTML modification to emphasize the words "Recipe" and "Pound Cake"
(c) Enhanced Thumbnail using HTML and image modification to make "Recipe" and "Pound Cake" stand out more
(d) Emphasis on "MiniDisc Player"
(e) Emphasis on "hybrid", "car", and "mileage"
(f) Emphasis on "Hellerstein"
(g) Plain thumbnail of a page only consisting of text
(h) Enhanced thumbnail emphasizing specific terms in the text page


Even though enhanced thumbnails have performed well, they are computationally expensive to create. This likely explains why they have not been seen in use outside of laboratory studies.

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali developed something similar by adding a tag cloud to each thumbnail and named the concept a "visual tag".

Internal Image


An internal image is an image embedded within the web page. For some web pages, like news stories and product pages, these internal images can be good surrogates because of their uniqueness. Pinterest uses internal images as surrogates.

Pinterest uses internal images as surrogates for web pages.

The key is identifying which embedded image is best for representing the page. Hu identified the issues with solving this problem as part of "Categorizing Images in Web Documents", identifying a number of features such as using the text surrounding an image and evaluating the number of colors in the image. Maekawa worked on classifying images and achieved an 83.1% accuracy in "Image Classification for Mobile Web Browsing" (conference copy). While these studies provided solutions for classifying images, we really need to know which images are unique and relevant to the web page. Research does exist to address this issue, such as the work described in Li's "Improving relevance judgment of web search results with image excerpts" (conference copy). These solutions are imperfect, which may be why Pinterest and other sites ask the user to choose an image from those embedded in the page.


Visual Snippet


In 2009, Teevan introduced visual snippets as part of "Visual snippets: summarizing web pages for search and revisitation" (Microsoft Research copy, conference slides). Teevan gave 20 web pages to a graphic designer and asked him to generate a small 120x120 image representing each page. She observed a pattern in the resulting images and derived a template to use as a surrogate. These surrogates combine the internal image, placed within the background of the surrogate, with a title running across the top of the page, and a page logo.

Examples of thumbnails on the bottom and their corresponding visual snippets on top.
She used machine learning to choose a good internal image and logo. This is more complex than merely selecting a salient internal image as noted in the previous section. Not only does the visual snippet require two images, but two different types of images.

External Image


In 2010, Jiao put forth the idea of using external images in "Visual Summarization of Web Pages". Jiao notes that detecting the internal image may be difficult if not impossible for some pages. Instead, he suggests using image search engines to find a representative image to use as a surrogate.

A simplified version of his algorithm is:

  1. Extract key phrases from the target web page using Chen's KEX algorithm
  2. Use these phrases as queries for an image search engine
  3. Rerank the search engine results based on textual similarity to the target web page
  4. Choose the top ranked image
Though this would likely work well for live web pages about products, it may be a poor fit for mementos due to the temporal nature of words. Consider a memento from the late 1990s where one of the key phrases extracted contains the word Clinton. In the 1990s, the document was likely referring to US President Bill Clinton. If we use a search engine in 2018, it may return an image of 2016 presidential candidate Hillary Clinton. Some of these temporal issues have been detailed as part of the Longitudinal Analytics on Web Archive Data (LAWA) project.

Text + Thumbnail


In "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" (google research copy) by Aula and "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" by Dziadosz, the authors consider the combination of text with a thumbnail as a surrogate.

The Internet Archive uses text and thumbnails for its search results, seen in the screenshot below.

The Internet Archive uses thumbnails and text together as part of its search results.
Al Maqbali further extended this concept with a text + visual tags.

Social Card


The social card goes by many names: rich link, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, or social card. The social card typically consists of an image, a title, and a text snippet from the web page it visualizes.

The data within the social card is typically drawn from data within the meta tags of the HTML of the target web page. As an artifact of social media, different social media platforms consult different meta tags within the target page.

For example, for Twitter, I used the following tags to produce the card below:

Social card for https://www.shawnmjones.org as seen on Twitter.


For Facebook, I used the following tags to produce the card below:
Social card for https://www.shawnmjones.org as seen on Facebook.


Note how the HTML tags are different for each service. Facebook supports the Open Graph Protocol, developed around 2009 (according to the CarbonDate service) whereas Twitter's features were developed around 2010 (according to CarbonDate). There are pages that lack this kind of assistive markup. To produce those cards, social media platforms will often use other methods, like those mentioned above, to extract a text snippet and an internal image. Any mementos captured prior to 2009 will not have the benefit of this assistive markup.

Though most platforms generate social cards come in landscape form, some do generate a portrait form as well. The intended use of the social cards on the platform and the nature of other visual cues on the platform often drive the decision as to which form the social card should take. All of the studies in this blog post evaluated social cards in their landscape form.
A landscape social card from Facebook.
A portrait social card from Google+.


Social cards are not just used by social media. Wikipedia uses social cards to provide a preview of links if the user hovers over the link, like what Kopetzky had envisioned with thumbnails. Google News often uses social cards for individual stories. Social cards sometimes include additional information beyond text snippet and image. In "What's Happening and What Happened: Searching the Social Web" Omar Alonso detailed the use of social cards in a prototype for Bing search results. Those cards also incorporated lists of users who shared the target web page as well as associated hashtags.

When a user hovers over an internal link, Wikipedia uses social cards  to display a preview of the linked web page.
Google News often uses social cards to list individual news articles.

There are similar concepts that are not instances of the social card. Some of the cards used by Google News are not social cards because each is a surrogate for a news story spanning multiple resources, rather than a single resource. Likewise, search engines use entity cards to display information about a specific entity drawn from multiple sources. Entity cards have been found to be useful by Bota's 2016 study "Playing Your Cards Right: The Effect of Entity Cards on Search Behaviour and Workload". I do not consider entity cards to be social cards because each social card is a surrogate for a single web resource, whereas an entity card is a surrogate for a conceptual entity and is drawn from multiple sources.
This card used by Google News is not a surrogate for a single web resource, and hence I do not consider it a social card.
This card format, used by Google is also not a surrogate for a single web resource. This is an entity card, drawing from multiple web resources.

The creation of social cards can also be a lucrative market, with Embed.ly offering plans for web platforms ranging from $9 to $99 per month. They provide embedding services for the long form blogging service Medium, supporting a limited number of source websites. Individual cards can be made on their code generation page.

Evaluations of these Surrogates


Web page surrogates have been of great interest to those studying search engine result pages. I have review eight studies on web surrogates, most mentioned above. I focused on how these studies compared surrogates with each other.


Author & Year Text
Snippet
Internal/
External
Image
Visual
Snippet
Thumbnail Enhanced
Thumbnail/
Visual Tags
Text + Thumbnail Social Card
Woodruff 2001 X X X
Dziadosz 2002 X X X
Li 2008 X X
Teevan 2009 X X X
Jiao 2010 X X X
Aula 2010 X X X
Al Maqbali 2010 X X X X X
Loumakis 2011 X X X
Capra 2013 X X X


As noted above Woodruff introduced the concept of enhanced thumbnails in "Using Thumbnails to Search the Web". To evaluate their effectiveness, she generated questions based on tasks users commonly perform on the web. The questions were divided into 4 categories and 3 questions per category were each given to 18 participants. The participants were presented with search engine result pages consisting of 100 text snippets, thumbnails, or enhanced thumbnails. In their attempt to find web resources that would address their assigned questions, participants were evaluated based on their response times. The results indicated that enhanced thumbnails provided the fastest response times overall, but the results varied depending on the type of task. For locating an entity's homepage, text snippets and enhanced thumbnails performed roughly the same. For finding the picture of an entity, thumbnails and enhanced thumbnails performed roughly the same. All three surrogate types performed just as well for e-commerce or medical side-effect questions.

Dziadosz tested the concept of text snippets combined with thumbnails in "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" In this study of 35 participants, each was given 2 queries each and 2 tasks. Each participant was given a different surrogate type. Their first task was to identify all search engine results on the page that they assumed to be relevant to their query. Their second task was to visit the pages being the surrogates and identify which were actually relevant. The number of correct decisions for text snippets combined with thumbnails was higher than just for text or just for thumbnails. Aula, in "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" also evaluated text snippets, thumbnails, and a combination. She discovered that both were effective in making relevance judgements.

Teevan evaluated the effectiveness of visual snippets in "Visual snippets: summarizing web pages for search and revisitation". Her study consisted on 276 participants who were each given 12 search tasks and a set of 20 search results, with 4 of the 12 tasks completed with different surrogates. She discovered that text snippets required the fewest clicks compared to thumbnails, which required the most. This indicates a lot of false positive matches for participants when using thumbnails. Participants preferred visual snippets or text snippets equally over thumbnails and preferred visual snippets for shopping tasks. Most participants found thumbnails to be too small to be useful.

Jiao introduced the concept of using external images as a surrogate in "Visual Summarization of Web Pages". He compared the use of internal images, external images, thumbnails, and visual snippets. Like Dziadosz's study, participants were asked to guess the relevance of the web page behind the surrogate and then later evaluate if their earlier guess was correct. To generate search results, they randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Bing. His results show that none of the surrogates works for all types of pages. Overall internal images were best for pages that contained a dominant image whereas thumbnails or external images were best for understanding pages that did not contain a dominant image.

In "Improving relevance judgment of web search results with image excerpts", Li was interested in identifying dominant images in web pages. I focus here on the second study in his work which compares text snippets and social cards. They randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Google. The search engine results were then evaluated and reformatted into either text snippets or social cards. Two groups of 12 students each were given the queries either classified by their functionalities or semantic categories. The participants were evaluated based on the number of clicks of relevant results and also on the amount of time they took with each search. Social cards were the clear winner over text snippets in terms of time and clicks.

Loumakis, in "This Image Smells Good: Effects of Image Information Scent in Search Engine Results Pages" (university copy) attempted to compare the performance of images, text snippets, and social cards. Using preselected queries and 81 participants, Loumakis also reformatted Google search results. He did not get the same level of performance in his study, noting that "Adding an image to a SERP result will not significantly help users in identifying correct results, but neither will it significantly hinder them if an image is placed with text cues where the scents may conflict."

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali explored the use of different image augmentations for visual snippets, text + thumbnail, social card, text + visual snippet, and a text + tag cloud/thumbnail combination. Al Maqbali had 65 participants evaluate the relevance of search engine result pages as in the prior studies. This study reached the same conclusion as Loumakis: adding images to text snippets does not appear to make a difference to the performance of search engine users.

To further understand the disagreement between the results of Loumakis, Al Maqbali, and Li, in "Augmenting web search surrogates with images", Capra explored the effectiveness of text snippets and social cards. He wanted determine if the quality or relevance of the image used in the social card had any effect on performance. Prior to any relevance study, he had one set of participants rate individual internal images for a social card as good, bad, and mixed. For individual surrogates, Capra discovered that text snippets with good images have a slightly higher statistically significant accuracy score than just text snippets alone, at the cost of judgement duration for each surrogate. The accuracy for text snippets was 0.864, the accuracy for social cards with bad images was also 0.864, and the accuracy for social cards with good images was 0.884. If the search engine result pages were evaluated overall, then there was evidence that good images showed improvement in accuracy with ambiguous queries (e.g., jaguar the car or the cat?), but in this case the improvements were not statistically significant.

Deciding on the best surrogate for use with web pages depends on a number of factors, and the studies comparing these surrogates have some disagreement. Text snippets continue to endure for search results likely due to Capra's, Al Maqbali's, and Loumakis' results. Social cards are preferred by users, but the minor improvement in search time and relevance accuracy does not warrant the effort necessary to select a good internal image for the card. This means that social cards are effectively relegated for use in social media where each can be generated individually rather than with hundreds of search results. This also means that thumbnails are relegated to other tasks, such as a surrogate for a file on a filesystem or within a browser's interface. As most of these studies focused primarily on search engine results, it is likely that many of these surrogates work better with other use cases.

Surrogates for Mementos


There are more uses for surrogates than search engine results. When grouped together, some surrogates provide more information than the answer to the question should I click on this?.

Enhanced thumbnails often reflect the search terms of the query provided by the user. Most memento applications do not have a query, and hence there are not words or phrases to enhance within the thumbnail. Mabali's tag cloud concept may be of interest here. I am examining other ways to expose words and phrases of interest from archived collections, so this surrogate type may find new life in mementos.

Internal images are often used as part of social cards. If one could expose the images that tie to a particular theme in a web archive collection, then it is possible that we could select images for use as memento surrogates within the theme of the collection. This would likely require some form of machine learning to be viable. This same process goes for visual snippets.

As noted above, external images are problematic surrogates for mementos due to the temporal nature of words. If we could divide a web archive into specific time periods, then external images could be extracted from pages around the same time, limiting the amount of temporal shift.

Thumbnails are often useful in groups to demonstrate the content drift of a single web resource. For this surrogate group to be useful, the consumer of such a thumbnail gallery needs to understand the direction that time flows in the visualization. Thumbnails are not limited to the "one-per-row" paradigm of landscape social cards or text snippets, and hence thumbnails can be presented in a grid formation. This can be confusing to the user trying to compare the content drift of a resource, but textual cues, such as the memento-datetime, placed above or below the thumbnail can clear up this confusion.

Storytelling often uses surrogates in the form of social cards to tell a story. In this case, the surrogates are visualizations of the underlying web pages. When provided as a series of social cards, one per row, in order to publication date or memento-datetime, collections of these surrogates can convey information about an unfolding news story, such as in AlNoamany's collection summarization work (preprint version, dissertation). Many mementos do not have the metadata that might assist in finding a good internal image. This means that any service providing social cards to mementos must instead rely upon a number of image selection algorithms with differing levels of success. Because text snippets are essentially social cards lacking an image, is it possible that they, too, would be suitable in this context?

Conclusion


I started on this journey looking for the best surrogate for use with mementos. I discovered many different surrogates for web resources. The studies evaluating these different surrogates focused on the success of users finding relevant information in search engine results. It appears that the search engine industry has largely focused on text snippets as they are the least expensive surrogate to produce and studies indicate that the addition of images has minimal impact on their effectiveness. Mementos have many different uses and it is possible that one or more of these surrogates may be better fit for their temporal nature. Now that I am developing a vocabulary for these surrogates, I can start to explore how they might best be used with mementos, bringing other useful visualizations to web archive collections.

-- Shawn M. Jones

Monday, April 23, 2018

2018-04-23: "Grampa, what's a deleted tweet?"


In early February, 2018 Breitbart News made a splash with its inflammatory tweet suggesting Muslims will end Super Bowl,  which they deleted twelve hours later stating it did not meet their editorial standards. The deleted tweet had an imaginary conversation between a Muslim child and a grandparent about the Super Bowl and linked to one of articles on the declining TV ratings of  National Football League (NFL) for the annual championship game. News articles from The Hill, Huffington Post, Politico, Independent, etc., talked about the deleted tweet controversy in detail. 

Being web archiving researchers, we decided to look into the deleted tweet incident of Breitbart News to shed some light on their deleted tweets pattern over recent months.

Role of web archives in finding deleted tweets   


Hany M. SalahEdeen and Michael L. Nelson in their paper, "Losing my revolution: How many resources shared on social media have been lost?",  talk about the amount of resources shared in social media that is still live or present in the public web archives. They concluded that nearly 11% of the shared resources are lost in their first year and after that we lose the shared resources at a rate of 0.02% per day.

Web archives such as Internet ArchiveArchive-ItUK Web Archives, etc., have an important role in the preservation of resources shared in social media. Using web archives, sometimes we can recover deleted tweets. For example, Miranda Smith in her blog post, "Twitter Follower Count History via Internet Archive" talks about using Internet Archive to fetch historical Twitter data to graph followers count over time. She also explains the advantages of using web archives for finding historical data of users over the Twitter API.

The only caveat in using web archives to uncover the deleted tweets is its limited coverage of Twitter. But for popular Twitter accounts having a high number of mementos such as RealDonaldTrumpBarrack ObamaBreitbartNewsCNN, etc., we can often uncover deleted tweets. The issue of "How Much of the Web Is Archived?" has been discussed by Ainsworth et al. but there has been no separate analysis on how much of Twitter is archived which will help us in estimating the accuracy of finding deleted tweets using web archives.

Web services like Politwoops track deleted tweets of public officials including people currently in office and candidates for office in the USA and some EU nations. However, tweets deleted before a  person becomes a candidate or tweets deleted after a person left office will not be covered. Although Politwoops tracks the elected officials, it misses out on appointed government officials like Michael FlynnFor these twitter accounts web archives are the lone solution to finding their deleted tweets. The most important aspect of not relying totally on these web services alone to find the deleted tweets is due to them being banned by Twitter. It happened once in June, 2015 with Twitter citing violation of the developer agreement. It took Politwoops six months to resume its services back in December, 2015. These instances of being banned by Twitter suggest that we explore web archives to uncover deleted tweets in case of services like Politwoops are banned again.  

Why are deleted tweets important?


With the surge in the usage of social media sites like Twitter, Facebook etc., researchers have been using social media sites to study patterns of online user behaviour.  In context of Twitter, deleted tweets play an important role in understanding users' behavioural patterns. In the paper, "An Examination of Regret in Bullying Tweets", Xu et al. built a SVM-based classifier to predict deleted tweets from Twitter users posting bullying related tweets to later regret and delete them. Petrovic et al., in their paper, "I Wish I Didn’t Say That! Analyzing and Predicting Deleted Messages in Twitter", discuss about the reasons for deleted tweets and using a machine learning approach to predict them. They concluded by saying that tweets with swear words have higher probability of being deleted. Zhou et al. in their papers, "Tweet Properly: Analyzing Deleted Tweets to Understand and Identify Regrettable Ones" and "Identifying Regrettable Messages from Tweets", mention the impact of published tweets that cannot be undone by deletion, as other users have noticed the tweet and cached them even before they are deleted.      


How were deleted tweets found?


To begin our analysis, we used the Twitter API to fetch the most recent 3200 tweets from Breitbart News' Twitter timeline. The live tweets fetched from the Twitter API spanned from 2017-10-22 to 2018-02-18. Later, we received the TimeMap for Breitbart's Twitter page using Memgator, the Memento aggregator service built by Sawood Alam. Using the URI-Ms from the fetched TimeMap, we collected mementos for Breitbart's Twitter page within the specified  time range of live tweets fetched using the Twitter API. 

Code to fetch recent tweets using Python-Twitter API

import twitter
api = twitter.Api(consumer_key='xxxxxx',
                      consumer_secret='xxxxxx',
                      access_token_key='xxxxxx',
                      access_token_secret='xxxxxx',
                      sleep_on_rate_limit=True)
                      
twitter_response = api.GetUserTimeline(screen_name=screen_name, count=200, include_rts=True)

Shell command to run Memgator locally 

$ memgator --contimeout=10s --agent=XXXXXX server 
MemGator 1.0-rc7
   _____                  _______       __
  /     \  _____  _____  / _____/______/  |___________
 /  Y Y  \/  __ \/     \/  \  ___\__  \   _/ _ \_   _ \
/   | |   \  ___/  Y Y  \   \_\  \/ __ |  | |_| |  | \/
\__/___\__/\____\__|_|__/\_______/_____|__|\___/|__|

TimeMap   : http://localhost:1208/timemap/{FORMAT}/{URI-R}
TimeGate  : http://localhost:1208/timegate/{URI-R} [Accept-Datetime]
Memento   : http://localhost:1208/memento[/{FORMAT}|proxy]/{DATETIME}/{URI-R}

# FORMAT          => link|json|cdxj
# DATETIME        => YYYY[MM[DD[hh[mm[ss]]]]]
# Accept-Datetime => Header in RFC1123 format

Code to fetch TimeMap for any twitter handle

1
2
3
4
url = "http://localhost:1208/timemap/"
data_format = "cdxj"
command = url + data_format + "/http://twitter.com/<screen-name>" + 
response = requests.get(command)
We parsed tweets and their tweet ids from each memento and compared each archived tweet id with the live tweet ids fetched using the Twitter API. We further validated the status of tweet ids present in web archives but deleted on Twitter using the Twitter API to confirm if the tweets were deleted. On comparing the live and archived versions of tweets, we discovered 22 deleted tweets from Breitbart News.

Code to parse tweets, their timestamps and tweet ids from mementos


import bs4

soup = bs4.BeautifulSoup(open(<HTML representation of Memento>),"html.parser")
match_tweet_div_tag = soup.select('div.js-stream-tweet')
for tag in match_tweet_div_tag:
   if tag.has_attr("data-tweet-id"):
       # Get Tweet id
       ...........
       # Parse tweets
       match_timeline_tweets = tag.select('p.js-tweet-text.tweet-text')
       ...........
       # Parse tweet timestamps
       match_tweet_timestamp = tag.find("span", {"class": "js-short-timestamp"})
       ...........

Analysis of Deleted Tweets from Breitbart News


The most prominent of the 22 deleted tweets was the above mentioned Super Bowl deleted tweet. Talking about the above mentioned deleted tweet in context for people who are unaware of the role of web archives, we urge them that taking screenshots fearing something might be lost in future is smart but it would be even better if we push them to the web archives where it would be preserved for a longer time than compared to someone's private archive. For further information refer to Plinio Varagas's blog post "Links to Web Archives, not Search Engine Caches", where he talks about the difference between archived pages and search engine caches in terms of the decay period of the web pages.

Fig 1 - Super Bowl tweet on Internet Archive
Tweet Memento at Internet Archive
There is another tweet which was initially tweeted by Allum Bokhari, a senior Breitbart correspondent, and retweeted by Breitbart News but was later un-retweeted. The original tweet from Allum Bokhari is present on the live of web but the retweet is missing from the live web, with the plausible reason of Breitbart News later retweeting a similar post from Allum Bokhari.
Undo retweet of Breitbart News
Fig 2 - Archived version of unretweeted tweet by Breitbart News
Tweet memento at the Internet Archive

Fig 3 - Live version of unretweeted tweet by Breitbart News
Live Tweet Status
Of the 22 deleted tweets, 20 were of the form where Breitbart News retweeted someone's tweet but the original tweet was lost. Of those 20 tweets, 18 were from two affiliates of Breitbart News, NolteNC and John Carney. Therefore, we decided to have a look at both the accounts to determine the reason for their deleted tweets.

Analysis of deleted tweets from John Carney and  NolteNC


We fetched live tweets for John Carney using the Twitter API and then fetched the TimeMap for John Carney's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. Due to the low number of mementos within the specified time range, the analysis showed no deleted tweets. We then fetched live tweets from the Twitter API for John Carney for a week to find deleted tweets by comparing with all the previous responses from the Twitter API. We discovered that tweets older than seven days are automatically deleted on Tuesday and Saturday. The precise manner in deletion of tweets suggests the use of any automated tweet deletion service. There are a number of tweet deletion services like Twitter DeleterTweet Eraser etc. which delete tweets on certain conditions based on the lifespan of the tweet or the number of tweets to be present in the Twitter timeline at any given instance.
Fig 4 - John Carney's tweet deletion pattern shown with 50 tweet ids
We fetched live tweets for NolteNC using the Twitter API and then fetched the TimeMap for NolteNC's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. As for NolteNC, we had a considerable number of mementos within the specified time range to discover his deleted tweets. Our analysis provided us with 169 live tweets and 3569 deleted tweets from 2017-11-03 to 2018-02-17.
Fig 5 - NolteNC's original tweet


Fig 6 - Breitbart News retweeting NolteNC's tweet.
With 1000s of deleted tweets, it seemed unlikely that he was manually deleting tweets. We had all the reasons to believe that similar to John Carney, NolteNC deleted tweets automatically using some tweet deletion service. We collected live tweets for his account over a week and compared all the previous responses from the Twitter API to come to the conclusion that all his tweets which were aged more than seven days on Wednesday and Saturday were deleted.
Fig 7 - NolteNC's tweet deletion pattern shown with 50 tweets 

Conclusions

  1. It is not enough to make screen shots of controversial tweets  but, we need to push web contents that we wish to preserve for future fearing of its loss to the web archives due to longer retention capability than our personal archives.
  2. For finding deleted tweets, web archives work effectively for popular accounts because they are archived often but for less popular accounts with fewer mementos this approach will not work.
  3. Although Breitbart News does not delete tweets often, some of its correspondents automatically delete their tweets, effectively deleting the corresponding retweets.
--
Mohammed Nauman Siddique (@m_nsiddique)

Friday, April 13, 2018

2018-04-13: Web Archives are Used for Link Stability, Censorship Avoidance, and Traffic Siphoning

ISIS members immolating captured Jordanian pilot
Web archives have been used for purposes other than digital preservation and browsing historical data. These purposes can be divided into three categories:

  1. Uploading content to web archives to ensure continuous availability of the data.
  2. Avoiding governments' censorship or websites' terms of service.
  3. Using URLs from web archives, instead of direct links, for news sites with opposing ideologies to avoid increasing their web traffic and deprive them of ad revenue.

1. Uploading content to web archives to ensure continuous availability of the data


Web archives, by design, are intended to solve the problem of digital data preservation so people can access data when it is no longer available on the live web. In this paper, Who and What Links to the Internet Archive, (Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, and Michael L. Nelson, 2013), the authors show that the percentage of the requested archived pages which currently do not exist on the live web is 65%. The paper also determines where do Internet Archive's Wayback Machine users come from. The following table, from the paper, contains the top 10 referrers that link to IA’s Wayback Machine. The list of top 10 referrers represents 51.9% of all the referrers. en.wikipedia.org outnumbers all other sites including search engines and the home page of Internet Archive (archive.org).
The top 10 referrers that link to IA’s Wayback Machine
Who and What Links to the Internet Archive, (AlNoamany et al. 2013) Table 5

Sometimes the archived data is controversial and the user wants to make sure that he or she can refer back to it later in case it is removed from the live web. A clear example of that is the deleted tweets from U.S. president Donald Trump.
Mr. Trump's deleted tweets on politwoops.eu


2. Avoiding governments' censorship or websites' terms of service


Using the Internet Archive to find a way around terms of service for file sharing sites was addressed by Justin Littman in a blog post, Islamic State Extremists Are Using the Internet Archive to Deliver Propaganda. He stated that ISIS sympathizers are using the Internet Archive as a web delivery platform for extremist propaganda, posing a threat to the archival mission of Internet Archive. Mr. Littman did not evaluate the content to determine if it is extremist in nature since much of it is in Arabic. This behavior is not new. It has been noted with some of the data uploaded by Al-Qaeda sympathizers a long time before ISIS was created. Al-Qaeda uploaded this file https://archive.org/details/osamatoobama to the Internet Archive on February 16 of 2010 to circumvent file sharing sites' content removal policies. ISIS sympathizers upload clips documenting battles, executions, or even video announcements by ISIS leaders to the Internet Archive because that type of data will get automatically removed from the web if uploaded to video sharing sites like Youtube to prevent extremists propaganda.

On February 4th of 2015, ISIS uploaded a video to the Internet Archive featuring the execution by immolation of captured Jordanian pilot Muath Al-Kasasbeh; that's only one day after the execution! This video violates Youtube's terms of service and is no longer on Youtube.
https://archive.org/details/YouTube_201502
ISIS members immolating captured Jordanian pilot (graphic video)
In fact, Youtube's algorithm is so aggressive that it removed thousands of videos documenting the Syrian revolution. Activists argued that the removed videos were uploaded for the purpose of documenting atrocities during the Syrian government's crackdown, and that Youtube killed any possible hope for future war crimes prosecutions.

Hani Al-Sibai, a lawyer, Islamic scholar, Al-Qaeda sympathizer, and a former member of The Egyptian Islamic Jihad Group who lives in London as a political refugee, uploads his content to the Internet Archive. Although he is anti-ISIS and, more often than not, his content does not encourage violence and he only had few issues with Youtube, he pushes his content to multiple sites on the web including web archiving sites to ensure continuous availability of his data.

For example, this is a an audio recording from Hani Al-Sibai condemning the immolation of the Jordanian pilot, Muath Al-Kasasbeh. Mr. Al-Sibai uploaded this recording to the Internet Archive a day after the execution.
https://archive.org/details/7arqTayyar
An audio recording by Hani Al-Sibai condemning the execution by burning (uploaded to IA a day after the execution)

These are some examples where the Internet Archive is used as a file sharing service. Clips are simultaneously uploaded to Youtube. Vimeo, and the Internet Archive for the purpose of sharing.
Screen-shot from justpaste.it where links to videos uploaded to IA are used for sharing purpose 
Both videos shown in the screen shot were removed from Youtube for violating terms of service, but they are not lost because they have been uploaded to the Internet Archive.

https://www.youtube.com/watch?v=Cznm0L5X9LE
Rebuttal from Hani Al-Sibai addressing ISIS spokesman's attack on Al-Qaeda leader Ayman Al-Zawaheri (removed from Youtube)

https://archive.org/details/Fajr3_201407
Rebuttal from Hani Al-Sibai addressing ISIS spokesman's attack on Al-Qaeda leader Ayman Al-Zawaheri (uploaded to IA)

https://www.youtube.com/watch?v=VuSgxhBtoic
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (removed from Youtube)

https://archive.org/details/Ta3liq_Hadi
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (uploaded to IA)
The same video was not removed from Vimeo
https://vimeo.com/111975796
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (uploaded to Vimeo)
I am not sure if web archiving sites have content moderation policies, but even with sharing sites that do, they are inconsistent! Youtube is a perfect example; no one knows what YouTube's rules even are anymore.

Less popular use of the Internet Archive include browsing the live web using Internet Archive links to bypass governments' censorship. Sometimes, governments block sites with opposing ideologies, but their archived versions remain accessible. When these governments realize that their censorship is being evaded, they entirely block the Internet Archive to prevent access to the the same content they blocked on the live web. In 2017, the IA’s Wayback Machine was blocked in India and in 2015, Russia blocked the Internet Archive over a single page!

3. Using URLs from web archives instead of direct links for news sites with opposing ideologies to deprive them of ad revenue

Even when the live web version is not blocked, there are situations where readers want to deny traffic and the resulting ad revenue for web sites with opposing ideologies. In a recent paper, Understanding Web Archiving Services and Their (Mis)Use on Social Media (Savvas Zannettou, Jeremy Blackburn, Emiliano De Cristofaro, Michael Sirivianos, Gianluca Stringhini, 2018), the authors presented a large-scale analysis of Web archiving services and their use on social network, the archived content, and how it is shared/used. They found that contentious news and social media posts are the most common types of content archived. Also, URLs from web archiving sites are widely posted on “fringe” groups in Reddit and 4chan to preserve controversial data that might disappear; this case also falls under the first category. Furthermore, the authors found evidence of groups' admins forcing members to use URLs from web archives instead of direct links to sites with opposing ideologies to refer to them without increasing their traffic or to deprive them of ad revenue. For instance, The_Donald subreddit systematically targets ad revenue of news sources with adverse ideologies using moderation bots that block URLs from those sites and prompt users to post archive URLs instead.

The authors also found that web archives are used to evade censorship policies in some communities: for example, /pol/ users post archive.is URLs to share content from 8chan and Facebook, which are banned on the platform, or to dodge word-filters (e.g., ‘smh’ becomes ‘baka’, so links to smh.com.au point to baka.com.au instead).

According to the authors, Reddit bots are responsible for posting a huge portion of archive URLs in Reddit due to moderators trying to ensure the availability of the data, but this practice affects the amount of traffic that the source sites would have received from Reddit.

I went on 4chan to include a few examples similar to those examined in the paper and despite not knowing what 4chan is prior to reading the paper, I was able to find a couple of examples of sharing archived links on 4chan in just under 2 minutes. I took screen shots of both examples; the threads have been deleted since 4chan removes threads after they reach page 10.

Pages are archived on archive.is then shared on 4chan
Sharing links to archive.org in a comment on 4chan

The take away message is that web archives have been used for purposes other than digital preservation and browsing historical data. These purposes include:
  1. Uploading content to web archives to mitigate the risk of data loss.
  2. Avoiding governments' censorship or websites' terms of service.
  3. Using URLs from web archives, instead of original source links for news sites with opposing ideologies to deprive them of ad revenue.
--
Hussam Hallak