Monday, April 23, 2018

2018-04-23: "Grampa, what's a deleted tweet?"


In early February, 2018 Breitbart News made a splash with its inflammatory tweet suggesting Muslims will end Super Bowl,  which they deleted twelve hours later stating it did not meet their editorial standards. The deleted tweet had an imaginary conversation between a Muslim child and a grandparent about the Super Bowl and linked to one of articles on the declining TV ratings of  National Football League (NFL) for the annual championship game. News articles from The Hill, Huffington Post, Politico, Independent, etc., talked about the deleted tweet controversy in detail. 

Being web archiving researchers, we decided to look into the deleted tweet incident of Breitbart News to shed some light on their deleted tweets pattern over recent months.

Role of web archives in finding deleted tweets   


Hany M. SalahEdeen and Michael L. Nelson in their paper, "Losing my revolution: How many resources shared on social media have been lost?",  talk about the amount of resources shared in social media that is still live or present in the public web archives. They concluded that nearly 11% of the shared resources are lost in their first year and after that we lose the shared resources at a rate of 0.02% per day.

Web archives such as Internet ArchiveArchive-ItUK Web Archives, etc., have an important role in the preservation of resources shared in social media. Using web archives, sometimes we can recover deleted tweets. For example, Miranda Smith in her blog post, "Twitter Follower Count History via Internet Archive" talks about using Internet Archive to fetch historical Twitter data to graph followers count over time. She also explains the advantages of using web archives for finding historical data of users over the Twitter API.

The only caveat in using web archives to uncover the deleted tweets is its limited coverage of Twitter. But for popular Twitter accounts having a high number of mementos such as RealDonaldTrumpBarrack ObamaBreitbartNewsCNN, etc., we can often uncover deleted tweets. The issue of "How Much of the Web Is Archived?" has been discussed by Ainsworth et al. but there has been no separate analysis on how much of Twitter is archived which will help us in estimating the accuracy of finding deleted tweets using web archives.

Web services like Politwoops track deleted tweets of public officials including people currently in office and candidates for office in the USA and some EU nations. However, tweets deleted before a  person becomes a candidate or tweets deleted after a person left office will not be covered. Although Politwoops tracks the elected officials, it misses out on appointed government officials like Michael FlynnFor these twitter accounts web archives are the lone solution to finding their deleted tweets. The most important aspect of not relying totally on these web services alone to find the deleted tweets is due to them being banned by Twitter. It happened once in June, 2015 with Twitter citing violation of the developer agreement. It took Politwoops six months to resume its services back in December, 2015. These instances of being banned by Twitter suggest that we explore web archives to uncover deleted tweets in case of services like Politwoops are banned again.  

Why are deleted tweets important?


With the surge in the usage of social media sites like Twitter, Facebook etc., researchers have been using social media sites to study patterns of online user behaviour.  In context of Twitter, deleted tweets play an important role in understanding users' behavioural patterns. In the paper, "An Examination of Regret in Bullying Tweets", Xu et al. built a SVM-based classifier to predict deleted tweets from Twitter users posting bullying related tweets to later regret and delete them. Petrovic et al., in their paper, "I Wish I Didn’t Say That! Analyzing and Predicting Deleted Messages in Twitter", discuss about the reasons for deleted tweets and using a machine learning approach to predict them. They concluded by saying that tweets with swear words have higher probability of being deleted. Zhou et al. in their papers, "Tweet Properly: Analyzing Deleted Tweets to Understand and Identify Regrettable Ones" and "Identifying Regrettable Messages from Tweets", mention the impact of published tweets that cannot be undone by deletion, as other users have noticed the tweet and cached them even before they are deleted.      


How were deleted tweets found?


To begin our analysis, we used the Twitter API to fetch the most recent 3200 tweets from Breitbart News' Twitter timeline. The live tweets fetched from the Twitter API spanned from 2017-10-22 to 2018-02-18. Later, we received the TimeMap for Breitbart's Twitter page using Memgator, the Memento aggregator service built by Sawood Alam. Using the URI-Ms from the fetched TimeMap, we collected mementos for Breitbart's Twitter page within the specified  time range of live tweets fetched using the Twitter API. 

Code to fetch recent tweets using Python-Twitter API

import twitter
api = twitter.Api(consumer_key='xxxxxx',
                      consumer_secret='xxxxxx',
                      access_token_key='xxxxxx',
                      access_token_secret='xxxxxx',
                      sleep_on_rate_limit=True)
                      
twitter_response = api.GetUserTimeline(screen_name=screen_name, count=200, include_rts=True)

Shell command to run Memgator locally 

$ memgator --contimeout=10s --agent=XXXXXX server 
MemGator 1.0-rc7
   _____                  _______       __
  /     \  _____  _____  / _____/______/  |___________
 /  Y Y  \/  __ \/     \/  \  ___\__  \   _/ _ \_   _ \
/   | |   \  ___/  Y Y  \   \_\  \/ __ |  | |_| |  | \/
\__/___\__/\____\__|_|__/\_______/_____|__|\___/|__|

TimeMap   : http://localhost:1208/timemap/{FORMAT}/{URI-R}
TimeGate  : http://localhost:1208/timegate/{URI-R} [Accept-Datetime]
Memento   : http://localhost:1208/memento[/{FORMAT}|proxy]/{DATETIME}/{URI-R}

# FORMAT          => link|json|cdxj
# DATETIME        => YYYY[MM[DD[hh[mm[ss]]]]]
# Accept-Datetime => Header in RFC1123 format

Code to fetch TimeMap for any twitter handle

1
2
3
4
url = "http://localhost:1208/timemap/"
data_format = "cdxj"
command = url + data_format + "/http://twitter.com/<screen-name>" + 
response = requests.get(command)
We parsed tweets and their tweet ids from each memento and compared each archived tweet id with the live tweet ids fetched using the Twitter API. We further validated the status of tweet ids present in web archives but deleted on Twitter using the Twitter API to confirm if the tweets were deleted. On comparing the live and archived versions of tweets, we discovered 22 deleted tweets from Breitbart News.

Code to parse tweets, their timestamps and tweet ids from mementos


import bs4

soup = bs4.BeautifulSoup(open(<HTML representation of Memento>),"html.parser")
match_tweet_div_tag = soup.select('div.js-stream-tweet')
for tag in match_tweet_div_tag:
   if tag.has_attr("data-tweet-id"):
       # Get Tweet id
       ...........
       # Parse tweets
       match_timeline_tweets = tag.select('p.js-tweet-text.tweet-text')
       ...........
       # Parse tweet timestamps
       match_tweet_timestamp = tag.find("span", {"class": "js-short-timestamp"})
       ...........

Analysis of Deleted Tweets from Breitbart News


The most prominent of the 22 deleted tweets was the above mentioned Super Bowl deleted tweet. Talking about the above mentioned deleted tweet in context for people who are unaware of the role of web archives, we urge them that taking screenshots fearing something might be lost in future is smart but it would be even better if we push them to the web archives where it would be preserved for a longer time than compared to someone's private archive. For further information refer to Plinio Varagas's blog post "Links to Web Archives, not Search Engine Caches", where he talks about the difference between archived pages and search engine caches in terms of the decay period of the web pages.

Fig 1 - Super Bowl tweet on Internet Archive
Tweet Memento at Internet Archive
There is another tweet which was initially tweeted by Allum Bokhari, a senior Breitbart correspondent, and retweeted by Breitbart News but was later un-retweeted. The original tweet from Allum Bokhari is present on the live of web but the retweet is missing from the live web, with the plausible reason of Breitbart News later retweeting a similar post from Allum Bokhari.
Undo retweet of Breitbart News
Fig 2 - Archived version of unretweeted tweet by Breitbart News
Tweet memento at the Internet Archive

Fig 3 - Live version of unretweeted tweet by Breitbart News
Live Tweet Status
Of the 22 deleted tweets, 20 were of the form where Breitbart News retweeted someone's tweet but the original tweet was lost. Of those 20 tweets, 18 were from two affiliates of Breitbart News, NolteNC and John Carney. Therefore, we decided to have a look at both the accounts to determine the reason for their deleted tweets.

Analysis of deleted tweets from John Carney and  NolteNC


We fetched live tweets for John Carney using the Twitter API and then fetched the TimeMap for John Carney's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. Due to the low number of mementos within the specified time range, the analysis showed no deleted tweets. We then fetched live tweets from the Twitter API for John Carney for a week to find deleted tweets by comparing with all the previous responses from the Twitter API. We discovered that tweets older than seven days are automatically deleted on Tuesday and Saturday. The precise manner in deletion of tweets suggests the use of any automated tweet deletion service. There are a number of tweet deletion services like Twitter DeleterTweet Eraser etc. which delete tweets on certain conditions based on the lifespan of the tweet or the number of tweets to be present in the Twitter timeline at any given instance.
Fig 4 - John Carney's tweet deletion pattern shown with 50 tweet ids
We fetched live tweets for NolteNC using the Twitter API and then fetched the TimeMap for NolteNC's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. As for NolteNC, we had a considerable number of mementos within the specified time range to discover his deleted tweets. Our analysis provided us with 169 live tweets and 3569 deleted tweets from 2017-11-03 to 2018-02-17.
Fig 5 - NolteNC's original tweet


Fig 6 - Breitbart News retweeting NolteNC's tweet.
With 1000s of deleted tweets, it seemed unlikely that he was manually deleting tweets. We had all the reasons to believe that similar to John Carney, NolteNC deleted tweets automatically using some tweet deletion service. We collected live tweets for his account over a week and compared all the previous responses from the Twitter API to come to the conclusion that all his tweets which were aged more than seven days on Wednesday and Saturday were deleted.
Fig 7 - NolteNC's tweet deletion pattern shown with 50 tweets 

Conclusions

  1. It is not enough to make screen shots of controversial tweets  but, we need to push web contents that we wish to preserve for future fearing of its loss to the web archives due to longer retention capability than our personal archives.
  2. For finding deleted tweets, web archives work effectively for popular accounts because they are archived often but for less popular accounts with fewer mementos this approach will not work.
  3. Although Breitbart News does not delete tweets often, some of its correspondents automatically delete their tweets, effectively deleting the corresponding retweets.
--
Mohammed Nauman Siddique (@m_nsiddique)

No comments:

Post a Comment