2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Archive Unleashed 3.0 took place in the Internet Archive, San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium, hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam, Mohamed Aturban, Erika Siregar, and myself. This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0, and Web Archive Hackathon 2.0.

@WebSciDL at @internetarchive after Archives Unleashed 3.0 wrap up. We have a winner of #HackArchives pic.twitter.com/vYLi89yap0
— Sawood Alam (@ibnesayeed) February 25, 2017

This workshop, was supported by the Internet Archive, Rutgers University, and the University of Waterloo. The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools to web archives. The three organizers of this workshop were: Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan, (Assistant Professor, Department of History, University of Waterloo), and Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo).

It was a big moment for me as I first saw the Internet Archive building, it had an Internet Archive truck parked outside. Since 2009, the IA headquarters have been at 300 Funston Avenue in San Fransisco, a former Christian Science Church. Inside the building in the main hall there were multiple mini statues for every archivist who worked in the IA for over three years.

On Wednesday night, we had a welcome dinner and a small introduction of the members that have arrived.

Day 1 (February 23, 2017)

On Thursday, we started with a breakfast and headed to the main hall where several presentations occurred. Matthew Weber presented “Stating the Problem, Logistical Comments”. Dr. Weber started by stating the goals which include developing a common vision of web archiving development and tool development, and to learn to work with born digital resources for humanities and social science research.

Next, Ian Milligan presented “History Crawling with Warcbase”. Dr. Milligan gave an overview of Warcbase. Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The tool is used to analyze web archives using Spark, and to take advantage of HBase to provide random access as well as analytics capabilities.

Archives Unleashed 3 underway w an overview of Warcbase @ianmilligan1 #hackarchives thx @NSF @RutgersCommInfo @internetarchive for support pic.twitter.com/AaJ8Nemvd8
— Matthew Weber (@docmattweber) February 23, 2017

Next, Jefferson Bailey (Internet Archive) presented “Data Mining Web Archives”. He talked about the conceptual issues in Access to WA which include: Provenance (much data, but not all as expected), Acquisition (highly technical; crawl configs; soft 404s), Border issues (the web never really ends), Lure of evermore data (more data is not better data), and Attestation (higher sensitivity to elision than in traditional archives?). He also explained the different formats that the Internet Archive can save its data, which include CDX, Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). In addition, he presented an overview of some research projects based on the IA collaboration. Some of the researches he mentioned were: The ALEXANDRIA project, Web Archives for Longitudinal Knowledge, Global Event and Trend Archive Research & Integrated Digital Event Archiving, and many more.

Next, Vinay Goel (Internet Archive) presented “API Overview”. He presented the Beta WayBack Machine, which searches the IA based on a URL or a word related to a sites home page. He mentioned that search results are presented based on the anchor text search.

Justin Littman (George Washington University Libraries), presented “Social Media Collecting with Social Feed Manager”. SFM is an open source software that collects social media from APIs of Twitter, Tumblr, Flickr, and Sina Weibo.

The final talk was by “Ilya Kreymer” (Rhizome), where he presents an overview of the tool “Webrecorder”. The tool provides an integrated platform for creating high-fidelity web archives while browsing, sharing, and disseminating archived content.

After that, we had a short coffee break and started to form three groups. In order to form the groups, all participants were encouraged to write a few words on the topic they would like to work on, some words that appeared were: fake news, news, twitter, etc. Similar notes are grouped together and associating members. The resulting groups were Local News, Fake News, and End of Term Transition.

Group Name	Group Members
Local News: Good News/Bad News	Sawood Alam, Old Dominion University Lulwah Alkwai, Old Dominion University Mark Beasley, Rhizome Brenda Berkelaar, University of Texas at Austin Frances Corry, University of Southern California Ilya Kreymer, Rhizome Nathalie Casemajor, INRS Lauren Ko, University of North Texas
Fake News	Erika Siregar, Old Dominion University Allison Hegel, University of California, Los Angeles Liuqing Li, Virginia Tech Dallas Pillen, University of Michigan Melanie Walsh, Washington University
End of Term Transition	Mohamed Aturban, Old Dominion University Justin Littman, George Washington University Jessica Ogden, University of Southampton Yu Xu, University of Southern California Shawn Walker, University of Washington

Every group started to work on its dataset and brain storm different research questions to answer, and formed a plan of work. Then we basically worked all through the day, and ended the night with a working dinner at the IA.

Archives Unleashed day 1: Team EOT-Transition using @eotarchive to try to detect changes in web presence of fed agencies. #hackarchives
— Justin Littman (@justin_littman) February 24, 2017

Day 2 (February 24, 2017)

On Friday we started by eating breakfast, and then each team continued to work on their projects.

Every Friday the IA has free lunches where hundreds of people join together; some were artists, activists, engineers, librarians and many more. After that, a public tour of the IA takes place.

We had some light talks after lunch. The first talk was by Justin Littman, were he presented an overview of his new tool called “Fbarc”. This tool archive webpages from Facebook using the Graph API.

.@justin_littman gives a lightning intro to the Facebook Graph API and his new tool f(b)arc (https://t.co/7UQtNOiPjZ). #hackarchives pic.twitter.com/uJn17UFGmh
— Ian Milligan (@ianmilligan1) February 24, 2017

Nick Ruest (Digital Assets Librarian at York University), gave a talk on “Twitter”. Next, Shawn Walker (University of Washington), presented “We are doing it wrong!”. He explained how the current collecting process of social media is not how people view social media.

Now @walkeroh is showing us how we are collecting social media wrong - users see websites, researchers see scrolling JSON. #hackarchives pic.twitter.com/SWCMT3qq0Z
— Ian Milligan (@ianmilligan1) February 24, 2017

After that all the teams presented their projects. Starting with our team, we called our project "Good News/Bad News". We utilized historical captures (mementos) of various local news sites' homepages from Archive-It to prepare our seed dataset. In order to transform the data for our usage we utilized the Webrecorder, WAT converter, and some custom scripts. We extracted various headlines featured on the homepages of the each site for each day. With the extracted headlines we analyzed the sentiments on various levels including individual headlines, individual sites, and over the whole nation using the VADER-Sentiment-Analysis Python library. To leverage more machine learning capabilities for clustering and classification, we built a Latent Semantic Indexer (LSI) model using a Ruby library called Classifier Reborn. Our LSI model helped us convey the overlap of discourse across the country. We also explored the possibility of building Word2Vec model using TensorFlow for advanced machine learning, but due to limited amount of available time, despite the great potential, we could not pursue it. To distinguish between the local and the national discourse we planned on utilizing Term Frequency-Inverse Document Frequency, but could not put it together on time. For the visualization we planned on showing the interactive US map along with the heat map of the newspaper location with the newspaper ranking as the size of the spot and the color indicating if it is good news (green) or bad news (red). Also, when a newspaper is selected a list of associated headlines is revealed (color coded as Good/Bad), a Pie chart showing the overall percentage Good/Bad/Neutral, related headlines from various other news sites across the country, and a word cloud of the top 200 most frequently used words. This visualization could also have a time slider that show the change of the sentiment for the newspapers over time. We had many more interesting visualization ideas to express our findings, but the limited amount of time only allowed us to go this far. We have made all of our code and necessary data available in a GitHub repo and trying to make a live installation available for exploration soon.

Final presentations at #hackarchives are underway @internetarchive wrapping up archives unleashed 3!! pic.twitter.com/KbV4oFfurh
— Matthew Weber (@docmattweber) February 25, 2017

Next, the team “Fake News” presented their work. The team started with the research questions: “Is it fake news to misquote a presidential candidate by just one word? What about two? Three? When exactly does fake news become fake?”. Based on these question, they hypothesis that “Fake news doesn’t only happen from the top down, but also happens at the very first moment of interpretation, especially when shared on social media networks". With this in mind, they want to determine how Twitter users were recording, interpreting, and sharing the words spoken by Donald Trump and Hillary Clinton in real time. Furthermore, they also want to find out how the “facts” (the accurate transcription of the words) began to evolve into counter-facts or alternate versions of their words. They analyzed the twitter data from the second presidential debate and focused on the most famous keywords such as "locker room", "respect for women", and "jail". The analysis result is visualized using word tree and bar chart. They also conducted a sentiment analysis which outputs a surprising result: most twitter result has positive sentiments towards the locker-room talk. Further analysis showed that apparently sarcastic/insincere comments skewed the sentiment analysis, hence the positive sentiments.

Next team used Twitter to track "fake news" - how were Trump, Clinton debate quotes shared. How did alternatives appear? #hackarchives pic.twitter.com/NsOtS5eMnt
— Ian Milligan (@ianmilligan1) February 25, 2017

After that, the team “End of Term Transition” presented their project. The group were trying to use public archives to estimate change in the main government domains at the time of each US presidential administration transition. For each of these official websites, they planned to identify the kind and the rate of change using multiple techniques including the Simhash, TF–IDF, edit distance, and efficient thumbnail generation. They investigated each of these techniques in terms of its performance and accuracy. The datasets were collected from the Internet Archive Wayback Machine around the 2001, 2005, 2009, 2013, and 2017 transitions. The team made their work available on Github.

Last up! Analysis of changes in web content across end of term crawls @internetarchive #hackarchives pic.twitter.com/dWrfChnTFF
— Matthew Weber (@docmattweber) February 25, 2017

Finally, a surprise new team joined, it was team “Nick”. It was presented by Nick Ruest, (Digital Assets Librarian at York University). Nick has been exploring Twitter API mysteries, he showed some visualizations showing some odd peaks that occurred.

The final mystery 'team' is @ruebot, who has been exploring Twitter API mysteries. Odd plateaus and peaks in data. #hackarchives pic.twitter.com/Q4dCGBvcTT
— Ian Milligan (@ianmilligan1) February 25, 2017

After the teams presented their work, the judges announced the team with the most points, and the winner team was “End of Term Transition”.

And the winner is Team Transition @internetarchive #hackarchives !!! That's a wrap folks. pic.twitter.com/bGili0d7WT
— Matthew Weber (@docmattweber) February 25, 2017

This workshop was extremely interesting and I enjoyed it fully. The fourth Datathon Archives Unleashed 4.0: Web Archive Datathon was announced, and will occur at the British Library, London, UK, at June 11 – 13, 2017. Thanks to Matthew Weber, Ian Milligan, and Jimmy Lin for organizing this event, and for Jefferson Bailey, and Vinay Goel, and everyone at the Internet Archive.

And we're proud to announce archives unleashed 4 @britishlibrary jun 11-13. Details at https://t.co/OCbha8LDmk #hackarchives
— Matthew Weber (@docmattweber) February 25, 2017

-Lulwah M. Alkwai

Search This Blog

Web Science and Digital Libraries Research Group

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Comments

Post a Comment