Thursday, June 30, 2016

2016-06-30: JCDL 2016 Doctoral Consortium Trip Report



Traditionally, Joint Conference on Digital Libraries (JCDL) has hosted a workshop session called Doctoral Consortium (DC) specific to PhD students of digital library research field and this year (JCDL 2016) was no exception. The workshop was intended for students that are in the early stages of their dissertation work. Several WS-DL group members attended and reported past DC workshops and this time it was my turn.

This year's doctoral consortium (June 19, 2016) was chaired by George BuchananJ. Stephen Downie, and Uma Murthy. Committee members included Sally Jo CunninghamMichael NelsonMartin Klein, and Edie Rasmussen. A total of six PhD students participated in the workshop and presented their work. The doctoral consortium session is generally not open for public participation, however, Michele C. Weigle (JCDL 2016 program chair), Mat Kelly (last year's DC participant), and Alexandar Nwala (potential future participant) also attended the session. Each presenter was given about 20 minutes for talk and 10 minutes for questions and comments.

I, Sawood Alam form Old Dominion University was the first presenter of the session with my research topic, "Web Archive Profiling for Efficient Memento Aggregation". I elaborated on the problem space of my research work by giving an example of collection building, indexing, updating indexes, and profiling or summarization of the collection for better collection understanding and efficient lookup routing. With the help of real life examples and events, I established the importance of small web archives and the need of efficient means of their aggregation. I further explained the methodology and various approaches of web archive profiling depending on available resources and desired detail. I briefly described the evaluation plan and preliminary results published/accepted in TPDL15, IJDL16, and TPDL16. Finally, I presented my tentative timeline of work and publication plans. My work is supported in part by the IIPC.


Adeleke Olateju Abayomi from the University of KwaZulu-Natal, South Africa presented her work entitled, "An Investigation of the Extent of Automation of Public Libraries in South-West Nigeria". It was a survey based research for which she conducted interviews and questionnaires with randomly selected librarians for the study. Attendees of the workshop asked questions about the scope of the automation she was studying, was it limited to the background library management process or the public facing services as well and whether the library patrons were also interviewed? Committee members suggested her to also include case studies from near by countries that have invested in library automation to strengthen her arguments.


Bakare Oluwabunmi Dorcas from the University of KwaZulu-Natal, South Africa presented her research work on "The Usage of Social Media Technologies among Academic Librarians in South Western Nigeria". She used a survey based research approach by conducting interviews and questionnaires with academic librarians of six universities. She mentioned how librarians are using social media such as Facebook groups for provision of library and information service delivery to library clienteles. The outcome from the study is expected to improve practice, inform policy and extend theory in the field of Social Media Technologies (SMT) use in academic libraries based on a developing country context. She was suggested to examine whether the material posted on social media is periodically archived elsewhere.


Prashant Chandrasekar from Virginia Tech presented his work on "A DL System to Facilitate Behavioral Studies of Social Networks". He is working on designing a framework that would enable researchers to conduct hypothesis testing on information related to the study of human behavior in a clinical trial that involves social networks. The system is being built with the immediate aim to serve the needs of a teams of researchers that are part of the Social Interactome project. However, the design of the framework and the scenarios of use will be generalized to all psychologists/sociologists.


Lei Li from the University of Pittsburgh and China Scholarship Council presented "A Judgement Model for Academic Content Quality on Social Media". Initially, there was some lack of clarity on what she meant by "academic content" on social media, which she clarified that her study is around ResearchGate posts and comments. The general comment from the committee on her quality assessment approach can be summarized as "it would be great to establish something of this sort, but she should limit the scope for her PhD work." Edie Rasmussen nicely put the challenge of creating a data set for quality measure as, "Life's too short to generate your own data set." which was in line with Yasmin AlNoamany saying, "I will never do it again!" while describing a manually labeled data set during her recent PhD defense. Dr. Michael Nelson suggested Lei Li to pick a good example and walk through it to elaborate on the process.


The final presentation of the day was from Jessica Ogden from the University of Southampton. Jessica presented her ongoing PhD research entitled "Interrogating the Politics and Performativity of Web Archives" which is centered on web archival practice, specifically looking at selection and collection practices across different web archiving communities in the field. Jessica prefaced the presentation with some information regarding her academic background to provide context for how the interdisciplinary project is being approached. Some philosophical questions were raised regarding the nature of web archives (and assumptions about the Web itself), as well as importance placed on the documenting the assumptions made during selection and collection of web archives (which are often left undocumented). For more details on Jessica’s research and the presentation at JCDL 2016 DC see her blog post.


Once all six participants presented their work, committee members highlighted some general comments such as every presenter did a good job of wrapping their talk in the allotted time and leaving enough time for questions and comments. They also noted that slides should not have too many words in them so that the presenter ends up reading them verbatim, on the other hand they should not be on the other extreme either where every slide has nothing but pictures. After these comments, the session was open for everyone to provide any feedback or ask general questions from presenters or the committee members. I noted that this year's doctoral consortium was dominated by "social media" based study.

On our way back to the hotel Alexander said, "the committee members had such a deep understanding of the subject and provided very useful comments." Mat and I replied, "yes they did indeed." It is strongly recommended that if possible, every PhD candidate should participate in a Doctoral Consortium workshop of the respective field at least once to gain some insights and perspective from the people outside their thesis committee members.

You may also like to read the main JCDL 2016 conference coverage by Alexander and the WADL 2016 workshop coverage by Mat.

Update (July 2, 2016): Added Bakare's slides and updated the description of her talk.

--
Sawood Alam

Monday, June 27, 2016

2016-06-27: Symposium on Saving The Web at the Library of Congress

On June 16, 2016, the Library of Congress hosted a one day Symposium entitled Saving the Web: The Ethics and Challenges of Preserving What's on the Internet. The Symposium came at the end of the Archives Unleashed 2.0 Web Archive Datathon. The Datathon itself is covered in an earlier report. In addition to presenting the results of datathon projects, the wide variety of speakers at the symposium defended the need for preservation, discussed the special issues associated with preservation of data, and finally highlighted the importance and concepts surrounding the preservation of multimedia.

Keynote by Vint Cerf


The symposium opened with Dame Wendy Hall, Professor of Computer Science at the University of Southampton and Kluge Chair in Science and Technology at the Library of Congress. She mentioned that the Internet has only been available for a few decades and we need to preserve its openness and freedom, because that openness and freedom are always in peril.  She used those points to introduce one of the inventors of TCP/IP, Vint Cerf.

Vint Cerf talked about the instability of the current Internet. He introduced the idea of "Digital Vellum", referring to vellum, a form of parchment created from animal skin that was once used to create fine quality documents. The goal is to not only capture the documents and data that make up the Internet, but also be able to recreate them in the distant future.

He highlighted a number of problems with the current Internet. URLs associated with domain names are not stable; a change in ownership or solvency of a company or organization can cause a domain name to stop responding.

To understand the context surrounding them, digital objects also require a lot of metadata to be captured in addition to their original content. Users need enough information to correctly interpret the content that has been preserved because its context may be lost to time.

Copyright law needs to be amended to give rights and protections to archival organizations, like the Library of Congress. Serious questions about protections for archival and replay of archived content still exist.
He explained the importance of multiple web archives, noting historical issues with libraries and archives being lost to natural disasters or war. He thinks there is something "delicious" about the Library of Alexandria being a backup site for the Internet Archive. He said that we still have many of the artifacts of the ancient world purely by luck or accident, and that we can do better.


Talking about other digital objects, Vint Cerf then discussed efforts to preserve software. He mentioned the OLIVE preservation project for archiving and replaying executable content.  They are currently looking into streaming virtual machines in order to replay old executables on modern systems. He did confess that we have a long way to go before we're able to reproduce old results in some cases.



Vint Cerf mentioned that idea of the "self-archiving Web", indicating that we need collaboration, open design, and new business models. He said that, due to its success, the Internet could serve as a good source of lessons for how one would go about designing the self-archiving Web. Participating in the current Internet is done by just following the agreed upon protocols. It works largely because of its modularity and its capacity for layered evolution. The self-archiving Web should also try to embrace these strengths.
He listed outstanding questions with the approach of archiving the Internet. He said that he had some issues with contemplating the idea of the Internet containing itself. Is there were a better replacement for hyperlinks due to their deterioration? Should be be using something like DOIs instead? When should a snapshot be taken? How do we know when a change has occurred in a resource? How do we ensure that old formats, like old versions of HTML, will render well for future users? Do we store malware or encryption keys? How to handle access control for resources?




The other inventor of TCP/IP, Bob Kahn of the Corporation for National Research Initiatives (CNRI), was unable to attend. He was scheduled to present Digital Object Architecture (DOA), so Vint Cerf continued by presenting that work as well. He talked about the existing handle systems, such as DOI, where identifiers, rather than locations, are used for digital objects. These identifiers are then submitted to a resolution system that locates the object and then delivers it to the user. Of course, this resolution system is an additional layer of infrastructure that must be managed. There are quite a few handle system implementations, including those supported by the Library of Congress, CrossRef, and mEDRA.

He finished up by accepting a few questions from the audience. From these exchanges came additional insight - and funny moments - shown in the tweets below.




Archives Unleashed Presentation



Ian Milligan is an Assistant Professor in the Department of History at the University of Waterloo. He discussed the use of web archives in studying history, highlighting how he used warcbase in a study of 186 million archived pages from geocities.com. He spoke about the importance in studying online communities in order to understand a period of history.




Then Ian and some team representatives presented our hard work from the Archives Unleashed Hackathon. We had worked on a variety of projects with different datasets. A lot of natural language processing combined with temporal metadata and modeling allowed our groups to study sentiment in elections, uncover differences in media reporting based on country, discovering documents related to terrorism, and more.









The Need for Preservation


Next was a series of presentations and panel discussion moderated by David Lazer (Northeastern University) with Abbie Grotke (Library of Congress), Jefferson Bailey (Internet Archive), Richard Marciano (University of Maryland), and Richard Price (British Library). Their topic was "The Need for Preservation". David Lazer started by discussing the curation of archives and posed the open question of how to determine which archived pages are valuable.

David Lazer


David Lazer is a Professor in Political Science and Computer and Information Science at Northeastern University. He started the session by discussing the quality of what has been archived. He mentioned that digital media allows us to think of documents and data in a different way. He discussed the issues with finding useful information in Twitter data, due to the presence of bots and other sources of noise.

Jefferson Bailey


Jefferson Bailey is the Director of Web Archiving Programs for the Archive-It Team at the Internet Archive. He highlighted some statistics about its current holdings as well as talking about its multimodal crawling strategy including work with libraries. At the moment, researchers must develop problem-specific tools to work with the Internet Archive. They are currently gathering information on research interests in an attempt to create a set of general purpose tools for research.

Richard Marciano


Richard Marciano leads the Digital Curation Innovation Center (DCIC) at the University of Maryland. He spoke about DCIC's work with big data and how it was related to digital archives. He finished up with some thought-provoking questions shown below.

Richard Price


Richard Price is the Head of Contemporary British Collections at the British Library. He discussed the mission of saving the web, and stressed that advocacy has always been important for libraries and archives. He mentioned that users are often the best advocates and that the right language is best when trying to advocate for web archiving, preferring the term "time travel" because it seems to engender more interest from the public.

Abbie Grotke


Abbie Grotke is the web archiving team lead for the Library of Congress. She discussed the curated web archive maintained by the Library of Congress. They perform regular crawls of specific websites and use RSS feeds to inform their crawling. Currently the team is focused on acquiring web content, but they do not yet have the resources to make it all accessible. She said that there are challenges to archiving the web in the United States, because most sites do not sit under a country-specific top level domain. The question and answer session afterwards brought up a number of good thoughts. What are the ethics of archiving? Many archives have a national focus, but many topics are international; how do we curate topics so that they are available across archives? Do people have a right to be forgotten?

Putting Data to Work


The next session was moderated by Dame Wendy Hall. The speakers for this session were Lee Rainie (Pew Research Center), Katy Borner (Indiana University Bloomington), James Hendler (Rensselaer Polytechnic Institute), and Phillip E. Scheur (Stanford University).

Lee Rainie


Lee Rainie is the director of Internet, Science and Technology for the Pew Research Center. He stated that he was happy that so many large scale projects involving Internet data, and especially archived data, have a civic focus. He bemoaned the decline of civic news provided by newspapers, but said that librarians and archivists can play an important role in ensuring that civic information gets archived in web archives. He did warn that, though so many research projects acquire data from Twitter, only 20% of Americans use twitter, meaning that many perspectives are lost.

Katy Börner


Katy Börner is a Distinguished Professor of Information Science at the School of Informatics and Computing at Indiana University Bloomington. She discussed the exciting world of visualizing (web) science. She featured some of the work at scimaps.org, a site dedicated to visualizations of scientific data. I was surprised to see her highlight the "Clickstream Map of Science" that was "near and dear" to her, with which I was very familiar because it was created by Johan Bollen, Herbert Van de Sompel, and others as part of "Clickstream Data Yields High-Resolution Maps of Science". She mentioned the need to not only create tools for visualizing web data, but also the importance of pursuing information literacy so that many can use these tools as well.

James Hendler


James Hendler is Director of the Institute for Data Exploration and Applications and the Tetherless World Professor of Computer, Web and Cognitive Sciences at Rensselaer Polytechnic Institute (RPI). He is one of the originators of the Semantic Web. He discussed data and how important it is to ensure that the data we use for research is suitable for others to consume as well. He mentioned the importance of metadata for making sense of data in context, echoing earlier points made by Vint Cerf. He talked about the temporal nature of data and how accessing datasets at different points in time is in itself useful. I spoke to him during one of the breaks about work the LANL Prototyping Team has been doing in regards to temporal access to semantic web data.

Philip Schreur


Philip Schreur is the Assistant University Librarian for Technical and Access Services at Stanford University Library. He discussed the issues of metadata and how libraries are engaged in a migration to linked data. He mentioned the importance of metadata in understanding historical context. He said that shifting from MARC and other metadata formats will be difficult, but necessary for the future of libraries. He sees a future where libraries will be creating metadata for the purposes of sharing it with the web. He also agrees that libraries will continue to curate data, but acquisition of content will be automated.

Dame Wendy Hall


Dame Wendy Hall then began talking about where she would take libraries, emphasizing that it is data that patrons are looking for. That data may take the form of documents, datasets, etc., but is more than just articles. She mentioned that librarians need to become more data-savvy and that discovery will become more and more important.

Saving Media


The last session was moderated by Matthew Weber (Rutgers University). This session included Philip Napoli (Rutgers University), Ramesh Jain (University of California, Irvine), and Katrin Weller (GESIS Leibniz Institute for the Social Sciences and former Kluge Fellow in Digital Studies).

Matthew Weber


Matthew Weber is an Assistant Professor in the School of Communication and Information at Rutgers University. He began the session by talking about how web content changes and how it is possible to view the perceptions of a group in a specific point in time because of these changes.

Philip Napoli


Philip Napoli is a Professor of Journalism and Media Studies in Rutgers School of Communication and Information. He began by echoing one of Vint Cerf's points: there is so much diverse content that it is more difficult to do a study in the early 2000s than it is to study media from the past. He mentioned that there needs to be focus on archiving local news because it is getting lost. It is also an area that local libraries can participate in.

Ramesh Jain


Ramesh Jain is a Professor at the Bren School of Information and Computer Sciences at the University of California, Irvine. His is area of research includes multimedia information systems. He spoke about multimedia and how the growth of cameras have created an unprecedented capability for capturing events. He mentioned how a change is occurring, in part thanks to social media, whereby now we are producing "visual documents" that contain text rather than textual documents that begrudgingly contain photos. He emphasized that we have begun not just creating a web of documents, but a "web of events".

Katrin Weller


Katrin Weller is an information scientist working at the GESIS Leibniz Institute for the Social Sciences. She discussed the issue of context in social media. Will present hashtags have any meaning in the future? She mentioned that future historians may use past instructional texts, like "Twitter for Dummies", to understand how our current tools are used. In some cases, it is important to understand that people change social media accounts over time.

Conclusion by Dame Wendy Hall



Dame Wendy Hall concluded the symposium by discussing the growth of the Internet and how it has changed the world. Her group at the University of Southampton hosts the Web Science Trust, with the goal of facilitating the development of Web Science. She explained that while libraries will be maintaining physical collections, data has also become important to researchers, requiring librarians to learn new data science skills. This led her to introduce Web Observatory, a place to share and link datasets so that researchers can answer questions about the web. The goal is to have metadata in a standard format that will support discovery, but also allow libraries to share each others' data rather than having to collect all of it themselves.

Thoughts and Thanks


All in all, this was an excellent experience and I am glad I attended. I was able to make contact with some of the best minds from a variety of fields while learning about their really fascinating work.

Thanks to Vint Cerf, Ian Milligan, David Lazer, Abbie Grotke, Jefferson Bailey, Richard Marciano, Richard Price, Lee Rainie, Katy Borner, James Hendler, Philip E. Scheur, Dame Wendy Hall, Matthew Weber, Philip Napoli, Ramesh Jain, and Katrin Weller for the excellent thought-provoking presentations.

Thanks to Matthew Weber, Ian Milligan, Jimmy Lin, Noshir Contractor, David Lazer, Wendy Hall, Nicholas Taylor, and Jefferson Bailey for making Archives Unleashed a reality and connecting it to the Save the Web Symposium. Also, thanks to all of the Archives Unleashed attendees who made the experience quite rewarding.

And final thanks go to Dame Wendy Hall and the John W. Kluge Center at the Library of Congress for hosting the event.


Thanks much for tweets from @DameWendyDBE, @EpistolaryBrown, @joecar25, @kwelle, @nullhandle, @websitemgmt, @lljohnston, @tsuomela, @jesseajohnston, @kzwa, @jillreillyjames, @lrainie, @ianmilligan1, @KlugeCtr, @NEH_PresAccess, @KingsleySteph, @justin_littman, @jahendler, @MikeNelson, @docmattweber, @raiminetinati, @earlymodernpost

Many others have also written articles about this event, including: -- Shawn M. Jones

2016-06-27: Archives Unleashed 2.0 Web Archive Hackathon Trip Report


Members from WSDL who participated in the Hackathon 2.0
Last week, June 13-15, 2016, six members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend the second Archives Unleashed 2.0 at the Library of Congress in Washington DC. This event is a follow-up to the Archives Unleashed (Web Archive Hackathon 1.0) held in March 2015 at the University of Toronto Library, Toronto, Ontario Canada. We (Mat Kelly, Alexander Nwala, John Berlin, Sawood Alam, Shawn Jones, and Mohamed Aturban) met with other participants, from various countries, who have different backgrounds -- librarians, historians, computer scientists, etc. The main goal of this event is to build tools for web archives as well as to support this kind of ongoing community to have a common vision of how to access and extract data from web collections.

This event was made possible with generous support from the National Science Foundation, the Social Sciences and Humanities Research Council of Canada, the University of Waterloo’s Department of History, the David R. Cheriton School of Computer Science and the University of Waterloo, and the School of Communication and Information at Rutgers University.

The event was organized by Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan (assistant professor, Department of History, University of Waterloo), Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo), Nicholas Worby (Government Information & Statistics Librarian, University of Toronto), and Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec). Here are some details about different activities over the three days of Hackathon 2.0.


Day 1 (June 13, 2016)

Our evening gathering in the first day of Hackathon 2.0 was at Gelman Library, George Washington University, and it was for (1) the participants to briefly introduce themselves and their area of research, and (2) forming multiple groups to work on different Hackathon projects. In order to form groups, all participants were encouraged to write a few words in three separate sticky notes describing a general topic they were interested in (e.g., topic modeling, extracting metadata, study tweets, and analysis of Supreme Court nominations), what kind of dataset they wanted to work on (e.g., collected tweets, and dataset from 2004/2008 election), and what they wished to accomplish with the selected dataset.



Participants were trying to put those sticky notes that had similar ideas together. After that, the initial groups were formed and every group was given a few minutes to introduce their project idea. By the end of the first day, we all went to a restaurant to have our dinner and socialize. Here is a list of the different groups formed initially during the first day after the brainstorming session, and I will explain later in some details what each group had accomplished:

Group Name Members
Twitter Political Organization Allie Kosterich, Nich Worby, John Berlin, Laura Wrubel, Gregory Wiedeman
Mojitos Nathalie Casemajor, Federico Nanni, Alexander Nwala, Sylvain Rocheleau, Jana Hajzlerova, Petra Galuscakova
Museum Ed Summers, Emily Maemura, Sawood Alam, Jefferson Bailey
I Know What You Hid Last Summer Mat Kelly, Shawn Walker, Keesha Henderson, Jaimie Murdock, Jessica Ogden, Ramine Tinati, Niko Tsakalakis
The Supremes Nicholas Taylor, Ian Milligan, Jimmy Lin, Patrick Rourke, Todd Suomela, Andrew Weber
Team Turtle Mohamed Aturban, Niel Chah, Steve Marti, Imaduddin Amin
Counter-Terrorism Daniel Kerchner, Emily Gade
Campaign: Origins Allison Hegel,Debra A. Riley-Huff, Justin Littman, Shawn M. Jones, Kevin Foley, Ericka Menchen-Trevino, Nick Bennett, Louise Keen


Day 2 (June 14, 2016)

Colleen Shogan, from the Library of Congress, declared the Hackathon open. Colleen mentioned that researchers who have questions about politics, history, or any other aspects related to culture memory really need specialists like us to help them access data available in different repositories such as Internet Archive and the Library of Congress. She emphasized the importance of such events and finally she thanked people who made this event possible including the organizers and the steering committee.


Matthew Weber presented the agenda of the day including presentations, a brief tour at the Library of Congress, and revising groups formed the day before. Matthew gave an example related to his dissertation work in the past illustrating how difficult it is to use web archives to answer research questions without building tools. He stated that this ongoing community is to build a common vision for web archive development tools to help accessing and extracting data, and uncover important stories from web archives. Finally, Matthew listed several kind of datasets available for the participants to work on. 2004, 2008, and 2010 election data, and the Supreme Court nominations are example of such datasets.


Ian Milligan introduced Warcbase (slides) which was developed by a team of five historians, three computer scientists, and a network scholar. Ian showed how slow it is to browse web archives using the traditional way of entering a URL in the Wayback Machine (remembering that requiring the URL itself limits what you can find in the archives). Warcbase works beyond that where it can be used to access, extract, manage, and process data from WARC files (e.g., extracting names, locations, plain text, URIs, and others from WARC files and generating different formats like network graphs or metadata in JSON). Warcbase supports filtering data based on dates, domain names, languages, etc. In addition, Warcbase is scalable which means it may run on a laptop, a powerful desktop, or on a cluster. Users may use command line tools as well as an interactive web-based interface to run Warcbase.




Jefferson Bailey and Vinay Goel from the Internet Archive presented Archive Research Services Workshop. Jefferson mentioned that the Internet Archive focuses on collecting web resources and providing access to those collections. The Internet Archive does not allow researchers to access their infrastructure to do intensive research like data mining activities. The Internet Archive has huge web collections about 13 terabyte, and it collects about a billion URIs a week. Jefferson indicated also that WARC files are huge, and it is difficult to work with such files. Also, researchers might request huge collections in WARC format, but they may end up using only a small portion. For those reasons, Internet Archive is trying to support specific research questions, so instead of providing data in WARC format, they will allow users to have access to datasets in different formats like CDX which consists of metadata about the original WARC files. Other formats include Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). Having such formats allows us to have smaller datasets. For example, CDX is only one percent of the size of WARC files.




Next, Vinay Goel from the Internet Archive continued on the same topic (Archive Research Services Workshop). Vinay gave a quick overview of ArchiveSpark. The tool might help the community to search and filter Internet Archive collections (e.g., filtering could be based on date, MIME type, and HTTP Response code). A research paper about ArchiveSpark was accepted and will be presented at JCDL 2016.





Abigail Grotke and Andrew Weber introduced Library of Congress Data Resources. Abigail and Andrew are working on web archiving team along with other members at the Library of Congress. They indicated that most of the crawling process is done at the Internet Archive. The Library of Congress has been collecting web resources for more than 16 years. They have made some collections available on the web. These collections are searchable (not full text search), indexed, and can be accessed by the Wayback Machine. In addition, the Library of Congress archive supports Memento. Most of the collections are not allowed to be accessed on the web due to copyright issues and permissions policy, and researchers must be physically there at the Library of Congress to access these collections.






During the coffee break, we had the opportunity to make a short tour around the Library of Congress Jefferson Building which was the first building with electricity and an elevator in use in DC.



After the coffee break, each group explained briefly their project idea, and what kind of dataset they were going to use. At this time, some participants moved to other groups as they found more interesting ideas.



While having our lunch, we were listening to the five-minute lightning talks. Nicholas Taylor (from Stanford University) introduced WASAPI, Jefferson Bailey (Internet Archive) gave a short talk about Researcher Services, Ericka Menchen-Trevino (American University) presented Web Historian, Nathalie Casemajor, Petra Galuscakova, and Sylvain Rocheleau briefly explained NUAGE, Alexander Nwala (Old Dominion University) introduced the topic Generating Collections for Stories and Event, and finally John Berlin (Old Dominion University) presented Are Wails Electric?



After that, the groups were located in different rooms based on what kind of equipment every team might need to work on the project. Each group met and had the opportunity to work for about 5 hours (there were 30 minutes coffee break after the first 2 hours) on their project ideas. By the end of the second day, we all came together around 6 PM, and each group's representative gave an update about their team's progress. Then, all participants were invited for dinner.





Day 3 (June 15, 2016)

Most of last day's time was for the groups to intensively work on their projects and produce the final results. From the time we had our breakfast at 8:30 AM at the Madison Atrium til the end of the day at 6:30, we were working on our projects except the times for the coffee break and lunch. Some participants gave five-minute lightning talks during the lunch time. The voice was not really clear at the Madison Atrium, Justin Littman was standing on a chair to deliver his talk, yet the voice still was not delivered clearly. For this reason, I will briefly mention what those talks were about.


Laura Wrubel, Daniel Kerchner, and Justin Littman from George Washington University presented an introduction to the new Social Feed Manager, a sampling of research projects supported by Social Feed Manager, and the provenance of a tweet (as inspired by web archiving). Sawood Alam from Old Dominion University introduced MemGator – A Memento Aggregator CLI and Server in Go. Jaimie Murdock from Indiana, Polygraphic and Polymathic presented the Into Thomas Jefferson’s Mind. Finally, Mat Kelly from Old Dominion University gave a short talk about Exploring Aggregation of Personal, Private, and Institutional Web Archives.




Final presentations

By the end of the day, each group presented the findings of the project that they were working on for the last couple of days:




  • Mojitos (Slides)

  • The team's goal was to detect and track the events discussed between polar media in Cuba. This was done by processing news data from the state controlled Cuban media (Granma) and a media that caters to Cuba located in Florida (el Nuevo Herald).




  • Campaign: Origins (Slides)

  • Using tweets with #election2016, @realDonaldTrump, and @HillaryClinton, this team ​searched for narratives using the content of the web pages linked to from these tweets, rather than just the tweets themselves. The tweets were collected on June 14 - 15. The team's intention is to use the Internet Archive's Save Page Now feature to capture the web pages as they are tweeted so that such a study can be repeated on a larger set of tweets in the future. They produced the following streamgraph.





  • The Supremes (Slides and more details)

  • This group has tried to analyze web archived data, provided in ARC format by the Library of Congress, about the Supreme Court nominations for Justice Alito and Justice Roberts. The size of the datasets is 92 GB containing 2.2 million pages about both Alito and Roberts. The goal of the team was to explore and analyze the data and produce more possible research questions. They used Warcbase to extract datasets from the ARC files. In addition, Warcbase can produce files in a format that can be opened directly in other platforms like Gephi.





  • I Know What You Hid Last Summer (Slides)

  • The team took Twitter datasets from the UK and Canadian Parliament members, identified the deleted tweets, noted which tweets contained links, checked if those links died after the tweet was deleted, and tried to derive meaning from the deletion. Further visualization was also done.




  • Museum

  • This team tried to analyze CDX files from the Internet Archive's IMLS Museums crawl consisting of over 219 million captures. They also utilized the Museum Universe Data File from IMLS to enrich their findings. They evaluated the proportion of various content-types (such as images or PDFs) that were crawled. They also quantified the term frequencies in the URLs of each content type. Additionally, they demonstrated the domain name distribution in the collection in a hierarchical chart (using tree-map). A part of their analysis is published on GitHub.





  • Counter-Terrorism (Slides)

  • This team collected 383,527 tweets (between 2013 and 2016) from 1,153 accounts of suspected extremists. Approximately, 300 people are associated with these accounts. The tweets are in mix of English and Arabic. The goal is to identify ISIS supporters by running them through an ideology classifier.




  • Team Turtle (Slides)

  • The team used a dataset from 2004 Presidential Election provided by the Library of Congress. The dataset was collected during the day before the election, the election day, and the day after the election. The goal of this team is to answer questions like (1) if one candidate spends more time talking about issues related to a particular state than the other candidate does, would this lead him to win the state? (2) would candidates give more time to the "swing" states than others? and (3) what is the most important topic for each state? The dataset was available in ARC format. Warcbase tool is used to extract text from those files. After that, the dataset was analyzed using techniques like Stanford NER tagger to tag places, people, and organizations, and the LDA model and TF-IDF to identify topics. Finally, the team produced an interactive visualization using D3.js.







  • Twitter Political Organization (Slides)

  • The team created a timeline of mentions in candidate tweets to donations for the Service Employees International Union (SEIU) on twitter, graph of retweets per day of the candidates and sentiment analysis (Naive Bayes classifier) of the candidates tweets was performed in attempt to see if there was a correlation between donation amount over time to how positive or negatively the candidates tweeted.


    After all the groups presented their work, Jimmy Lin announced ArchivesUnleashed Inc. It is a Delaware non-profit corporation aiming to create knowledge around the scholarly use of web archives. The board of directors of this new organization includes:
    • Ian Milligan (assistant professor, Department of History, University of Waterloo)
    • Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University)
    • Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo)
    • Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec)
    • Nicholas Worby (Government Information & Statistics Librarian, University of Toronto)

    The winning team

    Ian Milligan announced the winning team Counter-Terrorism (Congratulations to Daniel Kerchner and Emily Gade). In addition, the top four teams (Counter-Terrorism, Team Turtle, I Know What You Hid Last Summer, and Mojitos) were selected to present their work during the next day event Saving the Web




    --Mohamed Aturban