Monday, October 31, 2016

2016-10-31: Two Days at IEEE VIS 2016

I attended a brief portion of IEEE VIS 2016 in Baltimore on Monday and Tuesday (returned to ODU on Wednesday for our celebration of the Internet Archive's 20th anniversary).  As the name might suggest, VIS is the main academic conference for visualization researchers and practitioners. I've taught our graduate Information Visualization course for several years (project blog posts, project gallery), so I've enjoyed being able to attend VIS occasionally. (Mat Kelly presented a poster in 2013 and wrote up a great trip report.)

This year's conference was held at the Baltimore Hilton, just outside the gates of Oriole Park at Camden Yards.  If there had been a game, we could have watched during our breaks.





My excuse to attend this year (besides the close proximity) was that another class project was accepted as a poster. Juliette Pardue, Mridul Sen (@mridulish), and Christos Tsolakis took a previous semester's project, FluNet Vis, and generalized it. WorldVis (2-pager, PDF poster, live demo) allows users to load and visualize datasets of annual world data with a choropleth map and line charts. It also includes a scented widget in the form of a histogram showing the percentage of countries with reported data for each year in the dataset.

Before I get to the actual conference, I'd like to give kudos to whomever picked out the conference badges. I loved having a pen (or two) always handy.
Monday, October 24

Monday was "workshop and associated events" day.  If you're registered for the full conference, then you're able to attend any of these pre-main conference events (instead of having to pick and register for just one). This is nice, but results in lots of conflict in determining which interesting session to attend.  Thankfully, the community is full of live tweeters (#ieeevis), so I was able to follow along with some of the sessions I missed. It was a unique experience to be at a conference that appealed not only to my interest in visualization, but also to my interests in digital humanities and computer networking.

I was able to attend parts of 3 events:
  • VizSec (IEEE Symposium on Visualization for Cyber Security) - #VizSec
  • Vis4DH (Workshop on Visualization for the Digital Humanities) - #Vis4DH
  • BELIV (Beyond Time and Errors: Novel Evaluation Methods for Visualization) Workshop - #BELIV
VizSec

The VizSec keynote, "The State of (Viz) Security", was given by Jay Jacobs (@jayjacobs), Senior Data Scientist at BitSight, co-author of Data-Driven Security, and host of the Data-Driven Security podcast. He shared some of his perspectives as Lead Data Analyst on multiple Data Breach Investigations Reports. His data-driven approach focused on analyzing security breaches to help decision makers (those in the board room) better protect their organizations against future attacks. Rather than detecting a single breach, the goal is to determine how analysis can help them shore up their security in general. He spoke about how configuration (TLS, certificates, etc.) can be a major problem and that having a P2P share on the network indicates the potential for botnet activity.

In addition, he talked about vis intelligence and how CDFs and confidence intervals are often lost on the general public.

He also mentioned current techniques in displaying IT risk and how some organizations allow for manual override of the analysis.

And then during question time, a book recommendation: How to Measure Anything in Cybersecurity Risk
In addition to the keynote, I attended sessions on Case Studies and Visualizing Large Scale Threats.  Here are notes from a few of the presentations.

"CyberPetri at CDX 2016: Real-time Network Situation Awareness", by Dustin Arendt, Dan Best, Russ Burtner and Celeste Lyn Paul, presented analysis of data gathered from the 2016 Cyber Defense Exercise (CDX).

"Uncovering Periodic Network Signals of Cyber Attacks", by Ngoc Anh Huynh, Wee Keong Ng, Alex Ulmer and Jörn Kohlhammer, looked at analyzing network traces of malware and provided a good example of evaluation using a small simulated environment and real network traces.

"Bigfoot: A Geo-based Visualization Methodology for Detecting BGP Threats", by Meenakshi Syamkumar, Ramakrishnan Durairajan and Paul Barford, brought me back to my networking days with a primer on BGP.

"Understanding the Context of Network Traffic Alerts" (video), by Bram Cappers and Jarke J. van Wijk, used machine learning on PCAP traces and built upon their 2015 VizSec paper "SNAPS: Semantic Network traffic Analysis through Projection and Selection" (video).


Vis4DH

DJ Wrisley (@djwrisley) put together a great Storify with tweets from Vis4DH
Here's a question we also ask in my main research area of digital preservation:

A theme throughout the sessions I attended was the tension between the humanities and the technical ("interpretation vs. observation", "rhetoric vs. objective"). Speakers advocated for technical researchers to attend digital humanities conferences, like DH 2016, to help bridge the gap and get to know others in the area.

There was also a discussion of close reading vs. distant reading.

Distant reading, analyzing the structure of a work, is relatively easy to visualize (frequency of words, parts of speech, character appearance), but close reading is about interpretation and is harder to fit into a visualization.   But the discussion did bring up the promise of distant reading as a way to navigate to close reading.

BELIV
I made the point to attend the presentation of the 2016 BELIV Impact Award so that I could hear Ben Shneiderman (@benbendc) speak.  He and his long-time collaborator, Catherine Plaisant, were presented the award for their 2006 paper, "Strategies for Evaluating Information Visualization Tools: Multidimensional In-depth Long-term Case Studies".
Ben's advice was to "get out of the lab" and "work on real problems".

I also attended the "Reflections" paper session, which consisted of position papers from Jessica Hullman (@JessicaHullman), Michael Sedlmair, and Robert Kosara (@eagereyes).  Jessica Hullman's paper focused on evaluations of uncertainty visualizations, and Michael Sedlmair presented seven scenarios (with examples) for design study contributions:
  1. propose a novel technique
  2. reflect on methods
  3. illustrate design guidelines
  4. transfer to other problems
  5. improve understanding of a VIS sub-area
  6. address a problem that your readers care about
  7. strong and convincing evaluation
Robert Kosara challenged the audience to "reexamine what we think we know about visualization" and looked at how some well-known vis guidelines have either recently been questioned or should be questioned.

Tuesday, October 25

The VIS keynote was given by Ricardo Hausmann (@ricardo_hausman), Director at the Center for International Development & Professor of the Practice of Economic Development, Kennedy School of Government, Harvard University. He gave an excellent talk and shared his work on the Atlas of Economic Complexity and his ideas on how technology has played a large role in the wealth gap between rich and poor nations.





After the keynote, in the InfoVis session, Papers Co-Chairs Niklas Elmqvist (@NElmqvist), Bongshin Lee (@bongshin), and Kwan-Liu Ma described a bit of the reviewing process and revealed even more details in a blog post. I especially liked the feedback and statistics (including distribution of scores and review length) that were provided to reviewers (though I didn't get a picture of the slide). I hope to incorporate something like that in the next conference I have a hand in running.

I attended parts of both InfoVis and VAST paper sessions.  There was a ton of interesting work presented in both.  Here are notes from a few of the presentations.

"Visualization by Demonstration: An Interaction Paradigm for Visual Data Exploration" (website with demo video), by Bahador Saket (@bahador10), Hannah Kim, Eli T. Brown, and Alex Endert, presented a new interface for allowing relatively novice users to manipulate their data.  Items start out as a random scatterplot, but users can rearrange the points into bar charts, true scatterplots, add confidence intervals, etc. just by manipulating the graph into the idea of what it should look like.

"Vega-Lite: A Grammar of Interactive Graphics" (video), by Arvind Satyanarayan (@arvindsatya1), Dominik Moritz (@domoritz), Kanit Wongsuphasawat (@kanitw), and Jeffrey Heer (@jeffrey_heer), won the InfoVis Best Paper Award.  This work presents a high-level visualization grammar for building rapid prototypes of common visualization types, using JSON syntax. Vega-Lite can be compiled into Vega specifications, and Vega itself is an extension to the popular D3 library. Vega-Lite came out of the Voyager project, which was presented at InfoVis 2015. The authors mentioned that this work has already been extended - Altair is a Python API for Vega-Lite. One of the key features of Vega-Lite is the ability to create multiple linked views of the data.  The current release only supports a single view, but the authors hope to have multi-view support available by the end of the year.  I'm excited to have my students try out Vega-Lite next semester.

"HindSight: Encouraging Exploration through Direct Encoding of Personal Interaction History", by Mi Feng, Cheng Deng, Evan M. Peck, and Lane Harrison (@laneharrison), allows users to explore visualizations based on their own history in interacting with the visualization.  The tool is also described in a blog post and with demo examples.
"PowerSet: A Comprehensive Visualization of Set Intersections" (video), by Bilal Alsallakh (@bilalalsallakh) and Liu Ren, described a new method for visualizing set data (typically shown in Venn or Euler diagrams) in a rectangle format (similar to a treemap). 


On Tuesday night, I attended the VisLies meetup, which focused on highlighting poor visualizations that had somehow made it into popular media.  This website will be a great resource for my class next semester. I plan to ask each student to pick one of these and explain what went wrong.

Although I was only able to attend two days of the conference, I saw lots of great work that I plan to bring back into the classroom to share with my students.

(In addition to this brief overview, check out Robert Kosara's daily commentaries (Sunday/Monday, Tuesday, Wednesday/Thursday, Thursday/Friday) at https://eagereyes.org.)

-Michele (@weiglemc)

Friday, October 28, 2016

2016-10-27: UrduTech - The GeoCities of Urdu Blogosphere


On December 12, 2008, an Urdu blogger Muhammad Waris reported an issue in Urdu Mehfil about his lost blog that was hosted on UrduTech.net. Not just Waris, but many other Urdu bloggers of that time were anxious about their lost blogs due to a sudden outage of the blogging service UrduTech. The downtime lasted for several weeks which has changed the shape of the Urdu blogosphere.

Before diving into the UrduTech story, let's have a brief look into the Urdu language and the role of the Urdu Mehfil forum in promoting Urdu on the Web. Urdu is a language spoken by more than 100 million people worldwide (about 1.5% of the global population), primarily in India and Pakistan. It has a rich literature, while being one of the premier languages of poetry in South Asia for centuries. However, the digital footprint of Urdu has been relatively smaller than some other languages like Arabic or Hindi. In the early days of the Web, computers were not easily available to the masses of the Urdu speaking community. Urdu input support was often not built-in or would require additional software installation and configuration. The right-to-left (RTL) direction of the text flow in Urdu script was another issue of writing and reading it on devices that were optimized for left-to-right languages. There were not many fonts that support Urdu character set completely and properly. The most commonly used Nastaleeq typeface was initially only available in a propriety page-making software called InPage which did not support Unicode and locked-in the content of books and news papers. Early online Urdu news sites used to export the content as images and publish on the web.

Urdu community used to write Urdu text in Roman script on the Web initially, but the efforts of promoting Unicode Urdu were happening on small scales; one such early effort was Urdu Computing Yahoo Group by Eijaz Ubaid. In the year 2005, some people from the Urdu community including NabeelZack, and many others took an initiative to build a platform to promote Unicode Urdu on the Web and created UrduWeb and a discussion board under that with the name Urdu Mehfil. This has quickly become the hub for Urdu related discussions, development, and idea exchange. The community created tools to ease the process of reading and writing Urdu in computers and on the Web. They created many beautiful Urdu fonts and keyboard layouts, translated various software and CMS systems and customized themes to make them RTL friendly, created dictionaries and encyclopedia, developed plugins for various software to enable Urdu in them, developed Urdu variants of Linux OS, provided technical help and support, digitized printed books, created Urdu blog aggregator (Saiyarah) to promote blogging and increase the visibility of new bloggers, and gave a platform to share literary work. These are just a few of many contributions of UrduWeb. These efforts played a significant role in shaping up the presence of Urdu on the Web.

I, Sawood Alam, am associated with UrduWeb since early 2008 with my continuing interest in getting the language and culture online. For the last seven years I am administering UrduWeb. In this period I have mentored various projects, developed many tools, and took various initiatives. I recently collaborated with Fateh, another UrduWeb member, to published a paper entitled, "Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages" (PDF), in an effort to enable easy and fast lookup in many classical and culturally significant Urdu dictionaries that are available in scanned form in the Internet Archive.

To give a sense of the increased activity and presence of Urdu on the Web we can take a couple examples. In the year 2007 when UrduTech was introduced as a blogging platform, Urdu Wikipedia was in the third group of languages on Wikipedia based on the number of articles, with only 1,000+ articles. Fast forward eight years, now in 2016 it has jumped to the second group of languages with 100,000+ articles and actively growing.


In May, 2015 Google Translate Community hosted a translation challenge, in which Urdu languages surfaced in the top ten most contributing languages that was highlighted by Google Translate as, "Notably Bengali and Urdu are in the lead along with some larger languages."


Now, back to the Urdu blogging story, in the year 2007, WordPress CMS was the most popular blogging software for those who can afford to host their site and make it work. For those who were not technically sound or did not want to pay for hosting, WordPress and Blogger were among the most popular hosted free blogging platforms. However, when it comes to Urdu, both platforms had some limitations. WordPress allowed flexible options of plugins, translations, and theming etc., but only if one runs the CMS on their server, hosted free service in contrast, had limited number of themes of which none were RTL friendly and it did not allow custom plugins either. This means, changing CSS to better suit the rendering of the mixed bidirectional content was not allowed that would render the lines containing bidirectional text (which is not uncommon in Urdu) in an unnatural and unreadable order. Lack of custom plugin support would also mean that providing JavaScript based Urdu input support in the reply form was not an option as a result articles would receive more comments in Roman script than in Urdu. On the other hand, blogger allowed theme customization, but the comment form was rendered inside an iframe that had no way to inject external JavaScript in it to allow Urdu input support. As a result, those Urdu bloggers who chose one of these hosted free blogging services had some compromises.

The technical friction of getting things to work for Urdu was a big reason for the slow adoption of Urdu blogging. To make it easier, Imran Hameed, a member of UrduWeb, introduced UrduTech blogging service. People from UrduWeb including Mohib, Ammar, Mawra, and some others encouraged many people to start Urdu blogging. UrduTech used WordPress MU to allow multi-user blogging on a single installation. It was hosted on a shared hosting service. Creating a new blog was as simple as filling an online form with three fields and hit the "Next" button. From there, one can choose from a handful of beautiful RTL-friendly themes and enable pre-installed add-ons to allow Urdu input support, both in the dashboard for post writing and on the public facing site for comments. Removing all the frictions WordPress and Blogger had, UrduTech gave a big boost to the Urdu community and many people started creating their blogs.


It turned out that creating a new blog on UrduTech was easy not just for legitimate people, but for spammers as well. This is evident from the earliest capture of UrduTech.net in the Internet Archive. Unfortunately, the styleseets, images, and other resources were not well archived, so please bear with the ugly looking (damaged Memento) screenshots.


Later captures in the web archive show that as the Urdu bloggers community grew on UrduTech, so did the attack from spam bots. This has increased the burden of the moderation to actively and regularly clean the spam registrations.


The service ran for a little over a year with occasional minor down times. Urdu blogosphere has started evolving slowly and the diversity of the content increased. During this period, some people have slowly started migrating to other blogging platforms such as their personal free or paid hosting, other Urdu blogging offerings, or hosted free services of WordPress and Blogger. This is evident from the blogroll of various bloggers in their archived copies.

Increasing activity on UrduTech from both human and bots lead to the point where the shared hosting provider decided to shut the service down without any warning. People were anxious of the sudden loss of their content and demanding for the backup. Who makes backups? (Hint: Web archives!) Imran, the founder of the service, was busy in his other priorities that took him more than a month to bring the service back online. In the interim, people either decided to never do blogging again or swiftly moved on to other more robust options to start over from scratch (so did Waris) with the lesson learned the hard way to make backup of their content regularly.


"Did Waris really lost all his hard work and hundreds of valuable articles he wrote about Urdu and Persian literature and poetry?" I asked myself. The answer was perhaps to be found somewhere in 20,000 hard drives of the Internet Archive. However, I didn't know his lost blog's URL, but the Internet Archive was there to help. I first looked through a few captures of the UrduTech in the archive, from there I was able to find his blog link. I was happy to discover that his blog's home page a was archived a few times, however the permalinks of individual blog posts were not. Also, the pages of the blog home with older posts were not archived either. This means, from the last capture, only the 25 latest posts can be retrieved (without comments). When other earlier captures of the home page are combined, a few more posts can be archived, but perhaps not all of them. Although the stylesheet and various template resources are missing, the images in the post are archived, which is great.


What happened to the UrduTech service? When it came back online after a long outage, many people have already lost their interest and trust in the service. In less than three months, the service went down again, but this time it was the ultimate death of the service until the domain name registration expired.

Due to its popularity and search engine ranking, the domain was a good target for drop catching. Mementos (captures) during November 27, 2011 and December 18, 2014 show a blank page when viewed using WayBack Machine. A closer inspection of the page source reveals what is happening there. Using JavaScript, the page is loaded in the top frame (if not already) and the page has frames to load more content. Unfortunately, resources in the frame are not archived, so it is difficult to say how the page might have looked in that duration. However, there is some plain text for "noframe" fallback that reveals that the domain drop catchers were trying to exploit the "tech" keyword present in the UrduTech name, though they have nothing to do with Urdu.


Sometime before March 25, 2015, the domain name was presumably went through another drop catch. Alternatively, it is possible that the same domain name owner has decided to host a different type of content on that domain. Whatever is the case, since then the domain is serving a health-related "legitimate-looking fake" site, it is still live, and adding new content every now and then. However, the content of the site has nothing to do with neither "Urdu" nor "tech".


UrduTech simplified a challenging task at that time, made it accessible to people with the little technical skills, proliferated the community, killed the service, but the community has moved on (though the hard way) and transformed into a more mature and stable blogosphere. It has played the same role for Urdu blogging what the GeoCities did for personal home page hosting, only on a smaller scale for a specific community. Over the time the Web technology matured, support for Urdu in computer and smart phones became better, awareness of the tools and technologies grew in the community in general, and various new communication media such as social media sites helped spread the word and connect people together. Now, the Urdu blogosphere has grown significantly and people in the community organize regular meetups and Urdu blogger conferences. Manzarnamah, another initiative from UrduWeb members, introduces new bloggers in the community, publishes interviews of regular bloggers, and distributes annual awards to bloggers. Bilal, another member of the UrduWeb, is independently creating tools and guides to help new bloggers and the Urdu community in general. UrduTech was certainly not the only driving force for Urdu blogging, but it did play a significant role.


On the occasion of 20th birthday celebration of the Internet Archive (#IA20), on behalf of WS-DL Research Group and the Urdu community I extend my gratitude for preserving the Web for 20 long years. Happy Birthday Internet Archive, keep preserving the Web for many many more years to come. I could only wish that the preservation was more complete and less damaged, but having something is better than nothing and as DSHR puts it, "You get what you get and you don't get upset". Without these archived copies I would not be able to augment my own memories and tell the story of the evolution of a community that is very dear to me and to many others. I can only imagine how many more such stories are buried in the spinning discs of the Internet Archive.

--
Sawood Alam

Wednesday, October 26, 2016

2016-10-26: They should not be forgotten!

Source: http://www.masrawy.com/News/News_
Various/details/2015/6/7/596077/أسرة-الشهيد-أحمد-بسيوني
-فوجئنا-بصورته-على-قناة-الشرق-والقناة-نرفض-التصريح
I remembered his face and smile very well. It was very tough for me to look at his smile and realize that he will not be in this world again. It got worse for me when I read his story and many others who had died defending the future of my home country, Egypt, hoping to draw a better future for their kids. Ahmed Basiony, one of Egypt’s great artists, was killed by the Egyptian Regime on the January 28th, 2011. One of the main reasons that drove Basiony to participate in the protests is filming police beatings to document the protests. While he was filming, he also used his camera during the demonstration to zoom on the soldiers and warn the people around him so they take cautions before they had gunfire. Suddenly, his camera fell down.

Basiony was a dad for two kids: one and six years old. He has been loved by everyone who knew him.  I hope Basiony's and others' stories will remain for future generations.


Basiony was among the protests in the first days of the Egypt Revolution.
Source: https://www.facebook.com/photo.php?
fbid=206347302708907&set=a.139725092704462.24594.
100000009164407&type=3&theater
curl -I http://1000memories.com/egypt 
HTTP/1.1 404 Not Found 
Date: Tue, 25 Oct 2016 16:53:04 GMT 
Server: nginx/1.4.6 (Ubuntu) 
Content-Type: text/html; charset=UTF-8


Basiony's information and many other martyrs were documented at the site 1000memories.com/egypt. The 1000memories site contained a digital collection of around 403 martyrs with information about their live. The entire Web site is unavailable now, and the Internet Archive is the only place where it was archived. Not only the 1000memories that has been disappeared, there are also many other repositories that contained videos, images, etc. that document the 18 days of the Egyptian Revolution disappeared. Examples are iamtahrir.com (archived version), which contained the artwork produced during the Egyptian Revolution, and 25Leaks.com (archived versions), which contained about 100s of important papers posted by people during the revolution. Both sites were created for collecting content related to the Egyptian Revolution.

The Jan. 25 Egyptian Revolution is one of the most important events that has happened in recent history. Several books and initiatives have been published for documenting the 18 days of the Egyptian Revolution. These books cited many digital collections and other sites that were dedicated to document the Egyptian Revolution (e.g., 25Leaks.com). Unfortunately, the links to many of these Web sites are now broken and there is no way (without the archive) to know what they contained.

Luckily, 1000memories.com/egypt has multiple copies in the "Egypt Revolution and Politics" collection in Archive-It, a subscription service from the Internet Archive that allow institutions to develop, curate, and preserve collections of Web resources. I'm glad I found information of Basiony and many more martyrs archived!


Archiving Web pages is a method for ensuring these resources are available for posterity. My PhD research focused on exploring methods for summarizing and interacting with collections in Archive-It, and recording the events of the Egyptian Revolution spurred my initial interest in web archiving. My research necessarily focused on quantitative analysis, but this post has allowed me to revisit the humanity behind these web pages that would be lost without web archiving.

Sources:


--Yasmin

Tuesday, October 25, 2016

2016-10-25: Web Archive Study Informs Website Design

Shortly after beginning my Ph.D. research with the Old Dominion University Web Science and Digital Libraries team, I also rediscovered a Hampton Roads folk music non-profit I had spent a lot of time with years before.  Somehow I was talked into joining the board (not necessarily the most sensible thing when pursuing a Ph.D.).

My research area being digital preservation and web archiving, I decided to have a look at the Tidewater Friends of Folk Music (TFFM) website and its archived web pages (mementos).  Naturally, I looked at oldest copy of the home page available, 2002-01-25.  What I found is definitely reminiscent of early, mostly hand-coded HTML:

tffm.org 2002-01-25 23:57:26 GMT (Internet Archive)
https://web.archive.org/web/20020125235726/http://tffm.org/


Of course the most important thing for most people is concerts, so I had a look at the concerts page too (interestingly, the newest concerts page available is five years newer than the oldest home page—this phenomena was the subject was of my JCDL 2013 paper.).


tffm.org/concerts 2007-10-07 06:17:32 GMT (Internet Archive)
https://web.archive.org/web/20071007061732/http://tffm.org/concerts.html

Clicking my way through the home and concert page and mementos, I found little had changed over time other than masthead image.

2005-08-26 21:05:28 GMT 2005-12-11 09:23:55 GMT 2009-08-31 06:31:40 GMT

The end result is that I became, and remain, TFFM’s web master.  However, studying web archive quality, that is completeness and temporal coherence, has greatly influenced my redesigns of the TFFM website.  First up was bringing the most important information to the forefront in a much more readable and navigable format.  Here is a memento captured 2011-05-23:

tffm.org 2011-05-23 11:10:54 GMT (Internet Archive)
https://web.archive.org/web/20110523111054/http://www.tffm.org/concerts.html

As part of the redesign, I put my new-found knowledge of archival crawler to use.  The TFFM website now had a proper sitemap and every concert its own URI with very few URI aliases.  This design lasted until the TFFM board decided to replace “Folk” with “Acoustic,” changing the name to Tidewater Friends of Acoustic Music (TFAM).

Along with the change came a brighter look and mobile-friendly design.  Again, putting knowledge from my Ph.D. studies to work, the mobile-friendly design is responsive, adapting to the user’s device, rather than incorporating a second set of URIs and independent design.  With the response approach, archived copies replay correctly in both mobile and desktop browsers.

tidewateracoustic.org 2014-10-07 01:56:07 GMT
https://web.archive.org/web/20141007015607/http://tidewateracoustic.org/

After watching several fellow Ph.D. students struggle with the impact of JavaScript and dynamic HTML on archivability, I elected to minimized the use of JavaScript on the TFAM the site.  JavaScript greatly complicates web archiving and reduces archive quality significantly.

So, the sensibility of taking on a volunteer website project while pursuing my Ph.D. aside, I can say that in some ways the two have synergy.  My Ph.D. studies have influenced the design of the TFAM website and the TFAM website is a small, practical, and personal proving ground for my Ph.D. work.  The two have complemented each other well.

Enjoy live music? Check out http://tidewateracoustic.org!

— Scott G. Ainsworth



2016-10-26: A look back at the 2008 and 2012 US General Elections via Web Archives

Web Archives perform the crucial service of preserving our collective digital heritage. October 26, 2016 marks the 20th anniversary of the Internet Archive, and the United States presidential Election will take place November 8, 2016.  To commemorate both occasions, let us look at the 2008 and 2012 US General Elections as told by Web Archives from the perspectives of CNN and Fox News. We started with three news media - MSNBC, CNN and Fox News in order to capture both ends of the political spectrum. However, msnbc.com has redirected to various different URLs in the past (e.g., msnbc.msn.com, nbcnews.com) and the result is that the site is not well-archived.

Obama vs McCain - Fox News (2008)
Obama vs McCain - CNN (2008)

Obama vs Romney - Fox News (2012)
The archives show that the current concerns about voter fraud and election irregularities are not new (at least on Fox News, we did not find corresponding stories at CNN).
This Fox News page contains a story titled: "Government on High Alert for Voter Fraud" (2008)

Fox News: "Trouble at the ballot box" (2008)

Fox News claims a mural of Obama at a Philly polling station, that was ordered to be covered by a Judge, was not properly covered (2012)
Obama vs Romney - CNN (2012)
We appreciate the ability to tell these stories by virtue of the presence of public Web archives such as the Internet Archive. We also appreciate frameworks such as the Memento protocol that provide a means to access multiple web archives, and tools such as Sawood's Memgator which implements the memento protocol. For the comprehensive list of mementos (extracted with Memgator) for these stories see: Table vis or Timeline vis.
--Nwala

2016-10-25: Paper in the Archive

Mat reports on his journalistic experience and how we can relive it through Internet Archive (#IA20)                            

We have our collections, the things we care about, the mementos that remind us of our past. Many of these things reside on the Web. For those we want to recall and should have (in hindsight) saved, we turn to the Internet Archive.

As a computer science (CS) undergrad at University of Florida, I worked at the student-run university newspaper, The Independent Florida Alligator. This experience became particularly relevant with my recent scholarship to preserve online news. At the paper, we reported mostly on the university community, but also on news that catered to the ACRs through reports about Gainesville (e.g., city politics).

News is compiled late in the day to maximize temporal currency. I started at the paper as a "Section Producer" and eventually evolved to be a Managing Editor. I was in charge of the online edition, the "New Media" counterpart of the daily print edition -- Alligator Online. The late shift fit well with my already established coding schedule.

Proof from '05, with the 'thew' still intact.

The Alligator is an independent newspaper -- the content we published could conflict with the university without fear of being censored by the university. Typical associated college newspapers have this conflict of interest, which potentially limits their content only to that which is approved. This was part of the draw to the paper for me and I imagine, the student readers seeking less biased reporting. The orange boxes were often empty well before day's end. Students and ACRs read the print paper. As a CS student, I preferred Alligator Online.

With a unique technical perspective among my journalistic peers, I introduced a homebrewed content management system (CMS) into the online production process. This allowed Alligator Online to focus on porting the print content and not on futzing with markup. This also made the content far more accessible and, as time has shown thanks to Internet Archive, preservable.

Internet Archive's capture of Alligator Online at alligator.org over time with my time there highlighted in orange.

After graduating from UF in 2006, I continued to live and work elsewhere in Gainesville for a few years. Even then technically an ACR, I still preferred Alligator Online to print. A new set of students transitioned into production of Alligator Online and eventually deployed a new CMS.

Now as a PhD student of CS studying the past Web, I have observed a resultant decline in accessibility that occurred after I had moved on from the paper. This corresponds further with our work On the Change in Archivability of Websites Over Time (PDF). Thankfully, adaptations at Alligator Online and possibly IA have allowed the preservation rate to recover (see above, post-tenure).

alligator.org before (2004) and after (2006) I managed, per captures by Internet Archive.

With Internet Archive celebrating 20 years in existence (#IA20), IA has provided the means for me to see the aforementioned trend in time. My knowledge in the mid-2000s of web standards and accessibility facilitated preservation. Because of this, with special thanks to IA, the collections of pages I care about -- the mementos that remind me of my past -- are accessible and well-preserved.

— Mat (@machawk1)

NOTE: Only after publishing this post I thought to check alligator.org's robots.txt file as archived by IA. The final capture of alligator.org in 2007 before the next temporally adjacent one in 2009 occurred on August 7, 2007. At that time (and prior), no robots.txt file existed for alligator.org despite IA preserving the 404. Around late October of that same year, a robot.txt file was introduced with the lines:
User-Agent: *
Disallow: /

Monday, October 24, 2016

2016-10-24: 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) Trip Report


"Dad, he is pushing random doorbell buttons", Dr. Herzog's daughter complained about her brother while we were walking back home late night after having dinner in the city center of Potsdam. Dr. Herzog smiled and replied, "it's rather a cool idea, let's all do it". Repeating the TPDL 2015 tradition, Dr. Michael Herzog's family was hosting me (Sawood Alam) at their place after the TPDL 2016 conference in Hannover. Leaving some cursing people behind (who were disturbed by false doorbells), he asked me, "how was your conference this year?"

Day 1



Between the two parallel sessions of the first day, I attended the Doctoral Consortium session as a participant. The chair Kjetil Nørvåg, Norwegian University of Science and Technology, Norway, began the session with the formal introduction of the session structure and timeline. Out of the seven accepted Doctoral Consortium submissions, only five could make it to the workshop.
My talk was mainly praised for the good content organization, an easy to follow story for the problem description, tiered approach to solving problems, and inclusion of the work and publication plans. Konstantina's talk on political bias identification generated the maximum discussion during the QA session. I owe her references to A visual history of Donald Trump dominating the news cycle and Text analysis of Trump's tweets confirms he writes only the (angrier) Android half.


Each presenter was assigned a mentor for more in-depth feedback on their work and provide and outsider's perspective that would help define the scope of the thesis and recognize parts that might need more elaboration. After formal presentation session, presenters were spread apart for one-to-one session with their corresponding mentor. Nattiya Kanhabua, from Aalborg University, Denmark, was my mentor. She provided great feedback and some useful references that might be relevant to my research. We also talked about the possibilities of collaboration in future where our research interest intersects.


After the conclusion of the Doctoral Consortium Workshop we headed to Technische Informationsbibliothek (TIB) where Mila Runnwerth welcomed us to German National Library of Science and Technology. She gave us an insightful presentation followed by a guided tour to the library facilities.


Day 2


The main conference started on the second day with David Bainbridge's keynote presentation on "Mozart's Laptop: Implications for Creativity in Multimedia Digital Libraries and Beyond". He introduced a tool named Expeditee that gives a universal UI for text, image, and music interaction. The talk was full of interesting references and demonstrations such a querying music by humming. Following the keynote, I attended the Digital Humanities track while missing the other two parallel tracks.
Then I moved to another track for Search and User Aspects sessions.
Following the regular presentation tracks the Posters and Demos session was scheduled. It came to me as a surprise that all the Doctoral Consortium submissions were automatically included in the Posters session (apart from the regular poster and demo submissions) and assigned reserved places in the hall for posters, which means I had to do something for the traditional Minute Madness event that I was not prepared for. So I ended up reusing #IAmNotAGator gag that I prepared for JCDL 2016 Minute Madness and utilized the poster time to advertise MemGator and Memento.


Day 3


On the second day of the conference I had two papers to present. So, I decided to wear business formal attire. As a consequence, the conference photographer stopped me at the building entrance and asked me to pose for him near the information desk. The lady on the information desk tried to explain me routes to various places of the city, but the modeling session extended so long that it became awkward and we both started smiling.

The day began with Jan Rybicki's keynote talk on "Pretty Things Done with (Electronic) Texts: Why We Need Full-Text Access". For the first time I came to know about the term Stylometry. His slides were full of beautiful visualizations. The tool used to generate the data for the visualizations is published as an R package called stylo. After the keynote, I attended the Web Archives session.


After the lunch break I moved to the Short Papers track where I had my second presentation of the day.


After the coffee break I attended the Multimedia and Time Aspects track while missing the panel session on Digital Humanities and eInfrastructures.
In the evening we headed to the XII Apostel Hannover for the conference dinner. The food was good. During the dinner they announced Giannis Tsakonas and Joffrey Decourselle as the best paper and the best poster winners respectively.


Day 4


On the last day of the main conference I decided to skip the panel and tutorial tracks in the favor of the Digital Library Evaluation research track.
After a brief coffee break everyone gathered for the closing keynote presentation by Tony Veale on "Metaphors All the Way Down: The many practical uses of figurative language understanding". The talk was very informative, interesting, and full of hilarious examples. He mentioned the Library of Babel which reminded me of a digital implementation of it and a video talking about it. Slides looked more like a comic strip which was very much in line with the theme of the talk which ended up talking about various Twitter bots such as MetaphorIsMyBusiness and MetaphorMirror.

Following the closing keynote the main conference was concluded with some announcements. Next year TPDL 2017 will be hosted in Thessaloniki, Greece, during September 17-21, 2017. TPDL is willing to expand the scope and encouraging young researchers to come forward with session ideas, chair events, and take the lead. People who are active on social media and scientific communities are encouraged to spread the word out to bring more awareness and participation. This year's Twitter hashtag was #TPDL2016 where all the relevant Tweets can be found.


The rest of the afternoon I spent in the Alexandria Workshop.

Day 5


It was my last day in Hannover. I checked out from the conference hotel, Congress Hotel am Stadtpark Hannover. The hotel was located next to the conference venue and the views from the hotel were good. However, the experience at the hotel was not very good. It was located far away from the city center and there were no restaurants nearby. Despite complaints I have found an insect jumping on my laptop and bed on fifteenth floor, late night, for two consecutive nights. The basic Wi-Fi was useless and unreliable. In my opinion, nowadays, high-speed Wi-Fi in hotels should not be counted in luxury amenities, especially for business visitors. The hotel was not cheap either. These factors should be considered when choosing a conference venue and hotel by organizers.

I realized I still have some time to spare before I begin my journey. So, I decided to go to the conference venue where the Alexandria Workshop was ongoing. I was able to catch the keynote by Jane Winters in which she talked about many Web archiving related familiar projects. Then I headed to the Hannover city center to catch the train to Stendal.

"I know the rest of the story, since I received you in Stendal", Dr. Herzog interrupted me. We have reached home and it was already very late, hence, we called it a night and went to our beds.

Post-conference Days


After the conference, I spent a couple of days with Dr. Herzog's family on my way back. We visited Stendal University of Applied Sciences, met some interesting people for lunch at Schlosshotel Tangermünde, explored Potsdam by walking and biking, did some souvenir shopping and kitchen experiments, visited Dr. Herzog's daughter's school and the Freie Universität Berlin campus along with many other historical places on our way, and had dinner in Berlin where I finally revealed the secret of the disappearing earphone magic trick to Mrs. Herzog. On Sunday morning Dr. Herzog dropped me to the Berlin airport.


Dr. Herzog is a great host and tour guide. He has a beautiful, lovely, and welcoming family. Visiting his family is a single sufficient reason for me to visit Germany anytime.

--
Sawood Alam