On Saturday, my colleague, Justin Brunelle, and I took off on a road trip to attend this year’s JCDL conference in Washington, D.C. We arrived at the nation’s capital earlier that evening and began preparing our presentations after settling in at the George Washington University Inn. Both of us were accepted to present our work at the conference’s Doctoral Consortium. Justin has already blogged about the consortium and our experience in his brilliant blog post.
Other Perspectives on JCDL 2012:
The conference started on the following Monday (June 10th). The registration went smoothly and we all took our seats at the Betts Theatre in the Marvin Center which sits in the heart of George Washington University. Barrie Howard and Karim Boughida (the conference co-chairs) gave the welcoming remarks and were followed by Leo M. Chalupa, the Vice President of Research at the university.
Michael Nelson, opened up the session and introduced the keynote speaker, Jason Scott. Winning the award for the Most “Optimistic” Title (“All you Cared About Is Gone and All your Friends are Dead:Fun Frolic of Preservation Activism”), he described his work as a technology historian starting from textfiles.com and computer bulletin boards systems. Working to migrate our perception from just collecting pieces of history to collecting stories, he interviewed John Sheets from Bell Labs to tell his story. Jason also discussed corporations closing down services with either no export facilities or not enough time to migrate. He further elaborated by giving examples on Yahoo closing Geocities and Yahoo Pets; Magnolia bookmarking service going down; Tabblo social networking service bought by HP; and finally MobileMe by Apple. He concluded by speaking about his team (The Archive Team) and their initiative and efforts in preserving the web and in crowdsourcing the archiving task by using Archive Team Warrior.
The conference in general was designed to host two tracks simultaneously. Thus, it was fairly hard to choose which sessions to attend. Even though all the titles seemed interesting, I decided to attend the sessions that are closer to my field of expertise. Robert Sanderson was the chair of the first session, and he introduced Cathy Marshall from Microsoft Research. Cathy presented her full paper entitled, “On the Institutional Archiving of Social Media”. In it she described the work she has done to shed light on the controversy surrounding archiving social media. She started by examining a video filmed by the Library of Congress interviewed young students depicting the problem. After that she conducted 6 surveys to users inquiring about their attitude and practices, and asking them for their opinions. She utilized Amazon’s Mechanical Turk in this study and asked questions about what they thought about the announcement that the Library of Congress has all the Tweets from the day Twitter started (in 2006) from the day it started (in 2006). Following that, she investigated their opinions regarding publishing this feed to the researchers now or to the public or to the public after 50 years. Her study concluded that the concept of ownership in social media is fuzzy and social norms are evolving in regards to reuse.
The next presentation was by Maristella Agosti from Padua University in Italy. The talk entitled, “To Envisage and Design the Transition from a Digital Archive System Developed from Domain Experts to One for Non-Domain Users”. As the title suggests, she stressed the need to change our attitude from specialized users to open public in digital culture heritage collections. She also suggested that those collection models should be open to researchers from outside the field and to make this culture open to people from the public domain. She described the different levels of the model and started from two different open collections from Dublin and Padua in a pilot study. Following this presentation, our colleague, Kalpesh Padia, presented his paper entitled, “Visualizing Digital Collections at Archive-It”. In this presentation, he illustrated his work in producing a visualization tool to describe collections on Archive-It (the subscription service from the Internet Archive).
The next presentation was by Jillian Wallis and was entitled, “Data, Data Use & Inquiry: A New Point of View on Data Curation”. She took a case study of two branches of science (Astronomy and Environmental) and interviewed researchers from both fields about the data they collected and used in their analysis. She also inquired about its size, and where it was collected from in order to give a better idea on how will this data be reused across different disciplines. She defined what the foreground and background data is and explained how background data for one researcher could be foreground for another. She asked a question: Are we undervaluing data by discarding and not citing it? Finally, she concluded by stating that the use of data is highly dependent on research type and thus its citation. Next Dharitri Misra presented a paper entitled, “Digital Preservation and Knowledge Discovery Based on Documents from an International Health Science Program”. In this paper, she showed that biomedical documents are important for the obvious data it unfolds. They are also important in detecting hidden trends in masses of documents and how these trends are very useful. Her work described a case study of U.S.-Japan CMSP where they have more than 50 years of medical data.
Following a break, another pair of sessions started. I chose a panel session that attracted my attention which was entitled, “Big Data Is Already Here, and It’s Not Always What We Think”. This session was conducted by representatives from the Library of Congress. Leslie Johnston discussed the migration of perception from the term “records” to “data” and how library collections could be mined like data. She defined “big data” as the amount of data that might be relatively hard to manage and she demonstrated how this definition has become very fluid across the years. She gave three case studies, including the Historic Newspaper collection which has compiled 5 million pages from 25 states and has digitized them from microfilm. This collection gets 5 million views per day. She continued to the next case which was the most anticipated and controversial collection, the Twitter dataset. Leslie said she possessed all the Twitter datasets from 2006 to 2010, and it spans 21 billion tweets. They are stored in JSON format and reach 20TB in size. This collection has been stored with a large amount of metadata that would be invaluable to researchers in the field. She gladly announced that this data will be released to the researchers in the field later this year. Following Leslie, James Snyder (the director of the Motion Picture and Broadcasting Division) discussed his work and his division’s collection of more than 6.75 million unique items from various file formats. He explained that his division is responsible for the preservation of these collections for the duration of 125 years of the copyrights period. Jane Mandelbaum followed James and discussed accessibility and how users eventually will be able to find the information they need from the portals of the Library of Congress. She emphasized that we need a better understanding of the data and the metadata in order to maximize the benefit. Trevor Owens (also from the Library of Congress) said he is currently working on the NDIIPP project and discussed visualization and exploration of data. Finally, the panel accepted some questions in regards to their projects.
After the final break of the day, I attended a session directed by Sally Jo Cunningham. The first paper was presented by Timothy Cole and was entitled, “Descriptive Metadata, Iconclass, and Digitized Emblem Literature”. He began by explaining what emblems are and how they differ in size, forms, and shapes. The objectives in his paper were to expand the corpus of digitized emblems and build a cross repository portal for emblem studies. He emphasized multi-granular types of discovery and access aiming for more international interoperability. Finally he presented Emblematica online, which is a web-based resource for searching and browsing emblems. Next, Jin Ha Lee and Xiao Hu presented their paper next entitled, “Generating Ground Truth for Music Mood Classification Using Mechanical Turk”. This is a really interesting study which began by analyzing and differentiating between mood, affect, and emotion. Jin Ha explained the MIREX project in classifying music mood since 2007 and its differences. Since mood classification is a user-based analysis, this nature makes this classification more difficult, hard to replicate, and thus appeared the need to have ground truth data. Xiao explained they utilized 5 clusters and performed their experiments on Amazon’s mechanical turk by selecting good workers. They picked 1,250 songs and registered 2 judgments each. This experiment ran for 19 days and cost approximately $60.00. They tested the performance against 9 algorithms and displayed the results on Russell’s model of mood affect.
The next session was presented by Yinlin Chen and was entitled, “Categorization of Computing Education Resources into ACM Computing Classification System”. In his presentation, Paul talked about the Ensemble Project and his approach which used existing ACM DL metadata to build classifiers for harvested resources. Tadashi Nomoto followed Paul and presented his paper entitled, “Re-ranking Bibliographic Records for Personalized Library Search”. He presented an approach of accessing bibliographic records in a way to response to the user’s preferences.
Shortly after the break is when the minute madness started! The minute madness is a 60 second short teaser where each author can “advertise” for his poster or demo in the next session. The posters session was held side by side with the demo session, and it was a great opportunity to speak with the authors and also to support our colleague, Mat Kelly, in his demo/poster.
Day two followed the same pattern and after breakfast all the attendees gathered at the Betts Theatre. Herbert Van de Sompel welcomed the attendees and introduced the keynote speaker, Carole Goble. Carole is a professor in the School of Computer Science at the University of Manchester, UK and the Director of the myGrid Project. With a speech entitled, “The Reality of Reproducibility for in Silico”, she started by describing the reproducibility concept in the scientific field. She argued that it is a form of virtual witnessing of the scientific method and that its documentation incorporated the materials used and the method of proof. From Nature Magazine, she presented that nearly 47 out of 53 landmark publications are irreproducible (which was an astonishing number). She also stated the role of libraries in software sustainability including providing special codes, data collections, service based science, and cloud hosted services. Claiming that dependency is the root of decay, she explained the difference between replication/repetition and reproduction, she also described the areas where librarianship is crucial in preserving the scientific workflow, restoring it, and finally reconstructing it.
Shortly after the break, another pair of sessions started. I attended the first two presentations from Session 7 and later switched to attend the panel that was conducted contemporaneously to Session 7. The first presentation was by Jurgen Bernard and was entitled, “Content-based Layouts for Exploratory Metadata Search in Scientific Research Data”. In this paper, Jurgen attempted to answer the question: Can we build relations of metadata based on the content? He conducted several experiments analyzing and measuring similarities between the results of different scientists experimenting with the same phenomena. He presented a way to visualize similarities of results based on the metadata in the “Metadata Entity Glyph”. Paul Bogen led the next presentation entitled, “A quantitative Evaluation of Techniques for Detection of Abnormal Change Events in Blogs”. In this paper, Paul tried to analyze the patterns of change in blogs by conducting a survey on popular blogs from a large collection of social bookmarks to detect abnormal changes. Using his method, he argued that the results show statistically significant improvement over traditional threshold techniques for the same collection. Following this presentation, I hurried to join the round table discussion of, “Digging into Data” in hopes of learning more about this international grant competition and to hear questions to the representatives of the eight funding research agencies from the US, UK, Canada, and the Netherlands.
George Buchanan. He introduced Anderson Ferreira and his work entitled, “Active Association Sampling for Author Name Disambiguation”. In this presentation, Anderson described the problem of author name ambiguity. He suggested a possible method of solution based on supervised machine learning techniques aiming to reduce the set of examples needed to produce the training data. Following Anderson, Madian Khabsa gave a very interesting presentation entitled, “AckSeer: A Repository and Search Engine for Automatically Extracted Acknowledgements from Digital Libraries”. In his research, Madian explained the architecture of the search engine and the repository he developed to mine acknowledgements. Following Madian, Sujatha Das Gollapalli gave a presentation entitled, “Similar Researcher Search in Academic Environments”, which addressed the researcher recommendation and similarity problem. Nuno Freire followed Suitha and presented his paper entitled, ”An Analysis of the Named Entity Recognition Problem in Digital Libraries Metadata”. He discussed the task of information extraction dealing with the references to entities made by names that occur in the texts.
After the break, the last group of sessions was started by Jannik Strötgen and he presented his award nominated paper, “Event-centric Search and Exploration in Document Collections”. Jannik began his presentation by arguing that current search engines are great in extracting a set of documents, but fairly inadequate in extracting event related documents. Temporal and semantic events have always been treated as words, while time and space have been well defined and could be compared to enhance the concept of “event”. He followed by explaining that queries are multidimensional and that by adding the geo and temporal dimensions to the textual dimension the results (if ranked by event) will be enhanced. After the presentation, I switched sessions to attend the remaining portion of Session 12, which was directed by WS-DL alumnus Martin Klein. Feng Liang then presented his paper entitled, “Exploiting Real-time Information Retrieval in Microblogosphere”. In this paper, Fang discussed the problem of the semantic aspect (what is the most relevant?) versus the temporal aspect (what is the most recent?). For example, he argued that in Twitter the problem is very challenging due to the short length of the document, the abundance of shortened URLs, and the tradeoff between recency and relevance. Utilizing the TREC2011 Tweets dataset and the ICL-divergence model, he analyzed the query model along with the document model, which provided a significant improvement in the re-ranking task. Prat Tanapaisankit spoke next with a presentation entitled, “Personalized Query Expansion in QIC System”, in which he tackled the query in context problem. Robert Mercer followed Prat with a presentation entitled, “Investigating Keyphrase Indexing with Text Denoising”, in it he discussed how removing the noise parts from texts performed better or as well as the benchmark indexer.
After this session ended, the attendees headed to the awards banquet which was held at Sequoia Restaurant. Michael Nelson, along with Karim Boughida, was awarded the “Spark Performance” Award. After that Michael Nelson announced the winners of the awards for Best Full Paper, Short Paper, and Poster. Then we all spent a lovely evening by the river talking and socializing with the other researchers and attendees.
The next morning denoted the last day of the conference. It started with the last pair of sessions which were conducted by Edie Rasmussen and Pertti Vakkari. Sally Jo Cunningham presented the first paper entitled, “Book Selection Behavior in the Physical Library: Implications for the EBook Selection”. Her experiments aimed to gain insights into people’s book selection strategies which may inform the design of software support for ebook selection. Jesse Gozali followed Sally and presented a paper entitled, “How Do People Organize Their Photos in Each Event and How Does it Affect Storytelling, Searching, and Interpretation Tasks?” In this paper, he analyzed four different methods of photo organization and conducted a study to inform this analysis. Finally, George Buchanan presented Jennifer Pearson’s (his student) paper entitled, “Co-reading: Investigating Collaborative Group Reading”. In the group-reading setting, Jennifer investigated the differences in user behavior with paper versus their behavior with the new interface they developed on iPad.
The closing keynote speech was presented by George Dyson and was entitled, “The Sensible Moment: 1680-2012”. He took us on an amazing historic journey of technology. He began in the 1600's when Francis Bacon encoded the alphabet in five placements and paved the way for binary encoding; and when Thomas Hobbes declared that computing machines should have adding and remembering capabilities. He mentioned Alfred Smee's work in differentiating the reality, the thought, and the conscious; and H. G. Wells’s novel, "World Brain" (1938), in which he imagined the Internet; and finally he spoke about the amazing efforts of Alan Turing and John VonNeumann in creating the computing machines we have today.
Finally, Barrie Howard (the conference co-chair) thanked the audience and presented the JCDL 2013 co-chairs, who declared the venue will be in Indianapolis, Indiana. I am highly anticipating the 2013 conference and I hope to be able to submit some of my work.
Other Perspectives on JCDL 2012:
- Robin Camille Davis's notes on digital preservation and Jason Scott's keynote speech.
- THAT camp's article about Bridging the Gap between the CS DL community and the LIS DL community.
- The Library of Congress's notes from Behind the scenes of the JCDL2012 conference.
- The report about the Digging into Data by CNI and another by JISC.
- Also you can follow news, insights, reports, and more on the Twitter Feed.
Special thanks to Erin E. Ralston for editing and refining this post.