Friday, April 19, 2013

2013-04-19: Carbon Dating the Web

(note: Carbon Date 2.0 was released on 2014-11-14)

In the course of our research we often needed to determine when a certain web resource was created. In numerous cases, this question is fairly straightforward to answer by examining the resource itself. Articles often have publishing datetime stamps, social media contributions have posting time, and others you can estimate the creation date from reading the resource itself. This process is simple upon manually examining the resource, but when the dataset of resources is large it is harder to automate.
To solve this problem we conducted several experiments to determine when the resource was created automatically. When a resource is created it often gets indexed in the search engines, archived in the public archives, and shared in the social media thus leaving trails of existence. We trace those trails of existence and use the first appearance of the first trail as a close estimate of the creation date. The timeline below illustrates a common scenario of the lifetime of a resource.




We also examined the existence of a last modified timestamp in the resource’s header and the feasibility of using it as an estimate of creation date. We also examine the resource’s backlinks and in turn estimate their creation date which could be easier to extract, which gives us an insight on when the resource was created too.

In order to test the accuracy of our estimation we collected 1200 resources which we can manually extract the creation date from different sources. We tested our model and were able to estimate a creation date to over 75% of the resources and 33% having the exact creation date.  After validating our model we utilized it in building an age estimation service which if provided with the resource’s URL would return a JSON object of the creation dates from each source (search engines, archives, social media, backlinks, and others) and the estimated lowest creation date. You can use the service at:  http://cd.cs.odu.edu/cd/<YOUR_URL_HERE>

curl -i http://cd.cs.odu.edu/cd/http://www.mementoweb.org

HTTP/1.0 200 OK
Date: Fri, 01 Mar 2013 04:44:47 GMT
Server: WSGIServer/0.1 Python/2.6.5
Content-Length: 550
Content-Type: application/json; charset=UTF-8

{
      "URI": "http://www.mementoweb.org",
      "Estimated Creation Date": "2009-09-30T11:58:25",
      "Last Modified": "2012-04-20T21:52:07",
      "Bitly": "2011-03-24T10:44:12",
      "Topsy.com": "2009-11-09T20:53:20",
      "Backlinks": "2011-01-16T21:42:12",
      "Google.com": "2009-11-16",
      "Archives": {
            "Earliest": "2009-09-30T11:58:25",
            "By Archive": {
                  "wayback.archive-it.org": "2009-09-30T11:58:25",
                  "api.wayback.archive.org": "2009-09-30T11:58:25",
                  "webarchive.nationalarchives.gov.uk": "2010-04-02T00:00:00"
            }
      }
}


We published the code implemented as well in GitHub. You can download it from: https://github.com/HanySalahEldeen/CarbonDate  along with the instructions to install. To use this service, you should register with Bitly and Topsy and get their corresponding API keys. Second, modify the config file by adding your keys. Finally, launch server.py on your designated IP and port.

This work has been published at the third annual Temp Web workshop at the WWW 2013 conference in Rio de Janeiro, Brazil.


- Hany M. SalahEldeen, Michael L. Nelson, Carbon Dating The Web: Estimating the Age of Web Resources, Proceedings of TempWeb03, WWW 2013. (Also available as a Technical Report http://arxiv.org/abs/1304.5213).

Wednesday, April 10, 2013

2013-04-08: Grad Cohort Workshop (CRA-W) 2013

On April 5-6, I was pleased to have the opportunity to meet and network with many successful senior women as well as graduate students from other universities in CRA-W Graduate Cohort, which was held in Boston, MA. Grad Cohort, which began in 2004, aims to increase the ranks of senior women in computing by building and mentoring nationwide communities of women through their graduate studies. Grad Cohort accepts women students in their first, second, or third year of graduate school in computer science and engineering. They provide sessions for each of the three years. Since I am now in my third year of my computer science Ph.D., I attended third year sessions, which I'm going to talk about in the rest of the blog post. The workshop included a mix of formal presentations and informal discussions.

In the first day's afternoon, there was a Poster Session for participants to talk about their research. I presented a poster entitled "Access Patterns for Robots and Humans in Web Archives". The poster contains an analysis of the user access patterns of web archives using the Internet Archive's Wayback Machine access logs, the details of which will be published at JCDL 2013.


Day One:
After registration, breakfast was provided and followed with the welcome session by Lori Clarke (University of Massachusetts-Amherst), Sandhya Dwarkadas (University of Rochester), and Lori Pollock (University of Delaware). They started with an overview about CRA-W Grad Cohort Workshop emphasizing the goals of the workshop, which is held annually. The main goals of the workshop are to learn the process of doing research, providing easily insights into career paths, communicate and connect with people. The workshop has three tracks corresponding to students' the first, second, and third year of graduate school in computer science and engineering. They introduced the speakers and thanked the Sponsors of the 2013 CRA-W Grad Cohort Program.


The first session, which was titled "Preparing Your Thesis Proposal and Becoming a Ph.D. Candidate", was given by Julia Hirschberg (Columbia University). Julia started her presentation with a nice photo of her cute cats, then started her presentation with advice I admired, "It's fine to have a family and successful career". She emphasized the reasons of writing Ph.D. proposals and described the proposal process from the brainstorming and planning stage to getting the proposal to the committee. Julia also spoke of the presentation of the proposal and gave some tips on it. She gave hints for handling hard questions and how it is important to be prepared for them. At the end, it was a good experience I got from the open discussions about other schools' processes in doing Ph.D. proposals.

In the second session, since I have to specify a topic for my proposal in the next couple of months, I attended the "Finding a Research Topic" session with second year Grad Cohort. The session was given by Carla Brodley (Tufts University). Carla discussed the strategies of choosing the topic and addressed how this choice may affect our career plans (e.g., academic research or industrial research). She explained the thesis equation (Advisor + Topic = Dissertation). Before she ended the discussion, she contrasted "what is Ph.D. Research" against "what is not Ph.D. Research". After the session, it was time for lunch and networking with grad students, grouped by similar areas of research. To facilitate this, each table was labeled with a different field of computer science and engineering.


After the lunch, the third session titled "Ph.D. Academic Career Paths: Research, Teaching, and Administration" focused on the different career paths in academia. The session was given by Mary Lou Soffa (University of Virginia), Erin Solovey (MIT), and Tiffani Williams (Texas A&M University). The speakers started with a bio about themselves explaining how they got their jobs. They spoke of their roles of research, teaching, and service, and how they differ in different academic institutions. Furthermore, they discussed the challenges of research, teaching, and service, and the required skills for success in each one.

After the second break, Hillery Hunter (IBM), Kathryn McKinley (Microsoft Research), and Amanda Stent (AT&T) spoke of different personal stories to all Grad Cohorts in a session titled "Strategies for Human-Human Interaction". They focused on strategies for interaction with colleagues and the challenges of being a woman in a computing technology career. The stories were about uncomfortable situations that may arise and how to react. Upon the session's completion, they held a panel session to answer the audience's questions.

After the closing of the 4th session, the conference held a poster session. As above, I presented my poster titled "Access Patterns for Robots and Humans in Web Archives". Along with my poster, 83 other attendees presented posters from a wide range of computing technology field. At the end of the day, we had a great time in the reception, which was hosted by Microsoft Research and Google.

Day Two:
Day two started off with breakfast then a session titled "Ph.D. Non-Academic Career Paths: Industrial Research and Development" by Maryam Kamvar Garrett (Google), Suju Rajan (Yahoo!), and Amanda Stent (AT&T)". The focus of the session was on the different career paths and job opportunities in industry for Ph.D. students. They spoke of the challenges, skills, and experiences needed for success in industry careers. They gave tips for helping the grad students in making the choice of job (e.g., the importance of knowing the number of hours of each job which will be suitable for your life).

After a short break, Natalie Enright Jerger (University of Toronto), Hillery Hunter (IBM), and Erin Solovey (MIT) presented the second session "Ph.D. Job Search". Each one of the speakers gave us an idea about their job timeline and how they got their current jobs, declaring the obstacles they faced and how they overrode them. They raised a question "Why we should search for a job now?" and then said you have to remember three things:
  1. You are intelligent
  2. You are expert in something
  3. You can do it.
They gave tips for joining jobs in academia, such as active collaboration by attending conferences and communicating with people. They spoke of how to prepare for a job, preparing the application, preparing for the interview, what to do after the interview, and deciding between different offers. They highlighted how it is important to have external expert in your field who also will have time to write reference letters for you when you apply for a job.


The last session of the second day was "Balancing Graduate School and Personal Life" by A.J. Brush (Microsoft Research). The session addressed different strategies for balancing TA duties, the course work, the research program, and personal life. She said that it is important to know when to stop and take a break and how this is important in resuming your career with fresh mind.


At the end, "Wrap-Up and Final Remarks" were given by Lori Clarke, Sandhya Dwarkadas, and Lori Pollock. They thanked the speakers and the sponsors of the 2013 CRA-W Grad Cohort Program. Furthermore, they mentioned that they will participate in Grace Hopper Celebration of Women in Computing (GHC), the largest conference for technical women in the world.


It was great to meet all of these senior women in the computing technology field in addition to networking with graduate students and exchanging the experience with them about our research areas. At the end of the second day, our friends from MIT took us in a tour in Computer Science building.

---
Yasmin

Special thanks to Mat Kelly for editing this post