Wednesday, March 27, 2013

2013-03-27: ResourceSync Meeting and JCDL 2013 PC Meeting

 On March 21 & 22 members of the ResourceSync technical group met in Ann Arbor Michigan to work the 0.5 version of the ResourceSync specification.  In case you're not familiar, ResourceSync is a framework, intended to replace OAI-PMH, for specifying how a destination ("harvester" in PMH terms) can synchronize the web resources of a source ("repository" in PMH terms).  The source publishes a list of resources that it makes available via ResourceSync (which may be a subset of valid resources at the web site) using Sitemaps, with the idea that if you're already using Sitemaps then you are already minimally compliant, and the more advanced features of ResourceSync also use the Sitemap syntax for consistency.  Although the syntactic details are in flux, Herbert's presentation at the September 2012 NISO Forum is a good introduction the framework, as are the two recent D-Lib Magazine articles (Sept/Oct 2012 and Jan/Feb 2013). 

Some important but nuanced results came from the March meeting, many of which are contained in the figure below (collaboratively drawn at the whiteboard and then Omnigraffled by Graham Klyne):

Some highlights:
  • The use of <sitemapindex> files in ResourceSync is now consistent.  The purpose of sitemap indexes is to provide pagination for large sitemaps (50K URIs or 10MB total), but those are engineering limitations; logically two physical sitemaps listed in a sitemap index were a single, logical sitemap.  In the 0.5 version of the ResourceSync specification sitemap indexes were used for both pagination as well as specifying archive functionality and grouping capability lists.  That dual use has been removed and indexes are now used only for pagination (shown as the dashed boxes in each of the 4 columns in the figure above).
  • A resource set capability list can describe an entire site, or a site can have multiple resource set capability lists (e.g., the various collections in physics, mathematics, etc. in arXiv). This ability to subdivide the site is analogous to "sets" in PMH). 
  • The top part of the figure contains a few changes as well: the set of capability lists is now contained within a sitemap called a resource set list.  Although the number of capabilities for a single resource set list will never grow to need an index, it is possible that the number of number of resource set lists will grow to need an index (the dashed box at the very top of the figure).  An example would be creating a resource set for each category in a large wiki; there could easily be more than 50K such resource set capability lists.
  • You'll notice that only four capabilities are listed: Resource List (i.e., conventional Sitemaps), Change List, Resource Dump, and Change Dump (just a quick reminder that these capabilities are orthogonal and optionally implementable; for example you can implement a Change List without implementing a Resource List).  A discussion of archiving these capabilities will be moved to a (yet to be published) separate document, and will borrow heavily from Memento to describe archiving.  The subtle distinction is that to make Change Lists and Change Dumps useful, a source will likely have to provide some memory (i.e., a single change is not useful (memory=1), a list is required (memory=n) for a destination to make use of this capability).  But there is a distinction between the level of memory provided (i.e., ResourceSync semantics) and the possibly archiving of these capabilities (i.e., Memento Semantics). 
  • Also to be described in a separate document will be the server push description of ResourceSync.  The current document will continue to only be client pull.  
  • The manifests inside the dumps (i.e., inside the zip file) are to be provided as a single Sitemap, even if there are > 50K URIs.  Providing indexes inside of a dump just isn't worth the trouble.  Also note that although zip files are mentioned by name in the specification, we've left the door open for other formats and encodings (e.g., .tar.gz, .tar.bz2).
  • Finally, even if you have only a single Resource Set List, it will be wrapped with a Resource Set List Index, so the ./well-known/resourcesync URI can point to the same thing over time.  Yes, that's not ideal, but if you don't do it that way, then adding your second Resource Set List (which you'll eventually do) is an even bigger problem than having an index with just a single member. 
We discussed a host of other changes, such as how to specify temporal coverage of a Change Lists (see figure 1 in the specification -- does Change List 2's coverage begin at t3, or at the time of the first observed change which might be > t3?), but I'll leave those details for the 0.6 version of the specification.

On Friday after the ResourceSync meeting, Rob, Martin, and I hopped in a rental car and drove from Ann Arbor to Chicago for the JCDL 2013 Program Committee meeting.  The program will be released soon, but it was a full weekend of reviewing, meta-reviewing, and planning.  Accepted for presentation were 28 long and 22 short papers (for an acceptance rate of ~29% for each category, exact numbers will be released later), and some large number of posters.  This year each paper had three regular reviews and two meta-reviewers.  Last year was the first year of meta-reviewing for JCDL (with a single meta-reviewer), and I think the quality of the reviews are better as a result of meta-reviewing.  Although it was a lot of work and I was anxious about which of my own papers made it in, this year Herbert and I weren't responsible for chairing the program, so things were much less stressful.


Saturday, March 23, 2013

2013-03-22: NTRS, Web Archives, and Why We Should Build Collections

At the ResourceSync meeting this week, Simeon Warner brought my attention to the fact that the NASA Technical Report Server (NTRS) digital library had gone offline on March 19.  Although I have not been involved with it since about 2004, I was the creator of NTRS and it was a central part of my early career

If you click on now, you can a message saying the service is down.  Technically, you get an "HTTP/1.1 503 Service Temporarily Unavailable" message:

$ curl -I
HTTP/1.1 503 Service Temporarily Unavailable
Date: Sat, 23 Mar 2013 04:00:14 GMT
Server: Apache/2.2.3 (Red Hat)
Last-Modified: Fri, 22 Mar 2013 12:50:02 GMT
ETag: "720003-300-4d882e4c05280"
Accept-Ranges: bytes
Content-Length: 768
Connection: close
Content-Type: text/html; charset=UTF-8

 And the body of the page says:
The NASA technical reports server will be unavailable for public access
while the agency conducts a review of the site's content to ensure that it
does not contain technical information that is subject to U.S. export control laws
and regulations and that the appropriate reviews were performed.
The site will return to service when the review is complete.
We apologize for any inconvenience this may cause
Mark Phillips described it perfectly:

Presumably the shutdown of is in response to a security incident with a NASA LaRC contractor who is a Chinese national.  I won't address that issue, but I will say that shutting down a public web server in response demonstrates a profound misunderstanding of how the web works: you can't put the pdfs back in the server.  When I discovered that NTRS was down, I searched twitter and here is the first tweet I came across with a link to

Clicking on the link produces this image, note that the response stays the same and the URI is not rewritten:

Using MementoFox, I moved my slider to go back in time and I was able to find the report in Archive-It.  Here's the screen shot of the PDF in Preview overlapping the web browser:

Here's the TimeMap for those who are interested in the Memento details:

$ curl -i
HTTP/1.1 200 OK
Date: Sat, 23 Mar 2013 04:24:50 GMT
Server: Apache/2.2.15 (Red Hat)
Link: <http://>;rel="timemap";type="application/link-format";anchor=""
Connection: close
Transfer-Encoding: chunked
Content-Type: application/link-format

 <>;rel="first last memento";datetime="Fri, 07 May 2010 00:00:00 GMT"

Exactly how many pdfs are available in Memento-compliant archives?  I'm not sure; it could be just a few since most general web archives prefer html pages to pdfs.  MementoFox can make the rediscovery process seamless, but the point is that the pdfs are out there and shutting down won't bring them back.

Although I can't help you find all material NASA has taken off-line, I can point you to two resources.  First is a mirror of some NACA (1917-1958) that I helped set up with Paul Needham at Cranfield University in 2001:

I helped establish that when the NASA websites were taken down after September 11, 2001.  That made it clear to me that NASA information was too important to be left on * computers.  Although that mirror required a number of emails to coordinate, the bulk transfer was done over the web.  I'm surprised it is still up and running -- it is a testament to good, simple web design and perhaps it is proof that benign neglect can be helpful in the web as well as in the physical world.

Of more recent and larger scale is the Internet Archive's "NASA Technical Documents" collection:

I'm not sure about the size of their collection either, but there appears to be a good deal there.

I've also just discovered that Mark Phillips has a NACA collection as well.  I'm not sure if this is related to the IA collection or is totally separate:

Returning to Mark's point, it is events like this that demonstrate the value of copying by-value and not just by-reference.  I'm not concerned about popular culture artifacts disappearing (e.g., see our TPDL 2011 paper about music redundancy in YouTube), but it is not clear that long tail content like NASA reports will enjoy that same level of uncoordinated refreshing and migration.  The moral of the story: make copies of the content, and let services like Google Scholar cluster the pdfs together (e.g., a 1994 NASA TM of mine is on at least six different hosts, none of which are *

David Rosenthal has often mentioned that preservation threats include legal, bureaucratic, and political threats.  If NTRS was a LOCKSS participant then access would be uninterrupted, but even LOCKSS assumes that the organization responsible for the content is not the primary threat to the content.


2013-05-10 Edit: According to NASA Watch, NTRS came back online May 8 2013 -- without 85% of its full-text reports.  That same article also pointed out that at least some of the material is available at the Aerospace Research Information Center in South Korea. 

Saturday, March 2, 2013

2013-03-02: NFL 2013 Salary Cap

The NFL salary cap for 2013 has been calculated to be about $123 million. All NFL teams must be in compliance with the salary cap by March 12th when the new league year starts. March 12th also marks the start of the free agent market in the NFL. Teams that are over the salary cap must let some players go and teams that are under the salary cap are looking to add new players to their rosters.

The process sounds simple on the surface but in reality it becomes confusing rather quickly. Many teams routinely exceed the salary cap by manipulating contracts. The Pittsburgh Steelers were about $14 million over the cap until they modified Ben Roethlisberger's contract and changed most of his pay into a signing bonus. Signing bonuses can be amortized over the life of a contract. Instead of receiving an $18 million dollar salary, the player gets a $2 million dollar salary and a $16 million dollar bonus. The bonus will be divided by the number of years in the contract and thus reduce the impact on the salary cap.

This type of contract modification is taking place with many of the top players this year.
Tom Brady just signed a new contract that gives him a $7 million dollar salary and $30 million in bonuses over the five years.

At its core, the salary cap is about revenue sharing between the players and the team owners. One of the purported benefits of the salary cap is to help level the playing field between teams and improve competitive balance. The NFL instituted the salary cap in 1994 in part to prevent some teams from buying up the best players and dominating the game.

It is now close to twenty years after the salary cap was implemented and many people argue about the efficacy of the salary cap to promote equality on the field or if it brings mediocrity to the game. One way we can take a look at this is to compare the number of wins per year between teams. Each game has a winner and a loser and there are 16 regular season games for each of the 32 teams every year. Therefore the mean value of the number of wins per team will be 8 (controlling for fewer team in pre-expansion years). We are going to take a look at the distribution of the number of wins. If team parity is low you will see a larger standard deviation as the number of wins is spread out. Conversely if the standard deviation is small, most of the teams will be clustered around the mean 8 wins per year. Using data from 1990 (before the cap) to 2012, we plotted the standard deviations from each year with a trend line.

Our data shows that the standard deviation averages about 3.0 and clearly displays an increasing trend. The spread in number of wins per year has been increasing slightly since the salary cap has been in place. In the book Wages of Wins the authors introduced a metric the call the Noll-Scully measure which compares the idealized standard deviation to the actual standard deviation. For the NFL the idealized standard deviation is 2, to calculate Noll-Scully divide the actual standard deviation by the idealized. Using the trend line this results in a 1.45 in 1990, increasing to a 1.6 in 2012. This is similar to the overall score the authors found in Wages of Wins. By itself the Noll-Scully does not mean much but when the authors compared it to American Baseball, Basketball and other sports, the NFL displayed more parity. Finding this not quite convincing we decided to try another metric, the Gini Index.

The Gini Index is a measure of statistical dispersion and is commonly used by economists to measure inequality of wealth. The index ranges from 0 to 1 ( or 0 to 100 ). A score of 0 means perfect equality, a score of 100 means perfect inequality. In our case when measuring 32 teams that either win or lose, the worst case would be 16 teams with 16 wins and 16 teams with 0 wins. This would result in a Gini Index of 50 so our score will range from 0 (every teams has 8 wins) to 50. The Gini Index was calculated using the same data from 1990 to 2012 and plotted with a trend line. All of the values are below 25 which is considered very good by economists for wealth distribution. For our purposes it is still quite good however the data displayed almost the same upward trend as the standard deviation.

To find out if the upward trend was statistically significant we used Mann-Kendall trend analysis. Mann-Kendall is a non-parametric test used for identifying trends in time series data. The data values are evaluated as an ordered time series. Each data value is compared to all subsequent data values. The result of the analysis indicates that the upward trend is not statistically significant at 95%.

Overall the parity in the NFL appears to rather good, especially compared to some of the other American sports. The salary cap is not the only tool used to improve the team parity. Draft picks and strength of schedule are both determined based on last years performance. Teams that did poorly are given better draft picks and easier strength of schedules. The levers that the NFL have put in place appear to be working as far as promoting a competitive balance without negatively impacting gameplay. The performance of the Indianapolis Colts and the New England Patriots in the 2000s proves that dynasties are still alive and that a well managed team can still excel.

--Greg Szalkowski