Monday, April 30, 2018

2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

It is hard to believe that the time has come for me to write a wrap up blog about the adventure that was my Masters Degree and the thesis that got me to this point. If you follow this blog with any regularity you may remember two posts, written by myself, that were the genesis of my thesis topic:

Bonus points if you can guess the general topic of the thesis from the titles of those two blog posts. However, it is ok if you can not as I will give an oh so brief TL;DR;. The replay problems with cnn.com were, sadly, your typical here today gone tomorrow replay issues involving this little thing, that I have come to , known as JavaScript. What we also found out, when replaying mementos of cnn.com from the major web archives, was each web archive has their own unique and subtle variation of this thing called "replay". The next post about the curious case of mendely.com user pages (A State Of Replay) further confirmed that to us.

We found that not only does there exist variations in how web archives perform URL rewriting (URI-Rs URI-Ms) but also that, depending on the replay scheme employed, web archives are also modifying the JavaScript execution environment of the browser and the archived JavaScript code itself beyond URL rewriting! As you can imagine this left us asking a number of questions that lead to the realization that the web archiving lacks the terminology required to effectively describe the existing styles of replay and the modifications made to an archived web page and its embedded resources in order to facilitate replay.

Thus my thesis was born and is titled "To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages".

Since I am known around the ws-dl headquarters for my love of deep diving into the secrets of (securely) replaying JavaScript, I will keep the length of this blog post to a minimum. The thesis can be broken down into three parts, namely Styles Of Replay, Memento Modifications, and Auto-Generating Client-Side Rewriters. For more detail information about my thesis, I have embedded my defense's slides below and the full text of the thesis has been made available.

Styles Of Replay

The existing styles of replaying mementos from web archives is broken down into two distinct models, namely "Wayback" and "Non-Wayback", and each has its own distinct styles. For the sake of simplicity and length of this blog post I will only (briefly) cover the replay styles of the "Wayback" model.

Non-Sandboxing Replay

Non-sandboxing replay is the style of replay that does not separate the replayed memento from the archive-controlled portion of replay, namely the banner. This style of replay is considered the OG (original gangster) way for replaying mementos simply because it was, at the time, the only way to replay mementos and was introduced by the Internet Archive's Wayback Machine. To both clarify and illustrate what we mean by "does not separate the replayed memento from archive-controlled portion of replay", consider the image below displaying the HTML and frame tree for a http://2016.makemepulse.com memento replayed from the Internet Archive on October 22, 2017.

As you can see from the image above, the archive's banner and the memento exist together on the same domain (web.archive.org). Implying that the replayed memento(s) can tamper with the banner (displayed during replay) and or interfere with archive control over replay. Non-malicious examples of mementos containing HTML tags that can both tamper with the banner and interfere with archive control over replay skip to the Replay Preserving modifications section of post. Now to address the recent claim that "memento(s) were hacked in the archive" and its correlation to non-sanboxing replay. Additional discussion on this topic can be found in Dr. Michael Nelson's blog post covering the case of blog.reidreport.com and in his presentation for the National Forum on Ethics and Archiving the Web (slides, trip report).

For a memento to be considered (actually) hacked, the web archive the memento is replayed (retrieved) from must be have been compromised in a manner that requires the hack to be made within the data-stores of the archive and does not involve user initiated preservation. However, user initiated preservation can only tamper with a non-hacked memento when replayed from an archive. The tampering occurs when an embedded resource, previously un-archived at the memento-datetime of the "hacked" memento, is archived from the future (present datetime relative to memento-datetime) and typically involves the usage of JavaScript. Unlike non-sandboxing replay, the next style of Wayback replay, Sandboxed Replay, directly addresses this issue and the issues of how to securely replay archived JavaScript. PS. No signs of tampering, JavaScript based or otherwise, were present in the blog.reidreport.com mementos from the Library of Congress. How do I know??? Read my thesis and or look over my thesis defense slides, I cover in detail what is involved in the mitigation of JavaScript based memento tampering and know what that actually looks like .

Sandboxed Replay

Sandboxed replay is the style of replay that separates the replayed memento from the archive-controlled portion of the page through replay isolation. Replay isolation is the usage of an iframe to the sandbox the replayed memento, replayed from a different domain, from the archive controlled portion of replay. Because replay is split into two different domains (illustrated in the image seen below), one for the replay of the memento and one for the archived controlled portion of replay (banner), the memento cannot tamper with the archives control over replay or the banner. Due to security restrictions placed on web pages from different origins by the browser called the Same Origin Policy. Web archives employing sandboxed replay typically also perform the memento modification style known as Temporal Jailing. This style of replay is currently employed by Webrecorder and all web archives using Pywb (open source, python implementation of the Wayback Machine). For more information on the security issues involved in high-fidelity web archiving see the talk entitled Thinking like a hacker: Security Considerations for High-Fidelity Web Archives given by Ilya Kreymer and Jack Cushman at WAC2017 (trip report), as well as, Dr. David Rosenthal's commentary on the talk.

Memento Modifications

The modification made by web archives to mementos in order to facilitate there replay can be broken down into three categories, the first of which is Archival Linkage.

Archival Linkage Modifications

Archival linkage modifications are made by the archive to a memento and its embedded resources in order to serve (replay) them from the archive. The archival linkage category of modifications are the most fundamental and necessary modifications made to mementos by web archives simply because they prevent the Zombie Apocalypse. You are probably already familiar with this category of memento modifications as it is more commonly referred to as URL rewriting
(URI-R URI-M).

<!-- pre rewritten -->
<link rel="stylesheet" href="/foreverTime.css">
<!-- post rewritten -->
<link rel="stylesheet" href="/20171007035807cs_/foreverTime.css">

URL rewriting (archival linkage modifications) ensures that you can relive (replay) mementos, not from the live web, but from the archive. Hence the necessity and requirement for this kind of memento modifications. However, it is becoming necessary to seemingly damage mementos in order to simply replay them.

Replay Preserving Modifications

Replay Preserving Modifications are modifications made by web archives to specific HTML element and attribute pairs in order to negate their intended semantics. To illustrate this, let us consider two examples, the first of which was introduced by our fearless leader Dr. Michael Nelson and is known as the zombie introducing meta refresh tag shown below.

<meta http-equiv="refresh" content="35;url=?zombie=666"/>

As you are familiar, the meta refresh tag will, after 35 seconds, refresh the page with the "?zombie=666" appended to original URL. When a page containing this dastardly tag is archived and replayed, the results of the refresh plus appending "?zombie=666" to the URI-M causes the browser to navigate to a new URI-M that was never archived. To overcome this archives must arm themselves with the attribute prefixing shotgun in order to negate the tag and attribute's effects. A successful defense against the zombie invasion when using the attribute prefixing shotgun is shown below.

    <meta _http-equiv="refresh" _content="35;url=?zombie=666"/>
    

Now let me introduce to you a new more insidious tag that does not introduce a zombie into replay but rather a demon known as the meta csp tag, shown below.

<meta http-equiv="Content-Security-Policy"
 content="default-src http://notarchive.com; img-src ...."/>

Naturally, web archives do not want web pages to be delivering their own Content-Security-Policies via meta tag because the results are devastating, as shown by the YouTube video below.

Readers have no fear, this issue is fixed!!!! I fixed the meta csp issue for Pywb and Webrecorder in pull request #274 submitted to Pywb. I also reported this to the Internet Archive and they promptly got around to fixing it.

Temporal Jailing

The final category of modifications, known as temporal Jailing, is the emulation of the JavaScript environment as it existed at the original memento-datetime through client-side rewriting. Temporal jailing ensures both the secure replay of JavaScript and that JavaScript can not tamper with time (introduce zombies) by applying overrides to the JavaScript APIs provided by the browser in order to intercept un-rewriten urls. Yes there is more to it, a whole lot more, but because it involves replaying JavaScript and I am attempting to keep this blog post reasonably short(ish), I must force you to consult my thesis or thesis defense slides for more specific details. However, for more information about the impact of JavaScript on archivability, and measuring the impact of missing resources see Dr. Justin Brunelle's Ph.D. wrap up blog post. The technique for the secure replay of JavaScript known as temporal jailing is currently used by Webrecorder and Pywb.

Auto-Generating Client-Side Rewriters

Have I mention yet just how much I JavaScript?? If not, lemme give you a brief overview of how I am auto-generating client-side rewriting libraries, created a new way to replay JavaScript (currently used in production by Webrecorder and Pywb) and increased the replay fidelity of the Internet Archive's Wayback Machine.

First up let me introduce to you Emu: Easily Maintained Client-Side URL Rewriter (GitHub). Emu allows for any web archive to generate their own generic client-side rewriting library, that conforms to the de facto standard implementation Pywb's wombat.js, by supplying it the Web IDL definitions for the JavaScript APIs of the browser. Web IDL was created by the W3C to describe interfaces intended to be implemented in web browser, allow the behavior of common script objects in the web platform to be specified more readily, and provide how interfaces described with Web IDL correspond to constructs within ECMAScript execution environments. You may be wondering how can I guarantee this tool will generate a client-side rewriter that provides complete coverage of the JavaScript APIs of the browser and that we can readily obtain these Web IDL definitions? My answer is simple and it is to confider the following excerpt from the HTML specification:

This specification uses the term document to refer to any use of HTML, ..., as well as to fully-fledged interactive applications. The term is used to refer both to Document objects and their descendant DOM trees, and to serialized byte streams using the HTML syntax or the XML syntax, depending on context ... User agents that support scripting must also be conforming implementations of the IDL fragments in this specification, as described in the Web IDL specification

Pretty cool right, what is even cooler is that a good number of your major browsers/browser engines (Chromium, FireFox, and Webkit) generate and make publicly available Web IDL definitions representing the browsers/engines conformity to the specification! Next up a new way to replay JavaScript.

Remember the curious case of mendely.com user pages (A State Of Replay) and how we found out that Archive-It, in addition to applying archival linkage modifications, was rewriting JavaScript code to substitute a new foreign, archive controlled, version of the JavaScript APIs it was targeting. This is shown in the image below.

Archive-It rewriting embedded JavaScript from the memento for the curious case mendely.com user pages

Hmmmm, looks like Archive-It is only rewriting only two out of four instances of the text string location in the example shown above. This JavaScript rewriting was targeting the Location interface which controls the location of the browser. Ok, so how well would Pywb/Webrecorder do in this situation?? From the image shown below, not as good and maybe a tad bit worse...

Pywb v0.33 replay of https://reacttraining.com/react-router/web/example/auth-workflow

That's right folks, JavaScript rewrites in HTML. Why??? See below.

Bundling HTML in JavaScript, https://reacttraining.com/react-router/15-5fae8d6cf7d50c1c6c7a.js

Because the documentation site for React Router was bundling HTML inside of JavaScript containing the text string "location" (shown above), the rewrites were exposed in the documentations HTML displayed to page viewers (second image above). In combination with how Archive-It is also rewriting archived JavaScript, in a similar manner, I was like this needs to be fix. And fix it I did. Let me introduce to you a brand new way of replaying archived JavaScript shown below.

// window proxy
new window.Proxy({}, {
  get (target, prop) {/*intercept attribute getter calls*/},
  set (target, prop, value) {/*intercept attribute setter calls*/},
  has (target, prop) {/*intercept attribute lookup*/},
  ownKeys (target) {/*intercept own property lookup*/},
  getOwnPropertyDescriptor (target, key) {/*intercept descriptor lookup*/},
  getPrototypeOf (target) {/*intercept prototype retrieval*/},
  setPrototypeOf (target, newProto) {/*intercept prototype changes*/},
  isExtensible (target) {/*intercept is object extendable lookup*/},
  preventExtensions (target) {/*intercept prevent extension calls*/},
  deleteProperty (target, prop) {/*intercept is property deletion*/},
  defineProperty (target, prop, desc) {/*intercept new property definition*/},
})

// document proxy
new window.Proxy(window.document, {
  get (target, prop) {/*intercept attribute getter calls*/},
  set (target, prop, value) {/*intercept attribute setter calls*/}
})

The native JavaScript Proxy object allows an archive to perform runtime reflection on the proxied object. Simply put, it allows an archive to defined custom or restricted behavior for the proxied object. I have annotated the code snippet above with additional information about the particulars of how archives can use the Proxy object. Archives using the JavaScript Proxy object in combination with the setup shown below, web archives can guarantee the secure replay of archived JavaScript and do not have to perform the kind of rewriting shown above. Yay! Less archival modification of JavaScript!! This method of replaying archived JavaScript was merged into Pywb on August 4, 2017 (contributed by yours truly) and has been used in production by Webrecoder since August 21, 2017. Now to tell you about how I increased the replay fidelity of the Internet Archive and how you can too .

var __archive$assign$function__ = function(name) {/*return archive override*/};
{
  // archive overrides shadow these interfaces
  let window = __archive$assign$function__("window");
  let self = __archive$assign$function__("self");
  let document = __archive$assign$function__("document");
  let location = __archive$assign$function__("location");
  let top = __archive$assign$function__("top");
  let parent = __archive$assign$function__("parent");
  let frames = __archive$assign$function__("frames");
  let opener = __archive$assign$function__("opener");
  /* archived JavaScript */
}

Ok so I generated a client-side rewriter for the Internet Archive's Wayback Machine using the code that is now Emu and crawled 577 Internet Archive mementos from the top 700 web pages found in the Alexa top 1 million web site list circa June 2017. The crawler I wrote for this can be found on GitHub . By using the generated client-side rewriter I was able to increase the cumulative number of requests made by the Internet Archive mementos by 32.8%, a 45,051 request increase (graph of this metric shown below). Remember that each additional request corresponds to a resource that previously was unable to be replayed from the Wayback Machine.

Hey look, I also decreased the number of requests blocked by the content-security policy of the Wayback Machine by 87.5%, a 5,972 request increase (graph of this metric shown below). Remember, that earch request un-blocked corresponds to a URI-R the Wayback Machine could not rewrite server-side and requires the usage of client-side rewriting (Pywb and Webrecorder are using this technique already).

Now you must be thinking this impressive to say the least, but how do I know these numbers are not faked / or doctored in some way in order to give a client-side rewriting the advantage??? Well you know what they say seeing is believing!!! The generated client-side rewriter used in the crawl that produced the numbers shown to you today is available as the Wayback++ Chrome and Firefox browser extension! Source code for it is on GitHub as well. And oh look, a video demonstrating the increase in replay fidelity gained if the Internet Archive were to use client-side. Oh, I almost forgot to mention that at the 1:47 mark in the video I make mementos of cnn.com replayable again from the Internet Archive. Winning!!

Pretty good for just a masters thesis wouldn't you agree. Now it's time for the obligatory list of all the things I have created in the process of this research and time as a masters student:

What is next you may ask??? Well I am going to be taking a break before I start down the path known as a Ph.D. Why??????? To become the senior backend developer for Webrecorder of course! There is so so much to be learned from actually getting my hands dirty in facilitating high-fidelity web archiving such that when I return, I will have a much better idea of what my research's focus should be on.

If I have said this once, I have said this a million times. When you use a web browser in the preservation process, there is no such thing as an un-archivable web page! Long live high-fidelity web archiving!

- John Berlin (@johnaberlin , @N0taN3rd )

Tuesday, April 24, 2018

2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com


This story started in December, 2017 with Joy-Ann Reid (of MSNBC) apologizing for "insensitive LGBT blog posts" that she wrote on her blog many years ago when she was a morning radio talk show host in Florida.   This apology was, at least in some quarters, (begrudgingly) accepted.   Today's update was news that Reid and her lawyers had in December claimed that either her blog, and/or the Internet Archive's record of the blog had been hacked (Mediaite, The Intercept).  Later today, the Internet Archive issued a blog post to deny the claim that it was hacked, stating:
This past December, Reid’s lawyers contacted us, asking to have archives of the blog (blog.reidreport.com) taken down, stating that “fraudulent” posts were “inserted into legitimate content” in our archives of the blog. Her attorneys stated that they didn’t know if the alleged insertion happened on the original site or with our archives (Reid’s claim regarding the point of manipulation is still unclear to us).
...
At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog.   
Checking the Internet Archive for robots.txt, we can see that on 2018-02-16 blog.reidreport.com had a standard robots.txt page that blocked the admin section of WordPress, but by 2018-02-21 they had a version that blocked all robots, and as of today (2018-04-24) they had a version that specifically blocked only the Internet Archive's crawler ("ia_archiver").  As of about 5pm EDT, the robots.txt file had been removed (probably because of the Internet Archive's blog post calling out the presence of the robots.txt; cf. a similar situation in 2013 with the Conservative Party in the UK), but it may take a while for the Internet Archive to register its absence.

2018-04-25 update: Thanks to Peter Sterne for pointing out that www.blog.reidreport.com/robots.txt still exists, even though blog.reidreport.com/robots.txt does not.  They technically can be two different URLs though the convention is for them to canonicalize to the same URL (which is what the Wayback Machine does).  HTTP session info provided below, but the summary is that robots.txt is still in effect and the need for other web archives is still paramount. 



Until the Internet Archive begins serving blog.reidreport.com again, this is a good time to remind everyone that there are web archives other than the Internet Archive.  The screen shot above shows the Memento Time Travel service, which searches about 26 public web archives.  In this case, it found mementos (i.e., captures of web pages) in five different web archives: Archive-It (a subsidiary of the Internet Archive), Bibliotheca Alexandrina (the Egyptian Web Archive), the National Library of Ireland, the archive.is on-demand archiving service, and the Library of Congress.  For a machine readable service, below I list the TimeMap (list of mementos) generated by our MemGator service; the details aren't important but it is the source of the URLs that will appear next.  

Beginning with the original tweets by @Jamie_Maz (2017-11-30 thread, 2018-04-18 thread), I scanned through the screen shots (no URLs were given) and looked for screen shots that had definitive datetimes (most images did not have them).  The datetimes are (with ones for which we have evidence in bold, and the ones that we inferred by matching text are maked with "(inferred)"):

2005-04-25
2005-07-16
2005-07-21
2006-01-20 (inferred)
2006-06-05
2006-06-13 (inferred)
2006-10-03
2006-12-23
2007-02-21
2008-07-04
2008-10-16
2009-01-15
(update: because of canonicalization errors, some of the URLs are not being excluded; see below)

Most of those dates are pretty early in web archiving times, when the Internet Archive was the only archive commonly available, and many (all?) of the mementos in other web archives were surely originally crawled by the Internet Archive, even if on a contract basis (e.g., for the Library of Congress).  Nonetheless, with multiple copies geographically and administratively dispersed throughout the globe, an adversary would have had to hack multiple web archives and alter their contents (cf. lockss.org), or have hacked the original site (blog.reidreport.com) approximately 12 years ago for adulterated pages to have been hosted at all the different web archives.  While both scenarios are technically possible, they are extraordinarily unlikely.  

While we don't know the totality of the hacking claims, we can offer three archived web pages, hosted at the Library of Congress web archive (webarchive.loc.gov), that corroborate at least some of the claims @Jamie_Maz.

2006-01-20


Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20060125004941/http://blog.reidreport.com/ 


2018-04-25 update: the above image is for the left-hand image in the tweet; the right-hand image in the tweet can be found about 1/3 of the way down at: https://web.archive.org/web/20070222030051/https://blog.reidreport.com/labels/Tim%20Hardaway.html


2006-06-05


Evidence for this tweet can be found at (approximately 2/3 down): http://webarchive.loc.gov/all/20060608144033/http://blog.reidreport.com/


2006-06-13

I'm not sure this evidence maps directly to one of tweets, but it fits the general theme of anti-Charlie Crist: http://webarchive.loc.gov/all/20060615134635/http://blog.reidreport.com/


This memento also exists at archive.is; it is a copy of the Internet Archive's copy but it is not blocked by robots.txt because it is in another archive: http://archive.is/20060615134635/http://blog.reidreport.com/

2006-10-03



Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20061010125903/http://blog.reidreport.com/


2008-10-16


Evidence for this tweet can be found at (approximately 1/3 down): http://webarchive.loc.gov/all/20081018020856/http://blog.reidreport.com/ 



In summary, of the many examples that @Jamie_Maz provides, I can find five copies in the Library of Congress's web archive.  These crawls were probably performed on behalf of the Library of Congress by the Internet Archive (for election-based coverage); even though there are many different (and independent) web archives now, in 2006 the Internet Archive was pretty much the only game in town.  Even though these mementos are not independent observations, there is no plausible scenario for these copies to have been hacked in multiple web archives or at the original blog 10+ years ago.  There may be additional evidence in the other web archives, but I haven't exhaustively searched them.

We don't know the full details of what Reid's lawyers alleged, so perhaps there are details that we don't know.  But the analysis from the Internet Archive crawl engineers, plus evidence in separate web archives suggest that the claim has no merit.

The case of blog.reidreport.com is another example of why we need multiple web archives.  


--Michael


Thanks to Prof. Michele Weigle and John Berlin for bringing this issue to my attention and uncovering some of the examples.   

Memento TimeMap for blog.reidreport.com:



2018-04-25 update: As noted above, Peter Sterne brought to my attention that the non-standard URL of www.blog.reidreport.com/robots.txt still exists (and is blocking "ia_archiver") even though the more standard blog.reidreport.com/robots.txt is 404. 



Another 2018-04-25 update: The NYT has covered the story ("MSNBC Host Joy Reid Blames Hackers for Anti-Gay Blog Posts, but Questions Mount"), and there was an interview with Reid's computer security expert ("Should We Believe Joy Reid’s Blog Was Hacked? This Security Consultant Says We Should"), Jonathon Nichols.  

 I embed a statement from Nichols (released by Erik Wemple), and a tweet from Nichols clarifying that they were not suggesting that Wayback Machine's mementos were hacked, but rather the hacked blog was crawled by the Internet Archive.  

This is where it's important to note that there maybe a discrepancy between the posts that Nichols is concerned with and those that @Jamie_Maz surfaced.  There is (semi-)independent evidence of @Jamie_Maz's pages, with the ultimate implication that for those pages to have been the result of a hack, blog.reidreport.com would have had to been hacked as many as 12 years ago -- and for nobody to have noticed at the time.  

Reid (& Nichols) could always unblock the Internet Archive and share the evidence of the hack. 




Yet another 2018-04-25 update: Apparently there are some holes in the http vs. https canonicalization wrt robots.txt blockage, allowing some of posts to surface.  Here's an example (via @YanceyMc):
https://web.archive.org/web/20060225041734/https://blog.reidreport.com/2005/10/harriet-miers-and-lesbian-hair-check.html





Also, @wvualphasoldier deleted his tweets then protected his account, so that's the reason the above embed no longer formats correctly. (2018-04-26 update: @wvualphasoldier has now unprotected his account, but his earlier tweets about Joy Reid are still deleted.)

Yet, Yet Another 2018-04-25 update:

Thanks to Prof. Weigle and Mat Kelly for providing examples of some of the URLs that are slipping through the robots.txt exclusion.

Here's one: https://web.archive.org/web/20060805055643/https://blog.reidreport.com

and another: https://web.archive.org/web/20050728132003/https://blog.reidreport.com:443/

Which has the following information that I thought I saw in the original @Jamie_Maz tweets but now I can't find it, so perhaps I'm misremembering.  It certainly fits the overall theme.  Edit: it's in this post:



2018-04-26 update: I saw this last night but I'm adding it now.  In Erik Wemple's article "Is MSNBC’s Joy Reid the victim of malicious ‘screenshot manipulation’?", he links to a PDFs to Google and the Internet Archive from Reid's lawyers.  The letters do not have the attachments, but from the text of the letters I was able to infer some of the references they discuss, such as "...making sexual innuendos about Senator Orrin Hatch..." and "Things people say when they're on the Fox News Channel".  Those two quotes can be found at both:

http://webarchive.loc.gov/all/20060111221738/http://blog.reidreport.com/
http://web.archive.org/web/20060111221738/https://blog.reidreport.com/

Picking this apart further, the post "Things people say when they're on the Fox News Channel" is one of about 12 posts Reid made on January 10, 2006.  The above mementos at LC and IA were archived on January 11, 2006 (specifically, 2006-01-11T22:17:38Z).  Assuming the timestamp of "11:28" is EST (Reid was in Florida at the time), the difference between EST and GMT is 5 hours.  The interval between the posting time and the archiving time is pretty small, less than 30 hours:
2006-01-11T22:17:38Z archive time
2006-01-10T16:28:00Z posting time (I've assumed "00" for seconds)
-----------------------------
29 hours 49 minutes 38 seconds, or 1 day, 5 hours, 49 minutes, 38 seconds.

Call it 30 hours.  It's possible that the blog's timezone was GMT, so the interval would become 35 hours.  It's also possible the timezone was PST (Blogger was a Google service in 2006, so there's a good chance the machines were in CA), then the window shrinks to 27 hours.  But EST and 30 hours is a pretty good guess.

Blogger does allow you to change creation dates on blog posts, so it is possible to go into a blog "today" and author a post that was "created" in 2006.  But the Internet Archive saw what it saw on 2006-01-11T22:17:38Z, so if the blog was hacked and posts were backdated to appear as 2006-01-10, then a hacker would have to have logged into the site and posted content without Reid or her readers noticing.  The archived page shows six posts on January 11, 2006, 12 posts on January 10, and 11 posts on January 09.  Even if we accept the premise that some of those posts are fraudulent, it is clear that Reid was a prolific blogger and regularly interacted with the blog (history note: blogging was very popular before Twitter & Facebook all but replaced it some years later).  I don't know what kind of readership Reid's blog had in 2006, but at the very least she was interacting with it many times per day, and would have had occasion to notice posts that she did not author.

Reid's lawyers also mention the post "Best Love Life EVER: Celebrity Wife-Swap Edition", which is also available at the above linked archived pages.  It purports to be published on January 11, 2006 at 9:23am, the first post (of six) of the day. Repeating the same analysis as above:

2006-01-11T22:17:38Z archive time
2006-01-11T14:23:00Z posting time (I've assumed "00" for seconds)
-----------------------------
7 hours 54 minutes 38 seconds

This establishes a boundary of 8 hours between when the allegedly fraudulent post was purportedly made and when the page was archived.  The last post of the day was 4:51pm ("The hypothetical situation room").  Repeating the above analysis again:


2006-01-11T22:17:38Z archive time
2006-01-11T21:51:00Z last posting time (I've assumed "00" for seconds)
-----------------------------
0 hours 26 minutes 38 seconds

In other words, the window for the last time that Reid interacted with the blog and when the blog was archived is less than 30 minutes.  It's entirely possible that Reid pressed "publish" and did not look back at the blog, and just moved on to the next task.  It's also possible that an adversary logged into the blog, posted the content before the 2006-01-11T22:17:38Z archival time, and changed the creation date to be earlier in the morning, before the presumably legitimate content was published.  But also keep in mind that such an adversary would not know in advance the time that the Internet Archive would visit the page (IA's "save page now" did not exist in 2006).  

In summary, while we can't rule out an external adversary logging in and inserting fraudulent content right before archiving time, it would take an extraordinary set of circumstances for the content to appear before the page was archived and not be noticed by Reid (or brought to her attention by her readers).

(also, apologies if I've made datetime arithmetic or TZ conversion errors; corrections welcome)

2018-04-27 update: I've deduced why the number of comments for the top-level pages (at least the few that I've looked at) do not match some people's expectations.  In short, the html page and the corresponding javascript page that holds the index for which posts have what number of comments are out of sync.  In one example I looked at, the html page is crawled on 2006-01-11 and the js index is crawled on 2006-02-07.  When the js runs and can't find its post id in the js index, it assumes zero comments and prints the string "comments?".

I'm including a twitter thread here in lieu of a proper write-up.