Thursday, March 9, 2017

2017-03-09: A State Of Replay or Location, Location, Location

We have written blog posts about the time traveling zombie apocalypse in web archives and how the lack of client-side JavaScript execution at preservation time prevented the SOPA protest of certain websites from being seen in the archive. A more recent post about CNN's utilization of JavaScript to load and render the contents of its homepage have made it unarchivable since November 1st, 2016. The CNN post detailed how some "tricks" were utilized to circumvent CORS restrictions of HTTP requests made by JavaScript to talk to their CDN were the root cause of why the page is unarchivable / unreplayable. I will now present to you a variation of this which is more insidious and less obvious than what was occurring in the CNN archives.

TL;DR

In this blog post, I will be showing in detail what caused a particular web page to fail on replay. In particular, the replay failure occurred due to the lack of necessary authentication and HTTP methods made for the custom resources this page requires for viewing. Thus the pages JavaScript thought the current page being viewed required the viewer to sign in and will always cause redirection to happen before the page has loaded. Also depending on the replay systems rewrite mechanisms, the JavaScript of the page could collide with the replays systems causing undesired effects. The biggest issue highlighted in the blog post is that certain archives replay systems are employing unbounded JavaScript rewrites that, albeit in certain situations, fundamentally destroy the original page's JavaScript. Putting its execution into states its creators could not have prepared for or thought possible when viewing the page on the live web. It must be noted that this blog post is the result of my research into the modifications made to a web page in order to archive and replay it faithfully as it was on the live web.

Background

Consider the following URI https://www.mendeley.com/profiles/helen-palmer which when viewed on the live web behaves as you would expect any page not requiring a login to behave.
But before I continue, some background about mendely.com since you may not have known about this website as I did not before it was brought to my attention. mendely.com is a LinkedIn of sorts for researchers which provides additional services geared towards them specifically. Like LinkedIn, mendely.com has publicly accessible profile pages listing a researcher's interests, their publications, educational history, professional experience, and following/follower network. All of this is accessible without a login and the only features you would expect to require a login such as follow the user or read one of their listed publications take you to a login page. But the behavior of the live web page is not maintained when replayed after being archived.

A State of Replay

Now consider the memento of https://www.mendeley.com/profiles/helen-palmer from Archive-It on 2016-12-15T23:19:00. When the page starts to load and becomes partially rendered, an abrupt redirection occurs taking you to
www.mendeley.com/sign-in/?routeTo=https%3A%2F%2Fwww.mendeley.com%2Fprofiles%2Fhelen-palmer
which is 404 in the archive.
Obviously, this should not be happening since this is not the behavior of the page on the live web. It is likely that the pages JavaScript is misbehaving when running on the host wayback.archive-it.org. Before we investigate what is causing this to happen let us see if the redirection occurs when replaying a memento from the Internet Archive on 2017-01-26T21:48:31 and a memento from Webrecorder on 2017-02-12T23:27:56.
Webrcorder
Internet Archives
The video below shows this occurring in all three archives



and as seen in the video below this happens on other pages on mendeley.com


Comparing The Page On The Live Web To Replay On Archive-It

Unfortunately, both are unable to replay the page due to the redirection occurring which points to the credibility of the original assumption that the pages JavaScript is causing the redirection. Before diving into JavaScript detective mode, let us see if the output from the developer console can give us any clues. Seen below is the browser console with XMLHttpRequests (XHR) logging enabled when viewing
https://www.mendeley.com/profiles/helen-palmer
on the live web seen below Besides the optimizely (user experience/analytics platform) XHR requests the page's own JavaScript makes several requests to the sites backend at
https://api.mendeley.com
and a single GET request for
https://www.mendeley.com/profiles/helen-palmer/co-authors
A breakdown of the request to api.mendely listed below:
  • GET api.mendeley.com/catalog (x8)
  • GET api.mendeley.com/documents (x1)
  • GET api.mendeley.com/scopus/article_authors (x8)
  • POST api.mendely.com/events/_batch (x1)
From these network requests, we can infer that the live web page is dynamically populating the publications list of its profile pages and perhaps some other elements of the page. Now let's check the browser console from the Archive-It memento on 2016-12-15T23:19:00.
Many errors are occurring as seen in the browser console from the Archive-It memento but it is the XHR request errors and lack of XHR requests made that are significant. The first significant XHR error is a 404 that occurred when trying to execute a GET request for
http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/helen-palmerco-authors/
This is a rewrite error (URI-R -> URI-M). The live web pages JavaScript requested
https://www.mendeley.com/profiles/helen-palmer/co-authors
but when replayed the archived JavaScript made the request for
https://www.mendeley.com/profiles/helen-palmerco-authors
Stranger yet is that the XHR finished loading console entry indicates it was made to
http://wayback.archive-it.org/profiles/helen-palmerco-authors
not, the URI-M that received the 404. Thankfully we can consult the developer tools included in our web browsers to see request/response headers for each request. The corresponding headers for
http://wayback.archive-it.org/profiles/helen-palmerco-authors
are seen below
The request was really 302 and was indeed made to
http://wayback.archive-it.org/profiles/helen-palmerco-authors
but the actual location indicated in the response is to the "correct" URI-M
http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/helen-palmerco-authors
The other significant difference from the live webs XHR requests is that the archived pages JavaScript is no longer requesting the resources from api.mendely.com. We now have a single request for
http://wayback.archive-it.org/profiles/refreshToken
This request suffered the same fate as the previous request, 302 with location of
http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/refreshToken
and then the redirection happens. Now we have a better understanding of what is happening with the Archive-It memento. The question about the Internet Archives and Webrecorders memento remains.

Does This Occur In Other Archives

The console output from the Internet Archives memento on 2017-01-26T21:48:31 seen below shows that the requests to api.mendeley.com are not made. The request for the refresh token is made, but unlike the Archive-It memento the request to co-authors is rewritten successfully and does not receive a 404 but still redirects seen below:
Likewise with the memento from Webrecorder on 2017-02-12T23:27:56 seen below, the request made to co-authors is rewritten successfully, we have the request for refresh token but still redirect to the sign-in page like the others.
As the redirection occurs for the Internet Archives and Webrecorders memento we can now finally ask the question what happened to the api.mendeley.com requests and what in the pages JavaScript is making replay fail.

Location, Location, Location

The mendeley website defines a global object that contains definitions for URLs to be used by the pages JavaScript when talking to the backend. That global object seen below (from the Archive-It memento) is untouched by the archives rewriting mechanisms. Now there is another inline script tag that adds some preloaded state for use by the pages JavaScript seen below (also from Archive-It). But here we find that our first instance of erroneous JavaScript rewriting. As you can see the __PRELOADED_STATE__ object has a key of WB_wombat_self_location which is a rewrite targeting window.location or self.location. Clearly, this is not correct when you consider the contents of this object which describe a physical location. When comparing the live web key for this entry seen below, the degree of error in this rewrite becomes apparent. Some quick background on the WB_wombat prefix before continuing on. The WB_wombat prefix normally indicates that the replay system is using the wombat.js library from PyWb and conversely Webrecorder. They are not, rather they are using their own rewrite library called ait-client-rewrite.js. The only similarity between the two is the usage of the name wombat.
Finding the refresh token code in the pages JavaScript was not so difficult seen below is the section of code that likely causes the redirect. You will notice that the redirect occurs when it is determined the viewer is not authorized to be seeing this page. This becomes clearer when seeing the code that executes retrieval of the refresh token. Here, we see two things: mendeley.com has a maximum number of retries for actions they require some form of authentication for (this is the same for the majority of its resources the pages JavaScript makes requests for) and the second instance of erroneous JavaScript rewriting:
e = t.headers.WB_wombat_self_location;
It is clear to see that Archive-It is using regular expressions to rewrite any <pojo>.location to WB_wombat_self_location as on inspection of that code section you can see that the pages JavaScript is clearly looking for the location sent in the headers commonly for 3xx or 201 responses (RFC 7231#6.3.2). This is further confirmed by the following line from the code seen above
e && this.settings.followLocation && 201 === t.status
The same can be seen in this code section from the Webrecorder memento which leaves the Internet Archives memento but the Internet Archive does not do such rewrites making this a non-issue for them. These files can be found in a gist I created if you desire to inspect them for yourself. Now at this point, you must be thinking case closed we have found out what went wrong and so did I but was not so sure as the redirection occurs in the Internet Archives memento as well.

Digging Deeper

I downloaded the Webrecorder memento loaded into my own instance of PyWb, and used its fuzzy match rewrite rules (via regexes) to insert print statements at locations in the code I believed would surface additional errors. The fruit of this labor can be seen below.
As seen above the requests to
api.mendely.com/documents and api.mendely.com/events/_batch
are actually being made but are shown as even going through by the developer tools which is extremely odd. However, the effects of this can be seen by the two errors shown after the console entries for
/profiles/helen-palmer/co-authors
and
anchor_setter herf https://www.mendely.com/profiles/helen-palmer/co-authors
which are store:publications:set.error and data:co-authors:list.error. These are the errors which I believe to be the root cause of the redirection. Before I address why that is and what the anchor_setter console entry means, we need to return to considering the HTTP requests made by the browser when viewing the live web page and not just those the browsers built in developer tools show us.

Understanding A Problem By Proxy

To achieve this I used an open-source alternative to Charles called James. James is an HTTP Proxy and Monitor that will allow us to intercept and view the requests made from the browser when viewing
https://www.mendeley.com/profiles/helen-palmer
on the live web. The image below displays the HTTP requests made by the browser starting at the time when the request for co-authors was made.
The blue rectangle highlights the requests made when replayed via Archive-It, Internet Archive and Webrecorder which include the request for co-authors (data:co-authors:list.error). The red rectangle highlights the request made for retrieving the publications (store:publications:set.error). The pinkish purple rectangle highlights a block of HTTP Options (RFC 7231#4.3.7) requests made when requesting resources from api.mendely. The request made in the red rectangle also have the Options request made before the
GET request for api.mendeley.com/catalog?=[query string]
This is happening because and to quote from the MDN entry for HTTP OPTIONS request:
Preflighted requests in CORS
In CORS, a preflight request with the OPTIONS method is sent, so that the server can respond whether it is acceptable to send the request with these parameters. The Access-Control-Request-Method header notifies the server as part of a preflight request that when the actual request is sent, it will be sent with a POST request method. The Access-Control-Request-Headers header notifies the server that when the actual request is sent, it will be sent with a X-PINGOTHER and Content-Type custom headers. The server now has an opportunity to determine whether it wishes to accept a request under these circumstances.
What they mean by preflighted is that this request is made implicitly by the browser and the reason it is being sent before the actual JavaScript made request is because the content type they are requesting is
application/vnd.mendeley-document.1+json
A full list of the content-types the mendeley pages request are enumerated in a gist likewise with the JavaScript that makes the requests for each content-type Again let's compare the browser requests as seen by James from the live web to the archived versions to see if what our browser was not showing us for the live web version is happening in the archive. Seen below are the browser-made HTTP requests as seen by James for the Archive-It memento on 2016-12-15T23:19:00.
The
helen-palmer/co-authors -> helen-palmerco-authors
rewrite issue is indeed occurring with the requests which are not made for the URI-M but hitting wayback.archive-it.org first same with profile/refreshToken. We do not see any of the requests for api.mendely as you would expect. Another strange thing is both of the requests for refreshToken get 302 status until a 200 response comes back but now from a memento on 2016-12-15T23:19:01. The memento from the Internet Archive on 2017-01-26T21:48:31 suffers similarly as seen below, but the request for helen-palmer/co-authors remains intact. The biggest difference here is that the memento from the Internet Archive is bouncing through time much more than the Archive-It memento.
The memento from Webrecorder on 2017-02-12T23:27:56 suffers similarly as did the memento from Archive-It, but this time something new happens as seen below.
The request for refreshToken goes through the first time and resolves to a 200 but we have the
helen-palmer/co-authors -> helen-palmerco-authors
rewrite error occurring. Only this time the request stays a memento request but promptly resolves to a 404 due to the rewrite error. Both the Archive-It memento and the Webrecorder memento share this rewrite error, and both use wombat to some extent so what gives. The explanation for this is likely to lie with the use of wombat (at least for the Webrecorder memento) as the library overrides a number of the global dom elements and friends at the prototype level (enumerated for clarity via this link). This is to bring the URL rewrites to the JavaScript level and to ensure the requests made become rewritten at request time. In order to better understand the totality of this, recall the image seen below (this time with sections highlighted) which I took after inserting print statements into the archived JavaScript via PyWbs fuzzy match rewrite rules.
The console entry anchor_setter href represents an instance when the archived JavaScript for mendeley.com/profiles/helen-palmer is about to make an XHR request and is logged from the wombat override of the a tags href setter method. I added this print statement to my instance of PyWb's wombat because the mendeley JavaScript uses a promise based XHR request library called axios. The axios library utilizes an anchor tag to determine if the URL for the request being made is same origin and does its own processing of the URL to be tested after using the anchor tag to normalize it. As you can see from the image above, the URL being set is relative to the page but becomes a normalized URL after being set on the anchor tag (I logged the before and after of just the set method). It must be noted that the version of wombat I used likely differs from the versions being employed by Webrecorder and maybe Archive-It. But from the evidence presented it appears to be a collision between the rewriting code and the axios libraries own code.

HTTP Options Request Replay Test

Now I can image that the heads of the readers of this blog post maybe heads are hurting, or I may have lost a few along the way. I apologize for that by the way. However, I have one more outstanding issue I brought before you to clear up. What happened to the api.mendely requests especially the options requests. The options requests were not executed for one of two reasons. The first is the pages JavaScript could not receive the expected responses due to the Authflow requests failed when replayed from an archive. Second one of the requests for content-type
application/vnd.mendeley-document.1+json
failed due to the lack of replaying HTTP Options methods or it did not return what you thought it would when replayed. To test this out I created a page hosted using GitHub pages called replay test. This page's goal is to throw some gotchas at archival and replay systems. Of those gotchas is an HTTP Options request (using axios) to https://n0tan3rd.github.io/replay_test which is promptly replied to by GitHub with a 405 not allowed. An interesting property of the response by GitHub for the request that the body is HTML which the live web displays once the request is complete. We may assume a service like Webrecorder would be able to replay this. Wrong it does not nor does the Internet Archive. What does happen is the following as seen when replayed via Webrecoder which created the capture.
The same can be seen in the capture from the Internet Archive below
What you are seeing is the response to my Options request which is to respond as if I my browser made an GET request to view the capture. This means the headers and status code I was expecting to find were never sent but saw a 200 response for viewing the capture not the request for the resource I made. This implies that the mendeleys JavaScript will never be able to make the requests for its resources that are content-type
application/vnd.mendeley-document.1+json
when replayed from an archive. Few, this now concludes this investigation and I leave what else my replay_test pages does as an exercise for the reader.

Conclusions

So what is the solution for this but first we must consider.... I'm joking. I can see only two solutions for this. The first is that replay systems used by archives that use regular expressions for JavaScript rewrites need to start thinking like JavaScript compilers such as Babel when doing the rewriting. Regular expressions can not understand the context of the statement being rewritten whereas compilers like Babel can. This would ensure the validity of the rewrite and avoid rewriting JavaScript code that has nothing to do with the windows location. The second is to archive and replay the full server client HTTP request-response chain.
- John Berlin
2017-03-19 Update:
The rewrite error occurring on Webrecorder has been corrected.
Thanks to Ilya Kreymer for his help in diagnosing the issue on Webrecorder. The getToken code of the page retrieves the token from a cookie via document.cookie which was not preserved thus the redirect. Ilya has created a capture with the preserved cookie. When replayed the capture will not redirect.

1 comment:

  1. I appreciate this thorough investigation into replay issues, though I’m not sure I understand the conclusions.

    It sounds like you’re saying "well its all messed up on this page and there aren’t any solutions, so lets abandon this approach altogether. Maybe this other tool (Babel) can do the trick!"
    This is not really helpful to those of our working to solve the issues of high-fidelity replay and not the right attitude to move forward.

    Also "archive and replay the full server client HTTP request-response chain" is exactly what tools like Webrecorder do, although sometimes matching request->response can be improved.

    If you look at wombat.js, regex rewriting is only a small portion of what the system does.. the rest is a dynamic override system to emulate the JS of the original page, not unlike a JS interpreter might do.

    (Though, It would certainly be interesting to see what could be accomplished with a system like Babel, I imagine it would solve some issues but possibly result in new ones).

    The right solution is to the problems you mention is to carefully identify the exact source of the issues and how/if they could be solved. This is a very hard problem with a moving target, and finding what is going wrong can be tricky. Progress can be made incrementally, as it has been thus far:

    - One of the issues, a rewriting error due to absolute vs relative paths (helen-palmer/co-authors -> helen-palmerco-authors) has already fixed and not happening on the capture you mention.

    - The main issue on that page, lack of auth token being available, is actually due to cookies. You can see this by stepping into getToken() function to see that the token is obtained from document.cookie

    The Set-Cookie header was not being recorded by Webrecorder as a matter of policy. This causes the auth token to not be available, causing a redirect. This was done for security reasons, and something we are looking at. One viable solution is to record Set-Cookie that do not have HttpOnly.

    - By creating a recording that includes cookies, we can see that the page replays correctly:

    https://webrecorder.io/ilya/warc-with-cookies/20170308020246/https://www.mendeley.com/profiles/helen-palmer/

    - Pre-flight OPTIONS requests are something that are outside of the control of the page. Fortunately, the replay/recording system transforms the page so that cross-domain requests are no longer cross-domain. Thus, preflight OPTIONS request that is sent normally is often not sent during recording/replay.

    The new replay-test test suite is definitely useful in figuring out issues, particularly with newer frameworks. Thank you for making -- I think it could be a great benchmark for further improvements. A few more points after taking a quick look:

    - The lack of 405 on replay is due to a limitation of what is stored in the index, a request is made with a GET and an OPTIONS to the same url, but the index does not store the verb (CDXJ), and so returns a GET capture for an OPTIONS request. This is great test and something will be fixable with improvements to the index.

    - Service workers are definitely a tricky area. Over the years, they have become more prevalent and replay tech needs to be able to keep up with them. The web archiving replay system needs to continue to evolve to address new features like service workers. I have a few ideas on how to do that, but that’s enough for this comment :)

    ReplyDelete