Monday, March 20, 2017

2017-03-20: A survey of 5 boilerplate removal methods

Boilerplate removal result from BeautifulSoup's get_text() method for news website. Extracted text includes extraneous text, HTML and Javascript text.
Fig. 1: Boilerplate removal result for BeautifulSoup's get_text() method for a news website. Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text.
Boilerplate removal result from NLTK's (OLD) clean_html() method for news website. Extraneous text included, but does not include Javascript and HTML text.
Fig. 2: Boilerplate removal result for NLTK's (OLD) clean_html() method for a news websiteExtracted text includes extraneous text, but does not include Javascript, HTML, comments or CSS text.
Boilerplate removal result from Justext method for news website. Smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but title missing.

Fig. 3: Boilerplate removal result for Justext method for a news websiteExtracted text includes smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent.
Boilerplate removal result from Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext but missing title text and first paragraph.
Fig. 4: Boilerplate removal result for Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent.
Boilerplate removal result from  Python-boilerpipe (ArticleExtractor) method for this news website. Smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.
Fig. 5: Boilerplate removal result for  Python-boilerpipe (ArticleExtractor) method for a news websiteExtracted text includes smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.
Boilerplate removal refers to the task of extracting the main text content of webpages. This is done through the removal of content such as navigation links, header and footer sections, etc. Even though this task is a common prerequisite for most text processing tasks, I have not found an authoritative versatile solution. In other to better understand how some common options for boilerplate removal perform against one another, I developed a simple experiment to measure how well the methods perform when compared to a gold standard text extraction method (myself). Python-boilerpipe (ArticleExtractor mode) performed best on my small sample of 10 news documents with an average Jaccard Index score of 0.7530, and median Jaccard Index score of 0.8964. The Jaccard scores for each document for a given boilerplate removal method was calculated over the sets (bag of words) created from the news documents and the gold standard text.

Some common boilerplate removal methods
  1. BeautifulSoup's get_text()
    • Description: BeautifulSoup is a very (if not the most) popular python library used to parse HTML. It offers a boilerplate removal method - get_text() - which can be invoked with a tag element such as the body element of a webpage. Empirically, the get_text() method does not do a good job removing all the Javascript, HTML markups, comments, and CSS text of webpages, and includes extraneous text along with the extracted text.
    • Recommendation: I don't recommend exclusive use of get_text() for boilerplate removal.
  2. NLTK's (OLD) clean_html()
    • Description: Natural Language processing Toolkit (NLTK) used to provide a method called clean_html() for boilerplate removal. This method used regular expressions to parse and subsequently remove HTML, Javascript, CSS, comments, and white spaces. However, presently, NLTK deprecated this implementation and suggests the use of BeautifulSoup's get_text() method, which as we have already seen does not do a good job.
    • Recommendation: This method does a good job removing HTML, Javascript, CSS, comments, and white spaces. However, it includes the boilerplate text such as the navigation link text, as well as header and footer sections text. Therefore, if your application is not sensitive to extraneous text, and you just care about including all text from a page, this method is sufficient.
  3. Justext
    • Description: According to Mišo Belica, the creator of Justext, it was designed to preserve mainly text containing full sentences, thus, well suited for creating linguistic resources. Justext also provides an online demo.
    • Recommendation: Justext is a decent boilerplate removal method that performed almost as well as the best boilerplate removal method from our experiment (Python-boilerpipe). But note that Justext may omit page titles.
  4. Python-goose
    • Description: Python-goose is a python rewrite of an application originally written in Java and subsequently Scala. According to the author, the goal of Goose is to process news article or article-type pages, extract the main body text, metadata, and most probable image candidate.
    • Recommendation: Python-goose is a decent boilerplate removal method, but it was outperformed by Python-boilerpipe. Also note that Python-goose may omit page titles just like Justext.
  5. Python-boilerpipe
    • Description: Python-boilerpipe is a python wrapper of the original Java library for boilerplate removal and text extraction from HTML pages.
    • Recommendation: Python-boilerpipe outperformed all the other boilerplate removal methods in my small test sample. I currently use this method as the boilerplate removal method for my applications.
With the following corresponding gold standard text documents:

  1. Gold standard text for news document - 1
  2. Gold standard text for news document - 2
  3. Gold standard text for news document - 3
  4. Gold standard text for news document - 4
  5. Gold standard text for news document - 5
  6. Gold standard text for news document - 6
  7. Gold standard text for news document - 7
  8. Gold standard text for news document - 8
  9. Gold standard text for news document - 9
  10. Gold standard text for news document - 10
The HTML extracted from the 10 news documents was extracted by dereferencing each of the 10 URLs with curl. This means the boilerplate removal methods operated on just HTML (without running Javascript). I also ran the boilerplate removal methods on archived copies from archive.is for the 10 documents. This was based on the rationale that since archive.is runs Javascript and transforms the original page, this might impact the results. My experiment showed that boilerplate removal run on archived copies reduced the similarity between the gold standard texts and the output texts of all the boilerplate removal methods except BeautifulSoup's get_text() method (Table 2).

Second, for each document, I manually copied text I considered to be the main body of text for the document to create a total of 10 gold standard texts. Third, I removed the boilerplate from the 10 documents using the 8  methods outlined in Table 1. This led to a total of 80 extracted text documents (10 for each boilerplate removal method). Fourth, for each of the 80 documents, I computed the Jaccard Index (intersection divided by union of both set) over each document and it's respective gold standard. Fifth, for each of the 8 boilerplate removal methods outlined in Table 1, I computed the average of the Jaccard scores for the 10 news documents (Table 1).

Result

Table 1: Boilerplate removal results for live web news documents

Rank Methods Averages of Jaccard Indices for 10 documents Median of Jaccard Indices for 10 documents
1 Python-boilerpipe.ArticleExtractor 0.7530 0.8964
2 Justext 0.7134 0.8339
3 Python-boilerpipe.DefaultExtractor 0.6706 0.7073
4 Python-goose 0.7009 0.6822
5 Python-boilerpipe.CanolaExtractor 0.6227 0.6472
6 Python-boilerpipe.LargestContentExtractor 0.6188 0.6444
7 NLTK's (OLD) clean_html() 0.3847 0.3479
8 BeautifulSoup's get_text() 0.1959 0.2201


Table 2: Boilerplate removal results for archived news documents showing lower similarity compared to live web version (Table 1)
Rank Methods Averages of Jaccard Indices for 10 documents Median of Jaccard Indices for 10 documents
1 Python-boilerpipe.ArticleExtractor 0.6240 0.7121
2 Python-boilerpipe.DefaultExtractor 0.5534 0.7010
3 Justext 0.5956 0.6414
4 Python-boilerpipe.CanolaExtractor 0.5028 0.5274
5 Python-boilerpipe.LargestContentExtractor 0.4961 0.4669
6 Python-goose 0.4209 0.4289
7 NLTK's (OLD) clean_html() 0.3365 0.3232
8 BeautifulSoup's get_text() 0.2630 0.2687

Python-boilerpipe (ArticleExtractor mode) outperformed all the other methods. I acknowledge that this experiment is by no means rigorous for important reasons which include:
  • The test sample is very small.
  • Only news documents were considered.
  • The use of the Jaccard similarity measure forces documents to be represented as sets. This eliminates order (the permutation of words) and duplicate words. Consequently, if a boilerplate removal method omits some occurrences of a word, this information will be lost in the Jaccard similarity calculation.
Nevertheless, I believe this small experiment sheds some light about the different behaviors of the different boilerplate removal methods. For example, BeautifulSoup get_text() does not do a good job removing HTML, Javascript, CSS, and comments unlike NLTK's clean_html(), which does a good job removing these, but includes extraneous text. Also, Justext and Python-goose do not include a large body of extraneous text, even though they may omit a news article's title. Finally, based on these experiment results, Python-boilerpipe is best boilerplate removal method.

2017-04-13 Edit: At the request of Ryan Baumann, I included Python-readability in this survey.
Fig. 6: Boilerplate removal result for Python-readability method for a news website. Extracted text does not include Javascript, HTML comments, and CSS. However, the extracted text includes non-contiguous segments of extraneous HTML. Also, the title and first paragraph was omitted.
Description: Python-readability was developed by Yuri Baburov. It is a python port of a ruby port of arc90's readability project.  It attempts to pulls out the main body text of a document and cleans it up.
Recommendation: Python-readability ranked 5th place in Table 1 with an average Jaccard index score and median Jaccard index score of 0.5990 and 0.6567, respectively. Similarly, it ranked 5th place in Table 2 with an average Jaccard index score and median Jaccard index score of  0.5021 and 0.5236, respectively. This library removes the Javascript, HTML comments, and CSS. But it does not do a good job removing all the HTML from the output text. But note that Python-readability may omit page titles. If you choose to use this library, consider further clean up operations to remove the extraneous HTML.

2017-10-07 Edit: I included ScrapyNewspaper and news-please in the boilerplate removal survey. Please note that these libraries were not designed exclusively for boilerplate removal - boilerplate removal is a single feature from a collection of other primary functionalities. Therefore, my recommendation on the use of any library only considers the effectiveness of a library toward boilerplate removal.


Additional methods for boilerplate removal

  1. Scrapy
    • Description: Scrapy is a very popular python library used for crawling and extracting structured data from websites. Boilerplate removal is provided in the remove_tags() function. This method performed poorly in the survey since it combined extraneous text with JavaScript and CSS text and empty spaces. Scrapy ranked 8th place in Table 1 with an average Jaccard index score and median Jaccard index score of 0.2140 and 0.2235, respectively. Also, it ranked 8th place in Table 2 with an average Jaccard index score and median Jaccard index score of  0.2635 and 0.2692, respectively.
    • Recommendation: I don't recommend exclusive use of remove_tags() for boilerplate removal.
  2. Newspaper
    • Description: Newspaper is a python library developed by Lucas Ou-Yang, designed primarily for news article scraping and curation. Some Newspaper features include: a multi-threaded article download capability, news URL identification, text extraction from HTML, top/all image extraction from HTML, and summary/keyword extraction from text.
    • Recommendation: Newspaper is a decent boilerplate removal method, but it was outperformed by Python-boilerpipe. Also note that Newspaper may omit page titles. Newspaper ranked 5th in Table 1 (marginally outperformed by Python-goose) with average Jaccard index score and median Jaccard index score of 0.6709 and 0.6822, respectively. It ranked 5th in Table 2 also with average Jaccard index score and median Jaccard index score of 0.4941 and 0.4741, respectively.
  3. news-please
    • Description: Felix Hamborg introduced me to the news-please based on his research on news crawling and extraction. news-please is a multi-language, open-source crawler and extractor of news articles. news-please is designed to help users crawl news websites and extract metadata such as titles, lead paragraphs, main content, publication date, author, and main image. news-please combines Scrapy, Newspaper, and Python-readability. 
    • Recommendation: See Newspaper recommendation because news-please had the same performance scores as Newspaper. This is no surprise because news-please utilizes Newspaper.

Fig. 7: Boilerplate removal result for Scrapy method for a news website. Extracted text includes extraneous text (Junk text), Javascript, CSS and empty spaces.
Fig. 8: Boilerplate removal result for Newspaper and news-please methods for a news website. No extraneous text, but some missing text such as the title.
--Nwala

4 comments:

  1. Hi, do you mind sharing pieces of code used to extract contents and calculate the Jaccardian indexes? I have repeated your experiment on the same documents checking other libraries I found and wanted to do a comparison, but I get a bit different values. To be precise justext is doing far better than boilerpipe and I am not sure if I am not doing something wrong.

    ReplyDelete
    Replies
    1. Hello Anne,

      here is the source code: https://github.com/anwala/experiment/tree/master/BoilerplateRM

      Delete
    2. Thank you very much Alexander, I will look at it and check what are the differences. I will get back to you, thanks.

      Delete
  2. Hello Anne,
    I will prepare the code to share with you before the end of the week. I'm very interested in seeing your results.
    Thanks.

    ReplyDelete