Yesterday we told you we updated Churnalism with fresher Wikipedia content, making it more reliable than ever to see if that original news article actually contains boilerplate language. Today, we’ll tell you about the technical challenges involved — and about a new tool spawned from that effort.
Periodically loading a new copy of Wikipedia represents a substantial effort but it’s critical to maintaining good results. Wikipedia dumps it’s entire database once per month as a huge, compressed XML file. The first time we loaded the Wikipedia corpus we used the WikiExtractor.py script from the Tanl package. Since there is no formal grammar for the wikitext markup style, this script has become a de facto standard for the Python crowd. It worked well, and we were thankful to avoid implementing it from scratch. Still, it did have a few rough edges that we wanted to work around — so we’ve forked that script into a more flexible tool.
When we load the Wikipedia corpus into our document database, we leave out a large number of documents. These are documents that are unlikely to match any news articles. For example, we eliminate any disambiguation pages, redirect pages, very short articles and “list of” pages such as “List of Districts of Serbia” or de facto listing pages such as “Alien.” One of our frustrations with the original extractor was that we had to perform all of our filtering after the documents were converted to plain text. We were wasting computing time converting “List of” pages to plain text despite knowing that we would later discard them. Additionally, we couldn’t directly filter out these types of documents since that information was lost in the conversion to plain text. We had to fall back to more crude measurements, such as eliminating documents shorter than a certain length. Our new tool lets us filter and transform each document in an arbitrary order while it’s being extracted, saving valuable time and allowing us to iterate more quickly.
The last issue we wanted to solve with this fork revolved around block quotes. Sometimes our users would visit news articles that were quite old. Between the article being published and the time our users visited it, the article had been quoted in Wikipedia. Since our system can’t reliably determine which text was published first, we would display a false positive result. Wikipedia was properly attributing the text to the original article, yet our tool was claiming that the original author may have copied it inappropriately. This was frustrating for both us and our users. Our new extraction tool allows us to save those block quote tags and post-process the matching results to eliminate them. The wider web has looser rules surrounding block quote tags so we still can’t (yet!) avoid this when an article properly quotes Wikipedia.
While we are sure to get more use out of this new tool for our work on Churnalism and other projects, hopefully it will be useful to others as well. The code carries the GPLv3 license, in line with the original project.