Illuminating the jargon of criminal justice data with Elasticsearch
Last year, a team of researchers here at the Sunlight Foundation started putting together a wide-ranging but centralized inventory of criminal justice data. As we’ve been doing this important work, other efforts to address issues with criminal justice data have popped up nationwide. There’s been a call for better data by organizations as high up as the White House, where a Police Data Initiative has been organized. (We are honored to be involved with that effort.)
We’ve collected the location of thousands of datasets and information about those datasets by hand; this includes the category, format, the frequency which data is updated and about 20 other data points to help people find and navigate the data they’re looking for. The next steps are to build a user-facing product, which we are currently working on.
Sunlight has worked with with government data at the federal and state levels for nine years now, so we regularly work to clean up data that is incomplete or fragmented. Our work gathering information about criminal justice datasets across the country has yielded some familiar challenges — as well as some interesting new ones.
As the Web developer working with the criminal justice team, I’ve been trying to learn about the domain of criminal justice information and terminology in particular. One interesting issue that came up as soon as I started looking at the data they are collecting is the variation in the terminology used by researchers, practitioners and journalists. For example, the terms “close management,” “solitary housing unit” and the “shu” all refer to the concept I knew as “solitary confinement,” but various jurisdictions were using different terms in their datasets. How do you help people uncover information across the country when different people use different language to refer to the same things?
Thanks to the thorough work of the criminal justice research team, I have access to a rich (and growing) list of terms and synonyms related to criminal justice data. To make use of this information, I’ve been working with [Elasticsearch](https://github.com/elastic/elasticsearch), an open source search engine with a high degree of customization. In the past, I’ve used Elasticsearch along with [Haystack](http://haystacksearch.org) (a tool to connect Elasticsearch with a Django website) to quickly and easily add search functionality to websites. However, for this project I needed to dive into [text analysis features](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) that allow you to transform human-readable text into search optimized *tokens*.
With Elasticsearch, you can specify tasks that analyze and transform text when it is put into the search index (a database, essentially) and also when search queries are performed. There are a number of natural language processing tasks built-in to Elasticsearch, such as various “stemmer” filters for matching different forms of words to a root (such as “questioning” to “question”) or filters for removing extremely common “stop” words like “and,” “the” and “is.”
I experimented with a synonym filter to create sets of synonyms so that one word expands to a list of words (a search for “arrests” becomes a search for “arrests,bookings”). I’ve also used the synonym filter to map multiple phrases to a single phrase, or vice versa. By creating a custom synonym filter, I made it possible to search for “close management” or “solitary confinement” and also get results for “shu,” simply by mapping these terms so that “shu” gets stored in the search index along with the terms I’ve determined are related.
So far, this approach has helped me demonstrate to my colleagues one way our criminal justice inventory could be made into a searchable website, as well as given me a path to better understanding the complex nature of criminal justice data. There are many other exciting challenges working with a catalog of criminal justice datasets housed in various state, local, national and university websites: Jargon is only the tip of the iceberg. Nevertheless, it’s exciting to be able to leverage technology to create better ways of understanding the landscape of criminal justice data.