Earlier this year, I participated in Sunlight Foundation’s PDF Liberation hackathon — a hackathon created (like the name suggests) to liberate valuable information both government and otherwise, from Portable Document Formats (PDFs). Working under the umbrella of OpenJC — a Code for America local chapter that supports civic hacking in Jersey City, NJ we created an interactive budget for the City of Jersey City, as part of the PDF Liberation challenges/projects. This interactive budget was our first project at the newly formed Code for America chapter and was developed to introduce and foster a collaboration between city government, residents and the local technology community.
The main goal of the project was to help the public become more educated about the budget process and city finances and ultimately become more involved in budget discussions. It was to illustrate how the government can be open and how open government can bring value to the community. Inspired by the OpenSpending project and the participatory Open Budget in Oakland, the OpenJC team began to build an interactive web based visualization and soon realized that preparing the data will be laborious and full of obstacles. Historical financial data for the city data was available on the official city website in the form of 37 scanned PDF (Portable Document Format) documents and a total of 3,871 pages. The city budget alone was a 100+ page PDF document, which needed to be converted to a machine readable format. OpenJC contemplated crowdsourcing and reaching out to residents to volunteer to manually convert the data to spreadsheets. Fortunately, on January 17th 2014, Sunlight Foundation together with Public Sector Credit Solutions , organized a national PDF Liberation Hackathon dedicated to improving open source tools for PDF extraction. The event was hosted in six cities, New York, Washington DC, San Francisco, Chicago and Oklahoma. In addition to providing a list of tools for working with PDF documents, in New York, developers from Tabula offered on site support.
During the course of the weekend, a framework that automated the conversion of the 37 city official documents utilizing several tools was created. ABBYY Cloud OCR SDK API provided 50,000 free pages to all participants and was used to convert the scanned image PDFs into text based PDFs. Next, Tabula, developed by a team at New York Times was used to convert text PDF to spreadsheet using the non-interactive page parser to convert each page of PDF into a single spreadsheet. The results of the table parser were not completely accurate but could be cleaned up by programming some higher-level heuristics. The project is open source and can be found on Github. The liberation of Jersey City budget won first prize at the New York event.
To complete the project, the tabular data extracted during the PDF Liberation event was converted into a hierarchical data model to create D3 visualization for 2013. The interactive project, which was published by the mayor of Jersey City on March 20th this year, is available on the Jersey City Website. When the 2014 proposed budget was published by the city, again in a PDF format, the set of tools developed during the event was used to convert the data in less than one day.
To further the understanding of the city finances, OpenJC plans to include additional interactive visualizations that highlight year to year differences, link revenues to spending, expend data sets to include budgets of the city’s six autonomous agencies and board of education and explain municipal debt. To increase public participation in the budget process, the civic groups in Jersey City, including Civic JC, Civic Parent, Sustainable JC and Open JC with help of CitizensCampaign are collaborating together and hosting a Budget Forum in Jersey City on May 15th. The goal of the event is to bring residents and city council together to discuss the the city budget in detail. The Jersey City Budget visualization will be used as one of the tools to help understand the 2014 budget proposal.
Anna Lukasiak is a co-founder of Open JC, a local community group collaborating with the Jersey City Government to build open data, open source and open government in Jersey City. She is a supporter of public sector innovation and the value of digital-government projects. As part of her work with Open JC, Anna created the prototype for the Jersey City Budget Visualization application which won first prize at the PDF Liberation Hackathon. Other projects Anna has overseen with Open JC include development of updated Ward Boundary maps, and partnering with city agencies including the Jersey City Police Department and the Jersey City Department of Recreation. Anna has 16 years of information technology and obtained her Bachelor and Masters Degree in Civil Engineering and Economics from MIT. You can reach her at email@example.com
Interested in writing a guest blog for Sunlight? Email us at firstname.lastname@example.org