OpenGov Voices: Exploring the transparency and open data landscape

An image of Kamil Gregor, Data Analyst at Masaryk University and
Kamil Gregor, Data Analyst at Masaryk University and Image credit:

After this year’s TransparencyCamp, I’m convinced the transparency and open data community has moved to a new stage of its development. As recently as three years ago, we were excited by the first civic organizations popping up in various countries and by governments opening large datasets for the first time.

Since then, the community has grown considerably larger and stronger. The discussion very much shifted from how to get the data to what to do with it and how to connect better internationally. Cross-country cooperation, advocacy campaigns and data standards are becoming the hot topics at conferences and on high-profile mailing lists.

With this shift in focus, there’s a new need in the community – a need for rich, accurate and timely information on what kinds of organizations, projects, tools, stories and experts exist, where they are and what they’re doing. The phrase I heard probably the most often at TCamp, as well as at the last year’s OKCon, was something like: “Gee, I wish there was a website listing projects like these,” or “Where can I find all these awesome tools?”

As the community grows increasingly complex, more and more people seem to realize that someone somewhere has most likely already solved the problem they’re dealing with, developed an app they need or written a great story that gets their message across.

It’s not surprising there are many maps and catalogs of transparency and open data organizations, projects, experts, websites etc. For example, Sunlight Foundation has been keeping track of international transparency organizations in one impressive spreadsheet. The OpeningParliament community maintains a fairly comprehensive global list of parliamentary monitoring organizations. Open Government Partnership maps many organizations and individuals involved in the initiative.

I believe, however, that these attempts to explore the community, employ an outdated and ineffective method – they create phone books. Usually, the goal is to put together as comprehensive a list, a map or a wiki as possible. Data on individual items, be it organizations, people, projects, websites or tools, are gathered manually by researchers. As soon as these endeavors run out of funding, they are usually no longer expanded or updated. The end users of the product, moreover, have little incentive to collaborate in creating and maintaining it.

I argue that instead of phone books, we need dating websites. They are nice examples of a more progressive method of organizing knowledge. A user provides information describing him or her and is offered potential partners based on a match of his or her and other users’ information. Even better examples of this approach are websites such as What Should I Read Next recommending books or movies.

If I type that I’ve read “2001: A Space Odyssey” into the website I’m suggested to read Isaac Asimov’s “I, Robot” and not, for example, Jane Austen’s “Pride and Prejudice”. Why? Because the former is much more similar to the book I’ve read than the latter.

What’s great about this method is that it’s automated and content-blind. People running the website don’t actually have to read all the books and then manually decide what’s similar to what. The decision is made by a robot.

And how does a robot unable to understand content know what to recommend? Because there are more users who like “2001: A Space Odyssey” and “I, Robot” at the same time than users who enjoy “2001: A Space Odyssey” and “Pride and Prejudice”. This is the essence of so called relational data.

I use the same approach to explore the transparency and open data community. I aggregate already existing lists, catalogs and maps of organizations in the community to measure their similarity. Two organizations that appear on the same lists are usually very similar. And conversely, two organizations that are rarely included in a list together probably have little in common.

Note that I don’t have to necessarily know anything about the organizations or even the lists in questions. All I need to know is organizations’ memberships in lists. Similarity of the organizations is automatically calculated by a robot called principal component analysis (note that a robot called network analysis can also be used).

To prove this concept, I analyzed data from Sunlight Foundation’s spreadsheet of transparency organizations refined by Mor Rubinstein. I took 139 non-governmental organizations and assigned them memberships in 36 topical lists. Every list corresponds to one concept, usually an activity or an issue. If an organization engages in an activity or focuses on a topic, it is included in the corresponding concept’s list:

The principal component analysis returns two charts. The first chart visualizes the landscape of concepts:

Each bubble represents one concept. The size of a bubble corresponds with the number of organizations on the concept’s list. If two concepts are close together in the chart, it means that organizations assigned to one concept are very often also assigned to the other. For example, “research” and “monitoring” are close. This means that organizations that conduct research often also engage in monitoring. (You can easily select concepts in the menu).

Substantive meaning of the X and Y axes is not determined by the researcher but it can be a posteriori interpreted based on configuration of concepts. I argue that the X axis captures division between traditional and progressive activities. On the left, there are concepts such as “advocacy”, “monitoring” and “research”. On the right, there are concepts such as “technology” and “open data”:

This proves that organizations that engage in traditional activities tend not to develop and use progressive tools and methods and vice versa. The Y axis is much more difficult to interpret but it may capture what I call society-oriented versus state-oriented activities. On the top, there are concepts such as “democracy” or “direct action” that are fundamentally oriented towards citizens. On the bottom, there are concepts like “lobbying”, “corruption” and “info access” that entail interactions with the state.

The landscape reveals several surprising facts. For example, “participation” is located closer to the progressive end of the horizontal axis, suggesting that many organizations focusing on participation use open data and modern technologies.

The second output of the principal component analysis is a chart that visualizes the landscape of organizations:

This time, each point represents one organization. The same meaning of axes is retained. The horizontal axis separates traditional organizations, such as many Transparency International chapters, on the left and progressive open data and technology organizations on the right:

The vertical axis separates organizations oriented towards interactions with the state on the top, such as NDI and parliamentary, elections and party finances monitoring organizations, from organizations oriented towards interactions with the society on the bottom, for example OKF and a number of FOI organizations:

Again, this chart reveals several interesting previously unknown facts that would be difficult to find and show using other methods. For example, it seems that almost all state-oriented organizations are fairly progressive – the top right corner of the cart is filled with data points while the left right corner that would be occupied by traditional and state-oriented organizations is empty.

This method of automated measuring of similarity is proven valuable by the fact that organizations actually known to be similar are clustered on the landscape. For example, various Transparency International chapters engage in similar activities – and they are clustered in the chart:

It should be noted that this analysis is very preliminary. Even a brief glance over the raw data above reveals it needs serious cleanup and update. I intend to expand the analysis by including additional lists, such as those mentioned above. Even such low quality data, however, returns results that seem sensible to someone with intimate knowledge of the community. This suggests the method is fairly robust and further proves its usefulness.

But creating fancy charts is not the goal here. I envision a simple Google-like website where a member of the transparency and open data community quickly fills in key information about his or her organization or project. The website then advices to check out five or so organizations, people, projects and tools because they have similar agenda, are based in the same country, receive funding from the same sources etc. Growing demand for such a service makes me believe we’ll see a beta version very soon.

Kamil Gregor,, CC BY 2014.

Kamil Gregor is a data analyst with the Masaryk University and, a Czech and Slovak non-profit and politically independent parliamentary monitoring organization, founded with the aim of promoting political transparency. Kamil focuses mainly on parliamentary data openness. He is also involved in the Reconstruction of the State project, a joint Czech initiative of anti-corruption organizations, experts, businesses and local supporters with specific goals to adopt 9 realistic legislative measures.

Interested in writing a guest blog for Sunlight? Email us at