Wrangling Open Government Data: A Case Study with Data Preparer

You are currently viewing Wrangling Open Government Data: A Case Study with Data Preparer

Wrangling Open Government Data: A Case Study with Data Preparer

Open Government Data

Governments collect and analyse significant amounts of data, for example for monitoring the effect of their policies and to support decision making. This data is often useful for other applications, and it has been recognized that making such data open could have significant benefits. There are even long-standing principles to inform the publishing of open government data

This open data movement has given rise to a variety of open data portals and repositories. These may be international, national (e.g., UK, US), regional (e.g., London, New York) or by topic (e.g. house price data).

The features of open government data vary from collection to collection. For example, regional and national data sets likely have wider coverage. However, they may also be associated with more publishers, and thus be of more variable quality. Here we look at some examples using the London Datastore. This site brings together over 800 data sets, on topics including transport, environment, health, housing, jobs and the economy. These are released under several different licenses, but many use the UK Open Government License, which supports flexible reuse of the data in exchange for an acknowledgement. The data sets show signs of having been curated. Recurring names tend to be consistently formatted, and standard reference codes for geographic areas are widely used (if sometimes not consistently named).

Preparing the Data

Wrangling open government data typically involves bringing several data sets together. Insights can often be obtained by looking for patterns involving different types of data (e.g., environment and health, education and social factors). Here we describe how such open government data can be wrangled using Data Preparer. For the specific problem to be addressed, we assume that the goal is to relate educational attainment in schools with different social factors, at the level of the borough. A borough is an administrative division in local government. There are 32 boroughs in London. We assume that several data sets have been imported into Data Preparer, not all of which are relevant to the problem. The imported sources can be viewed and explored in more detail from the sources tab.

Browsing sources

These sources can then be browsed to identify relevant attributes, which can then be added to the target. It is important in this example that the attribute names for the target are closely related to those in the sources, as we do not assume in this example that we have access to representative data for the target.

Defining the target

Then selecting wrangle in the left menu will cause Data Preparer to populate the end product.

Wrangling results

In this case the initial end product is quite good. The result is produced by joining tables together (on the borough name or code) that provide the data for each of the attributes. The user can see how the result was produced by selecting explain in the left menu.

This is an example of an integration in which the different data sets required each provide data for different columns. This is sometimes referred to as vertical integration, as the columns are brought together from different sources. In horizontal integration, the rows are brought together from different sources. In the London Datastore, the same type of data is rarely available in different data sets, so vertical integration is more commonly required.