Wrangling Open Government Data: A Case Study with Data Preparer
Open Government Data
Governments collect and analyse significant amounts of data, for example for monitoring the effect of their policies and to support decision making. This data is often useful for other applications, and it has been recognized that making such data open could have significant benefits. There are even long-standing principles to inform the publishing of open government data.
This open data movement has given rise to a variety of open data portals and repositories. These may be international, national (e.g., UK, US), regional (e.g., London, New York) or by topic (e.g. house price data).
The features of open government data vary from collection to collection. For example, regional and national data sets likely have wider coverage. However, they may also be associated with more publishers, and thus be of more variable quality. Here we look at some examples using the London Datastore. This site brings together over 800 data sets, on topics including transport, environment, health, housing, jobs and the economy. These are released under several different licenses, but many use the UK Open Government License, which supports flexible reuse of the data in exchange for an acknowledgement. The data sets show signs of having been curated. Recurring names tend to be consistently formatted, and standard reference codes for geographic areas are widely used (if sometimes not consistently named).
Preparing the Data
Wrangling open government data typically involves bringing several data sets together. Insights can often be obtained by looking for patterns involving different types of data (e.g., environment and health, education and social factors). Here we describe how such open government data can be wrangled using Data Preparer. For the specific problem to be addressed, we assume that the goal is to relate educational attainment in schools with different social factors, at the level of the borough. A borough is an administrative division in local government. There are 32 boroughs in London. We assume that several data sets have been imported into Data Preparer, not all of which are relevant to the problem. The imported sources can be viewed and explored in more detail from the sources tab.
These sources can then be browsed to identify relevant attributes, which can then be added to the target. It is important in this example that the attribute names for the target are closely related to those in the sources, as we do not assume in this example that we have access to representative data for the target.
Then selecting wrangle in the left menu will cause Data Preparer to populate the end product.
In this case the initial end product is quite good. The result is produced by joining tables together (on the borough name or code) that provide the data for each of the attributes. The user can see how the result was produced by selecting explain in the left menu.
This is an example of an integration in which the different data sets required each provide data for different columns. This is sometimes referred to as vertical integration, as the columns are brought together from different sources. In horizontal integration, the rows are brought together from different sources. In the London Datastore, the same type of data is rarely available in different data sets, so vertical integration is more commonly required.