Data preparation challenges
Data preparation is a necessity for many analysis tasks, but is identified as taking 60% to 80% of a data scientist’s time. Why is this percentage so high? Intuitively, there are several reasons why data preparation may be expected to be challenging and time consuming, specifically:
- Data preparation may involve a significant number of steps.
- These steps may require both technical and domain knowledge.
- The individual steps may be difficult to get right.
- There may be many options.
Data preparation process
The following figure illustrates a typical data preparation pipeline. For each of the steps, the role of the person carrying out the data preparation task is described.
Data preparation steps
In the above pipeline there are 6 steps in the data preparation process. Each of these steps may be supported by a specific tool or component, which must be learned and applied. In addition, the application of tools often involves the user writing rules and/or setting configuration parameters. For example, mappings identify how data sources should be combined. As an example, combining property data with open government data on the location of the property, might involve the following tables:
In this example, the most promising action is likely to join the tables together on the Postcode. However, there are potential complications. What if the regions covered by the Property Sales are different from those covered by Social Data? This could lead to a result in which there is lots of missing data. As such, the user must not only be able to specify a suitable join, but must also be familiar with the data.
The above example identifies a potential issue joining two given tables. However, there could be many data sets, and many possible ways of combining them. For example, there could be property data from many real estate agencies, and open government data repositories can contain thousands of data sets. For example, data.gov.uk contains around 35,000 data sets, data.gov contains over 130,000 data sets.
There may also be trade-offs to be made when selecting the data. For example, in the open data repository for London, different information is available at different geographical scales, for example at the level of postcodes, electoral wards, or local government boroughs. As a result, even deciding on the most suitable granularity for an analysis requires significant insights to support data source discovery.