Data scientists are employed to obtain insights from data. This typically involves applying analysis, learning or visualization techniques to existing data sets. However, the most suitable data for a task may not be directly available, giving rise to the need for data preparation. Data preparation is sometimes referred to as data wrangling.
An example of data preparation for real estate data
For example, assume that a real estate agency wants to analyze pricing trends in their area. One approach could be to analyze data on properties they have sold. This data may be available in a spreadsheet, but manual data entry may have led to inconsistencies in how data is represented, and some data of relevance may be missing. Consider the following fragment of data held within the real estate agency:
In this example, we can see that the town information is sometimes missing, and that street names are sometimes also associated with the part of the town in which the property is found. Any analysis that, for example, clusters the data by street or town is likely to produce unreliable results. As such, individual data sets may require some measure of data preparation before they are suitable for analysis.
Data integration in the real estate example
In addition, many analyses stand to benefit from bringing different data sets together. In our example, what if the purpose of the analysis is to identify factors that influence the pricing trends? Factors such as local crime or income levels, may be available in open government data sets, with which the real estate data can be combined. Joining the existing real estate data with open government data augments the existing property sales records with new information. For example, we may have access to data about social features of neighbourhoods:
The data preparation process described above may provide the types of information needed for the analysis. However, using only the agency’s data may lead to only a fraction of the relevant sales being considered. Additional information about the properties for sale may be available from the web sites of competitor agencies, as illustrated below:
Combining such external data sets provides additional sales data, but once again there are representational inconsistencies. For example, in this data set we can see that the house number, street name and town are stored together in a single Address column, whereas they are separate in Property Sales above. These inconsistencies will need to be resolved to enable analysis to be carried out in a dependable manner.
These different steps, involving cleaning individual data sets and/or integrating data sets, are typical of data preparation tasks. To give an existing definition, data wrangling “is the process of transforming and mapping data from one raw data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics”.
Additional information on why data preparation is a big deal is provided in the next post in this series.