A possible process
Data preparation preprocesses data for analysis. There is no widely accepted data preparation process. Furthermore, where processes are discussed, this is often in rather an abstract way (as here). The documentation for tools sometimes suggests a process, but tools tend not to impose a specific way of working. Indeed, the process presented here could be applied using a variety of data preparation platforms. Here we present an outline of an iterative process that could be followed during data preparation:
The steps in the process
This process involves the following steps:
- Design/Refine Target: It is necessary to have some outcome in mind, even if this evolves over time. A target table definition pins down the intended result of the data preparation process, and informs the subsequent steps as they seek to populate this target. Many data preparation tools are bottom-up (i.e. they work forward from the sources) and do not require an explicit target. Nevertheless, the data scientist/engineer must at least have some idea of the intended outcome.
- Discover Sources: Data sources need to be identified that can be used to populate the target. Key sources may already be known and available, but different sources will have different roles. For example, some sources may provide the type of data to be analyzed (e.g., properties, companies, suppliers), whereas other sources may be able to augment or validate such data (e.g., address lists, company registers, product catalogs).
- Select Sources: In the proposed methodology, a subset of the potential sources should be selected for further investigation, for example on the basis of profiling results. Experience with these sources may inform subsequent iterations.
- Repair Sources: The selected sources may have quality problems that are best dealt with at the source level. Deferring data cleaning to the populated target may lead to a single challenging step as a replacement for several simpler steps. For example, format transformation is easier over when there are fewer different formats present at the same time. In addition, combining the sources may be easier where they have been cleaned, for example to increase the consistency of join columns.
- Integrate Sources: Several sources may need to be combined to populate the target. This step may identify several different ways of populating the target. These can be selected between on the basis of their quality or relevance at later stages.
- Repair Result: Given the identified ways of populating the target, these can be reviewed on the basis of their quality, to identify the most promising current results. This may be passed on for downstream analysis, or issues with this result may inform subsequent iterations.
Although this methodology is not detailed, it likely manifests features that will often be important in practice. It seems that iteration will be necessary; the best selection of sources likely depends on quality details that will only become apparent later. Furthermore, combining data may reveal features that were less than obvious before. For example, the coverage or consistency of different data sets may not be immediately obvious in isolation. Increased understanding of different sources may lead previously missed or passed over sources being revisited. Furthermore, there may be limits on the time available for data preparation, and incrementally improving results reduces the risk of missed deadlines and unmanaged expectations.