Approaches to Data Preparation
There are a variety of data preparation tools and techniques for supporting data preparation. At a top level, there seem to be three options:
- Hand-craft it. For small scale and stable problems, it may be practical to manually edit the data into a suitable format. Such a hands-on approach avoids the need to obtain experience with specialised tools or techniques. This is unlikely to be cost-effective, or even realistic, in larger or rapidly changing applications.
- Program it. Preparing data can often be seen as a programming task. In this approach, code is written that captures the decisions in each of the steps in a data preparation process. Several programming languages, including those that are often used for analysis, include libraries that can be useful for data wrangling.
- Use a tool. There is a substantial market for tools that support data preparation, providing wrangling components, ways of using them together, and interactive development of data preparation tasks.
There are three main approaches to tool support, workflow-based, dataset-based and automation-based:
Both workflow-based and dataset-based approaches treat data preparation as a visual programming task.
- In workflow-based tools, there is typically a large library of components, for example for reading from sources, reformatting columns, or combining sources. These components are then placed on a canvas, and dependencies specified by linking the components. Such Workflow-based tools are often descended from Extract-Transform-Load (ETL) platforms that were originally developed for populating data warehouses.
- In dataset-based tools, the main visualisation is of a single data set, typically represented as a table, as in a spreadsheet. However, in contrast with a spreadsheet, the principal operations on columns reformat or combine columns, rather than expressing calculations. Dataset-based tools tend to be more recent, designed to support less technical users with self-service data preparation.
It is becoming common for products to combine workflow and dataset based approaches. Where this is the case, a dataset approach is used to manipulate a source or an intermediate result, and these datasets are then brought together as described in a workflow.
As the workflow and dataset-based tools are labour-intensive, the recently proposed automation-based tools seek to have the system take more responsibility for decision-making, with evidence, guidance and feedback provided by users. Our DataPreparer system is in this category.