Data preparation is not new. However, although data preparation platforms differ in how data preparation activities are specified, there tends to be fairly broad agreement as to what steps are required. The following are representative data preparation steps:
Source discovery is the identification of data sources that may be relevant to a problem. The proliferation of data sources, both internal and public, has encouraged exploration of techniques for searching for data sets. These include keyword search to the search for data sets that are similar to a given collection. It is predicted that catalog products may exceed the significant growth rate of other data preparation tools.
Data extraction is the process of obtaining structured data from diverse types of data source. Tools and techniques for data preparation tend to act over existing structured or semi-structured data sets. To accommodate this, tools typically provide libraries for accessing different file formats, database systems and APIs. However, access to sources such as web tables and the deep web may require specific data extraction tools for authoring or inferring rules for extracting web data, and several vendors provide such offerings. For example, web data extraction tools may support languages for accessing data within the html structure of web pages.
Data profiling involves deriving properties of data sets that may be relevant to their interpretation and use. For example, profiling on individual columns can identify those that uniquely identify rows (candidate keys), the fraction of missing values, the number of distinct values, syntactic or numeric outliers, and distributions of values. Profiling can also identify relationships between columns. For example, an inclusion dependency between a candidate key and a column in another table may represent a join opportunity.
Format transformation involves changing the representation of attribute values to improve consistency. Inconsistent data entry, or the combining of data from different sources, often leads to different representations for the same types of value. For example, dates may be formatted in different orders or with different separators (dd-mm-yy, mm/dd/yy), and names may be provided with or without titles, with or without middle initials, etc. Reformatting to improve consistency is often a prerequisite for other data preparation steps and for analyses. Reformatting requirements are often written as transformation rules, for example based on regular expressions, and may be authored manually or inferred from examples.
Data cleaning involves identifying and resolving quality problems. Two categories of quality problem have headings of their own in this list – format transformation and deduplication. In general, data quality handling has two aspects: (i) identifying quality problems; and (ii) resolving them. To identify data quality problems, a data set or candidate result can be evaluated in relation to expected patterns. Expected patterns may be in the form of database style constraints (e.g., expected value ranges, dependencies) or business rules. When a problem is detected, this can either be brought to the attention of the user, or actions may be specified to correct the problem (e.g. by selecting the data from one source over another when there is an inconsistency).
Matching involves identifying equivalent columns between sources, based on the similarity of metadata or values. Matches can be identified using algorithms, but there are often several plausible alternative ways of matching data sets, so the results of algorithms may need to be refined manually.
Mapping involves identifying how to merge sources to populate a target. A target schema is the structure that is required for downstream analysis or exploration. There may be many ways in which source tables can be combined to populate a target. A mapping is an executable expression that combines sources, typically by unioning or joining them.
Deduplication involves the removal of different representations of the same real world object. There may be many records in a single data set, or from different data sets to be integrated, that provide information about the same real world object. Identifying such potentially redundant data, and merging the duplicated records, is known as entity resolution or deduplication. Deduplication may involve several stages: blocking – fast but approximate comparison across large data sets; pairwise comparison – careful but slower comparison of candidate duplicates; and clustering – grouping of the candidate duplicates together. Deduplication may involve user-specified comparison operations, or comparison criteria may be inferred based on training data. Deduplication is often complicated by the presence of hard-to-specify parameters and thresholds.
Although these steps are not always all present, data preparation is typically a multi-faceted problem. Defining a data preparation task thus often requires knowledge of the application as well as of the technology being used to prepare the data. Furthermore, as the individual steps tend to involve detailed user involvement, data preparation tends to be costly.