Data preparation is the process of discovering, cleaning and integrating data sets for analysis. Although the cleaning that may be required varies from application to application, the need to address formatting inconsistencies is certainly a common requirement. For example, names, addresses, product descriptions, etc, are prone to inconsistencies.
Inconsistencies are not hard to find. To look for examples, my first thought was to type the name of an author into Amazon, and see what emerges. Searching for John le Carre gave as the first result a book with the title Agent Running in the Field: A Novel. However, “A Novel” is not part of the title of the book, and not all novels have this suffix. Indeed, Amazon titles are liberally augmented with information that may be helpful for human readers, but problematic for computational processing. Sometimes the same title has multiple additional descriptions. For example, The Spy Who Came in from the Cold by the same author is listed as The Spy Who Came in from the Cold: A George Smiley Novel (George Smiley Novels Book 3). Any data scientists working with such data have their work cut out!
Such inconsistencies are problematic both for other data preparation steps and for analytics. To consider data preparation, it might be useful to combine two book data sources where they have common titles, but this is likely to require an exact match on the title. Furthermore, the detection or removal of duplicates might place considerable weight on title comparisons, but common similarity measures are undermined by formatting inconsistencies. The Jaccard similarity of the two variants of the The Spy Who Came in from the Cold is less than 0.5. Such metrics are also potentially relevant for analytics; for example, the distance function of a clustering algorithm could certainly depend on Jaccard similarities. As a result, resolving formatting inconsistencies is a prerequisite for a variety of other tasks.
Resolving such seemingly ad hoc inconsistencies in a systematic manner is unlikely to be straightforward. However, formatting inconsistencies often follow recurring patterns that can be identified and resolved in software. Format transformations can be written manually or inferred from examples. Here are five common types of inconsistency:
- Format reordering. Example: reorder the components of person names; Beechey, Sir William →Sir William Beechey (Artists in the Tate, Name column).
- Change of component representation. Example: represent first names by their initials to normalize the representation of values; Dan Brown → D. Brown (Book Ratings, Authors column).
- Component extraction. Example: extract the active ingredient name from drug records; IPECAC (300000347) → IPECAC (Active Pharmaceutical Ingredients, Active Substance column).
- Component insertion. Example: add “Mailto:” prefix to email addresses to create hyperlinks; firstname.lastname@example.org → Mailto:email@example.com (Manufacturing Contracts, Contact Email column).
- Punctuation and/or case alterations. Example: add quotation marks to quotes; Whatever you are, be a good one. → “Whatever you are, be a good one.” (Quotations, Quote column).
Expressing such transformations in a generic way, with a view to applying them to an entire collection, involves identifying source and target patterns, with a mechanism for transferring values between the source and the target. This is rarely straightforward, since a collection of source values (e.g., a table column that requires format normalization), can contain many different patterns, hence multiple transformation types from the table are required. Moreover, the editing from a source value to a desired target value may involve a combination of the above transformation types, making the process even more complex.
Format transformations are recognised to be difficult to get right, and different approaches are available and in use, including:
- Hand-crafting transformation programs. Format transformations can be written using general purpose programming languages, typically using string processing libraries, for example for defining patterns using regular expressions. Furthermore, domain specific languages have been designed especially for transforming strings. Such languages are provided with some data preparation products. Hand-crafting provides developers with significant control, but defining rules that apply to all available cases is a tricky business, in particular for large data sets.
- Synthesizing transformations from custom examples. The effort required to hand-craft format transformation programs has motivated the application of programming by example to format transformation. A prominent example is shipped with Microsoft Excel. Given a spreadsheet, a user can create a new column, and provide example values that should be obtained by applying a format transformation to selected values from an existing column. The system will then seek to generate a program that can complete the population of the new column. If the result is not satisfactory, additional examples can be provided. This approach is great when it works, but providing good and representative examples is not necessarily trivial, and thus the approach is most likely to be effective with small and reasonably regular data sets.
- Synthesizing transformations from existing examples. There are circumstances in which comprehensive collections of examples already exist. For the data about Artists in the Tate from the table above, we may have access to a separate list of artist names from a catalog that represents the intended format for the result of the transformation. In this case, the existing examples can be aligned with source data, and rules synthesised to transform from the source representation in the catalog. This approach can be applied to large sources, as there is no need for the user to scan the source looking for cases. However, the off-the-shelf examples need to cover the different cases in the source, and thus this approach is likely to be most effective with fairly regular sources.
Each approach from above merits a detailed discussion. Therefore, future blogs from this series will discuss each of them in turn.
This post was jointly authored with Alex Bogatu.