Format Transformation – writing rules

Format Transformation – writing rules

Format transformation is the process of carrying out changes to the representation of data values, such as values of a table column, with a view to reducing some of the existing inconsistencies. In this blog, we set out to show one approach of how format transformation is being carried out in practice. As an example, consider the scenario exemplified below (the full data set can be checked here), where the values for Name have to be formatted from their existing structure: last_name, first_names to a new format, say first_names last_name

Adams, Robert → Robert Adams

Blais, Jean-Charles → Jean-Charles Blais

Churchill, Sir Winston → Sir Winston Churchill

Cooper, Richard, Junior → Richard Cooper Junior

Such a task is commonly performed programmatically by writing rules that either rely on some programming libraries, e.g. string manipulation libraries, or use a specialized language, e.g., regular expressions. In this process, often performed by data scientists, the following steps are carried out:

  1. Become familiar with the data and the patterns present on the Name column. 
  2. Create a program/script that extracts each component, e.g., first_name, last_name, and concatenates them in the appropriate order. Another option is to split the Name column by comma and concatenate the resulting two columns in inverted order. 
  3. Validate the result by checking the correctness of resulting values for some rows, especially for ones that represent special cases, e.g., the last row in the table above.

Note that, in performing Step 1 above, the data scientist has to identify the major patterns that inform the subsequent steps. This rapidly becomes infeasible if the dataset exceeds a few hundred rows. Step 2 requires knowledge of string manipulation techniques, e.g., based on regular expressions, and knowledge of a language in which transformative rules can be written and executed. This makes format transformation accessible only to technical users. Furthermore, such a program has to be created every time a different column is transformed. Finally, in Step 3, another, potentially time-expensive, exploratory analysis is required to identify corner cases that have not been caught by the previous rules. Such challenges are widespread in data preparation, and dealing with them often delays or even hinders successful data analysis. 

Format Transformation Rules in Python

At its core, format transformation relies on transformation rules that are either written directly by a programmer in some programming language, or built through the interaction of the user with a graphical interface provided by a specialized data-preparation tool. The code snippet below is a Python program that, given the values of Name, uses regular expressions to obtain the values of New Name.

  1. import re
  2. def transformation_rule(pattern, in_value):
  3.     matches = pattern.findall(in_value)
  4.     if matches is not None and len(matches) > 1:
  5.     return matches[1][1] + ” ” + matches[0][1] + ” ” + “”.join([t[1] for t in matches[min(2, len(matches)):]])
  1.     return in_value
  2. in_values = [‘Adam, Robert’, ‘Blais, Jean-Charles’, ‘Churchill, Sir 8. Winston’, ‘Cooper, Richard, Junior’]
  3. pattern = re.compile(r'(,\s?)?([\w\-\s]+)’)
  4. transformed = [transformation_rule(pattern, value) for value in in_values]
  5. print(transformed)

Result: [‘Robert Adam ‘, ‘Jean-Charles Blais ‘, ‘Sir Winston Churchill ‘, ‘Richard Cooper Junior’]

The program, through the transformation_rule function, transforms each value that matches the pattern to its corresponding output. The pattern, defined at line 9, is a regular expression that is consistent with any value consisting of one or more words separated by comma and/or space and/or hyphen.  Identifying such a regular expression pattern or a set of such patterns that cover all input values is infeasible when there are many values. Furthermore, automatically inferring one or more such regular expressions from a given collection of matching strings is a non-trivial computing problem.

Format Transformation Rules in Trifacta

An alternative to writing transformation rules in the form of programs or scripts is to use a data wrangling tool such as Wrangler, commercialized by Trifacta. Most of the advantages offered by such tools over the more conventional programmatic transformations lies with Steps 1 and 3 above. For instance, wrangling tools offer data preview features that allow the identification of existing representation patterns automatically, and flagging of problematic strings that cannot be transformed by the current transformations. With respect to Step 2 above, some data wrangling tools offer domain specific transformation languages that are more grounded into natural language, or even go as far as to provide GUI representations of common transformation functions, therefore, making format transformation accessible to less technical users. Despite their simpler transformation language though, in performing data wrangling using specialized tools, Step 2 from above can still become infeasible quickly. This is because, even if sometimes the system is capable of providing suggestions regarding potentially suitable transformations, the user still has to know how to express the transformations using the underlying language (or what combination of GUI representations achieves the task).

As an example, consider the transformation script specific to Data Wrangler below.

splitpatterns col: Name on: /,\s/ ignoreCase: true

replacepatterns col: Name1 with: ” on: `{start}”|”{end}` global: true

replacepatterns col: Name2 with: ” on: `{start}”|”{end}` global: true

merge col: Name2,Name1 with: ‘ ‘ as: ‘Full Name’

The script splits the Name column into two sub-columns: Name1 and Name2, removes the quotes characters added when ingesting the data, and concatenates the values of the two sub-columns separated by space. Note that the script above will not correctly transform the last row of our example. So the user has to incrementally edit the script to include all the formats in the Name column. Note also that a more complex script, e.g., based on regular expressions that match most of the existing formats, can be created using data wrangling tools. In such a case the complexity of Step 2 above would be very similar to performing format transformations using Python.

It transpires that, although successful in speeding up format transformation, data preparation tools such as Data Wrangler do not eradicate the need for specialized knowledge of data and/or the language used in writing transformation rules. Subsequent posts of this series will discuss alternative approaches to format transformation that aim to facilitate transformations even in the absence of data and programming knowledge.

Summary

The traditional programmatic approach to data wrangling offers efficient and flexible transformation methods for experienced users familiar with both the data and the programming language used. As data wrangling has become a necessity for many analysis scenarios, specialized wrangling tools have been proposed, with the aim of enabling data wrangling by less-technical users. As shown in this blog, when using such tools, data wrangling tasks, such as format transformation, require less familiarity with the data, but similar levels of knowledge of the necessary transformations and of how to express them in the underlying language.

Subsequent blogs in this series will discuss proposed ideas, most of them actively researched, that aim to alleviate the user from having to be an experienced programmer in order to write transformation rules.