Data analysis is hard! Practitioners have testified time and again that extracting the much promised value out of data through analysis is no free lunch. There is a famous 80/20 rule of working with data that describes the state of affairs in the real-world quite well: data scientists tend to spend up to 80% of their time preparing the data and only 20% analysing it. Although preparation here includes everything from getting familiar with the data to cleaning, standardising and structuring unruly data to match the needs of an analysis or visualisation package, in this blog we focus on format transformations.
We have already covered how format transformation can be performed in practice, and one of the challenges that emerged from the past discussion was that transformation often requires at least some programming knowledge. In this blog we go a step further and explore two potential approaches to save the user from the need to write code.
FlashFill and transformation learning
Let’s start simple and note that the need for format transformation is older than the need for data wrangling. For instance, think Microsoft Excel, the swiss-army-knife of data standardisation and structuring, and consider the following example where the task is to extract the country name from the Participants column.
Surely, Excel users are not always experienced programmers and transforming the format representation of columns does not always involve writing complex VBA programs. Indeed, it doesn’t, or at least since Excel 2013. This is thanks to a feature called FlashFill: a special tool that analyzes the information users enter and automatically fills data when it identifies a pattern.
Performing the task from our example above in Excel using FlashFill is straightforward, as the tool can automatically learn the pattern transformation from the given examples and fill in the rest of the column as suggested in the picture above. These can be considered custom examples, as they are provided by the user.
Wrangler and predictive interaction
Getting back to the data analysis world, performing format transformations using some Excel features, be it a VBA script, is infeasible, since most of the data preparation tasks require the power and expressiveness of some programming libraries, e.g., in Python. We have seen before how Trifacta’s Wrangler aims to blend the ease of use of tools such as Excel and the power of programming languages such as Python, albeit still requiring some written code, at least until its predictive feature learns enough transformations. This predictive feature, much like FlashFill in Excel, intelligently assesses the data and synthesizes a transformation. As opposed to FlashFill though, this transformation stems from a combination of past user decisions and data formatting. Nevertheless, both approaches limit the need for written code for performing format transformations.
Here is an example of extracting the city name from a city-country pair in Trifacta’s Wrangler. For this relatively simple example the tool is able to synthesize the required transformation automatically. For more complex pattern transformations, the tool needs experience, in the form of previously used similar transformations, to be able to suggest a relevant one.
So there is life for format transformation tasks outside the, often inaccessible, world of programming. We have seen two examples of performing transformations only by providing custom examples. But looking at the need for such transformations from the perspective of data wrangling reveals one important shortcoming: the user has to be familiar with the data and able to identify different formatting patterns for which to provide examples: at least one, often more, examples per pattern to transform. This makes such tools suitable for spreadsheet-like scenarios where data tends to be of manageable sizes and the user can relatively quickly explore the data. For larger datasets, where there are more than a few transformation cases and spotting all these cases is hard, tools such as the ones discussed above come up short. A potential solution in such cases is Data Preparer, a data preparation tool that adopts a broader view on data wrangling and performs format transformation without assuming programming knowledge or high levels of data familiarity from the user. The next blog of this series will introduce Data Preparer and exemplify how format transformation can leverage existing data to discover the much needed examples.