The Data Value Factory has recently released its Data Preparer system. Data Preparer aims to significantly reduce the manual effort involved in developing programs for data integration and cleaning. What has influenced the approach and where has it come from?
The Data Value Factory is a spin-out company from The University of Manchester, where we have been investigating techniques for increasing automation within data integration for over 10 years.
It has been recognized in the database research community for a considerable period that the manual construction of data preparation applications was problematic. For example, a vision paper from 2005 relating to dataspaces, highlighted the high up-front costs of manual integration. These are problematic where ever more data is becoming available, and that it is impractical to manually clean and integrate all this data in a systematic manner.
This gave rise to the notion of pay-as-you-go data integration, in which an initial, algorithmically produced, integration is incrementally refined, to the level required for the task at hand. The following figure illustrates the iterative aspects of this approach.
In this approach, the sources are subject to automatic, best-effort, integration to yield a preliminary integrated data set. This data set is then shown to the user, who provides feedback on the result, which is used to inform further automatic integration.
Our original research in this area was funded by the UK Government through its Engineering and Physical Sciences Research Council (EPSRC). This gave rise to results on pay-as-you-go approaches to the selection and refinement of mappings between schemas, the identification of duplicate records, the identification of the most valuable feedback to obtain, and the sharing of feedback between users.
The subsequent emergence of big data, characterized in part by the V’s of Variety (e.g., diverse data representations) and Veracity (i.e., variable quality and relevance) increased the need for cost-effective data preparation. This gave rise to new ways of discussing data preparation, such as data wrangling and self-service data preparation. These terms reflect a more interactive approach to data preparation, where more of the work is in the hands of domain experts. However, although there have been innovative systems developed that describe themselves in these terms, they tend to leave users with fine-grained control over decision-making, at considerable cost. There is usually no automated integration or cleaning, as proposed in dataspaces.
This pressing need for further developments motivated further research to produce comprehensive pay-as-you-go data wrangling solutions. We obtained support with colleagues from Edinburgh and Oxford universities for research in Value Added Data Systems (VADA), again funded by the EPSRC.
Manchester led the work on data preparation architectures in VADA, which resulted in an end-to-end proposal, including automated techniques for resolving inconsistencies in the structure of different data sets and in the formatting of values within attributes.
The combination of the growing market for data preparation tools and a distinctive proposal motivated the setting up of a company to bring the Data Preparer approach to pay-as-you-go data preparation to market. We made contact with the university commercialization company, UMIP, following their Innovation Optimiser course in the fall of 2017, and won the Next Big Thing competition in spring of 2018. Since then, we have been meeting potential customers with support from ICURe, and developing the product from the research prototype, as discussed in the next post