Data Preparer – Research Prototype to Product

You are currently viewing Data Preparer – Research Prototype to Product

Data Preparer – Research Prototype to Product


Research prototypes exist to provide proof-of-concept demonstrations, and to allow experiments to be carried out that evaluate both individual components and the complete system. However, often they are only ever run by the people who wrote them.  There is typically no expectation that they will ever be used in practice by third parties. As a result, evolving a research prototype to a product is likely to involve significant re-engineering.

In the case of Data Preparer, a key architectural feature is that there are loosely connected components, that share data using a knowledge base. This has been important both for the research and for the transition to product, as it is straightforward to replace specific components. Almost all the components in the architecture have evolved or have been replaced at some time or other. Such changes have been to improve performance, bring on board new techniques, and/or avoid Intellectual Property issues.

Our proof of concept demonstration version of the system was presented at the ACM SIGMOD conference in June 2017 [1]. Since this demonstration version, we have replaced the components for matching, mapping generation, data repair and user context, and added a new component for format transformation.


In transitioning to product, the following have been significant tasks:

  • Continuous integration: with several developers, and the need to deploy on several architectures, we have been using continuous integration to ensure that changes deploy other than on development platforms and that every commit passes the tests.
  • Increased integration test coverage: with components in the architecture from several research and development strands, we have developed significant numbers of additional tests for the system as a whole, for when these are brought together.
  • Change to ethos: the original pay-as-you-go approach was best-effort. Feedback was used to refine the results of automation, but users had quite limited control over how wrangling was carried out.  We needed to make all the decisions made by the system visible to users, and to provide users with the ability to steer these decisions directly.

In addition, in the evolution to the first public release of Data Preparer, significant effort has been put towards accommodating real world needs that are not typically present in a lab environment. These include supporting various operating systems, browsers and databases, and improving robustness, error-reporting and performance. So, although the original architecture and much of its functionality remains in place, most of the details have changed significantly.

[1] N. Konstantinou, M. Koehler, E. Abel, C. Civili, B. Neumayr, E. Sallinger, A. Fernandes, G. Gottlob, J. Keane, L. Libkin, N. Paton: The VADA Architecture for Cost-Effective Data Wrangling. Int. Conf. on Management of Data (SIGMOD’17), ACM, Chicago, IL, USA, May 2017