BALab seminars — Data Virtual Machines: Simplifying Data Sharing, Exploration & Querying in Big Data Environments

Abstract

Today’s analytics environments are characterized by a high degree of heterogeneity in terms of data systems, formats and types of analysis. Many occasions call for rapid, ad hoc, on demand construction of a data model that represents (parts of) the data infrastructure of an organization, including ML tasks. This data model is given to data scientists to play with (express reports, build ML models, explore, etc.) We present a novel graph-based conceptual model, the Data Virtual Machine (DVM) representing data (persistent, transient, derived) of an organization. A DVM can be built quickly and agilely, offering schema flexibility. It is amenable to visual interfaces for schema and query management. Dataframing, a frequent preprocessing task, is usually carried out by experienced data engineers employing Python or R: a procedural approach with all the known drawbacks. Dataframes over DVMs are expressed declaratively - and visually, via a simple and intuitive tool. This way, non-IT experts can be involved in dataframing. In addition, query evaluation takes place within an algebraic framework with all the known benefits. I.e. a DVM enables the delegation of data engineering tasks to simpler users. Finally, a DVM offers a formalism that facilitates data sharing, data portability and a single view of any entity – because a DVM’s node is an attribute and an entity at the same time. In this respect, DVMs can excellently serve as a data virtualization technique, an emerging trend in the industry. We argue that DVMs can have a significant practical impact in today’s big data environments.