BALab seminars — Big Data Mediation: Α Conceptual Data Model and an Algebraic Framework for Efficient Dataframe Processing

Abstract

Most analytics projects focus on the management of the 3Vs of big data and use specific stacks to support this variety. However, they constrain themselves to ''local'' data, data that exists within or ''close'' to the organization. And yet, as it has been recently pointed out, ''the value of data explodes when it can be linked with other data.'' In this paper we present our vision for a global marketplace of analytics---either in the form of per-entity metrics or per-entity data, provided by globally accessible data management tasks---where a data scientist can pick and combine data at will in her data mining algorithms, possibly combining with her own data. The main idea is to use the dataframe, a popular data structure in R and Python. Currently, the columns of a dataframe contain computations or data found within the data infrastructure of the organization. We propose to extend the concept of a column. A column is now a collection of key-value pairs, produced anywhere by a remotely accessed program (e.g., an SQL query, a MapReduce job, even a continuous query.) The key is used for the outer join with the existing dataframe, the value is the content of the column. This whole process should be orchestrated by a set of well-defined, standardized APIs. We argue that the proposed architecture presents numerous challenges and could be beneficial for big data interoperability. In addition, it can be used to build mediation systems involving local or global columns. Columns correspond to attributes of entities, where the primary key of the entity is the key of the involved columns.