Technologies for main memory data analysis

Presenter: Marios Fragkoulis
Date: 02 March 2017


The absence of suitable analytical tools hinders knowledge extraction in cases of software applications that do not need the support of a database system. Some examples are applications whose data have a complex structure and are often stored in files, eg scientific applications in areas such as biology, and applications that do not maintain permanent data, such as data visualization applications and diagnostic tools. Databases offer widely used and recognized query interfaces, but applications that do not need the services of a database should not resort to this solution only to satisfy the need to analyze their data. The thesis studies the methods and technologies for supporting queries on main memory data and how the widespread architecture of software systems currently affects technologies. Based on the findings from the literature we develop a method and a technology to perform interactive queries on data that reside in main memory. After an overview of the programming languages that fit the data analysis we choose SQL, the standard data manipulation language for decades. The method we develop represents programming data structures in relational terms as requires SQL. Our method replaces the associations between structures with relationships between relational representations. The result is a virtual relational schema of the programming data model, which we call relational representation. The implementation, which we carried out in C/C++, includes a domain specific language for describing relational representations, a compiler that generates the source code of the relational interface to the programming data structures given a relational specification, and the implementation of SQLite’s virtual table API. The overall evaluation of our approach involves its integration in three C++ software applications, in the Linux kernel, and in Valgrind, where we also perform a user study with students. We find a) that our approach exhibits greater expressiveness than C++ queries, b) real problems in the Linux kernel, c) opportunities for space and performance optimizations in applications instrumented by Valgrind, and d) that it took users less time to draft queries with SQL than with Python.