Yearly Report 2020

Members

Faculty Members

  • Damianos Chatziantoniou
  • Dimitris Mitropoulos
  • Panos (Panagiotis) Louridas
  • Diomidis Spinellis

Senior Researchers

  • Vaggelis Atlidakis
  • Makrina Viola Kosti
  • Vasiliki Efstathiou
  • Stefanos Georgiou
  • Thodoris Sotiropoulos
  • Marios Fragkoulis

Associate Researchers

  • Charalambos-Ioannis Mitropoulos
  • Zoe Kotti
  • Konstantinos Kravvaritis
  • Stefanos Chaliasos
  • Antonios Gkortzis
  • Tushar Sharma
  • Konstantina Dritsa

Researchers

  • Georgios Liargkovas
  • Rafaila Galanopoulou
  • George Theodorou
  • Christos Pappas
  • Angeliki Papadopoulou
  • George Metaxopoulos
  • Theodosis Tsaklanos
  • Michael Loukeris
  • Marios Papachristou
  • Christos Chatzilenas
  • Aris Pattakos
  • Ioannis Batas
  • Efstathia Chioteli
  • Nikolas Doureliadis
  • Vitalis Salis

Overview in numbers

New Publications Number
Monographs and Edited Volumes 1
Journal Articles 5
Book Chapters 0
Conference Publications 8
Technical Reports 0
White Papers 0
Magazine Articles 0
Working Papers 0
Datasets 0
Total New Publications 14
Projects
New Projects 0
Ongoing Projects 2
Completed Projects 0
Members
Faculty Members 4
Senior Researchers 6
Associate Researchers 7
Researchers 15
Total Members 32
New Members 7
PhDs
Ongoing PhDs 6
Completed PhDs 1
New Seminars
New Seminars 21

New Publications

Monographs and Edited Volumes

    • Panos Louridas. Algorithms. The MIT Press, Cambridge, MA, 2020. ISBN 978-0-262-53902-9.

Journal Articles

    • C. Ebert, P. Louridas, T. M. Fernández-Caramés, and P. Fraga-Lamas. Blockchain technologies in practice. IEEE Software, 37(4):17–25, 2020.
    • Tushar Sharma, Paramvir Singh, and Diomidis Spinellis. An empirical investigation on the relationship between design and architecture smells. Empirical Software Engineering, 25:4020–4078, September 2020.
    • Dimitris Mitropoulos, Thodoris Sotiropoulos, Nikos Koutsovasilis, and Diomidis Spinellis. PDGuard: an architecture for the control and secure processing of personal data. In International Journal of Information Security, volume 19, 479–498. 2020.
    • Zoe Kotti, Konstantinos Kravvaritis, Konstantina Dritsa, and Diomidis Spinellis. Standing on shoulders or feet? An extended study on the usage of the MSR data papers. Empirical Software Engineering, 25(5):3288–3322, July 2020.
    • Georgios Doukidis, Diomidis Spinellis, and Christof Ebert. Digital transformation: a primer for practitioners. IEEE Software, 37(5):13–21, 2020.

Conference Publications

    • Diomidis Spinellis, Zoe Kotti, and Audris Mockus. A dataset for GitHub repository deduplication. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20, 523–527. New York, NY, USA, October 2020. Association for Computing Machinery.
    • Diomidis Spinellis, Zoe Kotti, Konstantinos Kravvaritis, Georgios Theodorou, and Panos Louridas. A dataset of enterprise-driven open source software. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20, 533–537. New York, NY, USA, October 2020. Association for Computing Machinery.
    • Thodoris Sotiropoulos, Dimitris Mitropoulos, and Diomidis Spinellis. Practical fault detection in Puppet programs. In 42nd International Conference on Software Engineering, ICSE '20, 26–37. ACM, June 2020.
    • Thodoris Sotiropoulos, Stefanos Chaliasos, Dimitris Mitropoulos, and Diomidis Spinellis. A model for detecting faults in build specifications. In Proceedings of the ACM on Programming Languages, OOPSLA '20. ACM, November 2020.
    • Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The Software Heritage graph dataset: large-scale analysis of public software development history. In 17th International Conference on Mining Software Repositories, MSR '20, 1–5. New York, NY, USA, June 2020. Association for Computing Machinery.
    • Audris Mockus, Diomidis Spinellis, Zoe Kotti, and Gabriel John Dusing. A complete set of related git repositories identified via community detection approaches based on shared commits. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20, 513–517. New York, NY, USA, October 2020. Association for Computing Machinery.
    • Konstantina Dritsa, Thodoris Sotiropoulos, Haris Skarpetis, and Panos Louridas. Search engine similarity analysis: a combined content and rankings approach. In 21st International Conference on Web Information Systems Engineering, WISE '20. Springer, October 2020.
    • Antonis Aggelakis, Prastudy Fauzi, Georgios Korfiatis, Panos Louridas, Foteinos Mergoupis-Anagnou, Janno Siim, and Michał Zając. A non-interactive shuffle argument with low trust assumptions. In Topics in Cryptology: CT-RSA 2020, San Francisco, CA, February 24–28. Cham, Switzerland, February 2020. Springer. Lecture Notes in Computer Science 12006.

New Members

    • Georgios Liargkovas
    • Rafaila Galanopoulou
    • Vaggelis Atlidakis
    • George Theodorou
    • Christos Pappas
    • Angeliki Papadopoulou
    • Makrina Viola Kosti

Ongoing PhDs

    • Zoe Kotti Topic: Data Analysis Applications in Software Engineering
    • Konstantinos Kravvaritis Topic: Data and Quality Metrics of System Configuration Code
    • Antonios Gkortzis Topic: Secure Systems on Cloud Computing Infrastructures
    • Stefanos Georgiou Topic: Energy and Run-Time Performance Practices in Software Engineering
    • Thodoris Sotiropoulos Topic: Abstractions for software testing
    • Konstantina Dritsa Topic: Data Science

Completed PhDs

    • Vaggelis Atlidakis Topic: Structure and Feedback in Cloud Service API Fuzzing

Seminars

      Data Virtual Machines: Data-Driven Conceptual Modeling of Big Data Infrastructures

      Date: 20 February 2020
      Presenter: Damianos Chatziantoniou
      Abstract

      In this talk we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term ``data infrastructure'' we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuitive to use manner, the same way the age or gender of a customer are made. We define formally a DVM, queries over DVMs and an algebraic framework for query evaluation. We also claim that a conceptual layer, such as DVM, is a prerequisite for end-to-end processing. In addition, we present a prototype tool based on DVMs, called DataMingler, which enables end-to-end processing in analytics environments by data stakeholders. Specifically, DataMingler:

      • allows data engineers to easily, quickly map actual data and processes’ output onto it, regardless systems and models
      • makes it easy for data scientists to understand entities and attributes, transform data as they wish – possibly using different languages – and define inputs for ML algorithms/ ad hoc reports
      • enables data officers to govern data, see data provenance and comply with data regulations
      • gives a view of own data to data contributors (e.g. customers, suppliers) – understand, retrieve, manage their personal data


      A Spring Lockodown

      Date: 29 May 2020
      Presenter: Konstantinos Kravvaritis
      Abstract

      Spring boot is a Spring Framework extension that aims to provide more convenience utilities and accelarate the development process. In this presentation, we give a brief overview of Spring boot, the reasons that it was introduced and its advantages.


      Awesome uses of Raspberry Pi

      Date: 29 May 2020
      Presenter: Stefanos Georgiou
      Abstract

      Raspberry Pi is a small, compacted, and cheap computer to build easily reliable systems. In this presentation, we attach a DHT-11 sensor to obtain temperature and humidity measurements. After, we extend this computer system to act upon events and inform the user if something goes wrong.


      Ultra-large-scale compressed graphs: The case of Software Heritage

      Date: 29 May 2020
      Presenter: Zoe Kotti
      Abstract

      While publicly available original source code doubles every two years, and state-of-the-art "big data" approaches require the occupation of several machines, compressed graphs can dramatically reduce the hardware resources needed to mine such large corpora. In this presentation we explore the compressed graph of Software Heritage, a dataset aimed at collecting and storing all publicly available source code. In general, graphs are suitable data models for conducting version control system analyses, while compressed graphs allow impressive graph traversal performances.


      Enriching Greek Parliament Dataset

      Date: 05 June 2020
      Presenter: Konstantina Dritsa
      Abstract

      Enriching the Greek Parliament Proceedings Dataset (https://zenodo.org/record/2587904) with information on the gender and parliament role of the Parliament members.


      Epidose: Contact tracing for all

      Date: 05 June 2020
      Presenter: Diomidis Spinellis
      Abstract

      Epidose is an open source software reference implementation for an epidemic dosimeter. Just as a radiation dosimeter measures dose uptake of external ionizing radiation, the epidemic dosimeter tracks potential exposure to viruses or bacteria associated with an epidemic. The dosimeter measures a person's exposure to an epidemic, such as COVID-19, based on exposure to contacts that have been tested positive. The epidemic dosimeter is designed to be widely accessible and to safeguard privacy. Specifically, it is designed to run on the $10 open-hardware Raspberry Pi Zero-W computer, with a minimal user interface, comprising LED indicators regarding operation and exposure risk and a physical interlock switch to allow the release of contact data. The software is based on the DP3T contact tracing "unlinkable" design and corresponding reference implementation code.


      Natural Language Understanding for Software Engineering

      Date: 05 June 2020
      Presenter: Vasiliki Efstathiou
      Abstract

      Details will be provided during the seminar.


      Automatically reproducing and analyzing Debian Packages with sbuild.

      Date: 19 June 2020
      Presenter: Stefanos Chaliasos
      Abstract

      A challenging issue in analyzing C packages is that it is challenging to reproduce them. In this talk, we present how we can exploit sbuild for running static analysis tools that need to rebuild a package to analyze it. sbuild is a tool that automates the build process of Debian packages. Moreover, we will see how we can scrape the Ultimate Debian Database for selecting packages base on some criteria.


      Revising/teaching/coding-fun in quarantine

      Date: 19 June 2020
      Presenter: Antonis Gkortzis
      Abstract

      This short presentation is a compilation of the tasks and activities completed during the COVID-19 quarantine time. The first part of the quarantine was dedicated to finalizing and submitting the revised paper, titled Software Reuse Cuts Both Ways, to JSS. The second part was dedicated to the Real Estate Analytics EU-funded program. Specifically, we finalized a data warehouse schema, consisting of > 300 tables (facts and dimensions), and begun its population. During both quarantine parts, teaching was an always-on task and required a great effort, with the participation being the largest of the last four years.


      Use of Neural Networks to Improve Fuzzing

      Date: 19 June 2020
      Presenter: Charalambos Ioannis Mitropoulos
      Abstract

      We will make an introduction of our recent work in fuzzing Android Native Libraries, and we will take a look of how we can improve fuzzing with the use of Neural Networks, presenting three different approaches proposed in three different papers that we read in quarantine time.


      A Dataset for GitHub Repository Deduplication

      Date: 26 June 2020
      Presenter: Diomidis Spinellis
      Abstract

      GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.


      A Dataset of Enterprise-Driven Open Source Software

      Date: 26 June 2020
      Presenter: Diomidis Spinellis
      Abstract

      We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.


      Practical Fault Detection in Puppet Programs

      Date: 26 June 2020
      Presenter: Thodoris Sotiropoulos
      Abstract

      Puppet is a popular computer system configuration management tool. By providing abstractions that model system resources it allows administrators to set up computer systems in a reliable, predictable, and documented fashion. Its use suffers from two potential pitfalls. First, if ordering constraints are not correctly specified whenever a Puppet resource depends on another, the non-deterministic application of resources can lead to race conditions and consequent failures. Second, if a service is not tied to its resources (through the notification construct), the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure's availability and functionality.

      We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, a formal model for traces allows us to capture the interactions of Puppet resources with the file system. By analyzing these interactions we identify (1) resources that are related to each other (e.g., operate on the same file), and (2) resources that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace's analysis against the program's dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault.

      We have evaluated our method on a large set of popular Puppet modules, and discovered 92 previously unknown issues in 33 modules. Performance benchmarking shows that our approach can analyze in seconds real-world configurations with a magnitude measured in thousands of lines and millions of system calls.


      Stochastic Opinion Dynamics for User Interest Prediction in Online Social Networks

      Date: 03 July 2020
      Presenter: Marios Papachristou
      Abstract

      In this seminar, we are going to talk about how one can infer the interests (e.g. hobbies) of users in online social networks using information from highly influential users of the network. More specifically, we experimentally observe that the majority of the network users (>70%) is dominated by a sublinear fraction of highly-influential nodes (core nodes). This structural property of networks is also known as the "core-periphery" structure, a phenomenon long-studied in economics and sociology.

      Using the influencers' initial opinions as steady-state trend-setters, we develop a generative model through which we explain how the users' interests (opinions) evolve over time, where each peripheral user looks at her k-nearest neighbors. Our model has strong theoretical and experimental guarantees and is able to surpass node embedding methods and related opinion dynamics methods and is able to scale to networks with millions of nodes.

      Duration: 30-40min.

      Joint work with D. Fotakis (NTUA).


      Pawk: A parallel programming implementation of Awk

      Date: 11 September 2020
      Presenter: Georgios Theodorou
      Abstract

      Pawk is an extension to GoAwk, having been designed with efficiency in mind. It manages to achieve considerably higher performance to both the standard use Awk as well as GoAwk. The two reasons behind the significant speed boost offered by Pawk are the use of multi-threading programming, as well as the choice of Golang as the language of implementation. Since Pawk makes use of parallel programming, it is logical to be restricted only to operations that can be executed in parallel. However, we believe that Pawk can come handy in a plethora of cases, especially when dealing with multi-GB files.


      Securing the Operations and Services of GRNET

      Date: 25 September 2020
      Presenter: Dimitris Mitropoulos
      Abstract

      GRNET CERT (Computer Emergency Response Team) provides incident response and security services to both the National Infrastructures for Research and Technology (GRNET) and to all Greek academic and research institutions. To do so, it employs Open-source Software (OSS) and approaches proposed by the academic community. In this talk we will discuss how GRNET CERT uses OSS to provide early warnings and alerts to its members and relevant organizations regarding risks and incidents. Furthermore, we will discuss how the team utilizes program analysis methods to assist the security audits it performs.


      Search Engine Similarity Analysis - A Combined Content and Rankings Approach

      Date: 16 October 2020
      Presenter: Konstantina Dritsa
      Abstract

      How different are search engines? The search engine wars are a favorite topic of on-line analysts, as two of the biggest companies in the world, Google and Microsoft, battle for prevalence of the web search space.Differences in search engine popularity can be explained by their effectiveness or other factors, such as familiarity with the most popular first engine, peer imitation, or force of habit. In this work we present a thorough analysis of the affinity of the two major search engines, Google and Bing, along with DuckDuckGo, which goes to great lengths to emphasize its privacy-friendly credentials. To do so, we collected search results using a comprehensive set of 300 unique queries for two time periods in 2016 and 2019, and developed a new similarity metric that leverages both the content and the ranking of search responses. We evaluated the characteristics of the metric against other metrics and approaches that have been proposed in the literature, and used it to (1) investigate the similarities of search engine results, (2) the evolution of their affinity over time, (3) what aspects of the results influence similarity, and (4) how the metric differs over different kinds of search services. We found that Google stands apart, but Bing and DuckDuckGo are largely indistinguishable from each other.


      A Model for Detecting Faults in Build Specifications

      Date: 23 October 2020
      Presenter: Thodoris Sotiropoulos
      Abstract

      Incremental and parallel builds are crucial features of modern build systems. Parallelism enables fast builds by running independent tasks simultaneously, while incrementality saves time and computing resources by processing the build operations that were affected by a particular code change. Writing build definitions that lead to error-free incremental and parallel builds is a challenging task. This is mainly because developers are often unable to predict the effects of build operations on the file system and how different build operations interact with each other. Faulty build scripts may seriously degrade the reliability of automated builds, as they cause build failures, and non-deterministic and incorrect outputs.

      To reason about arbitrary build executions, we present BuildFS, a generally-applicable model that takes into account the specification (as declared in build scripts) and the actual behavior (low-level file system operation) of build operations. We then formally define different types of faults related to incremental and parallel builds in terms of the conditions under which a file system operation violates the specification of a build operation. Our testing approach, which relies on the proposed model, analyzes the execution of single full build, translates it into BuildFS, and uncovers faults by checking for corresponding violations.

      We evaluate the effectiveness, efficiency, and applicability of our approach by examining 612 Make and Gradle projects. Notably, thanks to our treatment of build executions, our method is the first to handle JVM-oriented build systems. The results indicate that our approach is (1) able to uncover several important issues (247 issues found in 47 open-source projects have been confirmed and fixed by the upstream developers), and (2) much faster than a state-of-the-art tool for Make builds (the median and average speedup is 39X and 74X respectively).


      GraphQL, GraphQL-Mesh, and the semantic web

      Date: 13 November 2020
      Presenter: Uri Goldshtein
      Abstract

      GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data. I'll talk a bit about GraphQL in general, then a bit about the ideas behind GraphQL-Mesh, and then talk about the future ideas like GraphQL and semantic web and ideas like that for which there is a lot of room for exploration.

      Uri is the founder of The Guild - A group of open source developers, mostly focused around GraphQL. He was part of the writers of the GraphQL Subscriptions spec. The Guild works with companies around the world, helping them with their API technologies and improving the open source libraries while doing it.


      Detecting Locations in JavaScript Programs Affected by Breaking Library Changes

      Date: 27 November 2020
      Presenter: Anders Møller, Benjamin Barslev Nielsen, Martin Toldam Torp
      Abstract

      JavaScript libraries are widely used and evolve rapidly. When adapting client code to non-backwards compatible changes in libraries, a major challenge is how to locate affected API uses in client code, which is currently a difficult manual task. In this paper we address this challenge by introducing a simple pattern language for expressing API access points and a pattern-matching tool based on lightweight static analysis.

      Experimental evaluation on 15 popular npm packages shows that typical breaking changes are easy to express as patterns. Running the static analysis on 265 clients of these packages shows that it is accurate and efficient: it reveals usages of breaking APIs with only 14% false positives and no false negatives, and takes less than a second per client on average. In addition, the analysis is able to report its confidence, which makes it easier to identify the false positives. These results suggest that the approach, despite its simplicity, can reduce the manual effort of the client developers.

      This is a paper presented at this year's OOPSLA (Object-Oriented Programming, Systems, Languages & Applications) conference and recommended for the seminar by Theodoris Sotiropoulos. We will watch the talk's video and then discuss the method, the results, and the paper.


      Analytical database execution engines, vectorization, and Amazon Redshift applications

      Date: 18 December 2020
      Presenter: Orestis Polychroniou
      Abstract

      Query execution engines for analytics are continuously adapting to the underlying hardware in order to maximize performance. Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs and new processor designs such as the many-core Intel Xeon Phi CPUs that rely on SIMD vectorization to achieve high performance per core while packing a greater number of smaller cores per chip. In the database literature, using SIMD to optimize stand-alone operators with key–rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where hand-optimized individual operators become impractical. In this article, we extend a state-of-the-art analytical query engine design by combining code generation and operator pipelining with SIMD vectorization, and show that the SIMD speedup is diminished when execution is dominated by random memory accesses. To better utilize the hardware features, we introduce VIP, an analytical query engine designed and built bottom-up from pre- compiled column-oriented data-parallel sub-operators and implemented entirely in SIMD. In our evaluation using synthetic and TPC-H queries on a many-core CPU we show that VIP outperforms hand-optimized query-specific code without incurring the runtime compilation overhead, and highlight the efficiency of VIP at utilizing the hardware features of many-core CPUs.

      Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. Since launching in February 2013, it has been Amazon Web Service's (AWS) fastest growing service, with many thousands of customers and many petabytes of data under management. Amazon Redshift's pace of adoption has been a surprise to many participants in the data warehousing community. While Amazon Redshift was priced disruptively at launch, available for as little as $1000/TB/year, there are many open-source data warehousing technologies and many commercial data warehousing engines that provide free editions for development or under some usage limit. While Amazon Redshift provides a modern MPP, columnar, scale-out architecture, so too do many other data warehousing engines. And, while Amazon Redshift is available in the AWS cloud, one can build data warehouses using EC2 instances and the database engine of one's choice with either local or network-attached storage.

      Orestis Polychroniou is a senior software engineer at Amazon Web Services working on the core query execution performance of Amazon Redshift. Before joining Amazon he did his PhD with Kenneth A. Ross at Columbia University on modern databases with his PhD thesis focusing on analytical query execution optimizations optimized for all layers of modern hardware. His research has ~500 citations.


Note: Data before 2017 may refer to grandparented work conducted by BALab's members at its progenitor laboratory, ISTLab.