Yearly Report 2020

Overview in numbers

New Publications Number
Monographs and Edited Volumes 1
Journal Articles 4
Book Chapters 0
Conference Publications 7
Technical Reports 0
White Papers 0
Magazine Articles 0
Working Papers 0
Datasets 0
Total New Publications 12
New Projects 0
Ongoing Projects 2
Completed Projects 0
Faculty Members 3
Senior Researchers 5
Associate Researchers 7
Researchers 18
Total Members 33
New Members 8
Ongoing PhDs 6
Completed PhDs 1
New Seminars
New Seminars 19

New Publications

Monographs and Edited Volumes

    • Panos Louridas. Algorithms. The MIT Press, Cambridge, MA, 2020. ISBN 978-0-262-53902-9.

Journal Articles

    • C. Ebert, P. Louridas, T. M. Fernández-Caramés, and P. Fraga-Lamas. Blockchain technologies in practice. IEEE Software, 37(4):17–25, 2020.
    • Dimitris Mitropoulos, Thodoris Sotiropoulos, Nikos Koutsovasilis, and Diomidis Spinellis. PDGuard: an architecture for the control and secure processing of personal data. In International Journal of Information Security, volume 19, 479–498. 2020.
    • Zoe Kotti, Konstantinos Kravvaritis, Konstantina Dritsa, and Diomidis Spinellis. Standing on shoulders or feet? An extended study on the usage of the MSR data papers. Empirical Software Engineering, 25(5):3288–3322, July 2020.
    • Georgios Doukidis, Diomidis Spinellis, and Christof Ebert. Digital transformation: a primer for practitioners. IEEE Software, 37(5):13–21, 2020. doi:10.1109/MS.2020.2999969.

Conference Publications

    • Diomidis Spinellis, Zoe Kotti, and Audris Mockus. A dataset for GitHub repository deduplication. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20, 523–527. New York, NY, USA, October 2020. Association for Computing Machinery.
    • Diomidis Spinellis, Zoe Kotti, Konstantinos Kravvaritis, Georgios Theodorou, and Panos Louridas. A dataset of enterprise-driven open source software. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20, 533–537. New York, NY, USA, October 2020. Association for Computing Machinery.
    • Thodoris Sotiropoulos, Dimitris Mitropoulos, and Diomidis Spinellis. Practical fault detection in Puppet programs. In 42nd International Conference on Software Engineering, ICSE '20, 26–37. ACM, June 2020.
    • Thodoris Sotiropoulos, Stefanos Chaliasos, Dimitris Mitropoulos, and Diomidis Spinellis. A model for detecting faults in build specifications. In Proceedings of the ACM on Programming Languages, OOPSLA '20. ACM, November 2020.
    • Audris Mockus, Diomidis Spinellis, Zoe Kotti, and Gabriel John Dusing. A complete set of related git repositories identified via community detection approaches based on shared commits. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20, 513–517. New York, NY, USA, October 2020. Association for Computing Machinery.
    • Konstantina Dritsa, Thodoris Sotiropoulos, Haris Skarpetis, and Panos Louridas. Search engine similarity analysis: a combined content and rankings approach. In 21st International Conference on Web Information Systems Engineering, WISE '20. Springer, October 2020.
    • Antonis Aggelakis, Prastudy Fauzi, Georgios Korfiatis, Panos Louridas, Foteinos Mergoupis-Anagnou, Janno Siim, and Michał Zając. A non-interactive shuffle argument with low trust assumptions. In Topics in Cryptology: CT-RSA 2020, San Francisco, CA, February 24–28. Cham, Switzerland, February 2020. Springer. Lecture Notes in Computer Science 12006.

New Members

    • Theodoros Plessas
    • Vaggelis Atlidakis
    • George Theodorou
    • Christos Pappas
    • George Liargkovas
    • Rafaila Galanopoulou
    • Angeliki Papadopoulou
    • Makrina Viola Kosti

Ongoing PhDs

    • Zoe Kotti Topic: Data Analysis Applications in Software Engineering
    • Konstantinos Kravvaritis Topic: Data and Quality Metrics of System Configuration Code
    • Antonios Gkortzis Topic: Secure Systems on Cloud Computing Infrastructures
    • Stefanos Georgiou Topic: Energy Efficiency in Cloud Computing
    • Thodoris Sotiropoulos Topic: Techniques for Improving the Reliability of the Event-Driven Programs
    • Konstantina Dritsa Topic: Data Science

Completed PhDs

    • Vaggelis Atlidakis Topic: Structure and Feedback in Cloud Service API Fuzzing


      Data Virtual Machines: Data-Driven Conceptual Modeling of Big Data Infrastructures

      Date: 20 February 2020
      Presenter: Damianos Chatziantoniou

      In this talk we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term ``data infrastructure'' we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuitive to use manner, the same way the age or gender of a customer are made. We define formally a DVM, queries over DVMs and an algebraic framework for query evaluation. We also claim that a conceptual layer, such as DVM, is a prerequisite for end-to-end processing. In addition, we present a prototype tool based on DVMs, called DataMingler, which enables end-to-end processing in analytics environments by data stakeholders. Specifically, DataMingler:

      • allows data engineers to easily, quickly map actual data and processes’ output onto it, regardless systems and models
      • makes it easy for data scientists to understand entities and attributes, transform data as they wish – possibly using different languages – and define inputs for ML algorithms/ ad hoc reports
      • enables data officers to govern data, see data provenance and comply with data regulations
      • gives a view of own data to data contributors (e.g. customers, suppliers) – understand, retrieve, manage their personal data

      A Spring Lockodown

      Date: 29 May 2020
      Presenter: Konstantinos Kravvaritis

      Spring boot is a Spring Framework extension that aims to provide more convenience utilities and accelarate the development process. In this presentation, we give a brief overview of Spring boot, the reasons that it was introduced and its advantages.

      Awesome uses of Raspberry Pi

      Date: 29 May 2020
      Presenter: Stefanos Georgiou

      Raspberry Pi is a small, compacted, and cheap computer to build easily reliable systems. In this presentation, we attach a DHT-11 sensor to obtain temperature and humidity measurements. After, we extend this computer system to act upon events and inform the user if something goes wrong.

      Ultra-large-scale compressed graphs: The case of Software Heritage

      Date: 29 May 2020
      Presenter: Zoe Kotti

      While publicly available original source code doubles every two years, and state-of-the-art "big data" approaches require the occupation of several machines, compressed graphs can dramatically reduce the hardware resources needed to mine such large corpora. In this presentation we explore the compressed graph of Software Heritage, a dataset aimed at collecting and storing all publicly available source code. In general, graphs are suitable data models for conducting version control system analyses, while compressed graphs allow impressive graph traversal performances.

      Enriching Greek Parliament Dataset

      Date: 05 June 2020
      Presenter: Konstantina Dritsa

      Abstract: Enriching the Greek Parliament Proceedings Dataset ( with information on the gender and parliament role of the Parliament members.

      Epidose: Contact tracing for all

      Date: 05 June 2020
      Presenter: Diomidis Spinellis

      Epidose is an open source software reference implementation for an epidemic dosimeter. Just as a radiation dosimeter measures dose uptake of external ionizing radiation, the epidemic dosimeter tracks potential exposure to viruses or bacteria associated with an epidemic. The dosimeter measures a person's exposure to an epidemic, such as COVID-19, based on exposure to contacts that have been tested positive. The epidemic dosimeter is designed to be widely accessible and to safeguard privacy. Specifically, it is designed to run on the $10 open-hardware Raspberry Pi Zero-W computer, with a minimal user interface, comprising LED indicators regarding operation and exposure risk and a physical interlock switch to allow the release of contact data. The software is based on the DP3T contact tracing "unlinkable" design and corresponding reference implementation code.

      Natural Language Understanding for Software Engineering

      Date: 05 June 2020
      Presenter: Vasiliki Efstathiou

      Abstract: Details will be provided during the seminar.

      Automatically reproducing and analyzing Debian Packages with sbuild.

      Date: 19 June 2020
      Presenter: Stefanos Chaliasos

      Abstract: A challenging issue in analyzing C packages is that it is challenging to reproduce them. In this talk, we present how we can exploit sbuild for running static analysis tools that need to rebuild a package to analyze it. sbuild is a tool that automates the build process of Debian packages. Moreover, we will see how we can scrape the Ultimate Debian Database for selecting packages base on some criteria.

      Revising/teaching/coding-fun in quarantine

      Date: 19 June 2020
      Presenter: Antonis Gkortzis

      Abstract: This short presentation is a compilation of the tasks and activities completed during the COVID-19 quarantine time. The first part of the quarantine was dedicated to finalizing and submitting the revised paper, titled Software Reuse Cuts Both Ways, to JSS. The second part was dedicated to the Real Estate Analytics EU-funded program. Specifically, we finalized a data warehouse schema, consisting of > 300 tables (facts and dimensions), and begun its population. During both quarantine parts, teaching was an always-on task and required a great effort, with the participation being the largest of the last four years.

      Use of Neural Networks to Improve Fuzzing

      Date: 19 June 2020
      Presenter: Charalambos Ioannis Mitropoulos

      Abstract: We will make an introduction of our recent work in fuzzing Android Native Libraries, and we will take a look of how we can improve fuzzing with the use of Neural Networks, presenting three different approaches proposed in three different papers that we read in quarantine time.

      A Dataset for GitHub Repository Deduplication

      Date: 26 June 2020
      Presenter: Diomidis Spinellis

      GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

      A Dataset of Enterprise-Driven Open Source Software

      Date: 26 June 2020
      Presenter: Diomidis Spinellis

      We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

      Practical Fault Detection in Puppet Programs

      Date: 26 June 2020
      Presenter: Thodoris Sotiropoulos

      Puppet is a popular computer system configuration management tool. By providing abstractions that model system resources it allows administrators to set up computer systems in a reliable, predictable, and documented fashion. Its use suffers from two potential pitfalls. First, if ordering constraints are not correctly specified whenever a Puppet resource depends on another, the non-deterministic application of resources can lead to race conditions and consequent failures. Second, if a service is not tied to its resources (through the notification construct), the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure's availability and functionality.

      We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, a formal model for traces allows us to capture the interactions of Puppet resources with the file system. By analyzing these interactions we identify (1) resources that are related to each other (e.g., operate on the same file), and (2) resources that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace's analysis against the program's dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault.

      We have evaluated our method on a large set of popular Puppet modules, and discovered 92 previously unknown issues in 33 modules. Performance benchmarking shows that our approach can analyze in seconds real-world configurations with a magnitude measured in thousands of lines and millions of system calls.

      Stochastic Opinion Dynamics for User Interest Prediction in Online Social Networks

      Date: 03 July 2020
      Presenter: Marios Papachristou

      In this seminar, we are going to talk about how one can infer the interests (e.g. hobbies) of users in online social networks using information from highly influential users of the network. More specifically, we experimentally observe that the majority of the network users (>70%) is dominated by a sublinear fraction of highly-influential nodes (core nodes). This structural property of networks is also known as the "core-periphery" structure, a phenomenon long-studied in economics and sociology.

      Using the influencers' initial opinions as steady-state trend-setters, we develop a generative model through which we explain how the users' interests (opinions) evolve over time, where each peripheral user looks at her k-nearest neighbors. Our model has strong theoretical and experimental guarantees and is able to surpass node embedding methods and related opinion dynamics methods and is able to scale to networks with millions of nodes.

      Duration: 30-40min.

      Joint work with D. Fotakis (NTUA).

      Pawk: A parallel programming implementation of Awk

      Date: 11 September 2020
      Presenter: Georgios Theodorou

      Pawk is an extension to GoAwk, having been designed with efficiency in mind. It manages to achieve considerably higher performance to both the standard use Awk as well as GoAwk. The two reasons behind the significant speed boost offered by Pawk are the use of multi-threading programming, as well as the choice of Golang as the language of implementation. Since Pawk makes use of parallel programming, it is logical to be restricted only to operations that can be executed in parallel. However, we believe that Pawk can come handy in a plethora of cases, especially when dealing with multi-GB files.

      Securing the Operations and Services of GRNET

      Date: 25 September 2020
      Presenter: Dimitris Mitropoulos

      GRNET CERT (Computer Emergency Response Team) provides incident response and security services to both the National Infrastructures for Research and Technology (GRNET) and to all Greek academic and research institutions. To do so, it employs Open-source Software (OSS) and approaches proposed by the academic community. In this talk we will discuss how GRNET CERT uses OSS to provide early warnings and alerts to its members and relevant organizations regarding risks and incidents. Furthermore, we will discuss how the team utilizes program analysis methods to assist the security audits it performs.

      Search Engine Similarity Analysis - A Combined Content and Rankings Approach

      Date: 16 October 2020
      Presenter: Konstantina Dritsa

      Abstract: How different are search engines? The search engine wars are a favorite topic of on-line analysts, as two of the biggest companies in the world, Google and Microsoft, battle for prevalence of the web search space.Differences in search engine popularity can be explained by their effectiveness or other factors, such as familiarity with the most popular first engine, peer imitation, or force of habit. In this work we present a thorough analysis of the affinity of the two major search engines, Google and Bing, along with DuckDuckGo, which goes to great lengths to emphasize its privacy-friendly credentials. To do so, we collected search results using a comprehensive set of 300 unique queries for two time periods in 2016 and 2019, and developed a new similarity metric that leverages both the content and the ranking of search responses. We evaluated the characteristics of the metric against other metrics and approaches that have been proposed in the literature, and used it to (1) investigate the similarities of search engine results, (2) the evolution of their affinity over time, (3) what aspects of the results influence similarity, and (4) how the metric differs over different kinds of search services. We found that Google stands apart, but Bing and DuckDuckGo are largely indistinguishable from each other.

      A Model for Detecting Faults in Build Specifications

      Date: 23 October 2020
      Presenter: Thodoris Sotiropoulos

      Incremental and parallel builds are crucial features of modern build systems. Parallelism enables fast builds by running independent tasks simultaneously, while incrementality saves time and computing resources by processing the build operations that were affected by a particular code change. Writing build definitions that lead to error-free incremental and parallel builds is a challenging task. This is mainly because developers are often unable to predict the effects of build operations on the file system and how different build operations interact with each other. Faulty build scripts may seriously degrade the reliability of automated builds, as they cause build failures, and non-deterministic and incorrect outputs.

      To reason about arbitrary build executions, we present BuildFS, a generally-applicable model that takes into account the specification (as declared in build scripts) and the actual behavior (low-level file system operation) of build operations. We then formally define different types of faults related to incremental and parallel builds in terms of the conditions under which a file system operation violates the specification of a build operation. Our testing approach, which relies on the proposed model, analyzes the execution of single full build, translates it into BuildFS, and uncovers faults by checking for corresponding violations.

      We evaluate the effectiveness, efficiency, and applicability of our approach by examining 612 Make and Gradle projects. Notably, thanks to our treatment of build executions, our method is the first to handle JVM-oriented build systems. The results indicate that our approach is (1) able to uncover several important issues (247 issues found in 47 open-source projects have been confirmed and fixed by the upstream developers), and (2) much faster than a state-of-the-art tool for Make builds (the median and average speedup is 39X and 74X respectively).

      GraphQL, GraphQL-Mesh, and the semantic web

      Date: 13 November 2020
      Presenter: Uri Goldshtein

      GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data. I'll talk a bit about GraphQL in general, then a bit about the ideas behind GraphQL-Mesh, and then talk about the future ideas like GraphQL and semantic web and ideas like that for which there is a lot of room for exploration.

      Uri is the founder of The Guild - A group of open source developers, mostly focused around GraphQL. He was part of the writers of the GraphQL Subscriptions spec. The Guild works with companies around the world, helping them with their API technologies and improving the open source libraries while doing it.

Note: Some of the above data refer to grandfathered work conducted by BALab's members at its progenitor laboratory, ISTLab.