Yearly Report 2020

Overview in numbers

New Publications Number
Monographs and Edited Volumes 1
Journal Articles 3
Book Chapters 0
Conference Publications 5
Technical Reports 0
White Papers 0
Magazine Articles 0
Working Papers 0
Datasets 0
Total New Publications 9
Projects
New Projects 0
Ongoing Projects 1
Completed Projects 0
Members
Faculty Members 3
Senior Researchers 4
Associate Researchers 7
Researchers 17
Total Members 31
New Members 6
PhDs
Ongoing PhDs 6
Completed PhDs 0
New Seminars
New Seminars 15

New Publications

Monographs and Edited Volumes

    • Panos Louridas. Algorithms. The MIT Press, Cambridge, MA, 2020. ISBN 978-0-262-53902-9.

Journal Articles

    • C. Ebert, P. Louridas, T. M. Fernández-Caramés, and P. Fraga-Lamas. Blockchain technologies in practice. IEEE Software, 37(4):17–25, 2020.
    • Zoe Kotti, Konstantinos Kravvaritis, Konstantina Dritsa, and Diomidis Spinellis. Standing on shoulders or feet? an extended study on the usage of the MSR data papers. Empirical Software Engineering, pages 1–35, July 2020.
    • Georgios Doukidis, Diomidis Spinellis, and Christof Ebert. Digital transformation: a primer for practitioners. IEEE Software, 37(5):13–21, 2020. doi:10.1109/MS.2020.2999969.

Conference Publications

    • Diomidis Spinellis, Zoe Kotti, and Audris Mockus. A dataset for GitHub repository deduplication. In 17th International Conference on Mining Software Repositories, MSR '20. New York, NY, USA, October 2020. Association for Computing Machinery. To appear.
    • Diomidis Spinellis, Zoe Kotti, Konstantinos Kravvaritis, Georgios Theodorou, and Panos Louridas. A dataset of enterprise-driven open source software. In 17th International Conference on Mining Software Repositories, MSR '20. New York, NY, USA, October 2020. Association for Computing Machinery. To appear.
    • Thodoris Sotiropoulos, Dimitris Mitropoulos, and Diomidis Spinellis. Practical fault detection in Puppet programs. In 42nd International Conference on Software Engineering, ICSE '20. 2020. To appear.
    • Audris Mockus, Diomidis Spinellis, Zoe Kotti, and Gabriel John Dusing. A complete set of related git repositories identified via community detection approaches based on shared commits. In 17th International Conference on Mining Software Repositories, MSR '20. New York, NY, USA, October 2020. Association for Computing Machinery. To appear.
    • Antonis Aggelakis, Prastudy Fauzi, Georgios Korfiatis, Panos Louridas, Foteinos Mergoupis-Anagnou, Janno Siim, and Michał Zając. A non-interactive shuffle argument with low trust assumptions. In Topics in Cryptology: CT-RSA 2020, San Francisco, CA, February 24–28. Cham, Switzerland, February 2020. Springer. Lecture Notes in Computer Science 12006.

New Members

    • George Theodorou
    • Christos Pappas
    • George Liargkovas
    • Rafaila Galanopoulou
    • Angeliki Papadopoulou
    • Makrina Viola Kosti

Ongoing PhDs

    • Zoe Kotti Topic: Data Analysis Applications in Software Engineering
    • Konstantinos Kravvaritis Topic: Data and Quality Metrics of System Configuration Code
    • Antonios Gkortzis Topic: Secure Systems on Cloud Computing Infrastructures
    • Stefanos Georgiou Topic: Energy Efficiency in Cloud Computing
    • Thodoris Sotiropoulos Topic: Techniques for Improving the Reliability of Event-Driven Programs
    • Konstantina Dritsa Topic: Data Science

Seminars

      Data Virtual Machines: Data-Driven Conceptual Modeling of Big Data Infrastructures

      Date: 20 February 2020
      Presenter: Damianos Chatziantoniou
      Abstract

      In this talk we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term ``data infrastructure'' we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuitive to use manner, the same way the age or gender of a customer are made. We define formally a DVM, queries over DVMs and an algebraic framework for query evaluation. We also claim that a conceptual layer, such as DVM, is a prerequisite for end-to-end processing. In addition, we present a prototype tool based on DVMs, called DataMingler, which enables end-to-end processing in analytics environments by data stakeholders. Specifically, DataMingler:

      • allows data engineers to easily, quickly map actual data and processes’ output onto it, regardless systems and models
      • makes it easy for data scientists to understand entities and attributes, transform data as they wish – possibly using different languages – and define inputs for ML algorithms/ ad hoc reports
      • enables data officers to govern data, see data provenance and comply with data regulations
      • gives a view of own data to data contributors (e.g. customers, suppliers) – understand, retrieve, manage their personal data


      A Spring Lockodown

      Date: 29 May 2020
      Presenter: Konstantinos Kravvaritis
      Abstract

      Spring boot is a Spring Framework extension that aims to provide more convenience utilities and accelarate the development process. In this presentation, we give a brief overview of Spring boot, the reasons that it was introduced and its advantages.


      Awesome uses of Raspberry Pi

      Date: 29 May 2020
      Presenter: Stefanos Georgiou
      Abstract

      Raspberry Pi is a small, compacted, and cheap computer to build easily reliable systems. In this presentation, we attach a DHT-11 sensor to obtain temperature and humidity measurements. After, we extend this computer system to act upon events and inform the user if something goes wrong.


      Ultra-large-scale compressed graphs: The case of Software Heritage

      Date: 29 May 2020
      Presenter: Zoe Kotti
      Abstract

      While publicly available original source code doubles every two years, and state-of-the-art "big data" approaches require the occupation of several machines, compressed graphs can dramatically reduce the hardware resources needed to mine such large corpora. In this presentation we explore the compressed graph of Software Heritage, a dataset aimed at collecting and storing all publicly available source code. In general, graphs are suitable data models for conducting version control system analyses, while compressed graphs allow impressive graph traversal performances.


      Enriching Greek Parliament Dataset

      Date: 05 June 2020
      Presenter: Konstantina Dritsa
      Abstract

      Abstract: Enriching the Greek Parliament Proceedings Dataset (https://zenodo.org/record/2587904) with information on the gender and parliament role of the Parliament members.


      Epidose: Contact tracing for all

      Date: 05 June 2020
      Presenter: Diomidis Spinellis
      Abstract

      Epidose is an open source software reference implementation for an epidemic dosimeter. Just as a radiation dosimeter measures dose uptake of external ionizing radiation, the epidemic dosimeter tracks potential exposure to viruses or bacteria associated with an epidemic. The dosimeter measures a person's exposure to an epidemic, such as COVID-19, based on exposure to contacts that have been tested positive. The epidemic dosimeter is designed to be widely accessible and to safeguard privacy. Specifically, it is designed to run on the $10 open-hardware Raspberry Pi Zero-W computer, with a minimal user interface, comprising LED indicators regarding operation and exposure risk and a physical interlock switch to allow the release of contact data. The software is based on the DP3T contact tracing "unlinkable" design and corresponding reference implementation code.


      Natural Language Understanding for Software Engineering

      Date: 05 June 2020
      Presenter: Vasiliki Efstathiou
      Abstract

      Abstract: Details will be provided during the seminar.


      Automatically reproducing and analyzing Debian Packages with sbuild.

      Date: 19 June 2020
      Presenter: Stefanos Chaliasos
      Abstract

      Abstract: A challenging issue in analyzing C packages is that it is challenging to reproduce them. In this talk, we present how we can exploit sbuild for running static analysis tools that need to rebuild a package to analyze it. sbuild is a tool that automates the build process of Debian packages. Moreover, we will see how we can scrape the Ultimate Debian Database for selecting packages base on some criteria.


      Revising/teaching/coding-fun in quarantine

      Date: 19 June 2020
      Presenter: Antonis Gkortzis
      Abstract

      Abstract: This short presentation is a compilation of the tasks and activities completed during the COVID-19 quarantine time. The first part of the quarantine was dedicated to finalizing and submitting the revised paper, titled Software Reuse Cuts Both Ways, to JSS. The second part was dedicated to the Real Estate Analytics EU-funded program. Specifically, we finalized a data warehouse schema, consisting of > 300 tables (facts and dimensions), and begun its population. During both quarantine parts, teaching was an always-on task and required a great effort, with the participation being the largest of the last four years.


      Use of Neural Networks to Improve Fuzzing

      Date: 19 June 2020
      Presenter: Charalambos Ioannis Mitropoulos
      Abstract

      Abstract: We will make an introduction of our recent work in fuzzing Android Native Libraries, and we will take a look of how we can improve fuzzing with the use of Neural Networks, presenting three different approaches proposed in three different papers that we read in quarantine time.


      A Dataset for GitHub Repository Deduplication

      Date: 26 June 2020
      Presenter: Diomidis Spinellis
      Abstract

      GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.


      A Dataset of Enterprise-Driven Open Source Software

      Date: 26 June 2020
      Presenter: Diomidis Spinellis
      Abstract

      We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.


      Practical Fault Detection in Puppet Programs

      Date: 26 June 2020
      Presenter: Thodoris Sotiropoulos
      Abstract

      Puppet is a popular computer system configuration management tool. By providing abstractions that model system resources it allows administrators to set up computer systems in a reliable, predictable, and documented fashion. Its use suffers from two potential pitfalls. First, if ordering constraints are not correctly specified whenever a Puppet resource depends on another, the non-deterministic application of resources can lead to race conditions and consequent failures. Second, if a service is not tied to its resources (through the notification construct), the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure's availability and functionality.

      We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, a formal model for traces allows us to capture the interactions of Puppet resources with the file system. By analyzing these interactions we identify (1) resources that are related to each other (e.g., operate on the same file), and (2) resources that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace's analysis against the program's dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault.

      We have evaluated our method on a large set of popular Puppet modules, and discovered 92 previously unknown issues in 33 modules. Performance benchmarking shows that our approach can analyze in seconds real-world configurations with a magnitude measured in thousands of lines and millions of system calls.


      Stochastic Opinion Dynamics for User Interest Prediction in Online Social Networks

      Date: 03 July 2020
      Presenter: Marios Papachristou
      Abstract

      In this seminar, we are going to talk about how one can infer the interests (e.g. hobbies) of users in online social networks using information from highly influential users of the network. More specifically, we experimentally observe that the majority of the network users (>70%) is dominated by a sublinear fraction of highly-influential nodes (core nodes). This structural property of networks is also known as the "core-periphery" structure, a phenomenon long-studied in economics and sociology.

      Using the influencers' initial opinions as steady-state trend-setters, we develop a generative model through which we explain how the users' interests (opinions) evolve over time, where each peripheral user looks at her k-nearest neighbors. Our model has strong theoretical and experimental guarantees and is able to surpass node embedding methods and related opinion dynamics methods and is able to scale to networks with millions of nodes.

      Duration: 30-40min.

      Joint work with D. Fotakis (NTUA).


      Pawk: A parallel programming implementation of Awk

      Date: 11 September 2020
      Presenter: Georgios Theodorou
      Abstract

      Pawk is an extension to GoAwk, having been designed with efficiency in mind. It manages to achieve considerably higher performance to both the standard use Awk as well as GoAwk. The two reasons behind the significant speed boost offered by Pawk are the use of multi-threading programming, as well as the choice of Golang as the language of implementation. Since Pawk makes use of parallel programming, it is logical to be restricted only to operations that can be executed in parallel. However, we believe that Pawk can come handy in a plethora of cases, especially when dealing with multi-GB files.


Note: Some of the above data refer to grandfathered work conducted by BALab's members at its progenitor laboratory, ISTLab.