| New Publications | Number |
|---|---|
| Monographs and Edited Volumes | 0 |
| PhD Theses | 0 |
| Journal Articles | 8 |
| Book Chapters | 0 |
| Conference Publications | 3 |
| Technical Reports | 0 |
| White Papers | 0 |
| Magazine Articles | 0 |
| Working Papers | 0 |
| Datasets | 0 |
| Total New Publications | 11 |
| Projects | |
| New Projects | 0 |
| Ongoing Projects | 2 |
| Completed Projects | 0 |
| Members | |
| Faculty Members | 5 |
| Senior Researchers | 8 |
| Associate Researchers | 7 |
| Researchers | 43 |
| Total Members | 63 |
| New Members | 11 |
| PhDs | |
| Ongoing PhDs | 2 |
| Completed PhDs | 1 |
| New Seminars | |
| New Seminars | 12 |
Date: 13 January 2025
Presenter: Christopher Barrie, NYU
Abstract
Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
Bio: Christopher Barrie is Assistant Professor of Sociology at NYU. He is also Core Faculty at CSMaP and Research Fellow at the Department of Sociology, University of Oxford.
Date: 07 February 2025
Presenter: Maria Kechagia, UoA
Abstract
Previous studies have shown that Automated Program Repair (APR) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by APR tools need to be validated by human programmers, which can be very costly, and prevents APR tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches or facilitating patch assessment. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from xTestCluster gives developers a) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, b) additional information (new test cases and their results) attached to each patch. After analyzing 902 plausible patches from 21 Java APR tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by APR tools, and provides them with new test cases that expose the differences in behavior between generated patches. Moreover, xTestCluster can complement other patch assessment techniques that help detect patch misclassification
URL: https://link.springer.com/article/10.1007/s10664-024-10503-2
Preprint: https://arxiv.org/pdf/2207.11082
Dr Maria Kechagia is an Assistant Professor in Software Engineering at the National and Kapodistrian University of Athens within the Department of Business Administration. From May 2019 to November 2014, she was a research fellow at University College London, in the UK. Previously, she was a postdoctoral researcher at the Delft University of Technology, in the Netherlands. She obtained a PhD degree from the Athens University of Economics and Business and an MSc degree from Imperial College London. Her research interests include software verification (static and dynamic analysis), automated program repair, software analytics, and software optimisation (energy efficiency and runtime performance). She has been a programme committee member of the research track of top software engineering venues including ICSE, FSE, ASE, ISSTA, MSR, ICSME, ESEM, and SANER, and a reviewer for top software engineering journals including TSE, TOSEM, EMSE, and JSS. She is a member of the editorial board of TSE.
Date: 17 February 2025
Presenter: Dr. Paul Ralph, Dalhousie University
Abstract
Scholarly peer review is “the lynchpin about which the whole business of science is pivoted” (Ziman 1968). Most researchers believe peer review is effective (Ware 2008), but empirical research consistently shows that reviewers cannot reliable distinguish methodologically sound from fundamentally flawed studies (Cole 1981; Peters & Ceci 1982; Lock 1991; Rothwell and Martyn 2000; Price 2014; Ralph 2016). Consequently, we created comprehensive evidence standards and tools to improve peer review in software engineering and related fields. Objective. The objective of this study is to investigate the impact of evidence standards on scholarly peer review. Method. A randomized controlled experiment was conducted at an A-ranked software engineering conference. The program committee was randomly divided into two groups: one using a typical conference review process; the other using a standardized process based on the ACM SIGSOFT Empirical Standards for Software Engineering Research (https://acmsigsoft.github.io/EmpiricalStandards/) Results. Evidence standards significantly improve inter-reviewer reliability without harming authors’ or reviewers’ attitudes toward the review process. Reviewers using evidence standards gave more praise and focused more on research methods than style. Discussion. Asking reviewers to write free-text comments about a paper and score it on a 6-point scale from strong reject to strong accept produces data statistically indistinguishable from random noise. This means that decisions are determined entirely by reviewer selection, not the merits of the research. Conventional review processes are therefore scientifically and morally indefensible. While not a silver bullet, evidence standards significantly improve reliability, and the data collected in this study facilitates further refinement of the standards and tooling toward still greater reliability.
Dr. D. Paul Ralph, PhD (British Columbia), B.Sc. / B.Comm (Memorial), is an award-winning scientist, author, consultant, and Professor of Software Engineering at Dalhousie University. His cutting-edge research at the intersection of software engineering, human-computer interaction, and project management explores the relationship between software teams’ social dynamics and success. It has been used by many leading technology companies including Adobe, Amazon, AT&T, Canon, Bea Systems, Broadcom, IBM, Google, HP, Microsoft, Netflix, PayPal, Samsung, Salesforce, Yahoo!, and Walmart. He has published more than 80 peer-reviewed articles in premier venues including IEEE Transactions on Software Engineering and the ACM/IEEE International Conference on Software Engineering. Dr. Ralph is editor-in-chief of the SIGSOFT Empirical Standards for Software Engineering Research.#### Biography
Date: 30 April 2025
Presenter: Nikos Alexopoulos
Abstract
Android users face increasingly sophisticated threats, ranging from malware and state-sponsored surveillance, to supply chain attacks and a large attack surface, mostly consisting of proprietary components. Android’s semantic gap, i.e. the disconnect between application behaviors and kernel-level events (system calls), is a major limiting factor towards developing approaches capable of detecting threats in the wild. This talk will present recent research on overcoming this limitation, introducing SysDroid, a simple and lightweight approach to reconstruct Android behaviors from Linux kernel traces.
SysDroid builds on two key insights: (a) I/O events can be captured in the kernel and attributed to applications by following IPC edges, and (b) a mapping between I/O events and interesting high-level behaviors can be established a priori by associating I/O events to high-level Android Framework API calls. The approach is effective in capturing application behaviors and can be used as the basis for further analysis.
Date: 09 July 2025
Presenter: Georgios Liargkovas, Columbia University
Abstract
For decades, OS tuning has relied on static heuristics that cannot adapt to dynamic, complex workloads. While machine learning offered a path forward, traditional models like Bayesian Optimization and Reinforcement Learning introduced their own challenges: a "semantic gap" preventing true contextual understanding, brittle reward engineering, and inefficient exploration unfit for live systems. This talk argues that Large Language Models (LLMs) represent the next leap forward. We present preliminary, promising results from an LLM-powered autonomous agent that leverages reasoning and pre-trained knowledge to overcome these limitations.
We will conclude by discussing future research directions for these emerging autonomous systems.
Georgios Liargkovas is a PhD student at Columbia University advised by Kostis Kaffes. His research focuses on OS scheduling and AI/ML for OS Optimization. He holds a BS in Management Science and Technology from Athens University of Economics and Business, where he conducted empirical software engineering research at BALab advised by Diomidis Spinellis.
Date: 24 September 2025
Presenter: Thodoris Sotiropoulos, ETH
Abstract
I will present the research directions our team pursued during the academic year 2024--2025 in the areas of software reliability, analysis, and security. For software reliability, we developed new methods to validate (i.e., find bugs) critical software infrastructure, focusing on (1) static analyzers which are widely used throughout the software development pipeline and (2) Infrastructure as Code (IaC) programs, which are routinely used to automate the provisioning and management of entire of computing infrastructures and servers.
For software analysis and security, we investigated the security challenges of applications that combine high-level languages (e.g., Python, JavaScript) with low-level components (e.g., C, Rust). We introduced techniques to automatically identify and reason about the bridges between these languages. This enables powerful cross-language analyses such as vulnerability detection and reachability analysis in hybrid programs. Finally, we investigated an emerging domain: the effect of compiler optimizations on Zero-Knowledge Virtual Machines (zkVMs). zkVMs are becoming foundational in privacy-preserving and verifiable computation. Therefore, understanding the limitations of existing compiler infrastructures on zkVM performance opens new research directions, including the development of zkVM-specific passes, backends, and superoptimizers.
Date: 06 October 2025
Presenter: Marek Horváth, Technical University of Košice, Slovakia
Abstract
Abstract: The seminar will introduce a research direction focused on authorship attribution in software engineering, exploring how the combination of source code stylometry and behavioral biometrics can be used to distinguish individual programmers. The talk will summarize the current state of a doctoral project in this domain, including applied methods and early findings. It will also briefly present related research activities conducted at the Technical University of Košice, with a special emphasis on applications in programming education and academic integrity.
Marek Horváth is a PhD student at the Technical University of Košice, Slovakia. His research focuses on authorship identification in software engineering using static code analysis and behavioral biometrics. He also works on educational applications of these methods to support students and instructors in programming courses.
Date: 20 October 2025
Presenter: Konstantinos Karakatsanis
Abstract
Dependency bloat is a persistent challenge in Python projects, which increases maintenance costs and security risks. While numerous tools exist for detecting unused dependencies in Python, removing these dependencies across the source code and configuration files of a project requires manual effort and expertise. To tackle this challenge we introduce PYTRIM, an end-to-end system to automate this process. PYTRIM eliminates unused imports and package declarations across a variety of file types, including Python source and configuration files such as requirements.txt and setup.py. PYTRIM’s modular design makes it agnostic to the source of dependency bloat information, enabling integration with any detection tool. Beyond its contribution when it comes to automation, PYTRIM also incorporates a novel dynamic analysis component that improves dependency detection recall. Our evaluation of PYTRIM’s end-to-end effectiveness on a ground-truth dataset of 37 merged pull requests from prior work, shows that PYTRIM achieves 98.3% accuracy in replicating human-made changes. To show its practical impact, we run PYTRIM on 971 open-source packages, identifying and trimming bloated dependencies in 39 of them. For each case, we submit a corresponding pull request, 6 of which have already been accepted and merged. PYTRIM is available as an open-source project, encouraging community contributions and further development.
Date: 03 November 2025
Presenter: Georgios Alexopoulos, UoA
Abstract
A great number of software packages combine code in high-level languages, such as Python, with binary extensions compiled from low-level languages such as C, C++ or Rust to either boost efficiency or enable specific functionalities. In this context, high-level function calls can trigger native (binary) code execution. This setup introduces challenges for call graph generation. Accurate call graphs are essential for various applications, including vulnerability management and software maintenance, as they help track execution paths, assess security risks, and identify unused or redundant code.
This work tackles the problem of cross-language call graph construction in Python. Instead of relying on static analysis, which struggles with identifying Python-native interactions, we propose a dynamic analysis technique which does not require inputs to execute code. Our approach is based on two key insights: (1) when a binary extension is imported from Python code, all its objects (e.g., functions) are loaded into memory, and (2) the layout of callable Python objects contains pointers to the native functions they invoke.
By analyzing these memory layouts for every loaded object, we identify corresponding graph edges, which link Python functions to the native functions they eventually invoke. This is an essential element for constructing call graphs across language boundaries. We implement this approach in PyXray, a tool that efficiently analyzes massive Python packages such as NumPy and PyTorch in minutes, while significantly outpeforming existing static analysis methods in terms of precision and recall.
PyXray enables two key applications: (1) cross-language vulnerability management, by identifying whether a Python package potentially calls a vulnerable native function and (2) cross-language bloat analysis, by quantifying unnecessary code across Python and native components.
Date: 24 November 2025
Presenter: Ioannis Karyotakis and Evangelos Talos
Abstract
Understanding the evolving patterns of developer coding activity within the programming industry can help promote both individual well-being and organizational productivity. We examine the evolution of commit activity among developers over the past decade through an analysis of commit data from 4\,549 GitHub repositories. Our findings show a subtle but consistent increase in the proportion of nighttime and weekend commits, particularly during early morning hours, indicating a shift toward more flexible and asynchronous work habits. In contrast, commit patterns across weekdays have remained stable, with no statistically significant differences between individual workdays from 2015 to 2024. These trends suggest a gradual departure from the conventional 9-to-5, Monday-to-Friday structure, with developers increasingly distributing their work across broader time frames. Our findings have practical implications for developers, who can use them to advocate for flexible work policies; for managers, who can better align schedules with real-world behaviors; and for researchers, who can further explore how temporal work patterns influence productivity and well-being.
Date: 25 November 2025
Presenter: Thodoris Sotiropoulos
Abstract
Programming language implementations can be viewed as complex software systems. In this talk, we explore how core object-oriented (OO) principles, especially polymorphism, help us build modular, extensible, and maintainable representations of programs. Using a small example language, we examine traditional approaches such as the "Visitor" design pattern and contrast them with modern Java features: records, sealed types, and pattern matching. We show how these mechanisms allow us to express diverse operations on program representations (evaluation, semantic analysis, etc.) elegantly and declaratively, while avoiding boilerplate and ensuring runtime safety through exhaustiveness checks.
Date: 05 December 2025
Presenter: Zoe Kotti
Abstract
This presentation examines the impact and evolution of software engineering research through four interconnected studies. First, an investigation of data papers in the Mining Software Repositories conference confirms their significant value as research artifacts while identifying opportunities for improved documentation and broader topic coverage. Second, the practical impact of software engineering research is assessed through a patent analysis and author survey, demonstrating that researchers successfully equip practitioners with tools and methods, though adoption is often hindered by funding and cost-benefit challenges. Third, a comprehensive tertiary study analyzes machine learning applications in software engineering, revealing widespread adoption across tasks but significant gaps in empirical validation and industrial transfer. Finally, the work explores Large Language Models in code completion, analyzing code perplexity to understand model confidence across different programming languages. Together, these contributions offer a holistic view of how academic research translates into practice and how emerging technologies are shaping the future of software engineering.