Date: 10 January 2018
Presenter: Alexios Zavras
Software nowadays is an amalgamation of numerous components; typical numbers show that every software product is mostly comprised of re-usable code like libraries, with only up to 20% of the code being specific to the product. Keeping track of all these components and their metadata, such as origin and licenses, is a significant problem that has to be solved by each and every software producer. The talk will discuss the need to keep accurate information on the components and present the attempts to solve the issue that are currently being designed and tried out in the industry. Questions and discussion are welcome and greatly appreciated!
Alexios Zavras (zvr) is the Senior Open Source Compliance Engineer of Intel Corp. He has been involved with Free and Open Source Software since 1983, and is an evangelist for all things Open. He has a PhD in Computer Science after having studied Electrical Engineering and Computer Science in Greece and the United States.
Date: 31 January 2018
Presenter: Theodore Stassinopoulos
There are many tools that perform static analysis on sourcecode and extract information about the existence of code smells. I would like to present our project which is in progress and focuses on Java source code. The main topics will include: a summary of metrics and code smells that our project is able to extract, which are the main logical functionalities performed during execution and what goals this project tries to achieve.
Date: 21 February 2018
Presenter: Damianos Chatziantoniou
Most analytics projects focus on the management of the 3Vs of big data and use specific stacks to support this variety. However, they constrain themselves to ''local'' data, data that exists within or ''close'' to the organization. And yet, as it has been recently pointed out, ''the value of data explodes when it can be linked with other data.'' In this paper we present our vision for a global marketplace of analytics---either in the form of per-entity metrics or per-entity data, provided by globally accessible data management tasks---where a data scientist can pick and combine data at will in her data mining algorithms, possibly combining with her own data. The main idea is to use the dataframe, a popular data structure in R and Python. Currently, the columns of a dataframe contain computations or data found within the data infrastructure of the organization. We propose to extend the concept of a column. A column is now a collection of key-value pairs, produced anywhere by a remotely accessed program (e.g., an SQL query, a MapReduce job, even a continuous query.) The key is used for the outer join with the existing dataframe, the value is the content of the column. This whole process should be orchestrated by a set of well-defined, standardized APIs. We argue that the proposed architecture presents numerous challenges and could be beneficial for big data interoperability. In addition, it can be used to build mediation systems involving local or global columns. Columns correspond to attributes of entities, where the primary key of the entity is the key of the involved columns.
Date: 14 March 2018
Presenter: Stefanos Georgiou
Continuous Integration it is a cutting-edge approach to build, test, and integrate software practitioners work frequently. By utilizing such a service, it makes the error detection quicker and reduces integration problems significantly. In this tutorial, we are presenting the Travis CI, a distributed hosting, building, and testing tool for software projects on Github. In addition, we show how a developer can attach a Github Pages project with Travis to automate the Github's push procedure.
Date: 29 March 2018
Presenter: Diomidis Spinellis
The documented Unix facilities data set provides the details regarding the evolution of 15596 unique facilities through 93 versions of Unix over a period of 48 years. It is based on the manual transcription of early scanned documents, on the curation of text obtained through optical character recognition, and on the automatic extraction of data from code available on the Unix History Repository. The data are categorized into user commands, system calls, C library functions, devices and special files, file formats and conventions, games et. al., miscellanea, system maintenance procedures and commands, and system kernel interfaces. A timeline view allows the visualization of the evolution across releases. The data can be used for empirical research regarding API evolution, system design, as well as technology adoption and trends.
Date: 18 April 2018
Presenter: Stefanos Georgiou
Date: 24 April 2018
Presenter: Tushar Sharma
Context: Databases are an integral element of enterprise applications. Similarly to code, database schemas are also prone to smells - best practice violations. Objective: We aim to explore database schema quality, associated characteristics and their relationships with other software artifacts. Method: We present a catalog of 13 database schema smells and elicit developers' perspective through a survey. We extract embedded SQL statements and identify database schema smells by employing the DbDeo tool which we developed. We analyze 2925 production-quality systems (357 industrial and 2568 well-engineered open-source projects) and empirically study quality characteristics of their database schemas. In total, we analyze 629 million lines of code containing more than 393 thousand SQL statements. Results: We find that the index abuse smell occurs most frequently in database code, that the use of an ORM framework doesn't immune the application from database smells, and that some database smells, such as adjacency list, are more prone to occur in industrial projects compared to open-source projects. Our co-occurrence analysis shows that whenever the clone table} smell in industrial projects and the values in attribute definition smell in open-source projects get spotted, it is very likely to find other database smells in the project. Conclusion: The awareness and knowledge of database smells are crucial for developing high-quality software systems and can be enhanced by the adoption of better tools helping developers to identify database smells early.
Date: 24 April 2018
Presenter: Vasiliki Efstathiou
The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pre–trained models. State of the art pre–trained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain. There is currently a lack of domain-specific pre–trained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts. We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.
Date: 07 May 2018
Presenter: Rizos Sakellariou
Traditionally, the objective of parallel computing has been to minimize execution time. As the complexity and the costs associated with modern execution platforms and infrastructures grow, parallel execution time cannot be viewed as a single objective to achieve at any cost. Instead, with such execution platforms consuming large amounts of energy, one needs to assess improvements in execution time against other types of cost. Cloud computing platforms, which are often used to execute parallel applications, typically follow a resource-on-demand paradigm, where users can pay for what resources they need. However, the underlying infrastructures suffer from increasing complexity which is partly masked by having users pay, sometimes for more than they need.
In this respect, the talk will motivate the need to address efficiently the issues related to the concurrent use of multiple (and often heterogeneous) resources offered by Cloud providers by capturing these issues as some form of a multi-objective optimization problem, which requires a good understanding and appreciation of different trade-offs. The talk will make this argument by presenting experience and research on planning the (parallel) execution of scientific workflow applications on the Cloud in a way that tries to strike a balance between different trade-offs such as execution time, energy consumption and cost. Algorithms and techniques, experimental results and ongoing research will be presented.
Rizos Sakellariou obtained his PhD from the University of Manchester in 1997 for a thesis on compile-time parallel loop partitioning and scheduling. Since then, he has held posts with Rice University and the University of Cyprus and for the last 18 years with the University of Manchester where he is leading a laboratory carrying our research in High-Performance, Parallel and Distributed systems, which over the last ten years has hosted more than 30 doctoral students, researchers and long-term visitors. He has carried out research on a number of topics related to parallel and distributed computing (including Grid and Cloud computing), with an emphasis on problems stemming from efficient/effective resource utilization and workload allocation issues. Further information about his research can be found on Google Scholar.
Date: 09 May 2018
Presenter: Alexandros Lattas
Background: As evolving desktop applications continuously accrue new features and grow more complex with denser user interfaces and deeply-nested commands, it becomes inefficient to use simple heuristic processes for grouping GUI commands in multi-level menus. Existing search-based software engineering studies on user performance prediction and command grouping optimization lack evidence-based answers on choosing a systematic grouping method.Research Questions: We investigate the scope of command grouping optimization methods to reduce a user’s average task completion time and improve their relative performance, as well as the benefit of using detailed interaction logs compared to sampling. Method: We introduce seven grouping methods and compare their performance based on extensive telemetry data, collected from program runs of a CAD application. Results: We find that methods using global frequencies, user-specific frequencies, deterministic and stochastic optimization, and clustering perform the best. Conclusions: We reduce the average user task completion time by more than 17%, by running a Knapsack Problem algorithm on clustered users, training only on a small sample of the available data. We show that with most methods using just a 1% sample of the data is enough to obtain nearly the same results as those obtained from all the data. Additionally, we map the methods to specific problems and applications where they would perform better. Overall, we provide a guide on how practitioners can use search-based software engineering techniques when grouping commands in menus and interfaces, to maximize users’ task execution efficiency.
Date: 23 May 2018
Presenter: Vasiliki Efstathiou
Recent research provides evidence that effective communication in collaborative
software development has significant impact on the software development lifecycle.
Although related qualitative and quantitative studies point out textual characteristics of well-formed messages, the underlying semantics of the intertwined linguistic structures still remain largely misinterpreted or ignored. Especially, regarding quality of code reviews the importance of thorough feedback, and explicit rationale is often mentioned but rarely linked with related linguistic features. As a first step towards addressing this shortcoming, we propose grounding these studies on theories of linguistics. We particularly focus on linguistic structures of coherent speech and explain how they can be exploited in practice. We reflect on related approaches and examine through a preliminary study on four open source projects, possible links between existing findings and the directions we suggest for detecting textual features of useful code reviews.
Date: 23 May 2018
Presenter: Antonis Gkortzis
Examining the different characteristics of open-source software in relation to security vulnerabilities, can provide the research community with findings that can lead to the development of more secure systems. We present a dataset where the reported vulnerabilities of 8694 open-source project versions, can be correlated with the corresponding source code and a number of software metrics. The metrics were obtained by analyzing the project's source code via well-established tools. Apart from commonly used metrics (e.g. loc), we also provide data related to modern development trends such as continuous integration and testing. We outline motivational examples based on the dataset we describe.
Date: 21 June 2018
Presenter: Panos Louridas
Data anonymisation is not easy: the Internet, after all, was not created for anonymous communications. One way to anonymize digital data is through re-encryption shuffles, i.e., shuffles of re-encrypted data. Re-encryption shuffles are not new, yet efficient open source solutions are hard to come by. This presentation will give the background of challenges faced in anonymisation and report progress in the implementation of a new re-encryption shuffle that uses modern cryptographic techniques.
Date: 10 July 2018
Presenter: Christos Chatzilenas
Abstract: A decompiler is a computer program that takes as input an executable file and produces a high-level source code file which can be recompiled successfully. Even though a decompiler may not always reconstruct perfectly the original source code, it remains an important tool for reverse engineering of computer software. The process of decompilation is very useful for the recovery of lost source code, for analyzing and understanding software whose code is not available, even for computer security in some cases. In this thesis, in order to address the decompilation problem we transform it to a translation problem which can be solved using machine translation. Two approaches are studied, statistical and neural machine translation, using two open-source tools Moses and OpenNMT, respectively. Maven repositories are retrieved from GitHub in order to form the dataset and an appropriate procedure is used to construct the parallel corpora. In this context experiments in Moses are not successful while the result of translation using neural machine translation, is fairly good. The difference between the decompiler presented in this thesis and existing Java decompilers is the fact that it can translate isolated bytecode snippets. Furthermore, this approach can be extended to produce better results by recovering comments, variables, methods and class names. Finally, this study illustrates that the Java source code which is produced from the decompilation is often accurate and can provide a useful picture of the snippets' behavior.
Date: 10 July 2018
Presenter: Konstantina Dritsa
“It's not what you say, but how you say it”. How often have you heard that phrase? Have you ever wished that you could take an objective and comprehensive look into what is said and how it is said in politics? Within this project, we examined the records of the Hellenic Parliament sittings from 1989 up to 2017 in order to evaluate the speech quality and examine the palette of sentiments that characterize the communication among its members. The readability of the speeches is evaluated with the use of the “Simple Measure of Gobbledygook” (SMOG) formula, partially adjusted to the Greek language. The sentiment mining is achieved with the use of two Greek sentiment lexicons. Our findings indicate a significant drop on the average readability score of the parliament records from 2003 up to 2017. On the other hand, the sentiment analysis presents steady scores throughout the years. The communication among parliament members is characterized mainly by the feeling of surprise followed closely by anger and disgust. At the same time our results show a steady prevalence of positive words over negative. The results are presented in graphs, mainly in comparison between political parties as well as between time intervals.
Date: 19 September 2018
Presenter: Antonis Spyropoulos
The Unix operating system is one of the widest spread operating systems. It has many distributions for a plethora of devices. For many years, the only way to interact with the user was the command line. Shell commands are powerful, but their execution with options and arguments is difficult, because users cannot remember all of them.
The goal of this work is to implement a graphical user interface which will guide the user on creating valid commands or shell scripts. The interface presents to the user the available options, arguments and their meaning. This information is extracted from each command's source code and documentation.
The implementation can be split in two parts. The first one is the extraction of the required data for each command. The second one is the creation of a graphical user interface. The extraction tool is reliable for specific commands. However, there are some commands with special characteristics that cannot be extracted reliably. The graphical user interface works perfectly if it is fed with correct data.
Date: 23 October 2018
Presenter: Yves Le Traon and Mike Papadakis
Mutation testing realises the idea of using artificial defects to support testing activities. Mutation is typically used as a way to evaluate the adequacy of test suites, to guide the generation of test cases and to support experimentation. Mutation has reached a maturity phase and gradually gains popularity both in academia and in industry. This talk will survey the advances related to the fundamental problems of mutation testing, will set out the challenges and open problems for the future development of the method and will present ongoing industrial projects using mutation testing.
Yves Le Traon is professor of Computer Science at University of Luxembourg, in the domain of software engineering, with a focus on software testing, and software security, and applications in the domains of mobile computing and sensor-based systems. He is head of the SerVal group (SEcurity, Reasoning and VALidation) of the Interdisciplinary Centre for Security, Reliability and Trust (SnT). His research interests include (1) innovative testing and debugging techniques, (2) mobile Android security using static code analysis, machine learning techniques and, (3) model-driven analytics with applications in the domains of IoT, smart grid, Fintech, Industry 4.0, and data-intensive systems in general.
Mike Papadakis is a research scientist at Luxembourg University's Interdisciplinary Centre for Security, Reliability and Trust
Date: 12 November 2018
Presenter: Thodoris Sotiropoulos
Date: 05 December 2018
Presenter: Davide Spadini
Automated testing has become an essential process for improving the quality of software systems. In fact, testing can help to point out defects and to ensure that production code is robust under many usage conditions. However, writing and maintaining high-quality test code is challenging and frequently considered of secondary importance. Managers, as well as developers, do not treat test code as equally important as production code, and this behaviour could lead to poor test code quality, and in the future to defect-prone production code. The goal of my research is to bring awareness to developers on the effect of poor testing, as well as helping them in writing better test code. To this aim, I am working on 2 different perspectives: (1) studying best practices on software testing, identifying problems and challenges of current approaches, and (2) building new tools that better support the writing of test code, that tackle the issues we discovered with previous studies.
Date: 17 December 2018
Presenter: Marios Fragkoulis
Data stream processing offers low latency processing of bounded and unbounded data sets with strict semantics over the time dimension of data and consistent fault-tolerant operation. The applicability of data stream processing and the maturity of current stream processing systems (SPS) has produced numerous and large scale deployments around the world for commercial and other important use cases, such as fraud detection, risk analysis, and disaster prediction.
When processing unbounded data for commercial or critical purposes the capability to recover from failures quickly, if not transparently, is important. In this work in progress we study three different recovery configurations: - restart recovery, which restarts a stream processing job in case of an operator failure from the latest checkpointed state - standby task recovery, which substitutes a failed operator with a standby instance - process pair, which runs two coordinated instances of a stream processing job in order to switch from one to the other in case of failure. Through empirical experiments we plan to map the tradeoffs between availability and resource utilization between the recovery configurations.