BALab seminars — Java Decompiler using Machine Translation Techniques

Abstract

A decompiler is a computer program that takes as input an executable file and produces a high-level source code file which can be recompiled successfully. Even though a decompiler may not always reconstruct perfectly the original source code, it remains an important tool for reverse engineering of computer software. The process of decompilation is very useful for the recovery of lost source code, for analyzing and understanding software whose code is not available, even for computer security in some cases. In this thesis, in order to address the decompilation problem we transform it to a translation problem which can be solved using machine translation. Two approaches are studied, statistical and neural machine translation, using two open-source tools Moses and OpenNMT, respectively. Maven repositories are retrieved from GitHub in order to form the dataset and an appropriate procedure is used to construct the parallel corpora. In this context experiments in Moses are not successful while the result of translation using neural machine translation, is fairly good. The difference between the decompiler presented in this thesis and existing Java decompilers is the fact that it can translate isolated bytecode snippets. Furthermore, this approach can be extended to produce better results by recovering comments, variables, methods and class names. Finally, this study illustrates that the Java source code which is produced from the decompilation is often accurate and can provide a useful picture of the snippets' behavior.