No Arabic abstract
In this paper, our aim is to propose a model for code abstraction, based on abstract interpretation, allowing us to improve the precision of a recently proposed static analysis by abstract interpretation of dynamic languages. The problem we tackle here is that the analysis may add some spurious code to the string-to-execute abstract value and this code may need some abstract representations in order to make it analyzable. This is precisely what we propose here, where we drive the code abstraction by the analysis we have to perform.
Recently, the automated translation of source code from one programming language to another by using automatic approaches inspired by Neural Machine Translation (NMT) methods for natural languages has come under study. However, such approaches suffer from the same problem as previous NMT approaches on natural languages, viz. the lack of an ability to estimate and evaluate the quality of the translations; and consequently ascribe some measure of interpretability to the models choices. In this paper, we attempt to estimate the quality of source code translations built on top of the TransCoder model. We consider the code translation task as an analog of machine translation (MT) for natural languages, with some added caveats. We present our main motivation from a user study built around code translation; and present a technique that correlates the confidences generated by that model to lint errors in the translated code. We conclude with some observations on these correlations, and some ideas for future work.
Program representation learning is a fundamental task in software engineering applications. With the availability of big code and the development of deep learning techniques, various program representation learning models have been proposed to understand the semantic properties of programs and applied on different software engineering tasks. However, no previous study has comprehensively assessed the generalizability of these deep models on different tasks, so that the pros and cons of the models are unclear. In this experience paper, we try to bridge this gap by systemically evaluating the performance of eight program representation learning models on three common tasks, where six models are based on abstract syntax trees and two models are based on plain text of source code. We kindly explain the criteria for selecting the models and tasks, as well as the method for enabling end-to-end learning in each task. The results of performance evaluation show that they perform diversely in each task and the performance of the AST-based models is generally unstable over different tasks. In order to further explain the results, we apply a prediction attribution technique to find what elements are captured by the models and responsible for the predictions in each task. Based on the findings, we discuss some general principles for better capturing the information in the source code, and hope to inspire researchers to improve program representation learning methods for software engineering tasks.
Context: Decentralized applications on blockchain platforms are realized through smart contracts. However, participants who lack programming knowledge often have difficulties reading the smart contract source codes, which leads to potential security risks and barriers to participation. Objective: Our objective is to translate the smart contract source codes into natural language descriptions to help people better understand, operate, and learn smart contracts. Method: This paper proposes an automated translation tool for Solidity smart contracts, termed SolcTrans, based on an abstract syntax tree and formal grammar. We have investigated 3,000 smart contracts and determined the part of speeches of corresponding blockchain terms. Among them, we further filtered out contract snippets without detailed comments and left 811 snippets to evaluate the translation quality of SolcTrans. Results: Experimental results show that even with a small corpus, SolcTrans can achieve similar performance to the state-of-the-art code comments generation models for other programming languages. In addition, SolcTrans has consistent performance when dealing with code snippets with different lengths and gas consumption. Conclusion: SolcTrans can correctly interpret Solidity codes and automatically convert them into comprehensible English text. We will release our tool and dataset for supporting reproduction and further studies in related fields.
In recent years there has been a considerable effort in optimising formal methods for application to code. This has been driven by tools such as CPAChecker, DIVINE, and CBMC. At the same time tools such as Uppaal have been massively expanding the realm of more traditional model checking technologies to include strategy synthesis algorithms - an aspect becoming more and more needed as software becomes increasingly parallel. Instead of reimplementing the advances made by Uppaal in this area, we suggest in this paper to develop a bridge between the source code and the engine of Uppaal. Our approach uses the widespread intermediate language LLVM and makes recent advances of the Uppaal ecosystem readily available to analysis of source code.
Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.