Predictive Mutation Analysis via Natural Language Channel in Source Code

112 0 0.0 ( 0 )

Download Cite

Added by Jinhan Kim

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jinhan Kim - Juyoung Jeon - Shin Hong

Software Engineering

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Mutation analysis can provide valuable insights into both System Under Test (SUT) and its test suite. However, it is not scalable due to the cost of building and testing a large number of mutants. Predictive Mutation Testing (PMT) has been proposed to reduce the cost of mutation testing, but it can only provide statistical inference about whether a mutant will be killed or not by the entire test suite. We propose Seshat, a Predictive Mutation Analysis (PMA) technique that can accurately predict the entire kill matrix, not just the mutation score of the given test suite. Seshat exploits the natural language channel in code, and learns the relationship between the syntactic and semantic concepts of each test case and the mutants it can kill, from a given kill matrix. The learnt model can later be used to predict the kill matrices for subseque

rate research

In-IDE Code Generation from Natural Language: Promise and Challenges

129 - Frank F. Xu , Bogdan Vasilescu , Graham Neubig 2021

A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code, especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries, but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. We perform the first comprehensive investigation of the promise and challenges of using such technology inside the IDE, asking at the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges? We first develop a plugin for the IDE that implements a hybrid of code generation and code retrieval functionality, and orchestrate virtual environments to enable collection of many user events. We ask developers with various backgrounds to complete 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Analysis identifies several pain points that could improve the effectiveness of future machine learning based code generation/retrieval developer assistants, and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies and development of better models.

Software Engineering

Associating Natural Language Comment and Source Code Entities

442 - Sheena Panthaplackel , Milos Gligoric , Raymond J. Mooney 2019

Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision.

Computation and Language Machine Learning Software Engineering

Analysis of Source Code Using UPPAAL

269 - Mitja Kulczynski 2021

In recent years there has been a considerable effort in optimising formal methods for application to code. This has been driven by tools such as CPAChecker, DIVINE, and CBMC. At the same time tools such as Uppaal have been massively expanding the realm of more traditional model checking technologies to include strategy synthesis algorithms - an aspect becoming more and more needed as software becomes increasingly parallel. Instead of reimplementing the advances made by Uppaal in this area, we suggest in this paper to develop a bridge between the source code and the engine of Uppaal. Our approach uses the widespread intermediate language LLVM and makes recent advances of the Uppaal ecosystem readily available to analysis of source code.

Software Engineering

A Topic Guided Pointer-Generator Model for Generating Natural Language Code Summaries

111 - Xin Wang , Xin Peng , Jun Sun 2021

Code summarization is the task of generating natural language description of source code, which is important for program understanding and maintenance. Existing approaches treat the task as a machine translation problem (e.g., from Java to English) and applied Neural Machine Translation models to solve the problem. These approaches only consider a given code unit (e.g., a method) without its broader context. The lacking of context may hinder the NMT model from gathering sufficient information for code summarization. Furthermore, existing approaches use a fixed vocabulary and do not fully consider the words in code, while many words in the code summary may come from the code. In this work, we present a neural network model named ToPNN for code summarization, which uses the topics in a broader context (e.g., class) to guide the neural networks that combine the generation of new words and the copy of existing words in code. Based on the model we present an approach for generating natural language code summaries at the method level (i.e., method comments). We evaluate our approach using a dataset with 4,203,565 commented Java methods. The results show significant improvement over state-of-the-art approaches and confirm the positive effect of class topics and the copy mechanism.

Software Engineering

Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

62 - Egor Bogomolov (JetBrains Research , Higher School of Economics , n Saint Petersburg 2020

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

Software Engineering