NaturalCC: A Toolkit to Naturalize the Source Code Corpus

271 0 0.0 ( 0 )

Download Cite

Added by Yao Wan

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Yao Wan - Yang He - Jian-Guo Zhang

Software Engineering

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each models performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at https://www.youtube.com/watch?v=q4W5VSI-u3E&t=25s.

rate research

Learning to map source code to software vulnerability using code-as-a-graph

133 - Sahil Suneja , Yunhui Zheng , Yufan Zhuang 2020

We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. Specifically, whether signatures of vulnerabilities in source code can be learned from its graph representation, in terms of relationships between nodes and edges. We create a pipeline we call AI4VA, which first encodes a sample source code into a Code Property Graph. The extracted graph is then vectorized in a manner which preserves its semantic information. A Gated Graph Neural Network is then trained using several such graphs to automatically extract templates differentiating the graph of a vulnerable sample from a healthy one. Our model outperforms static analyzers, classic machine learning, as well as CNN and RNN-based deep learning models on two of the three datasets we experiment with. We thus show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches. (Submitted Oct 2019, Paper #28, ICST)

Software Engineering Cryptography and Security Machine Learning

DR-Tools: a suite of lightweight open-source tools to measure and visualize Java source code

87 - Guilherme Lacerda 2020

In Software Engineering, some of the most critical activities are maintenance and evolution. However, to perform both with quality, minimizing impacts and risks, developers need to analyze and identify where the main problems come from previously. In this paper, we introduce DR-Tools Suite, a set of lightweight open-source tools that analyze and calculate source code metrics, allowing developers to visualize the results in different formats and graphs. Also, we define a set of heuristics to help the code analysis. We conducted two case studies (one academic and one industrial) to collect feedback on the tools suite, on how we will evolve the tools, as well as insights to develop new tools that support developers in their daily work.

Software Engineering

The standard coder: a machine learning approach to measuring the effort required to produce source code change

53 - Ian Wright , Albert Ziegler 2019

We apply machine learning to version control data to measure the quantity of effort required to produce source code changes. We construct a model of a `standard coder trained from examples of code changes produced by actual software developers together with the labor time they supplied. The effort of a code change is then defined as the labor hours supplied by the standard coder to produce that change. We therefore reduce heterogeneous, structured code changes to a scalar measure of effort derived from large quantities of empirical data on the coding behavior of software developers. The standard coder replaces traditional metrics, such as lines-of-code or function point analysis, and yields new insights into what code changes require more or less effort.

Software Engineering

Learning How to Mutate Source Code from Bug-Fixes

146 - Michele Tufano , Cody Watson , Gabriele Bavota 2018

Mutation testing has been widely accepted as an approach to guide test case generation or to assess the effectiveness of test suites. Empirical studies have shown that mutants are representative of real faults; yet they also indicated a clear need for better, possibly customized, mutation operators and strategies. While methods to devise domain-specific or general-purpose mutation operators from real faults exist, they are effort- and error-prone, and do not help the tester to decide whether and how to mutate a given source code element. We propose a novel approach to automatically learn mutants from faults in real programs. First, our approach processes bug fixing changes using fine-grained differencing, code abstraction, and change clustering. Then, it learns mutation models using a deep learning strategy. We have trained and evaluated our technique on a set of ~787k bug fixes mined from GitHub. Our empirical evaluation showed that our models are able to predict mutants that resemble the actual fixed bugs in between 9% and 45% of the cases, and over 98% of the automatically generated mutants are lexically and syntactically correct.

Software Engineering

A Large-Scale Study on Source Code Reviewer Recommendation

94 - Jakub Lipcak , Bruno Rossi 2018

Context: Software code reviews are an important part of the development process, leading to better software quality and reduced overall costs. However, finding appropriate code reviewers is a complex and time-consuming task. Goals: In this paper, we propose a large-scale study to compare performance of two main source code reviewer recommendation algorithms (RevFinder and a Naive Bayes-based approach) in identifying the best code reviewers for opened pull requests. Method: We mined data from Github and Gerrit repositories, building a large dataset of 51 projects, with more than 293K pull requests analyzed, 180K owners and 157K reviewers. Results: Based on the large analysis, we can state that i) no model can be generalized as best for all projects, ii) the usage of a different repository (Gerrit, GitHub) can have impact on the the recommendation results, iii) exploiting sub-projects information available in Gerrit can improve the recommendation results.

Software Engineering