Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model

81 0 0.0 ( 0 )

Download Cite

Added by Yixiao Yang

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Yixiao Yang

Software Engineering

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In the field of software engineering, applying language models to the token sequence of source code is the state-of-art approach to build a code recommendation system. The syntax tree of source code has hierarchical structures. Ignoring the characteristics of tree structures decreases the model performance. Current LSTM model handles sequential data. The performance of LSTM model will decrease sharply if the noise unseen data is distributed everywhere in the test suite. As code has free naming conventions, it is common for a model trained on one project to encounter many unknown words on another project. If we set many unseen words as UNK just like the solution in natural language processing, the number of UNK will be much greater than the sum of the most frequently appeared words. In an extreme case, just predicting UNK at everywhere may achieve very high prediction accuracy. Thus, such solution cannot reflect the true performance of a model when encountering noise unseen data. In this paper, we only mark a small number of rare words as UNK and show the prediction performance of models under in-project and cross-project evaluation. We propose a novel Hierarchical Language Model (HLM) to improve the robustness of LSTM model to gain the capacity about dealing with the inconsistency of data distribution between training and testing. The newly proposed HLM takes the hierarchical structure of code tree into consideration to predict code. HLM uses BiLSTM to generate embedding for sub-trees according to hierarchies and collects the embedding of sub-trees in context to predict next code. The experiments on inner-project and cross-project data sets indicate that the newly proposed Hierarchical Language Model (HLM) performs better than the state-of-art LSTM model in dealing with the data inconsistency between training and testing and achieves averagely 11.2% improvement in prediction accuracy.

rate research

Deep Just-In-Time Inconsistency Detection Between Comments and Source Code

77 - Sheena Panthaplackel , Junyi Jessy Li , Milos Gligoric 2020

Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a code base. To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system which can both detect and resolve inconsistent comments based on code changes.

Software Engineering Artificial Intelligence Computation and Language

Improving Dynamic Code Analysis by Code Abstraction

327 - Isabella Mastroeni , Vincenzo Arceri (Department of Environmental Sciences 2021

In this paper, our aim is to propose a model for code abstraction, based on abstract interpretation, allowing us to improve the precision of a recently proposed static analysis by abstract interpretation of dynamic languages. The problem we tackle here is that the analysis may add some spurious code to the string-to-execute abstract value and this code may need some abstract representations in order to make it analyzable. This is precisely what we propose here, where we drive the code abstraction by the analysis we have to perform.

Software Engineering Programming Languages

Corrigendum and Supplement to Improve Language Modelling for Code Completion through Learning General Token Repetition of Source Code (with Optimized Memory)

54 - Yixiao Yang 2020

This paper is written because I receive several inquiry emails saying it is hard to achieve good results when applying token repetition learning techniques. If REP (proposed by me) or Pointer-Mixture (proposed by Jian Li) is directly applied to source code to decide all token repetitions, the model performance will decrease sharply. As we use pre-order traversal to traverse the Abstract Syntax Tree (AST) to generate token sequence, tokens corresponding to AST grammar are ignored when learning token repetition. For non-grammar tokens, there are many kinds: strings, chars, numbers and identifiers. For each kind of tokens, we try to learn its repetition pattern and find that only identifiers have the property of token repetition. For identifiers, there are also many kinds such as variables, package names, method names, simple types, qualified types or qualified names. Actually, some kinds of identifiers such as package names, method names, qualified names or qualified types are unlikely to be repeated. Thus, we ignore these kinds of identifiers that are unlikely to be repeated when learning token repetition. This step is crucial and this important implementation trick is not clearly presented in the paper because we think it is trivial and too many details may bother readers. We offer the GitHub address of our model in our conference paper and readers can check the description and implementation in that repository. Thus, in this paper, we supplement the important implementation optimization details for the already published papers.

Software Engineering

Configuration Testing: Testing Configuration Values as Code and with Code

73 - Tianyin Xu , Owolabi Legunsen 2019

This paper proposes configuration testing--evaluating configuration values (to be deployed) by exercising the code that uses the values and assessing the corresponding program behavior. We advocate that configuration values should be systematically tested like software code and that configuration testing should be a key reliability engineering practice for preventing misconfigurations from production deployment. The essential advantage of configuration testing is to put the configuration values (to be deployed) in the context of the target software program under test. In this way, the dynamic effects of configuration values and the impact of configuration changes can be observed during testing. Configuration testing overcomes the fundamental limitations of de facto approaches to combatting misconfigurations, namely configuration validation and software testing--the former is disconnected from code logic and semantics, while the latter can hardly cover all possible configuration values and their combinations. Our preliminary results show the effectiveness of configuration testing in capturing real-world misconfigurations. We present the principles of writing new configuration tests and the promises of retrofitting existing software tests to be configuration tests. We discuss new adequacy and quality metrics for configuration testing. We also explore regression testing techniques to enable incremental configuration testing during continuous integration and deployment in modern software systems.

Software Engineering

GraphCodeBERT: Pre-training Code Representations with Data Flow

266 - Daya Guo , Shuo Ren , Shuai Lu 2020

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of where-the-value-comes-from between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

Software Engineering Computation and Language