أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Qihao Zhu

Lyra: A Benchmark for Turducken-Style Code Generation

143 - Qingyuan Liang , Zeyu Sun , Qihao Zhu 2021

Code generation is crucial to reduce manual software development efforts. Recently, neural techniques have been used to generate source code automatically. While promising, these approaches are evaluated on tasks for generating code in single program ming languages. However, in actual development, one programming language is often embedded in another. For example, SQL statements are often embedded as strings in base programming languages such as Python and Java, and JavaScript programs are often embedded in sever-side programming languages, such as PHP, Java, and Python. We call this a turducken-style programming. In this paper, we define a new code generation task: given a natural language comment, this task aims to generate a program in a base language with an embedded language. To our knowledge, this is the first turducken-style code generation task. For this task, we present Lyra: a dataset in Python with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real usage projects. Each program is paired with both a Chinese comment and an English comment. In our experiment, we adopted Transformer, a state-of-the-art technique, as the baseline. In the best setting, Transformer achieves 0.5% and 1.5% AST exact matching accuracy using Chinese and English comments, respectively. Therefore, we believe that Lyra provides a new challenge for code generation.

هندسة البرمجيات الذكاء الاصطناعي

Code Generation Based on Deep Learning: a Brief Review

190 - Qihao Zhu , Wenjie Zhang 2021

Automatic software development has been a research hot spot in the field of software engineering (SE) in the past decade. In particular, deep learning (DL) has been applied and achieved a lot of progress in various SE tasks. Among all applications, a utomatic code generation by machines as a general concept, including code completion and code synthesis, is a common expectation in the field of SE, which may greatly reduce the development burden of the software developers and improves the efficiency and quality of the software development process to a certain extent. Code completion is an important part of modern integrated development environments (IDEs). Code completion technology effectively helps programmers complete code class names, method names, and key-words, etc., which improves the efficiency of program development and reduces spelling errors in the coding process. Such tools use static analysis on the code and provide candidates for completion arranged in alphabetical order. Code synthesis is implemented from two aspects, one based on input-output samples and the other based on functionality description. In this study, we introduce existing techniques of these two aspects and the corresponding DL techniques, and present some possible future research directions.

هندسة البرمجيات الذكاء الاصطناعي

OCoR: An Overlapping-Aware Code Retriever

294 - Qihao Zhu , Zeyu Sun , Xiran Liang 2020

Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., message and msg), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier. The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.

الحساب واللغة التعلم الآلي هندسة البرمجيات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد