HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks

156 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Dakuo Wang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xuye Liu - Dakuo Wang - April Wang

هندسة البرمجيات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Jupyter notebook allows data scientists to write machine learning code together with its documentation in cells. In this paper, we propose a new task of code documentation generation (CDG) for computational notebooks. In contrast to the previous CDG tasks which focus on generating documentation for single code snippets, in a computational notebook, one documentation in a markdown cell often corresponds to multiple code cells, and these code cells have an inherent structure. We proposed a new model (HAConvGNN) that uses a hierarchical attention mechanism to consider the relevant code cells and the relevant code tokens information when generating the documentation. Tested on a new corpus constructed from well-documented Kaggle notebooks, we show that our model outperforms other baseline models.

قيم البحث

159 - Chen Lyu , Ruyun Wang , Hongyu Zhang 2021

The problem of code generation from textual program descriptions has long been viewed as a grand challenge in software engineering. In recent years, many deep learning based approaches have been proposed, which can generate a sequence of code from a sequence of textual program description. However, the existing approaches ignore the global relationships among API methods, which are important for understanding the usage of APIs. In this paper, we propose to model the dependencies among API methods as an API dependency graph (ADG) and incorporate the graph embedding into a sequence-to-sequence (Seq2Seq) model. In addition to the existing encoder-decoder structure, a new module named ``embedder is introduced. In this way, the decoder can utilize both global structural dependencies and textual program description to predict the target code. We conduct extensive code generation experiments on three public datasets and in two programming languages (Python and Java). Our proposed approach, called ADG-Seq2Seq, yields significant improvements over existing state-of-the-art methods and maintains its performance as the length of the target code increases. Extensive ablation tests show that the proposed ADG embedding is effective and outperforms the baselines.

هندسة البرمجيات الذكاء الاصطناعي

Testing with Jupyter notebooks: NoteBook VALidation (nbval) plug-in for pytest

190 - Hans Fangohr , Vidar Fauske , Thomas Kluyver 2020

The Notebook validation tool nbval allows to load and execute Python code from a Jupyter notebook file. While computing outputs from the cells in the notebook, these outputs are compared with the outputs saved in the notebook file, treating each cell as a test. Deviations are reported as test failures, with various configuration options available to control the behaviour. Application use cases include the validation of notebook-based documentation, tutorials and textbooks, as well as the use of notebooks as additional unit, integration and system tests for the libraries that are used in the notebook. Nbval is implemented as a plugin for the pytest testing software.

هندسة البرمجيات

Themisto: Towards Automated Documentation Generation in Computational Notebooks

95 - April Yi Wang , Dakuo Wang , Jaimie Drozdal 2021

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick ite rations, which leads to challenges in sharing their notebooks with others and future selves. Inspired by human documentation practices from analyzing 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore the Human-AI Collaboration opportunity in the code documentation scenario. Themisto facilitates the creation of different types of documentation via three approaches: a deep-learning-based approach to generate documentation for source code (fully automated), a query-based approach to retrieve the online API documentation for source code (fully automated), and a user prompt approach to motivate users to write more documentation (semi-automated). We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants satisfaction with their computational notebook.

تفاعل الإنسان والحاسوب

deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search

256 - Chen Zeng , Yue Yu , Shanshan Li 2021

With the rapid increase in the amount of public code repositories, developers maintain a great desire to retrieve precise code snippets by using natural language. Despite existing deep learning based approaches(e.g., DeepCS and MMAN) have provided th e end-to-end solutions (i.e., accepts natural language as queries and shows related code fragments retrieved directly from code corpus), the accuracy of code search in the large-scale repositories is still limited by the code representation (e.g., AST) and modeling (e.g., directly fusing the features in the attention stage). In this paper, we propose a novel learnable deep Graph for Code Search (calleddeGraphCS), to transfer source code into variable-based flow graphs based on the intermediate representation technique, which can model code semantics more precisely compared to process the code as text directly or use the syntactic tree representation. Furthermore, we propose a well-designed graph optimization mechanism to refine the code representation, and apply an improved gated graph neural network to model variable-based flow graphs. To evaluate the effectiveness of deGraphCS, we collect a large-scale dataset from GitHub containing 41,152 code snippets written in C language, and reproduce several typical deep code search methods for comparison. Besides, we design a qualitative user study to verify the practical value of our approach. The experimental results have shown that deGraphCS can achieve state-of-the-art performances, and accurately retrieve code snippets satisfying the needs of the users.

هندسة البرمجيات الذكاء الاصطناعي

CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network

95 - Yanlin Wang , Ensheng Shi , Lun Du 2021

Source code summaries are short natural language descriptions of code snippets that help developers better understand and maintain source code. There has been a surge of work on automatic code summarization to reduce the burden of writing summaries m anually. However, most contemporary approaches mainly leverage the information within the boundary of the method being summarized (i.e., local context), and ignore the broader context that could assist with code summarization. This paper explores two global contexts, namely intra-class and inter-class contexts, and proposes the model CoCoSUM: Contextual Code Summarization with Multi-Relational Graph Neural Networks. CoCoSUM first incorporates class names as the intra-class context to generate the class semantic embeddings. Then, relevant Unified Modeling Language (UML) class diagrams are extracted as inter-class context and are encoded into the class relational embeddings using a novel Multi-Relational Graph Neural Network (MRGNN). Class semantic embeddings and class relational embeddings, together with the outputs from code token encoder and AST encoder, are passed to a decoder armed with a two-level attention mechanism to generate high-quality, context-aware code summaries. We conduct extensive experiments to evaluate our approach and compare it with other automatic code summarization models. The experimental results show that CoCoSUM is effective and outperforms state-of-the-art methods. Our source code and experimental data are available in the supplementary materials and will be made publicly available.

هندسة البرمجيات