Learning to map source code to software vulnerability using code-as-a-graph

134 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Sahil Suneja

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Sahil Suneja - Yunhui Zheng - Yufan Zhuang

هندسة البرمجيات التشفير والأمن التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. Specifically, whether signatures of vulnerabilities in source code can be learned from its graph representation, in terms of relationships between nodes and edges. We create a pipeline we call AI4VA, which first encodes a sample source code into a Code Property Graph. The extracted graph is then vectorized in a manner which preserves its semantic information. A Gated Graph Neural Network is then trained using several such graphs to automatically extract templates differentiating the graph of a vulnerable sample from a healthy one. Our model outperforms static analyzers, classic machine learning, as well as CNN and RNN-based deep learning models on two of the three datasets we experiment with. We thus show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches. (Submitted Oct 2019, Paper #28, ICST)

قيم البحث

105 - Yufan Zhuang , Sahil Suneja , Veronika Thost 2021

Identifying vulnerable code is a precautionary measure to counter software security breaches. Tedious expert effort has been spent to build static analyzers, yet insecure patterns are barely fully enumerated. This work explores a deep learning approa ch to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program, in order to improve prediction performance. Compared with a generic GNN, our enhancements include a synthesis of multiple representations learned from the several parsed graphs of a program, and a new training loss metric that leverages the fine granularity of labeling. Our model outperforms multiple text, image and graph-based approaches, across two real-world datasets.

الذكاء الاصطناعي هندسة البرمجيات

Learning How to Mutate Source Code from Bug-Fixes

146 - Michele Tufano , Cody Watson , Gabriele Bavota 2018

Mutation testing has been widely accepted as an approach to guide test case generation or to assess the effectiveness of test suites. Empirical studies have shown that mutants are representative of real faults; yet they also indicated a clear need fo r better, possibly customized, mutation operators and strategies. While methods to devise domain-specific or general-purpose mutation operators from real faults exist, they are effort- and error-prone, and do not help the tester to decide whether and how to mutate a given source code element. We propose a novel approach to automatically learn mutants from faults in real programs. First, our approach processes bug fixing changes using fine-grained differencing, code abstraction, and change clustering. Then, it learns mutation models using a deep learning strategy. We have trained and evaluated our technique on a set of ~787k bug fixes mined from GitHub. Our empirical evaluation showed that our models are able to predict mutants that resemble the actual fixed bugs in between 9% and 45% of the cases, and over 98% of the automatically generated mutants are lexically and syntactically correct.

هندسة البرمجيات

Discovery of Layered Software Architecture from Source Code Using Ego Networks

213 - Sanjay Thakare , Arvind W Kiwelekar 2021

Software architecture refers to the high-level abstraction of a system including the configuration of the involved elements and the interactions and relationships that exist between them. Source codes can be easily built by referring to the software architectures. However, the reverse process i.e. derivation of the software architecture from the source code is a challenging task. Further, such an architecture consists of multiple layers, and distributing the existing elements into these layers should be done accurately and efficiently. In this paper, a novel approach is presented for the recovery of layered architectures from Java-based software systems using the concept of ego networks. Ego networks have traditionally been used for social network analysis, but in this paper, they are modified in a particular way and tuned to suit the mentioned task. Specifically, a dependency network is extracted from the source code to create an ego network. The ego network is processed to create and optimize ego layers in a particular structure. These ego layers when integrated and optimized together give the final layered architecture. The proposed approach is evaluated in two ways: on stat

هندسة البرمجيات الشبكات الاجتماعية والمعلومات

NaturalCC: A Toolkit to Naturalize the Source Code Corpus

270 - Yao Wan , Yang He , Jian-Guo Zhang 2020

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming la nguage communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each models performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at https://www.youtube.com/watch?v=q4W5VSI-u3E&t=25s.

هندسة البرمجيات

Analysis of Source Code Using UPPAAL

269 - Mitja Kulczynski 2021

In recent years there has been a considerable effort in optimising formal methods for application to code. This has been driven by tools such as CPAChecker, DIVINE, and CBMC. At the same time tools such as Uppaal have been massively expanding the rea lm of more traditional model checking technologies to include strategy synthesis algorithms - an aspect becoming more and more needed as software becomes increasingly parallel. Instead of reimplementing the advances made by Uppaal in this area, we suggest in this paper to develop a bridge between the source code and the engine of Uppaal. Our approach uses the widespread intermediate language LLVM and makes recent advances of the Uppaal ecosystem readily available to analysis of source code.

هندسة البرمجيات