Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

53 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Rhys Compton

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Rhys Compton - Eibe Frank - Panos Patros

التعلم الآلي لغات البرمجة هندسة البرمجيات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared for further ML research on source code.

قيم البحث

اقرأ أيضاً

Boosting Java Performance using GPGPUs

147 - James Clarkson , Christos Kotselidis , Gavin Brown 2015

Heterogeneous programming has started becoming the norm in order to achieve better performance by running portions of code on the most appropriate hardware resource. Currently, significant engineering efforts are undertaken in order to enable existin g programming languages to perform heterogeneous execution mainly on GPUs. In this paper we describe Jacc, an experimental framework which allows developers to program GPGPUs directly from Java. By using the Jacc framework, developers have the ability to add GPGPU support into their applications with minimal code refactoring. To simplify the development of GPGPU applications we allow developers to model heterogeneous code using two key abstractions: textit{tasks}, which encapsulate all the information needed to execute code on a GPGPU; and textit{task graphs}, which capture the inter-task control-flow of the application. Using this information the Jacc runtime is able to automatically handle data movement and synchronization between the host and the GPGPU; eliminating the need for explicitly managing disparate memory spaces. In order to generate highly parallel GPGPU code, Jacc provides developers with the ability to decorate key aspects of their code using annotations. The compiler, in turn, exploits this information in order to automatically generate code without requiring additional code refactoring. Finally, we demonstrate the advantages of Jacc, both in terms of programmability and performance, by evaluating it against existing Java frameworks. Experimental results show an average performance speedup of 32x and a 4.4x code decrease across eight evaluated benchmarks on a NVIDIA Tesla K20m GPU.

النظم الموزعة والتوازية والحوسبة العنقودية لغات البرمجة

Java & Lambda: a Featherweight Story

71 - Lorenzo Bettini , Viviana Bono , Mariangiola Dezani-Ciancaglini 2018

We present FJ&$lambda$, a new core calculus that extends Featherweight Java (FJ) with interfaces, supporting multiple inheritance in a restricted form, $lambda$-expressions, and intersection types. Our main goal is to formalise how lambdas and inters ection types are grafted on Java 8, by studying their properties in a formal setting. We show how intersection types play a significant role in several cases, in particular in the typecast of a $lambda$-expression and in the typing of conditional expressions. We also embody interface emph{default methods} in FJ&$lambda$, since they increase the dynamism of $lambda$-expressions, by allowing these methods to be called on $lambda$-expressions. The crucial point in Java 8 and in our calculus is that $lambda$-expressions can have various types according to the context requirements (target types): indeed, Java code does not compile when $lambda$-expressions come without target types. In particular, in the operational semantics we must record target types by decorating $lambda$-expressions, otherwise they would be lost in the runtime expressions. We prove the subject reduction property and progress for the resulting calculus, and we give a type inference algorithm that returns the type of a given program if it is well typed. The design of FJ&$lambda$ has been driven by the aim of making it a subset of Java 8, while preserving the elegance and compactness of FJ. Indeed, FJ&$lambda$ programs are typed and behave the same as Java programs.

المنطق في علوم الحاسوب لغات البرمجة

Learning Semantic Vector Representations of Source Code via a Siamese Neural Network

118 - David Wehr , Halley Fede , Eleanor Pence 2019

The abundance of open-source code, coupled with the success of recent advances in deep learning for natural language processing, has given rise to a promising new application of machine learning to source code. In this work, we explore the use of a S iamese recurrent neural network model on Python source code to create vectors which capture the semantics of code. We evaluate the quality of embeddings by identifying which problem from a programming competition the code solves. Our model significantly outperforms a bag-of-tokens embedding, providing promising results for improving code embeddings that can be used in future software engineering tasks.

التعلم الآلي لغات البرمجة هندسة البرمجيات

Inferring Javascript types using Graph Neural Networks

117 - Jessica Schrouff , Kai Wohlfahrt , Bruno Marnette 2019

The recent use of `Big Code with state-of-the-art deep learning methods offers promising avenues to ease program source code writing and correction. As a first step towards automatic code repair, we implemented a graph neural network model that predi cts token types for Javascript programs. The predictions achieve an accuracy above $90%$, which improves on previous similar work.

التعلم الآلي لغات البرمجة هندسة البرمجيات

Generating Adversarial Computer Programs using Optimized Obfuscations

113 - Shashank Srikant , Sijia Liu , Tamara Mitrovska 2021

Machine learning (ML) models that learn and predict properties of computer programs are increasingly being adopted and deployed. These models have demonstrated success in applications such as auto-completing code, summarizing large programs, and dete cting bugs and malware in programs. In this work, we investigate principled ways to adversarially perturb a computer program to fool such learned models, and thus determine their adversarial robustness. We use program obfuscations, which have conventionally been used to avoid attempts at reverse engineering programs, as adversarial perturbations. These perturbations modify programs in ways that do not alter their functionality but can be crafted to deceive an ML model when making a decision. We provide a general formulation for an adversarial program that allows applying multiple obfuscation transformations to a program in any language. We develop first-order optimization algorithms to efficiently determine two key aspects -- which parts of the program to transform, and what transformations to use. We show that it is important to optimize both these aspects to generate the best adversarially perturbed program. Due to the discrete nature of this problem, we also propose using randomized smoothing to improve the attack loss landscape to ease optimization. We evaluate our work on Python and Java programs on the problem of program summarization. We show that our best attack proposal achieves a $52%$ improvement over a state-of-the-art attack generation approach for programs trained on a seq2seq model. We further show that our formulation is better at training models that are robust to adversarial attacks.

التعلم الآلي لغات البرمجة هندسة البرمجيات