Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

63 0 0.0 ( 0 )

Download Cite

Added by Egor Bogomolov

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Egor Bogomolov (JetBrains Research - Higher School of Economics - n Saint Petersburg

Software Engineering

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

rate research

Learning to map source code to software vulnerability using code-as-a-graph

133 - Sahil Suneja , Yunhui Zheng , Yufan Zhuang 2020

We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. Specifically, whether signatures of vulnerabilities in source code can be learned from its graph representation, in terms of relationships between nodes and edges. We create a pipeline we call AI4VA, which first encodes a sample source code into a Code Property Graph. The extracted graph is then vectorized in a manner which preserves its semantic information. A Gated Graph Neural Network is then trained using several such graphs to automatically extract templates differentiating the graph of a vulnerable sample from a healthy one. Our model outperforms static analyzers, classic machine learning, as well as CNN and RNN-based deep learning models on two of the three datasets we experiment with. We thus show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches. (Submitted Oct 2019, Paper #28, ICST)

Software Engineering Cryptography and Security Machine Learning

Predictive Mutation Analysis via Natural Language Channel in Source Code

111 - Jinhan Kim , Juyoung Jeon , Shin Hong 2021

Mutation analysis can provide valuable insights into both System Under Test (SUT) and its test suite. However, it is not scalable due to the cost of building and testing a large number of mutants. Predictive Mutation Testing (PMT) has been proposed to reduce the cost of mutation testing, but it can only provide statistical inference about whether a mutant will be killed or not by the entire test suite. We propose Seshat, a Predictive Mutation Analysis (PMA) technique that can accurately predict the entire kill matrix, not just the mutation score of the given test suite. Seshat exploits the natural language channel in code, and learns the relationship between the syntactic and semantic concepts of each test case and the mutants it can kill, from a given kill matrix. The learnt model can later be used to predict the kill matrices for subseque

Software Engineering

Relationships between Software Architecture and Source Code in Practice: An Exploratory Survey and Interview

95 - Fangchao Tian , Peng Liang , Muhammad Ali Babar 2021

Context: Software Architecture (SA) and Source Code (SC) are two intertwined artefacts that represent the interdependent design decisions made at different levels of abstractions - High-Level (HL) and Low-Level (LL). An understanding of the relationships between SA and SC is expected to bridge the gap between SA and SC for supporting maintenance and evolution of software systems. Objective: We aimed at exploring practitioners understanding about the relationships between SA and SC. Method: We used a mixed-method that combines an online survey with 87 respondents and an interview with 8 participants to collect the views of practitioners from 37 countries about the relationships between SA and SC. Results: Our results reveal that: practitioners mainly discuss five features of relationships between SA and SC; a few practitioners have adopted dedicated approaches and tools in the literature for identifying and analyzing the relationships between SA and SC despite recognizing the importance of such information for improving a systems quality attributes, especially maintainability and reliability. It is felt that cost and effort are the major impediments that prevent practitioners from identifying, analyzing, and using the relationships between SA and SC. Conclusions: The results have empirically identified five features of relationships between SA and SC reported in the literature from the perspective of practitioners and a systematic framework to manage the five features of relationships should be developed with dedicated approaches and tools considering the cost and benefit of maintaining the relationships.

Software Engineering

Discovery of Layered Software Architecture from Source Code Using Ego Networks

213 - Sanjay Thakare , Arvind W Kiwelekar 2021

Software architecture refers to the high-level abstraction of a system including the configuration of the involved elements and the interactions and relationships that exist between them. Source codes can be easily built by referring to the software architectures. However, the reverse process i.e. derivation of the software architecture from the source code is a challenging task. Further, such an architecture consists of multiple layers, and distributing the existing elements into these layers should be done accurately and efficiently. In this paper, a novel approach is presented for the recovery of layered architectures from Java-based software systems using the concept of ego networks. Ego networks have traditionally been used for social network analysis, but in this paper, they are modified in a particular way and tuned to suit the mentioned task. Specifically, a dependency network is extracted from the source code to create an ego network. The ego network is processed to create and optimize ego layers in a particular structure. These ego layers when integrated and optimized together give the final layered architecture. The proposed approach is evaluated in two ways: on stat

Software Engineering Social and Information Networks

The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development

94 - Thibault Allanc{c}on , Stefanon Zacchiroli (UP 2021

We introduce the Software Heritage filesystem (SwhFS), a user-space filesystem that integrates large-scale open source software archival with development workflows. SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history.Using SwhFS, developers can quickly checkout any of the 2 billion commits archived by Software Heritage, even after they disappear from their previous known location and without incurring the performance cost of repository cloning. SwhFS works across unrelated repositories and different VCS technologies. Other source code artifacts archived by Software Heritage-individual source code files and trees, releases, and branches-can also be accessed using common programming tools and custom scripts, as if they were locally available.A screencast of SwhFS is available online at dx.doi.org/10.5281/zenodo.4531411.

Software Engineering