CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

59 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Leon Moonen

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Guru Prasad Bhandari - Amara Naseer - Leon Moonen (Simula Researchn Laboratory

هندسة البرمجيات الذكاء الاصطناعي التشفير والأمن

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the public National Vulnerability Database (NVD). We implement our approach in a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The CVEfixes collection tool automatically fetches all available CVE records from the NVD, gathers the vulnerable code and corresponding fixes from associated open-source repositories, and organizes the collected information in a relational database. Moreover, the dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. The collection can easily be repeated to keep up-to-date with newly discovered or patched vulnerabilities. The initial release of CVEfixes spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754 open-source projects that were addressed in a total of 5495 vulnerability fixing commits. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.

قيم البحث

146 - Michele Tufano , Cody Watson , Gabriele Bavota 2018

Mutation testing has been widely accepted as an approach to guide test case generation or to assess the effectiveness of test suites. Empirical studies have shown that mutants are representative of real faults; yet they also indicated a clear need fo r better, possibly customized, mutation operators and strategies. While methods to devise domain-specific or general-purpose mutation operators from real faults exist, they are effort- and error-prone, and do not help the tester to decide whether and how to mutate a given source code element. We propose a novel approach to automatically learn mutants from faults in real programs. First, our approach processes bug fixing changes using fine-grained differencing, code abstraction, and change clustering. Then, it learns mutation models using a deep learning strategy. We have trained and evaluated our technique on a set of ~787k bug fixes mined from GitHub. Our empirical evaluation showed that our models are able to predict mutants that resemble the actual fixed bugs in between 9% and 45% of the cases, and over 98% of the automatically generated mutants are lexically and syntactically correct.

هندسة البرمجيات

Exploring Factors and Measures to Select Open Source Software

96 - Xiaozhou Li , Sergio Moreschini , Zheying Zhang 2021

[Context] Open Source Software (OSS) is nowadays used and integrated in most of the commercial products. However, the selection of OSS projects for integration is not a simple process, mainly due to a of lack of clear selection models and lack of inf ormation from the OSS portals. [Objective] We investigated the current factors and measures that practitioners are currently considering when selecting OSS, the source of information and portals that can be used to assess the factors, and the possibility to automatically get this information with APIs. [Method] We elicited the factors and the measures adopted to assess and compare OSS performing a survey among 23 experienced developers who often integrate OSS in the software they develop. Moreover, we investigated the APIs of the portals adopted to assess OSS extracting information for the most starred 100K projects in GitHub. [Result] We identified a set consisting of 8 main factors and 74 sub-factors, together with 170 related metrics that companies can use to select OSS to be integrated in their software projects. Unexpectedly, only a small part of the factors can be evaluated automatically, and out of 170 metrics, only 40 are available, of which only 22 returned information for all the 100K projects. [Conclusion.] OSS selection can be partially automated, by extracting the information needed for the selection from portal APIs. OSS producers can benefit from our results by checking if they are providing all the information commonly required by potential adopters. Developers can benefit from our results, using the list of factors we selected as a checklist during the selection of OSS, or using the APIs we developed to automatically extract the data from OSS projects.

هندسة البرمجيات

Issue Link Label Recovery and Prediction for Open Source Software

76 - Alexander Nicholson , Jin L.C. Guo 2021

Modern open source software development heavily relies on the issue tracking systems to manage their feature requests, bug reports, tasks, and other similar artifacts. Together, those issues form a complex network with links to each other. The hetero geneous character of issues inherently results in varied link types and therefore poses a great challenge for users to create and maintain the label of the link manually. The goal of most existing automated issue link construction techniques ceases with only examining the existence of links between issues. In this work, we focus on the next important question of whether we can assess the type of issue link automatically through a data-driven method. We analyze the links between issues and their labels used the issue tracking system for 66 open source projects. Using three projects, we demonstrate promising results when using supervised machine learning classification for the task of link label recovery with careful model selection and tuning, achieving F1 scores of between 0.56-0.70 for the three studied projects. Further, the performance of our method for future link label prediction is convincing when there is sufficient historical data. Our work signifies the first step in systematically manage and maintain issue links faced in practice.

هندسة البرمجيات

Dynamic Scheduling and Workforce Assignment in Open Source Software Development

310 - Hui Xi , Dong Xu , Young-Jun Son 2020

A novel modeling framework is proposed for dynamic scheduling of projects and workforce assignment in open source software development (OSSD). The goal is to help project managers in OSSD distribute workforce to multiple projects to achieve high effi ciency in software development (e.g. high workforce utilization and short development time) while ensuring the quality of deliverables (e.g. code modularity and software security). The proposed framework consists of two models: 1) a system dynamic model coupled with a meta-heuristic to obtain an optimal schedule of software development projects considering their attributes (e.g. priority, effort, duration) and 2) an agent based model to represent the development community as a social network, where development managers form an optimal team for each project and balance the workload among multiple scheduled projects based on the optimal schedule obtained from the system dynamic model. To illustrate the proposed framework, a software enhancement request process in Kuali foundation is used as a case study. Survey data collected from the Kuali development managers, project managers and actual historical enhancement requests have been used to construct the proposed models. Extensive experiments are conducted to demonstrate the impact of varying parameters on the considered efficiency and quality.

هندسة البرمجيات أنظمة وتحكم أنظمة وتحكم

Characterizing the Roles of Contributors in Open-source Scientific Software Projects

75 - Reed Milewicz , Gustavo Pinto , Paige Rodeghero 2020

The development of scientific software is, more than ever, critical to the practice of science, and this is accompanied by a trend towards more open and collaborative efforts. Unfortunately, there has been little investigation into who is driving the evolution of such scientific software or how the collaboration happens. In this paper, we address this problem. We present an extensive analysis of seven open-source scientific software projects in order to develop an empirically-informed model of the development process. This analysis was complemented by a survey of 72 scientific software developers. In the majority of the projects, we found senior research staff (e.g. professors) to be responsible for half or more of commits (an average commit share of 72%) and heavily involved in architectural concerns (seniors were more likely to interact with files related to the build system, project meta-data, and developer documentation). Juniors (e.g.graduate students) also contribute substantially -- in one studied project, juniors made almost 100% of its commits. Still, graduate students had the longest contribution periods among juniors (with 1.72 years of commit activity compared to 0.98 years for postdocs and 4 months for undergraduates). Moreover, we also found that third-party contributors are scarce, contributing for just one day for the project. The results from this study aim to help scientists to better understand their own projects, communities, and the contributors behavior, while paving the road for future software engineering research

هندسة البرمجيات