No Arabic abstract
Concolic execution and fuzzing are two complementary coverage-based testing techniques. How to achieve the best of both remains an open challenge. To address this research problem, we propose and evaluate Legion. Legion re-engineers the Monte Carlo tree search (MCTS) framework from the AI literature to treat automated test generation as a problem of sequential decision-making under uncertainty. Its best-first search strategy provides a principled way to learn the most promising program states to investigate at each search iteration, based on observed rewards from previous iterations. Legion incorporates a form of directed fuzzing that we call approximate path-preserving fuzzing (APPFuzzing) to investigate program states selected by MCTS. APPFuzzing serves as the Monte Carlo simulation technique and is implemented by extending prior work on constrained sampling. We evaluate Legion against competitors on 2531 benchmarks from the coverage category of Test-Comp 2020, as well as measuring its sensitivity to hyperparameters, demonstrating its effectiveness on a wide variety of input programs.
Software testing is one of the most popular validation techniques in the software industry. Surprisingly, we can only find a few approaches to testing in the context of logic programming. In this paper, we introduce a systematic approach for dynamic testing that combines both concrete and symbolic execution. Our approach is fully automatic and guarantees full path coverage when it terminates. We prove some basic properties of our technique and illustrate its practical usefulness through a prototype implementation.
We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test, and online testing where DNNs are embedded into a specific application environment and tested in a closed-loop mode in interaction with the application environment. Typically, DNNs are subjected to both types of testing during their development life cycle where offline testing is applied immediately after DNN training and online testing follows after offline testing and once a DNN is deployed within a specific application environment. In this paper, we study the relationship between offline and online testing. Our goal is to determine how offline testing and online testing differ or complement one another and if offline testing results can be used to help reduce the cost of online testing? Though these questions are generally relevant to all autonomous systems, we study them in the context of automated driving systems where, as study subjects, we use DNNs automating end-to-end controls of steering functions of self-driving vehicles. Our results show that offline testing is less effective than online testing as many safety violations identified by online testing could not be identified by offline testing, while large prediction errors generated by offline testing always led to severe safety violations detectable by online testing. Further, we cannot exploit offline testing results to reduce the cost of online testing in practice since we are not able to identify specific situations where offline testing could be as accurate as online testing in identifying safety requirement violations.
The boom of DL technology leads to massive DL models built and shared, which facilitates the acquisition and reuse of DL models. For a given task, we encounter multiple DL models available with the same functionality, which are considered as candidates to achieve this task. Testers are expected to compare multiple DL models and select the more suitable ones w.r.t. the whole testing context. Due to the limitation of labeling effort, testers aim to select an efficient subset of samples to make an as precise rank estimation as possible for these models. To tackle this problem, we propose Sample Discrimination based Selection (SDS) to select efficient samples that could discriminate multiple models, i.e., the prediction behaviors (right/wrong) of these samples would be helpful to indicate the trend of model performance. To evaluate SDS, we conduct an extensive empirical study with three widely-used image datasets and 80 real world DL models. The experimental results show that, compared with state-of-the-art baseline methods, SDS is an effective and efficient sample selection method to rank multiple DL models.
Concolic testing is a popular dynamic validation technique that can be used for both model checking and automatic test case generation. We have recently introduced concolic testing in the context of logic programming. In contrast to previous approaches, the key ingredient in this setting is a technique to generate appropriate run-time goals by considering all possible ways an atom can unify with the heads of some program clauses. This is called selective unification. In this paper, we show that the existing algorithm is not complete and explore different alternatives in order to have a sound and complete algorithm for selective unification.
This paper proposes configuration testing--evaluating configuration values (to be deployed) by exercising the code that uses the values and assessing the corresponding program behavior. We advocate that configuration values should be systematically tested like software code and that configuration testing should be a key reliability engineering practice for preventing misconfigurations from production deployment. The essential advantage of configuration testing is to put the configuration values (to be deployed) in the context of the target software program under test. In this way, the dynamic effects of configuration values and the impact of configuration changes can be observed during testing. Configuration testing overcomes the fundamental limitations of de facto approaches to combatting misconfigurations, namely configuration validation and software testing--the former is disconnected from code logic and semantics, while the latter can hardly cover all possible configuration values and their combinations. Our preliminary results show the effectiveness of configuration testing in capturing real-world misconfigurations. We present the principles of writing new configuration tests and the promises of retrofitting existing software tests to be configuration tests. We discuss new adequacy and quality metrics for configuration testing. We also explore regression testing techniques to enable incremental configuration testing during continuous integration and deployment in modern software systems.