No Arabic abstract
In recent years, we observe an increasing amount of software with machine learning components being deployed. This poses the question of quality assurance for such components: how can we validate whether specified requirements are fulfilled by a machine learned software? Current testing and verification approaches either focus on a single requirement (e.g., fairness) or specialize on a single type of machine learning model (e.g., neural networks). In this paper, we propose property-driven testing of machine learning models. Our approach MLCheck encompasses (1) a language for property specification, and (2) a technique for systematic test case generation. The specification language is comparable to property-based testing languages. Test case generation employs advanced verification technology for a systematic, property-dependent construction of test suites, without additional user-supplied generator functions. We evaluate MLCheck using requirements and data sets from three different application areas (software discrimination, learning on knowledge graphs and security). Our evaluation shows that despite its generality MLCheck can even outperform specialised testing approaches while having a comparable runtime.
Web testing has long been recognized as a notoriously difficult task. Even nowadays, web testing still heavily relies on manual efforts while automated web testing is far from achieving human-level performance. Key challenges in web testing include dynamic content update and deep bugs hiding under complicated user interactions and specific input values, which can only be triggered by certain action sequences in the huge search space. In this paper, we propose WebExplor, an automatic end-to-end web testing framework, to achieve an adaptive exploration of web applications. WebExplor adopts curiosity-driven reinforcement learning to generate high-quality action sequences (test cases) satisfying temporal logical relations. Besides, WebExplor incrementally builds an automaton during the online testing process, which provides high-level guidance to further improve the testing efficiency. We have conducted comprehensive evaluations of WebExplor on six real-world projects, a commercial SaaS web application, and performed an in-the-wild study of the top 50 web applications in the world. The results demonstrate that in most cases WebExplor can achieve a significantly higher failure detection rate, code coverage, and efficiency than existing state-of-the-art web testing techniques. WebExplor also detected 12 previously unknown failures in the commercial web application, which have been confirmed and fixed by the developers. Furthermore, our in-the-wild study further uncovered 3,466 exceptions and errors.
Mutation testing is a well-established technique for assessing a test suites quality by injecting artificial faults into production code. In recent years, mutation testing has been extended to machine learning (ML) systems, and deep learning (DL) in particular; researchers have proposed approaches, tools, and statistically sound heuristics to determine whether mutants in DL systems are killed or not. However, as we will argue in this work, questions can be raised to what extent currently used mutation testing techniques in DL are actually in line with the classical interpretation of mutation testing. We observe that ML model development resembles a test-driven development (TDD) process, in which a training algorithm (`programmer) generates a model (program) that fits the data points (test data) to labels (implicit assertions), up to a certain threshold. However, considering proposed mutation testing techniques for ML systems under this TDD metaphor, in current approaches, the distinction between production and test code is blurry, and the realism of mutation operators can be challenged. We also consider the fundamental hypotheses underlying classical mutation testing: the competent programmer hypothesis and coupling effect hypothesis. As we will illustrate, these hypotheses do not trivially translate to ML system development, and more conscious and explicit scoping and concept mapping will be needed to truly draw parallels. Based on our observations, we propose several action points for better alignment of mutation testing techniques for ML with paradigms and vocabularies of classical mutation testing.
ReTest is a novel testing tool for Java applications with a graphical user interface (GUI), combining monkey testing and difference testing. Since this combination sidesteps the oracle problem, it enables the generation of GUI-based regression tests. ReTest makes use of evolutionary computing (EC), particularly a genetic algorithm (GA), to optimize these tests towards code coverage. While this is indeed a desirable goal in terms of software testing and potentially finds many bugs, it lacks one major ingredient: human behavior. Consequently, human testers often find the results less reasonable and difficult to interpret. This thesis proposes a new approach to improve the initial population of the GA with the aid of machine learning (ML), forming an ML-technique enhanced-EC (MLEC) algorithm. In order to do so, existing tests are exploited to extract information on how human testers use the given GUI. The obtained data is then utilized to train an artificial neural network (ANN), which ranks the available GUI actions respectively their underlying GUI components at runtime---reducing the gap between manually created and automatically generated regression tests. Although the approach is implemented on top of ReTest, it can be easily used to guide any form of monkey testing. The results show that with only little training data, the ANN is able to reach an accuracy of 82% and the resulting tests represent an improvement without reducing the overall code coverage and performance significantly.
The boom of DL technology leads to massive DL models built and shared, which facilitates the acquisition and reuse of DL models. For a given task, we encounter multiple DL models available with the same functionality, which are considered as candidates to achieve this task. Testers are expected to compare multiple DL models and select the more suitable ones w.r.t. the whole testing context. Due to the limitation of labeling effort, testers aim to select an efficient subset of samples to make an as precise rank estimation as possible for these models. To tackle this problem, we propose Sample Discrimination based Selection (SDS) to select efficient samples that could discriminate multiple models, i.e., the prediction behaviors (right/wrong) of these samples would be helpful to indicate the trend of model performance. To evaluate SDS, we conduct an extensive empirical study with three widely-used image datasets and 80 real world DL models. The experimental results show that, compared with state-of-the-art baseline methods, SDS is an effective and efficient sample selection method to rank multiple DL models.
Ethereum Virtual Machine (EVM) is the run-time environment for smart contracts and its vulnerabilities may lead to serious problems to the Ethereum ecology. With lots of techniques being developed for the validation of smart contracts, the security problems of EVM have not been well-studied. In this paper, we propose EVMFuzz, aiming to detect vulnerabilities of EVMs with differential fuzz testing. The core idea of EVMFuzz is to continuously generate seed contracts for different EVMs execution, so as to find as many inconsistencies among execution results as possible, eventually discover vulnerabilities with output cross-referencing. First, we present the evaluation metric for the internal inconsistency indicator, such as the opcode sequence executed and gas used. Then, we construct seed contracts via a set of predefined mutators and employ dynamic priority scheduling algorithm to guide seed contracts selection and maximize the inconsistency. Finally, we leverage different EVMs as crossreferencing oracles to avoid manual checking of the execution output. For evaluation, we conducted large-scale mutation on 36,295 real-world smart contracts and generated 253,153 smart contracts. Among them, 66.2% showed differential performance, including 1,596 variant contracts triggered inconsistent output among EVMs. Accompanied by manual root cause analysis, we found 5 previously unknown security bugs in four widely used EVMs, and all had been included in Common Vulnerabilities and Exposures (CVE) database.