An Empirical Analysis of UI-based Flaky Tests

244 0 0.0 ( 0 )

Download Cite

Added by Alan Romano

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Alan Romano

Software Engineering

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Flaky tests have gained attention from the research community in recent years and with good reason. These tests lead to wasted time and resources, and they reduce the reliability of the test suites and build systems they affect. However, most of the existing work on flaky tests focus exclusively on traditional unit tests. This work ignores UI tests that have larger input spaces and more diverse running conditions than traditional unit tests. In addition, UI tests tend to be more complex and resource-heavy, making them unsuited for detection techniques involving rerunning test suites multiple times. In this paper, we perform a study on flaky UI tests. We analyze 235 flaky UI test samples found in 62 projects from both web and Android environments. We identify the common underlying root causes of flakiness in the UI tests, the strategies used to manifest the flaky behavior, and the fixing strategies used to remedy flaky UI tests. The findings made in this work can provide a foundation for the development of detection and prevention techniques for flakiness arising in UI tests.

rate research

What is the Vocabulary of Flaky Tests? An Extended Replication

269 - B. H. P. Camara , M. A. G. Silva , A. T. Endo 2021

Software systems have been continuously evolved and delivered with high quality due to the widespread adoption of automated tests. A recurring issue hurting this scenario is the presence of flaky tests, a test case that may pass or fail non-deterministically. A promising, but yet lacking more empirical evidence, approach is to collect static data of automated tests and use them to predict their flakiness. In this paper, we conducted an empirical study to assess the use of code identifiers to predict test flakiness. To do so, we first replicate most parts of the previous study of Pinto~et~al.~(MSR~2020). This replication was extended by using a different ML Python platform (Scikit-learn) and adding different learning algorithms in the analyses. Then, we validated the performance of trained models using datasets with other flaky tests and from different projects. We successfully replicated the results of Pinto~et~al.~(2020), with minor differences using Scikit-learn; different algorithms had performance similar to the ones used previously. Concerning the validation, we noticed that the recall of the trained models was smaller, and classifiers presented a varying range of decreases. This was observed in both intra-project and inter-projects test flakiness prediction.

Software Engineering Artificial Intelligence

On the use of test smells for prediction of flaky tests

264 - B. H. P. Camara , M. A. G. Silva , A. T. Endo 2021

Regression testing is an important phase to deliver software with quality. However, flaky tests hamper the evaluation of test results and can increase costs. This is because a flaky test may pass or fail non-deterministically and to identify properly the flakiness of a test requires rerunning the test suite multiple times. To cope with this challenge, approaches have been proposed based on prediction models and machine learning. Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting, presenting low performance when executed in a cross-project scenario. To overcome these limitations, we investigate the use of test smells as predictors of flaky tests. We conducted an empirical study to understand if test smells have good performance as a classifier to predict the flakiness in the cross-project context, and analyzed the information gain of each test smell. We also compared the test smell-based approach with the vocabulary-based one. As a result, we obtained a classifier that had a reasonable performance (Random Forest, 0.83) to predict the flakiness in the testing phase. This classifier presented better performance than vocabulary-based model for cross-project prediction. The Assertion Roulette and Sleepy Test test smell types are the ones associated with the best information gain values.

Software Engineering Machine Learning

An Empirical Analysis of Task Allocation in Scrum-based Agile Programming

569 - Jun Lin , Han Yu , Zhiqi Shen 2014

Agile Software Development (ASD) methodology has become widely used in the industry. Understanding the challenges facing software engineering students is important to designing effective training methods to equip students with proper skills required for effectively using the ASD techniques. Existing empirical research mostly focused on eXtreme Programming (XP) based ASD methodologies. There is a lack of empirical studies about Scrum-based ASD programming which has become the most popular agile methodology among industry practitioners. In this paper, we present empirical findings regarding the aspects of task allocation decision-making, collaboration, and team morale related to the Scrum ASD process which have not yet been well studied by existing research. We draw our findings from a 12 week long course work project in 2014 involving 125 undergraduate software engineering students from a renowned university working in 21 Scrum teams. Instead of the traditional survey or interview based methods, which suffer from limitations in scale and level of details, we obtain fine grained data through logging students activities in our online agile project management (APM) platform - HASE. During this study, the platform logged over 10,000 ASD activities. Deviating from existing preconceptions, our results suggest negative correlations between collaboration and team performance as well as team morale.

Software Engineering

An Empirical Analysis of the Influence of Fault Space on Search-Based Automated Program Repair

84 - Ming Wen , Junjie Chen , Rongxin Wu 2017

Automated program repair (APR) has attracted great research attention, and various techniques have been proposed. Search-based APR is one of the most important categories among these techniques. Existing researches focus on the design of effective mutation operators and searching algorithms to better find the correct patch. Despite various efforts, the effectiveness of these techniques are still limited by the search space explosion problem. One of the key factors attribute to this problem is the quality of fault spaces as reported by existing studies. This motivates us to study the importance of the fault space to the success of finding a correct patch. Our empirical study aims to answer three questions. Does the fault space significantly correlate with the performance of search-based APR? If so, are there any indicative measurements to approximate the accuracy of the fault space before applying expensive APR techniques? Are there any automatic methods that can improve the accuracy of the fault space? We observe that the accuracy of the fault space affects the effectiveness and efficiency of search-based APR techniques, e.g., the failure rate of GenProg could be as high as $60%$ when the real fix location is ranked lower than 10 even though the correct patch is in the search space. Besides, GenProg is able to find more correct patches and with fewer trials when given a fault space with a higher accuracy. We also find that the negative mutation coverage, which is designed in this study to measure the capability of a test suite to kill the mutants created on the statements executed by failing tests, is the most indicative measurement to estimate the efficiency of search-based APR. Finally, we confirm that automated generated test cases can help improve the accuracy of fault spaces, and further improve the performance of search-based APR techniques.

Software Engineering

Wireframe-Based UI Design Search Through Image Autoencoder

127 - Jieshan Chen , Chunyang Chen , Zhenchang Xing 2021

UI design is an integral part of software development. For many developers who do not have much UI design experience, exposing them to a large database of real-application UI designs can help them quickly build up a realistic understanding of the design space for a software feature and get design inspirations from existing applications. However, existing keyword-based, image-similarity-based, and component-matching-based methods cannot reliably find relevant high-fidelity UI designs in a large database alike to the UI wireframe that the developers sketch, in face of the great variations in UI designs. In this article, we propose a deep-learning-based UI design search engine to fill in the gap. The key innovation of our search engine is to train a wireframe image autoencoder using a large database of real-application UI designs, without the need for labeling relevant UI designs. We implement our approach for Android UI design search, and conduct extensive experiments with artificially created relevant UI designs and human evaluation of UI design search results. Our experiments confirm the superior performance of our search engine over existing image-similarity or component-matching-based methods and demonstrate the usefulness of our search engine in real-world UI design tasks.

Software Engineering