ترغب بنشر مسار تعليمي؟ اضغط هنا

Employing Partial Least Squares Regression with Discriminant Analysis for Bug Prediction

144   0   0.0 ( 0 )
 نشر من قبل R\\'obert Rajk\\'o
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

Forecasting defect proneness of source code has long been a major research concern. Having an estimation of those parts of a software system that most likely contain bugs may help focus testing efforts, reduce costs, and improve product quality. Many prediction models and approaches have been introduced during the past decades that try to forecast bugged code elements based on static source code metrics, change and history metrics, or both. However, there is still no universal best solution to this problem, as most suitable features and models vary from dataset to dataset and depend on the context in which we use them. Therefore, novel approaches and further studies on this topic are highly necessary. In this paper, we employ a chemometric approach - Partial Least Squares with Discriminant Analysis (PLS-DA) - for predicting bug prone Classes in Java programs using static source code metrics. To our best knowledge, PLS-DA has never been used before as a statistical approach in the software maintenance domain for predicting software errors. In addition, we have used rigorous statistical treatments including bootstrap resampling and randomization (permutation) test, and evaluation for representing the software engineering results. We show that our PLS-DA based prediction model achieves superior performances compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when applying up-sampling on the largest open bug dataset, while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively.



قيم البحث

اقرأ أيضاً

A partial least squares regression is proposed for estimating the function-on-function regression model where a functional response and multiple functional predictors consist of random curves with quadratic and interaction effects. The direct estimat ion of a function-on-function regression model is usually an ill-posed problem. To overcome this difficulty, in practice, the functional data that belong to the infinite-dimensional space are generally projected into a finite-dimensional space of basis functions. The function-on-function regression model is converted to a multivariate regression model of the basis expansion coefficients. In the estimation phase of the proposed method, the functional variables are approximated by a finite-dimensional basis function expansion method. We show that the partial least squares regression constructed via a functional response, multiple functional predictors, and quadratic/interaction terms of the functional predictors is equivalent to the partial least squares regression constructed using basis expansions of functional variables. From the partial least squares regression of the basis expansions of functional variables, we provide an explicit formula for the partial least squares estimate of the coefficient function of the function-on-function regression model. Because the true forms of the models are generally unspecified, we propose a forward procedure for model selection. The finite sample performance of the proposed method is examined using several Monte Carlo experiments and two empirical data analyses, and the results were found to compare favorably with an existing method.
We present a new functional Bayes classifier that uses principal component (PC) or partial least squares (PLS) scores from the common covariance function, that is, the covariance function marginalized over groups. When the groups have different covar iance functions, the PC or PLS scores need not be independent or even uncorrelated. We use copulas to model the dependence. Our method is semiparametric; the marginal densities are estimated nonparametrically by kernel smoothing and the copula is modeled parametrically. We focus on Gaussian and t-copulas, but other copulas could be used. The strong performance of our methodology is demonstrated through simulation, real data examples, and asymptotic properties.
The problem of fitting experimental data to a given model function $f(t; p_1,p_2,dots,p_N)$ is conventionally solved numerically by methods such as that of Levenberg-Marquardt, which are based on approximating the Chi-squared measure of discrepancy b y a quadratic function. Such nonlinear iterative methods are usually necessary unless the function $f$ to be fitted is itself a linear function of the parameters $p_n$, in which case an elementary linear Least Squares regression is immediately available. When linearity is present in some, but not all, of the parameters, we show how to streamline the optimization method by reducing the nonlinear activity to the nonlinear parameters only. Numerical examples are given to demonstrate the effectiveness of this approach. The main idea is to replace entries corresponding to the linear terms in the numerical difference quotients with an optimal value easily obtained by linear regression. More generally, the idea applies to minimization problems which are quadratic in some of the parameters. We show that the covariance matrix of $chi^2$ remains the same even though the derivatives are calculated in a different way. For this reason, the standard non-linear optimization methods can be fully applied.
Given a linear regression setting, Iterative Least Trimmed Squares (ILTS) involves alternating between (a) selecting the subset of samples with lowest current loss, and (b) re-fitting the linear model only on that subset. Both steps are very fast and simple. In this paper we analyze ILTS in the setting of mixed linear regression with corruptions (MLR-C). We first establish deterministic conditions (on the features etc.) under which the ILTS iterate converges linearly to the closest mixture component. We also provide a global algorithm that uses ILTS as a subroutine, to fully solve mixed linear regressions with corruptions. We then evaluate it for the widely studied setting of isotropic Gaussian features, and establish that we match or better existing results in terms of sample complexity. Finally, we provide an ODE analysis for a gradient-descent variant of ILTS that has optimal time complexity. Our results provide initial theoretical evidence that iteratively fitting to the best subset of samples -- a potentially widely applicable idea -- can provably provide state of the art performance in bad training data settings.
87 - Ben Boukai , Yue Zhang 2018
We consider a resampling scheme for parameters estimates in nonlinear regression models. We provide an estimation procedure which recycles, via random weighting, the relevant parameters estimates to construct consistent estimates of the sampling dist ribution of the various estimates. We establish the asymptotic normality of the resampled estimates and demonstrate the applicability of the recycling approach in a small simulation study and via example.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا