ترغب بنشر مسار تعليمي؟ اضغط هنا

Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection

378   0   0.0 ( 0 )
 نشر من قبل Paul Kabaila
 تاريخ النشر 2007
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

We consider a linear regression model, with the parameter of interest a specified linear combination of the regression parameter vector. We suppose that, as a first step, a data-based model selection (e.g. by preliminary hypothesis tests or minimizing AIC) is used to select a model. It is common statistical practice to then construct a confidence interval for the parameter of interest based on the assumption that the selected model had been given to us a priori. This assumption is false and it can lead to a confidence interval with poor coverage properties. We provide an easily-computed finite sample upper bound (calculated by repeated numerical evaluation of a double integral) to the minimum coverage probability of this confidence interval. This bound applies for model selection by any of the following methods: minimum AIC, minimum BIC, maximum adjusted R-squared, minimum Mallows Cp and t-tests. The importance of this upper bound is that it delineates general categories of design matrices and model selection procedures for which this confidence interval has poor coverage properties. This upper bound is shown to be a finite sample analogue of an earlier large sample upper bound due to Kabaila and Leeb.



قيم البحث

اقرأ أيضاً

We compare the following two sources of poor coverage of post-model-selection confidence intervals: the preliminary data-based model selection sometimes chooses the wrong model and the data used to choose the model is re-used for the construction of the confidence interval.
We derive a computationally convenient formula for the large sample coverage probability of a confidence interval for a scalar parameter of interest following a preliminary hypothesis test that a specified vector parameter takes a given value in a ge neral regression model. Previously, this large sample coverage probability could only be estimated by simulation. Our formula only requires the evaluation, by numerical integration, of either a double or triple integral, irrespective of the dimension of this specified vector parameter. We illustrate the application of this formula to a confidence interval for the log odds ratio of myocardial infarction when the exposure is recent oral contraceptive use, following a preliminary test that two specified interactions in a logistic regression model are zero. For this real-life data, we compare this large sample coverage probability with the actual coverage probability of this confidence interval, obtained by simulation.
The asymptotic behaviour of the commonly used bootstrap percentile confidence interval is investigated when the parameters are subject to linear inequality constraints. We concentrate on the important one- and two-sample problems with data generated from general parametric distributions in the natural exponential family. The focus of this paper is on quantifying the coverage probabilities of the parametric bootstrap percentile confidence intervals, in particular their limiting behaviour near boundaries. We propose a local asymptotic framework to study this subtle coverage behaviour. Under this framework, we discover that when the true parameters are on, or close to, the restriction boundary, the asymptotic coverage probabilities can always exceed the nominal level in the one-sample case; however, they can be, remarkably, both under and over the nominal level in the two-sample case. Using illustrative examples, we show that the results provide theoretical justification and guidance on applying the bootstrap percentile method to constrained inference problems.
Recently, Kabaila and Wijethunga assessed the performance of a confidence interval centred on a bootstrap smoothed estimator, with width proportional to an estimator of Efrons delta method approximation to the standard deviation of this estimator. Th ey used a testbed situation consisting of two nested linear regression models, with error variance assumed known, and model selection using a preliminary hypothesis test. This assessment was in terms of coverage and scaled expected length, where the scaling is with respect to the expected length of the usual confidence interval with the same minimum coverage probability. They found that this confidence interval has scaled expected length that (a) has a maximum value that may be much greater than 1 and (b) is greater than a number slightly less than 1 when the simpler model is correct. We therefore ask the following question. For a confidence interval, centred on the bootstrap smoothed estimator, does there exist a formula for its data-based width such that, in this testbed situation, it has the desired minimum coverage and scaled expected length that (a) has a maximum value that is not too much larger than 1 and (b) is substantially less than 1 when the simpler model is correct? Using a recent decision-theoretic performance bound due to Kabaila and Kong, it is shown that the answer to this question is `no for a wide range of scenarios.
We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit fr om an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا