ترغب بنشر مسار تعليمي؟ اضغط هنا

Second-generation p-values: improved rigor, reproducibility, & transparency in statistical analyses

373   0   0.0 ( 0 )
 نشر من قبل Jeffrey Blume
 تاريخ النشر 2017
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

Verifying that a statistically significant result is scientifically meaningful is not only good scientific practice, it is a natural way to control the Type I error rate. Here we introduce a novel extension of the p-value - a second-generation p-value - that formally accounts for scientific relevance and leverages this natural Type I Error control. The approach relies on a pre-specified interval null hypothesis that represents the collection of effect sizes that are scientifically uninteresting or are practically null. The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses. As such, second-generation p-values indicate when the data are compatible with null hypotheses, or with alternative hypotheses, or when the data are inconclusive. Moreover, second-generation p-values provide a proper scientific adjustment for multiple comparisons and reduce false discovery rates. This is an advance for environments rich in data, where traditional p-value adjustments are needlessly punitive. Second-generation p-values promote transparency, rigor and reproducibility of scientific results by a priori specifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with alternative or null hypotheses.



قيم البحث

اقرأ أيضاً

Many statistical methods have been proposed for variable selection in the past century, but few balance inference and prediction tasks well. Here we report on a novel variable selection approach called Penalized regression with Second-Generation P-Va lues (ProSGPV). It captures the true model at the best rate achieved by current standards, is easy to implement in practice, and often yields the smallest parameter estimation error. The idea is to use an l0 penalization scheme with second-generation p-values (SGPV), instead of traditional ones, to determine which variables remain in a model. The approach yields tangible advantages for balancing support recovery, parameter estimation, and prediction tasks. The ProSGPV algorithm can maintain its good performance even when there is strong collinearity among features or when a high dimensional feature space with p > n is considered. We present extensive simulations and a real-world application comparing the ProSGPV approach with smoothly clipped absolute deviation (SCAD), adaptive lasso (AL), and mini-max concave penalty with penalized linear unbiased selection (MC+). While the last three algorithms are among the current standards for variable selection, ProSGPV has superior inference performance and comparable prediction performance in certain scenarios. Supplementary materials are available online.
Computational methods have reshaped the landscape of modern biology. While the biomedical community is increasingly dependent on computational tools, the mechanisms ensuring open data, open software, and reproducibility are variably enforced by acade mic institutions, funders, and publishers. Publications may present academic software for which essential materials are or become unavailable, such as source code and documentation. Publications that lack such information compromise the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit any subsequent work produced with the tool. We provide eight recommendations across four different domains to improve reproducibility, transparency, and rigor in computational biology - precisely on the main values which should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in biomedicine and life science research.
There has recently been considerable interest in addressing the problem of unifying distributed statistical analyses into a single coherent inference. This problem naturally arises in a number of situations, including in big-data settings, when worki ng under privacy constraints, and in Bayesian model choice. The majority of existing approaches have relied upon convenient approximations of the distributed analyses. Although typically being computationally efficient, and readily scaling with respect to the number of analyses being unified, approximate approaches can have significant shortcomings -- the quality of the inference can degrade rapidly with the number of analyses being unified, and can be substantially biased even when unifying a small number of analyses that do not concur. In contrast, the recent Fusion approach of Dai et al. (2019) is a rejection sampling scheme which is readily parallelisable and is exact (avoiding any form of approximation other than Monte Carlo error), albeit limited in applicability to unifying a small number of low-dimensional analyses. In this paper we introduce a practical Bayesian Fusion approach. We extend the theory underpinning the Fusion methodology and, by embedding it within a sequential Monte Carlo algorithm, we are able to recover the correct target distribution. By means of extensive guidance on the implementation of the approach, we demonstrate theoretically and empirically that Bayesian Fusion is robust to increasing numbers of analyses, and coherently unifying analyses which do not concur. This is achieved while being computationally competitive with approximate schemes.
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have s hown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently non-zero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high dimensional penalized Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies.
The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical infer ence on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the Shapley Population Variable Importance Measure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only $Theta(n)$ feature subsets given $n$ observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا