Stable variable selection for right censored data: comparison of methods

321 0 0.0 ( 0 )

Download Cite

Added by Marie Walschaerts

Publication date 2012

fields Mathematical Statistics

and research's language is English

Authors Marie Walschaerts

Applications

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The instability in the selection of models is a major concern with data sets containing a large number of covariates. This paper deals with variable selection methodology in the case of high-dimensional problems where the response variable can be right censored. We focuse on new stable variable selection methods based on bootstrap for two methodologies: the Cox proportional hazard model and survival trees. As far as the Cox model is concerned, we investigate the bootstrapping applied to two variable selection techniques: the stepwise algorithm based on the AIC criterion and the L1-penalization of Lasso. Regarding survival trees, we review two methodologies: the bootstrap node-level stabilization and random survival forests. We apply these different approaches to two real data sets. We compare the methods on the prediction error rate based on the Harrell concordance index and the relevance of the interpretation of the corresponding selected models. The aim is to find a compromise between a good prediction performance and ease to interpretation for clinicians. Results suggest that in the case of a small number of individuals, a bootstrapping adapted to L1-penalization in the Cox model or a bootstrap node-level stabilization in survival trees give a good alternative to the random survival forest methodology, known to give the smallest prediction error rate but difficult to interprete by non-statisticians. In a clinical perspective, the complementarity between the methods based on the Cox model and those based on survival trees would permit to built reliable models easy to interprete by the clinician.

rate research

Bayesian Variable Selection for Multivariate Zero-Inflated Models: Application to Microbiome Count Data

102 - Kyu Ha Lee , Brent A. Coull , Anna-Barbara Moscicki 2017

Microorganisms play critical roles in human health and disease. It is well known that microbes live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, multivariate statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa, yielding more powerful tests of exposure or treatment effects than application of taxon-specific univariate analyses. In addition, the analysis of microbial count data requires special attention because data commonly exhibit zero inflation. To meet these needs, we developed a Bayesian variable selection model for multivariate count data with excess zeros that incorporates information on the covariance structure of the outcomes (counts for multiple taxa), while estimating associations with the mean levels of these outcomes. Although there has been a great deal of effort in zero-inflated models for longitudinal data, little attention has been given to high-dimensional multivariate zero-inflated data modeled via a general correlation structure. Through simulation, we compared performance of the proposed method to that of existing univariate approaches, for both the binary and count parts of the model. When outcomes were correlated the proposed variable selection method maintained type I error while boosting the ability to identify true associations in the binary component of the model. For the count part of the model, in some scenarios the the univariate method had higher power than the multivariate approach. This higher power was at a cost of a highly inflated false discovery rate not observed with the proposed multivariate method. We applied the approach to oral microbiome data from the Pediatric HIV/AIDS Cohort Oral Health Study and identified five species (of 44) associated with HIV infection.

Applications

A kernel log-rank test of independence for right-censored data

65 - Tamara Fernandez , Arthur Gretton , David Rindt 2019

We introduce a general non-parametric independence test between right-censored survival times and covariates, which may be multivariate. Our test statistic has a dual interpretation, first in terms of the supremum of a potentially infinite collection of weight-indexed log-rank tests, with weight functions belonging to a reproducing kernel Hilbert space (RKHS) of functions; and second, as the norm of the difference of embeddings of certain finite measures into the RKHS, similar to the Hilbert-Schmidt Independence Criterion (HSIC) test-statistic. We study the asymptotic properties of the test, finding sufficient conditions to ensure our test correctly rejects the null hypothesis under any alternative. The test statistic can be computed straightforwardly, and the rejection threshold is obtained via an asymptotically consistent Wild Bootstrap procedure. Extensive simulations demonstrate that our testing procedure generally performs better than competing approaches in detecting complex non-linear dependence.

Methodology Machine Learning

Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets

119 - Xinzhi Han , Sen Lei 2018

With the rapid advance of the Internet, search engines (e.g., Google, Bing, Yahoo!) are used by billions of users for each day. The main function of a search engine is to locate the most relevant webpages corresponding to what the user requests. This report focuses on the core problem of information retrieval: how to learn the relevance between a document (very often webpage) and a query given by user. Our analysis consists of two parts: 1) we use standard statistical methods to select important features among 137 candidates given by information retrieval researchers from Microsoft. We find that not all the features are useful, and give interpretations on the top-selected features; 2) we give baselines on prediction over the real-world dataset MSLR-WEB by using various learning algorithms. We find that models of boosting trees, random forest in general achieve the best performance of prediction. This agrees with the mainstream opinion in information retrieval community that tree-based algorithms outperform the other candidates for this problem.

Applications Information Retrieval

Nonlinear Mixed-effects Scalar-on-function Models and Variable Selection for Kinematic Upper Limb Movement Data

65 - Yafeng Cheng , Jian Qing Shi , Janet Eyre 2016

This paper arises from collaborative research the aim of which was to model clinical assessments of upper limb function after stroke using 3D kinematic data. We present a new nonlinear mixed-effects scalar-on-function regression model with a Gaussian process prior focusing on variable selection from large number of candidates including both scalar and function variables. A novel variable selection algorithm has been developed, namely functional least angle regression (fLARS). As they are essential for this algorithm, we studied the representation of functional variables with different methods and the correlation between a scalar and a group of mixed scalar and functional variables. We also propose two new stopping rules for practical usage. This algorithm is able to do variable selection when the number of variables is larger than the sample size. It is efficient and accurate for both variable selection and parameter estimation. Our comprehensive simulation study showed that the method is superior to other existing variable selection methods. When the algorithm was applied to the analysis of the 3D kinetic movement data the use of the non linear random-effects model and the function variables significantly improved the prediction accuracy for the clinical assessment.

Applications

High-Dimensional Variable Selection and Prediction under Competing Risks with Application to SEER-Medicare Linked Data

140 - Jiayi Hou , Anthony Paravati , Ronghui Xu 2017

Competing risk analysis considers event times due to multiple causes, or of more than one event types. Commonly used regression models for such data include 1) cause-specific hazards model, which focuses on modeling one type of event while acknowledging other event types simultaneously; and 2) subdistribution hazards model, which links the covariate effects directly to the cumulative incidence function. Their use and in particular statistical properties in the presence of high-dimensional predictors are largely unexplored. Motivated by an analysis using the linked SEER-Medicare database for the purposes of predicting cancer versus non-cancer mortality for patients with prostate cancer, we study the accuracy of prediction and variable selection of existing statistical learning methods under both models using extensive simulation experiments, including different approaches to choosing penalty parameters in each method. We then apply the optimal approaches to the analysis of the SEER-Medicare data.

Applications

comments

Fetching comments

Peninsula Private University

Additional details More universities

Stable variable selection for right censored data: comparison of methods

Ask ChatGPT about the research

No Arabic abstract

Read More