أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Hemant Ishwaran

93 - Fei Tang , Hemant Ishwaran 2017

Random forest (RF) missing data algorithms are an attractive approach for dealing with missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms but relatively little guidance about their efficacy, which motivated us to study their performance. Using a large, diverse collection of data sets, performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting---the latter class representing a generalization of a new promising imputation algorithm called missForest. Performance of algorithms was assessed by ability to impute data accurately. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

التعلم الالي

A Machine Learning Alternative to P-values

62 - Min Lu , Hemant Ishwaran 2017

This paper presents an alternative approach to p-values in regression settings. This approach, whose origins can be traced to machine learning, is based on the leave-one-out bootstrap for prediction error. In machine learning this is called the out-o f-bag (OOB) error. To obtain the OOB error for a model, one draws a bootstrap sample and fits the model to the in-sample data. The out-of-sample prediction error for the model is obtained by calculating the prediction error for the model using the out-of-sample data. Repeating and averaging yields the OOB error, which represents a robust cross-validated estimate of the accuracy of the underlying model. By a simple modification to the bootstrap data involving noising up a variable, the OOB method yields a variable importance (VIMP) index, which directly measures how much a specific variable contributes to the prediction precision of a model. VIMP provides a scientifically interpretable measure of the effect size of a variable, we call the predictive effect size, that holds whether the researchers model is correct or not, unlike the p-value whose calculation is based on the assumed correctness of the model. We also discuss a marginal VIMP index, also easily calculated, which measures the marginal effect of a variable, or what we call the discovery effect. The OOB procedure can be applied to both parametric and nonparametric regression models and requires only that the researcher can repeatedly fit their model to bootstrap and modified bootstrap data. We illustrate this approach on a survival data set involving patients with systolic heart failure and to a simulated survival data set where the model is incorrectly specified to illustrate its robustness to model misspecification.

التعلم الالي التعلم الآلي

Consistency of Random Survival Forests

416 - Hemant Ishwaran , Udaya B. Kogalur 2008

We prove uniform consistency of Random Survival Forests (RSF), a newly introduced forest ensemble learner for analysis of right-censored survival data. Consistency is proven under general splitting rules, bootstrapping, and random selection of variab les--that is, under true implementation of the methodology. A key assumption made is that all variables are factors. Although this assumes that the feature space has finite cardinality, in practice the space can be a extremely large--indeed, current computational procedures do not properly deal with this setting. An indirect consequence of this work is the introduction of new computational methodology for dealing with factors with unlimited number of labels.

نظرية الإحصاء نظرية الإحصاء

Random survival forests

164 - Hemant Ishwaran , Udaya B. Kogalur , Eugene H. Blackstone 2008

We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A co nservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.

تطبيقات الإحصاء

Orthogonalized smoothing for rescaled spike and slab models

95 - Hemant Ishwaran , Ariadni Papana 2008

Rescaled spike and slab models are a new Bayesian variable selection method for linear regression models. In high dimensional orthogonal settings such models have been shown to possess optimal model selection properties. We review background theory a nd discuss applications of rescaled spike and slab models to prediction problems involving orthogonal polynomials. We first consider global smoothing and discuss potential weaknesses. Some of these deficiencies are remedied by using local regression. The local regression approach relies on an intimate connection between local weighted regression and weighted generalized ridge regression. An important implication is that one can trace the effective degrees of freedom of a curve as a way to visualize and classify curvature. Several motivating examples are presented.

تطبيقات الإحصاء المنهجية

Variable importance in binary regression trees and forests

60 - Hemant Ishwaran 2007

We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends fro m single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.

التعلم الالي

Spike and slab variable selection: Frequentist and Bayesian strategies

213 - Hemant Ishwaran , J. Sunil Rao 2005

Variable selection in the linear regression model takes many apparent faces from both frequentist and Bayesian standpoints. In this paper we introduce a variable selection method referred to as a rescaled spike and slab model. We study the importance of prior hierarchical specifications and draw connections to frequentist generalized ridge regression estimation. Specifically, we study the usefulness of continuous bimodal priors to model hypervariance parameters, and the effect scaling has on the posterior mean through its relationship to penalization. Several model selection strategies, some frequentist and some Bayesian in nature, are developed and studied theoretically. We demonstrate the importance of selective shrinkage for effective variable selection in terms of risk misclassification, and show this is achieved using the posterior from a rescaled spike and slab model. We also show how to verify a procedures ability to reduce model uncertainty in finite samples using a specialized forward selection strategy. Using this tool, we illustrate the effectiveness of rescaled spike and slab models in reducing model uncertainty.

نظرية الإحصاء نظرية الإحصاء

Discussion of Least angle regression by Efron et al

95 - Hemant Ishwaran 2004

Discussion of ``Least angle regression by Efron et al. [math.ST/0406456]

نظرية الإحصاء نظرية الإحصاء

Random probability measures via Polya sequences: revisiting the Blackwell-MacQueen urn scheme

84 - Hemant Ishwaran , Mahmoud Zarepour 2003

Sufficient conditions are developed for a class of generalized Polya urn schemes ensuring exchangeability. The extended class includes the Blackwell-MacQueen Polya urn and the urn schemes for the two-parameter Poisson-Dirichlet process and finite dimensional Dirichlet priors among others.

الاحتمالات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد