No Arabic abstract
Motivation: Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. The problem of assigning tumours to a known class is a particularly important example that has received considerable attention in the last ten years. Many of the classification methods proposed recently require some form of dimension-reduction of the problem. These methods provide a single model as an output and, in most cases, rely on the likelihood function in order to achieve variable selection. Results: We propose a prediction-based objective function that can be tailored to the requirements of practitioners and can be used to assess and interpret a given problem. The direct optimization of such a function can be very difficult because the problem is potentially discontinuous and nonconvex. We therefore propose a general procedure for variable selection that resembles importance sampling to explore the feature space. Our proposal compares favorably with competing alternatives when applied to two cancer data sets in that smaller models are obtained for better or at least comparable classification errors. Furthermore by providing a set of selected models instead of a single one, we construct a network of possible models for a target prediction accuracy level.
Isotonic regression is a nonparametric approach for fitting monotonic models to data that has been widely studied from both theoretical and practical perspectives. However, this approach encounters computational and statistical overfitting issues in higher dimensions. To address both concerns, we present an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic regression based on recursively partitioning the covariate space through solution of progressively smaller best cut subproblems. This creates a regularized sequence of isotonic models of increasing model complexity that converges to the global isotonic regression solution. The models along the sequence are often more accurate than the unregularized isotonic regression model because of the complexity control they offer. We quantify this complexity control through estimation of degrees of freedom along the path. Success of the regularized models in prediction and IRPs favorable computational properties are demonstrated through a series of simulated and real data experiments. We discuss application of IRP to the problem of searching for gene--gene interactions and epistasis, and demonstrate it on data from genome-wide association studies of three common diseases.
Penalized (or regularized) regression, as represented by Lasso and its variants, has become a standard technique for analyzing high-dimensional data when the number of variables substantially exceeds the sample size. The performance of penalized regression relies crucially on the choice of the tuning parameter, which determines the amount of regularization and hence the sparsity level of the fitted model. The optimal choice of tuning parameter depends on both the structure of the design matrix and the unknown random error distribution (variance, tail behavior, etc). This article reviews the current literature of tuning parameter selection for high-dimensional regression from both theoretical and practical perspectives. We discuss various strategies that choose the tuning parameter to achieve prediction accuracy or support recovery. We also review several recently proposed methods for tuning-free high-dimensional regression.
We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.
Recently, a so-called E-MS algorithm was developed for model selection in the presence of missing data. Specifically, it performs the Expectation step (E step) and Model Selection step (MS step) alternately to find the minimum point of the observed generalized information criteria (GIC). In practice, it could be numerically infeasible to perform the MS-step for high dimensional settings. In this paper, we propose a more simple and feasible generalized EMS (GEMS) algorithm which simply requires a decrease in the observed GIC in the MS-step and includes the original EMS algorithm as a special case. We obtain several numerical convergence results of the GEMS algorithm under mild conditions. We apply the proposed GEMS algorithm to Gaussian graphical model selection and variable selection in generalized linear models and compare it with existing competitors via numerical experiments. We illustrate its application with three real data sets.
Penalized variable selection for high dimensional longitudinal data has received much attention as accounting for the correlation among repeated measurements and providing additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies the potential of penalization methods is far from fully understood for accommodating structured sparsity. In this article, we develop a sparse group penalization method to conduct the bi-level gene-environment (G$times$E) interaction study under the repeatedly measured phenotype. Within the quadratic inference function (QIF) framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual level. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program (CAMP), we conduct G$times$E study by using high dimensional SNP data as the Genetic factor and the longitudinal trait, forced expiratory volume in one second (FEV1), as phenotype. Our method leads to improved prediction and identification of main and interaction effects with important implications.