No Arabic abstract
High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We describe an algorithm for selecting informative features in high-dimensional data, where each observation comes from one of $K$ different distributions. Our algorithm can be applied in a completely nonparametric setup without any distributional assumptions on the data, and it aims at outputting those features in the data, that contribute the most to the overall distributional variation. At the heart of our method is the recursive application of distribution-free graph-based tests on subsets of the feature set, located at different depths of a hierarchical clustering tree constructed from the data. Our algorithm recovers all truly contributing features with high probability, while ensuring optimal control on false-discovery. Finally, we show the superior performance of our method over other existing ones through synthetic data, and also demonstrate the utility of the method on a real-life dataset from the domain of climate change.
We consider the problem of variable selection in high-dimensional settings with missing observations among the covariates. To address this relatively understudied problem, we propose a new synergistic procedure -- adaptive Bayesian SLOPE -- which effectively combines the SLOPE method (sorted $l_1$ regularization) together with the Spike-and-Slab LASSO method. We position our approach within a Bayesian framework which allows for simultaneous variable selection and parameter estimation, despite the missing values. As with the Spike-and-Slab LASSO, the coefficients are regarded as arising from a hierarchical model consisting of two groups: (1) the spike for the inactive and (2) the slab for the active. However, instead of assigning independent spike priors for each covariate, here we deploy a joint SLOPE spike prior which takes into account the ordering of coefficient magnitudes in order to control for false discoveries. Through extensive simulations, we demonstrate satisfactory performance in terms of power, FDR and estimation bias under a wide range of scenarios. Finally, we analyze a real dataset consisting of patients from Paris hospitals who underwent a severe trauma, where we show excellent performance in predicting platelet levels. Our methodology has been implemented in C++ and wrapped into an R package ABSLOPE for public use.
When fitting statistical models, some predictors are often found to be correlated with each other, and functioning together. Many group variable selection methods are developed to select the groups of predictors that are closely related to the continuous or categorical response. These existing methods usually assume the group structures are well known. For example, variables with similar practical meaning, or dummy variables created by categorical data. However, in practice, it is impractical to know the exact group structure, especially when the variable dimensional is large. As a result, the group variable selection results may be selected. To solve the challenge, we propose a two-stage approach that combines a variable clustering stage and a group variable stage for the group variable selection problem. The variable clustering stage uses information from the data to find a group structure, which improves the performance of the existing group variable selection methods. For ultrahigh dimensional data, where the predictors are much larger than observations, we incorporated a variable screening method in the first stage and shows the advantages of such an approach. In this article, we compared and discussed the performance of four existing group variable selection methods under different simulation models, with and without the variable clustering stage. The two-stage method shows a better performance, in terms of the prediction accuracy, as well as in the accuracy to select active predictors. An athletes data is also used to show the advantages of the proposed method.
This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.
This paper proposes a two-fold factor model for high-dimensional functional time series (HDFTS), which enables the modeling and forecasting of multi-population mortality under the functional data framework. The proposed model first decomposes the HDFTS into functional time series with lower dimensions (common feature) and a system of basis functions specific to different cross-sections (heterogeneity). Then the lower-dimensional common functional time series are further reduced into low-dimensional scalar factor matrices. The dimensionally reduced factor matrices can reasonably convey useful information in the original HDFTS. All the temporal dynamics contained in the original HDFTS are extracted to facilitate forecasting. The proposed model can be regarded as a general case of several existing functional factor models. Through a Monte Carlo simulation, we demonstrate the performance of the proposed method in model fitting. In an empirical study of the Japanese subnational age-specific mortality rates, we show that the proposed model produces more accurate point and interval forecasts in modeling multi-population mortality than those existing functional factor models. The financial impact of the improvements in forecasts is demonstrated through comparisons in life annuity pricing practices.
We propose a novel generalisation to the Student-t Probabilistic Principal Component methodology which: (1) accounts for an asymmetric distribution of the observation data; (2) is a framework for grouped and generalised multiple-degree-of-freedom structures, which provides a more flexible approach to modelling groups of marginal tail dependence in the observation data; and (3) separates the tail effect of the error terms and factors. The new feature extraction methods are derived in an incomplete data setting to efficiently handle the presence of missing values in the observation vector. We discuss various special cases of the algorithm being a result of simplified assumptions on the process generating the data. The applicability of the new framework is illustrated on a data set that consists of crypto currencies with the highest market capitalisation.