No Arabic abstract
This paper proposes a canonical-correlation-based filter method for feature selection. The sum of squared canonical correlation coefficients is adopted as the feature ranking criterion. The proposed method boosts the computational speed of the ranking criterion in greedy search. The supporting theorems developed for the feature selection method are fundamental to the understanding of the canonical correlation analysis. In empirical studies, a synthetic dataset is used to demonstrate the speed advantage of the proposed method, and eight real datasets are applied to show the effectiveness of the proposed feature ranking criterion in both classification and regression. The results show that the proposed method is considerably faster than the definition-based method, and the proposed ranking criterion is competitive compared with the seven mutual-information-based criteria.
For feature selection and related problems, we introduce the notion of classification game, a cooperative game, with features as players and hinge loss based characteristic function and relate a features contribution to Shapley value based error apportioning (SVEA) of total training error. Our major contribution is ($star$) to show that for any dataset the threshold 0 on SVEA value identifies feature subset whose joint interactions for label prediction is significant or those features that span a subspace where the data is predominantly lying. In addition, our scheme ($star$) identifies the features on which Bayes classifier doesnt depend but any surrogate loss function based finite sample classifier does; this contributes to the excess $0$-$1$ risk of such a classifier, ($star$) estimates unknown true hinge risk of a feature, and ($star$) relate the stability property of an allocation and negative valued SVEA by designing the analogue of core of classification game. Due to Shapley values computationally expensive nature, we build on a known Monte Carlo based approximation algorithm that computes characteristic function (Linear Programs) only when needed. We address the potential sample bias problem in feature selection by providing interval estimates for SVEA values obtained from multiple sub-samples. We illustrate all the above aspects on various synthetic and real datasets and show that our scheme achieves better results than existing recursive feature elimination technique and ReliefF in most cases. Our theoretically grounded classification game in terms of well defined characteristic function offers interpretability (which we formalize in terms of final task) and explainability of our framework, including identification of important features.
High-dimensional variable selection is an important issue in many scientific fields, such as genomics. In this paper, we develop a sure independence feature screening pro- cedure based on kernel canonical correlation analysis (KCCA-SIS, for short). KCCA- SIS is easy to be implemented and applied. Compared to the sure independence screen- ing procedure based on the Pearson correlation (SIS, for short) developed by Fan and Lv [2008], KCCA-SIS can handle nonlinear dependencies among variables. Compared to the sure independence screening procedure based on the distance correlation (DC- SIS, for short) proposed by Li et al. [2012], KCCA-SIS is scale free, distribution free and has better approximation results based on the universal characteristic of Gaussian Kernel (Micchelli et al. [2006]). KCCA-SIS is more general than SIS and DC-SIS in the sense that SIS and DC-SIS correspond to certain choice of kernels. Compared to supremum of Hilbert Schmidt independence criterion-Sure independence screening (sup-HSIC-SIS, for short) developed by Balasubramanian et al. [2013], KCCA-SIS is scale free removing the marginal variation of features and response variables. No model assumption is needed between response and predictors to apply KCCA-SIS and it can be used in ultrahigh dimensional data analysis. Similar to DC-SIS and sup- HSIC-SIS, KCCA-SIS can also be used directly to screen grouped predictors and for multivariate response variables. We show that KCCA-SIS has the sure screening prop- erty, and has better performance through simulation studies. We applied KCCA-SIS to study Autism genes in a spatiotemporal gene expression dataset for human brain development, and obtained better results based on gene ontology enrichment analysis comparing to the other existing methods.
Online feature selection has been an active research area in recent years. We propose a novel diverse online feature selection method based on Determinantal Point Processes (DPP). Our model aims to provide diverse features which can be composed in either a supervised or unsupervised framework. The framework aims to promote diversity based on the kernel produced on a feature level, through at most three stages: feature sampling, local criteria and global criteria for feature selection. In the feature sampling, we sample incoming stream of features using conditional DPP. The local criteria is used to assess and select streamed features (i.e. only when they arrive), we use unsupervised scale invariant methods to remove redundant features and optionally supervised methods to introduce label information to assess relevant features. Lastly, the global criteria uses regularization methods to select a global optimal subset of features. This three stage procedure continues until there are no more features arriving or some predefined stopping condition is met. We demonstrate based on experiments conducted on that this approach yields better compactness, is comparable and in some instances outperforms other state-of-the-art online feature selection methods.
Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A popular model for the joint analysis of multi-type datasets decomposes each data matrix into a low-rank common-variation matrix generated by latent factors shared across all datasets, a low-rank distinctive-variation matrix corresponding to each dataset, and an additive noise matrix. We propose decomposition-based generalized canonical correlation analysis (D-GCCA), a novel decomposition method that appropriately defines those matrices on the L2 space of random variables, whereas most existing methods are developed on its approximation, the Euclidean dot product space. Moreover to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods inadequately consider such orthogonality and can thus suffer from substantial loss of undetected common variation. Our D-GCCA takes one step further than GCCA by separating common and distinctive variations among canonical variables, and enjoys an appealing interpretation from the perspective of principal component analysis. Consistent estimators of our common-variation and distinctive-variation matrices are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale datasets. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.
The problem of inferring the direct causal parents of a response variable among a large set of explanatory variables is of high practical importance in many disciplines. Recent work in the field of causal discovery exploits invariance properties of models across different experimental conditions for detecting direct causal links. However, these approaches generally do not scale well with the number of explanatory variables, are difficult to extend to nonlinear relationships, and require data across different experiments. Inspired by {em Debiased} machine learning methods, we study a one-vs.-the-rest feature selection approach to discover the direct causal parent of the response. We propose an algorithm that works for purely observational data, while also offering theoretical guarantees, including the case of partially nonlinear relationships. Requiring only one estimation for each variable, we can apply our approach even to large graphs, demonstrating significant improvements compared to established approaches.