No Arabic abstract
In the research field of big data, one of important issues is how to recover the sequentially changing sets of true features when the data sets arrive sequentially. The paper presents a general framework for online updating variable selection and parameter estimation in generalized linear models with streaming datasets. This is a type of online updating penalty likelihoods with differentiable or non-differentiable penalty function. The online updating coordinate descent algorithm is proposed to solve the online updating optimization problem. Moreover, a tuning parameter selection is suggested in an online updating way. The selection and estimation consistencies, and the oracle property are established, theoretically. Our methods are further examined and illustrated by various numerical examples from both simulation experiments and a real data analysis.
Mixtures-of-Experts (MoE) are conditional mixture models that have shown their performance in modeling heterogeneity in data in many statistical learning approaches for prediction, including regression and classification, as well as for clustering. Their estimation in high-dimensional problems is still however challenging. We consider the problem of parameter estimation and feature selection in MoE models with different generalized linear experts models, and propose a regularized maximum likelihood estimation that efficiently encourages sparse solutions for heterogeneous data with high-dimensional predictors. The developed proximal-Newton EM algorithm includes proximal Newton-type procedures to update the model parameter by monotonically maximizing the objective function and allows to perform efficient estimation and feature selection. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data, compared to the main state-of-the art competitors.
Under the environment of big data streams, it is a common situation where the variable set of a model may change according to the condition of data streams. In this paper, we propose a homogenization strategy to represent the heterogenous models that are gradually updated in the process of data streams. With the homogenized representations, we can easily construct various online updating statistics such as parameter estimation, residual sum of squares and $F$-statistic for the heterogenous updating regression models. The main difference from the classical scenarios is that the artificial covariates in the homogenized models are not identically distributed as the natural covariates in the original models, consequently, the related theoretical properties are distinct from the classical ones. The asymptotical properties of the online updating statistics are established, which show that the new method can achieve estimation efficiency and oracle property, without any constraint on the number of data batches. The behavior of the method is further illustrated by various numerical examples from simulation experiments.
A common problem in machine learning is determining if a variable significantly contributes to a models prediction performance. This problem is aggravated for datasets, such as gene expression datasets, that suffer the worst case of dimensionality: a low number of observations along with a high number of possible explanatory variables. In such scenarios, traditional methods for testing variable statistical significance or constructing variable confidence intervals do not apply. To address these problems, we developed a novel permutation framework for testing the significance of variables in supervised models. Our permutation framework has three main advantages. First, it is non-parametric and does not rely on distributional assumptions or asymptotic results. Second, it not only ranks model variables in terms of relative importance, but also tests for statistical significance of each variable. Third, it can test for the significance of the interaction between model variables. We applied this permutation framework to multi-class classification of the Iris flower dataset and of brain regions in RNA expression data, and using this framework showed variable-level statistical significance and interactions.
In this paper, we introduce a new methodology for Bayesian variable selection in linear regression that is independent of the traditional indicator method. A diagonal matrix $mathbf{G}$ is introduced to the prior of the coefficient vector $boldsymbol{beta}$, with each of the $g_j$s, bounded between $0$ and $1$, on the diagonal serves as a stabilizer of the corresponding $beta_j$. Mathematically, a promising variable has a $g_j$ value that is close to $0$, whereas the value of $g_j$ corresponding to an unpromising variable is close to $1$. This property is proven in this paper under orthogonality together with other asymptotic properties. Computationally, the sample path of each $g_j$ is obtained through Metropolis-within-Gibbs sampling method. Also, in this paper we give two simulations to verify the capability of this methodology in variable selection.
Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.