No Arabic abstract
A robust estimator for a wide family of mixtures of linear regression is presented. Robustness is based on the joint adoption of the Cluster Weighted Model and of an estimator based on trimming and restrictions. The selected model provides the conditional distribution of the response for each group, as in mixtures of regression, and further supplies local distributions for the explanatory variables. A novel version of the restrictions has been devised, under this model, for separately controlling the two sources of variability identified in it. This proposal avoids singularities in the log-likelihood, caused by approximate local collinearity in the explanatory variables or local exact fit in regressions, and reduces the occurrence of spurious local maximizers. In a natural way, due to the interaction between the model and the estimator, the procedure is able to resist the harmful influence of bad leverage points along the estimation of the mixture of regressions, which is still an open issue in the literature. The given methodology defines a well-posed statistical problem, whose estimator exists and is consistent to the corresponding solution of the population optimum, under widely general conditions. A feasible EM algorithm has also been provided to obtain the corresponding estimation. Many simulated examples and two real datasets have been chosen to show the ability of the procedure, on the one hand, to detect anomalous data, and, on the other hand, to identify the real cluster regressions without the influence of contamination.
This work relates the framework of model-based clustering for spatial functional data where the data are surfaces. We first introduce a Bayesian spatial spline regression model with mixed-effects (BSSR) for modeling spatial function data. The BSSR model is based on Nodal basis functions for spatial regression and accommodates both common mean behavior for the data through a fixed-effects part, and variability inter-individuals thanks to a random-effects part. Then, in order to model populations of spatial functional data issued from heterogeneous groups, we integrate the BSSR model into a mixture framework. The resulting model is a Bayesian mixture of spatial spline regressions with mixed-effects (BMSSR) used for density estimation and model-based surface clustering. The models, through their Bayesian formulation, allow to integrate possible prior knowledge on the data structure and constitute a good alternative to recent mixture of spatial spline regressions model estimated in a maximum likelihood framework via the expectation-maximization (EM) algorithm. The Bayesian model inference is performed by Markov Chain Monte Carlo (MCMC) sampling. We derive two Gibbs sampler to infer the BSSR and the BMSSR models and apply them on simulated surfaces and a real problem of handwritten digit recognition using the MNIST data set. The obtained results highlight the potential benefit of the proposed Bayesian approaches for modeling surfaces possibly dispersed in particular in clusters.
Compared to the conditional mean as a simple point estimator, the conditional density function is more informative to describe the distributions with multi-modality, asymmetry or heteroskedasticity. In this paper, we propose a novel parametric conditional density estimation method by showing the connection between the general density and the likelihood function of inhomogeneous Poisson process models. The maximum likelihood estimates can be obtained via weighted logistic regressions, and the computation can be significantly relaxed by combining a block-wise alternating maximization scheme and local case-control sampling. We also provide simulation studies for illustration.
Empirical researchers often trim observations with small denominator A when they estimate moments of the form E[B/A]. Large trimming is a common practice to mitigate variance, but it incurs large trimming bias. This paper provides a novel method of correcting large trimming bias. If a researcher is willing to assume that the joint distribution between A and B is smooth, then a large trimming bias may be estimated well. With the bias correction, we also develop a valid and robust inference result for E[B/A].
The use of a finite mixture of normal distributions in model-based clustering allows to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition this prior allows to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semi-parametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark data sets.
Parameter estimation of mixture regression model using the expectation maximization (EM) algorithm is highly sensitive to outliers. Here we propose a fast and efficient robust mixture regression algorithm, called Component-wise Adaptive Trimming (CAT) method. We consider simultaneous outlier detection and robust parameter estimation to minimize the effect of outlier contamination. Robust mixture regression has many important applications including in human cancer genomics data, where the population often displays strong heterogeneity added by unwanted technological perturbations. Existing robust mixture regression methods suffer from outliers as they either conduct parameter estimation in the presence of outliers, or rely on prior knowledge of the level of outlier contamination. CAT was implemented in the framework of classification expectation maximization, under which a natural definition of outliers could be derived. It implements a least trimmed squares (LTS) approach within each exclusive mixing component, where the robustness issue could be transformed from the mixture case to simple linear regression case. The high breakdown point of the LTS approach allows us to avoid the pre-specification of trimming parameter. Compared with multiple existing algorithms, CAT is the most competitive one that can handle and adaptively trim off outliers as well as heavy tailed noise, in different scenarios of simulated data and real genomic data. CAT has been implemented in an R package `RobMixReg available in CRAN.