Unsupervised learning of regression mixture models with unknown number of components

389 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Faicel Chamroukhi

تاريخ النشر 2014

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Faicel Chamroukhi

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Regression mixture models are widely studied in statistics, machine learning and data analysis. Fitting regression mixtures is challenging and is usually performed by maximum likelihood by using the expectation-maximization (EM) algorithm. However, it is well-known that the initialization is crucial for EM. If the initialization is inappropriately performed, the EM algorithm may lead to unsatisfactory results. The EM algorithm also requires the number of clusters to be given a priori; the problem of selecting the number of mixture components requires using model selection criteria to choose one from a set of pre-estimated candidate models. We propose a new fully unsupervised algorithm to learn regression mixture models with unknown number of components. The developed unsupervised learning approach consists in a penalized maximum likelihood estimation carried out by a robust expectation-maximization (EM) algorithm for fitting polynomial, spline and B-spline regressions mixtures. The proposed learning approach is fully unsupervised: 1) it simultaneously infers the model parameters and the optimal number of the regression mixture components from the data as the learning proceeds, rather than in a two-fold scheme as in standard model-based clustering using afterward model selection criteria, and 2) it does not require accurate initialization unlike the standard EM for regression mixtures. The developed approach is applied to curve clustering problems. Numerical experiments on simulated data show that the proposed robust EM algorithm performs well and provides accurate results in terms of robustness with regard initialization and retrieving the optimal partition with the actual number of clusters. An application to real data in the framework of functional data clustering, confirms the benefit of the proposed approach for practical applications.

قيم البحث

81 - R. Di Mari , R. Rocci , 2018

We consider an equivariant approach imposing data-driven bounds for the variances to avoid singular and spurious solutions in maximum likelihood (ML) estimation of clusterwise linear regression models. We investigate its use in the choice of the numb er of components and we propose a computational shortcut, which significantly reduces the computational time needed to tune the bounds on the data. In the simulation study and the two real-data applications, we show that the proposed methods guarantee a reliable assessment of the number of components compared to standard unconstrained methods, together with accurate model parameters estimation and cluster recovery.

حساب

A class of regression models for parallel and series systems with a random number of components

375 - Alice L. Morais , Silvia L. P. Ferrari 2014

In this paper we extend the Weibull power series (WPS) class of distributions and named this new class as extended Weibull power series (EWPS) class of distributions. The EWPS distributions are related to series and parallel systems with a random num - ber of components, whereas the WPS distributions (Morais and Barreto-Souza, 2011) are related to series systems only. Unlike the WPS distributions, for which the Weibull is a limiting special case, the Weibull law is a particular case of the EWPS distributions. We prove that the distributions in this class are identifiable under a simple assumption. We also prove stochastic and hazard rate order results and highlight that the shapes of the EWPS distributions are markedly more flexible than the shapes of the WPS distributions. We define a regression model for the EWPS response random variable to model a scale parameter and its quantiles. We present the maximum likelihood estimator and prove its consistency and normal asymptotic distribution. Although the construction of this class was motivated by series and parallel systems, the EWPS distributions are suitable for modeling a wide range of positive data sets. To illustrate potential uses of this model, we apply it to a real data set on the tensile strength of coconut fibers and present a simple device for diagnostic purposes.

المنهجية

A Mixture of Linear-Linear Regression Models for Linear-Circular Regression

96 - Ali Esmaieeli Sikaroudi , Chiwoo Park 2016

We introduce a new approach to a linear-circular regression problem that relates multiple linear predictors to a circular response. We follow a modeling approach of a wrapped normal distribution that describes angular variables and angular distributi ons and advances it for a linear-circular regression analysis. Some previous works model a circular variable as projection of a bivariate Gaussian random vector on the unit square, and the statistical inference of the resulting model involves complicated sampling steps. The proposed model treats circular responses as the result of the modulo operation on unobserved linear responses. The resulting model is a mixture of multiple linear-linear regression models. We present two EM algorithms for maximum likelihood estimation of the mixture model, one for a parametric model and another for a non-parametric model. The estimation algorithms provide a great trade-off between computation and estimation accuracy, which was numerically shown using five numerical examples. The proposed approach was applied to a problem of estimating wind directions that typically exhibit complex patterns with large variation and circularity.

المنهجية

Mixture composite regression models with multi-type feature selection

143 - Tsz Chai Fung , George Tzougas , Mario Wuthrich 2021

The aim of this paper is to present a mixture composite regression model for claim severity modelling. Claim severity modelling poses several challenges such as multimodality, heavy-tailedness and systematic effects in data. We tackle this modelling problem by studying a mixture composite regression model for simultaneous modeling of attritional and large claims, and for considering systematic effects in both the mixture components as well as the mixing probabilities. For model fitting, we present a group-fused regularization approach that allows us for selecting the explanatory variables which significantly impact the mixing probabilities and the different mixture components, respectively. We develop an asymptotic theory for this regularized estimation approach, and fitting is performed using a novel Generalized Expectation-Maximization algorithm. We exemplify our approach on real motor insurance data set.

المنهجية الاقتصاد القياسي تطبيقات الإحصاء

Bayesian inference for continuous-time hidden Markov models with an unknown number of states

79 - Yu Luo , David A. Stephens 2021

We consider the modeling of data generated by a latent continuous-time Markov jump process with a state space of finite but unknown dimensions. Typically in such models, the number of states has to be pre-specified, and Bayesian inference for a fixed number of states has not been studied until recently. In addition, although approaches to address the problem for discrete-time models have been developed, no method has been successfully implemented for the continuous-time case. We focus on reversible jump Markov chain Monte Carlo which allows the trans-dimensional move among different numbers of states in order to perform Bayesian inference for the unknown number of states. Specifically, we propose an efficient split-combine move which can facilitate the exploration of the parameter space, and demonstrate that it can be implemented effectively at scale. Subsequently, we extend this algorithm to the context of model-based clustering, allowing numbers of states and clusters both determined during the analysis. The model formulation, inference methodology, and associated algorithm are illustrated by simulation studies. Finally, We apply this method to real data from a Canadian healthcare system in Quebec.

المنهجية حساب