No Arabic abstract
We propose a parsimonious extension of the classical latent class model to cluster categorical data by relaxing the class conditional independence assumption. Under this new mixture model, named Conditional Modes Model, variables are grouped into conditionally independent blocks. The corresponding block distribution is a parsimonious multinomial distribution where the few free parameters correspond to the most likely modality crossings, while the remaining probability mass is uniformly spread over the other modality crossings. Thus, the proposed model allows to bring out the intra-class dependency between variables and to summarize each class by a few characteristic modality crossings. The model selection is performed via a Metropolis-within-Gibbs sampler to overcome the computational intractability of the block structure search. As this approach involves the computation of the integrated complete-data likelihood, we propose a new method (exact for the continuous parameters and approximated for the discrete ones) which avoids the biases of the textsc{bic} criterion pointed out by our experiments. Finally, the parameters are only estimated for the best model via an textsc{em} algorithm. The characteristics of the new model are illustrated on simulated data and on two biological data sets. These results strengthen the idea that this simple model allows to reduce biases involved by the conditional independence assumption and gives meaningful parameters. Both applications were performed with the R package texttt{CoModes}
In this paper, we present a Weibull link (skewed) model for categorical response data arising from binomial as well as multinomial model. We show that, for such types of categorical data, the most commonly used models (logit, probit and complementary log-log) can be obtained as limiting cases. We further compare the proposed model with some other asymmetrical models. The Bayesian as well as frequentist estimation procedures for binomial and multinomial data responses are presented in details. The analysis of two data sets to show the efficiency of the proposed model is performed.
Determining the number G of components in a finite mixture distribution is an important and difficult inference issue. This is a most important question, because statistical inference about the resulting model is highly sensitive to the value of G. Selecting an erroneous value of G may produce a poor density estimate. This is also a most difficult question from a theoretical perspective as it relates to unidentifiability issues of the mixture model. This is further a most relevant question from a practical viewpoint since the meaning of the number of components G is strongly related to the modelling purpose of a mixture distribution. We distinguish in this chapter between selecting G as a density estimation problem in Section 2 and selecting G in a model-based clustering framework in Section 3. Both sections discuss frequentist as well as Bayesian approaches. We present here some of the Bayesian solutions to the different interpretations of picking the right number of components in a mixture, before concluding on the ill-posed nature of the question.
Beta regression has been extensively used by statisticians and practitioners to model bounded continuous data and there is no strong and similar competitor having its main features. A class of normalized inverse-Gaussian (N-IG) process was introduced in the literature, being explored in the Bayesian context as a powerful alternative to the Dirichlet process. Until this moment, no attention has been paid for the univariate N-IG distribution in the classical inference. In this paper, we propose the bessel regression based on the univariate N-IG distribution, which is a robust alternative to the beta model. This robustness is illustrated through simulated and real data applications. The estimation of the parameters is done through an Expectation-Maximization algorithm and the paper discusses how to perform inference. A useful and practical discrimination procedure is proposed for model selection between bessel and beta regressions. Monte Carlo simulation results are presented to verify the finite-sample behavior of the EM-based estimators and the discrimination procedure. Further, the performances of the regressions are evaluated under misspecification, which is a critical point showing the robustness of the proposed model. Finally, three empirical illustrations are explored to confront results from bessel and beta regressions.
Diffusion tensor imaging (DTI) is a popular magnetic resonance imaging technique used to characterize microstructural changes in the brain. DTI studies quantify the diffusion of water molecules in a voxel using an estimated 3x3 symmetric positive definite diffusion tensor matrix. Statistical analysis of DTI data is challenging because the data are positive definite matrices. Matrix-variate information is often summarized by a univariate quantity, such as the fractional anisotropy (FA), leading to a loss of information. Furthermore, DTI analyses often ignore the spatial association of neighboring voxels, which can lead to imprecise estimates. Although the spatial modeling literature is abundant, modeling spatially dependent positive definite matrices is challenging. To mitigate these issues, we propose a matrix-variate Bayesian semiparametric mixture model, where the positive definite matrices are distributed as a mixture of inverse Wishart distributions with the spatial dependence captured by a Markov model for the mixture component labels. Conjugacy and the double Metropolis-Hastings algorithm result in fast and elegant Bayesian computing. Our simulation study shows that the proposed method is more powerful than non-spatial methods. We also apply the proposed method to investigate the effect of cocaine use on brain structure. The contribution of our work is to provide a novel statistical inference tool for DTI analysis by extending spatial statistics to matrix-variate data.
An extension of the latent class model is presented for clustering categorical data by relaxing the classical class conditional independence assumption of variables. This model consists in grouping the variables into inter-independent and intra-dependent blocks, in order to consider the main intra-class correlations. The dependency between variables grouped inside the same block of a class is taken into account by mixing two extreme distributions, which are respectively the independence and the maximum dependency. When the variables are dependent given the class, this approach is expected to reduce the biases of the latent class model. Indeed, it produces a meaningful dependency model with only a few additional parameters. The parameters are estimated, by maximum likelihood, by means of an EM algorithm. Moreover, a Gibbs sampler is used for model selection in order to overcome the computational intractability of the combinatorial problems involved by the block structure search. Two applications on medical and biological data sets show the relevance of this new model. The results strengthen the view that this model is meaningful and that it reduces the biases induced by the conditional independence assumption of the latent class model.