Model-based clustering for conditionally correlated categorical data

162 0 0.0 ( 0 )

Download Cite

Added by Matthieu Marbac

Publication date 2014

fields Mathematical Statistics

and research's language is English

Authors Matthieu Marbac - Christophe Biernacki - Vincent Vandewalle

Computation

visit our facebook page

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

An extension of the latent class model is presented for clustering categorical data by relaxing the classical class conditional independence assumption of variables. This model consists in grouping the variables into inter-independent and intra-dependent blocks, in order to consider the main intra-class correlations. The dependency between variables grouped inside the same block of a class is taken into account by mixing two extreme distributions, which are respectively the independence and the maximum dependency. When the variables are dependent given the class, this approach is expected to reduce the biases of the latent class model. Indeed, it produces a meaningful dependency model with only a few additional parameters. The parameters are estimated, by maximum likelihood, by means of an EM algorithm. Moreover, a Gibbs sampler is used for model selection in order to overcome the computational intractability of the combinatorial problems involved by the block structure search. Two applications on medical and biological data sets show the relevance of this new model. The results strengthen the view that this model is meaningful and that it reduces the biases induced by the conditional independence assumption of the latent class model.

rate research

Ultimate Polya Gamma Samplers -- Efficient MCMC for possibly imbalanced binary and categorical data

117 - Sylvia Fruhwirth-Schnatter , Gregor Zens , Helga Wagner 2020

Modeling binary and categorical data is one of the most commonly encountered tasks of applied statisticians and econometricians. While Bayesian methods in this context have been available for decades now, they often require a high level of familiarity with Bayesian statistics or suffer from issues such as low sampling efficiency. To contribute to the accessibility of Bayesian models for binary and categorical data, we introduce novel latent variable representations based on Polya Gamma random variables for a range of commonly encountered discrete choice models. From these latent variable representations, new Gibbs sampling algorithms for binary, binomial and multinomial logistic regression models are derived. All models allow for a conditionally Gaussian likelihood representation, rendering extensions to more complex modeling frameworks such as state space models straight-forward. However, sampling efficiency may still be an issue in these data augmentation based estimation frameworks. To counteract this, MCMC boosting strategies are developed and discussed in detail. The merits of our approach are illustrated through extensive simulations and a real data application.

Computation Methodology

Parameterized Complexity of Feature Selection for Categorical Data Clustering

92 - Sayan Bandyapadhyay , Fedor V. Fomin , Petr A. Golovach 2021

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $Sigma$), we want to select $m-ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|Sigma|)cdot m^{g(k,|Sigma|)}cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.

Data Structures and Algorithms Discrete Mathematics

Clustering categorical data via ensembling dissimilarity matrices

244 - Saeid Amiri , Bertrand Clarke , Jennifer Clarke 2015

We present a technique for clustering categorical data by generating many dissimilarity matrices and averaging over them. We begin by demonstrating our technique on low dimensional categorical data and comparing it to several other techniques that have been proposed. Then we give conditions under which our method should yield good results in general. Our method extends to high dimensional categorical data of equal lengths by ensembling over many choices of explanatory variables. In this context we compare our method with two other methods. Finally, we extend our method to high dimensional categorical data vectors of unequal length by using alignment techniques to equalize the lengths. We give examples to show that our method continues to provide good results, in particular, better in the context of genome sequences than clusterings suggested by phylogenetic trees.

Machine Learning

Bayesian Effect Fusion for Categorical Predictors

70 - Daniela Pauger , Helga Wagner 2017

In this paper, we propose a Bayesian approach to obtain a sparse representation of the effect of a categorical predictor in regression type models. As the effect of a categorical predictor is captured by a group of level effects, sparsity cannot only be achieved by excluding single irrelevant level effects but also by excluding the whole group of effects associated to a predictor or by fusing levels which have essentially the same effect on the response. To achieve this goal, we propose a prior which allows for almost perfect as well as almost zero dependence between level effects a priori. We show how this prior can be obtained by specifying spike and slab prior distributions on all effect differences associated to one categorical predictor and how restricted fusion can be implemented. An efficient MCMC method for posterior computation is developed. The performance of the proposed method is investigated on simulated data. Finally, we illustrate its application on real data from EU-SILC.

Computation

Bayesian model-based clustering for multiple network data

98 - Anastasia Mantziou , Simon Lunagomez , Robin Mitra 2021

There is increasing appetite for analysing multiple network data. This is different to analysing traditional data sets, where now each observation in the data comprises a network. Recent technological advancements have allowed the collection of this type of data in a range of different applications. This has inspired researchers to develop statistical models that most accurately describe the probabilistic mechanism that generates a network population and use this to make inferences about the underlying structure of the network data. Only a few studies developed to date consider the heterogeneity that can exist in a network population. We propose a Mixture of Measurement Error Models for identifying clusters of networks in a network population, with respect to similarities detected in the connectivity patterns among the networks nodes. Extensive simulation studies show our model performs well in both clustering multiple network data and inferring the model parameters. We further apply our model on two real world multiple network data sets resulting from the fields of Computing (Human Tracking Systems) and Neuroscience.

Applications