Model-based clustering for random hypergraphs

80 0 0.0 ( 0 )

Download Cite

Added by Tin Lok James Ng

Publication date 2018

fields Mathematical Statistics

and research's language is English

Authors Tin Lok James Ng - Thomas Brendan Murphy

Methodology

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

A probabilistic model for random hypergraphs is introduced to represent unary, binary and higher order interactions among objects in real-world problems. This model is an extension of the Latent Class Analysis model, which captures clustering structures among objects. An EM (expectation maximization) algorithm with MM (minorization maximization) steps is developed to perform parameter estimation while a cross validated likelihood approach is employed to perform model selection. The developed model is applied to three real-world data sets where interesting results are obtained.

rate research

Effect fusion using model-based clustering

153 - Gertraud Malsiner-Walli , Daniela Pauger , Helga Wagner 2017

In social and economic studies many of the collected variables are measured on a nominal scale, often with a large number of categories. The definition of categories is usually not unambiguous and different classification schemes using either a finer or a coarser grid are possible. Categorisation has an impact when such a variable is included as covariate in a regression model: a too fine grid will result in imprecise estimates of the corresponding effects, whereas with a too coarse grid important effects will be missed, resulting in biased effect estimates and poor predictive performance. To achieve automatic grouping of levels with essentially the same effect, we adopt a Bayesian approach and specify the prior on the level effects as a location mixture of spiky normal components. Fusion of level effects is induced by a prior on the mixture weights which encourages empty components. Model-based clustering of the effects during MCMC sampling allows to simultaneously detect categories which have essentially the same effect size and identify variables with no effect at all. The properties of this approach are investigated in simulation studies. Finally, the method is applied to analyse effects of high-dimensional categorical predictors on income in Austria.

Methodology

Model-based clustering of Gaussian copulas for mixed data

351 - Matthieu Marbac , Christophe Biernacki , 2014

Clustering task of mixed data is a challenging problem. In a probabilistic framework, the main difficulty is due to a shortage of conventional distributions for such data. In this paper, we propose to achieve the mixed data clustering with a Gaussian copula mixture model, since copulas, and in particular the Gaussian ones, are powerful tools for easily modelling the distribution of multivariate variables. Indeed, considering a mixing of continuous, integer and ordinal variables (thus all having a cumulative distribution function), this copula mixture model defines intra-component dependencies similar to a Gaussian mixture, so with classical correlation meaning. Simultaneously, it preserves standard margins associated to continuous, integer and ordered features, namely the Gaussian, the Poisson and the ordered multinomial distributions. As an interesting by-product, the proposed mixture model generalizes many well-known ones and also provides tools of visualization based on the parameters. At a practical level, the Bayesian inference is retained and it is achieved with a Metropolis-within-Gibbs sampler. Experiments on simulated and real data sets finally illustrate the expected advantages of the proposed model for mixed data: flexible and meaningful parametrization combined with visualization features.

Methodology

Model-based clustering based on sparse finite Gaussian mixtures

75 - Gertraud Malsiner-Walli , Sylvia Fruhwirth-Schnatter , Bettinan Grun 2016

In the framework of Bayesian model-based clustering based on a finite mixture of Gaussian distributions, we present a joint approach to estimate the number of mixture components and identify cluster-relevant variables simultaneously as well as to obtain an identified model. Our approach consists in specifying sparse hierarchical priors on the mixture weights and component means. In a deliberately overfitting mixture model the sparse prior on the weights empties superfluous components during MCMC. A straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited during MCMC sampling. Specifying a shrinkage prior, namely the normal gamma prior, on the component means leads to improved parameter estimates as well as identification of cluster-relevant variables. After estimating the mixture model using MCMC methods based on data augmentation and Gibbs sampling, an identified model is obtained by relabeling the MCMC output in the point process representation of the draws. This is performed using $K$-centroids cluster analysis based on the Mahalanobis distance. We evaluate our proposed strategy in a simulation setup with artificial data and by applying it to benchmark data sets.

Methodology

Objective Bayesian meta-analysis based on generalized multivariate random effects model

302 - Olha Bodnar , Taras Bodnar 2021

Objective Bayesian inference procedures are derived for the parameters of the multivariate random effects model generalized to elliptically contoured distributions. The posterior for the overall mean vector and the between-study covariance matrix is deduced by assigning two noninformative priors to the model parameter, namely the Berger and Bernardo reference prior and the Jeffreys prior, whose analytical expressions are obtained under weak distributional assumptions. It is shown that the only condition needed for the posterior to be proper is that the sample size is larger than the dimension of the data-generating model, independently of the class of elliptically contoured distributions used in the definition of the generalized multivariate random effects model. The theoretical findings of the paper are applied to real data consisting of ten studies about the effectiveness of hypertension treatment for reducing blood pressure where the treatment effects on both the systolic blood pressure and diastolic blood pressure are investigated.

Methodology Statistics Theory Statistics Theory

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

309 - Fionn Murtagh 2008

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

Methodology General Mathematics