Cluster Analysis via Random Partition Distributions

58 0 0.0 ( 0 )

Download Cite

Added by David B. Dahl

Publication date 2021

fields Mathematical Statistics

and research's language is English

Authors David B. Dahl - Jacob Andros - J. Brandon Carter

Methodology

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Hierarchical and k-medoids clustering are deterministic clustering algorithms based on pairwise distances. Using these same pairwise distances, we propose a novel stochastic clustering method based on random partition distributions. We call our method CaviarPD, for cluster analysis via random partition distributions. CaviarPD first samples clusterings from a random partition distribution and then finds the best cluster estimate based on these samples using algorithms to minimize an expected loss. We compare CaviarPD with hierarchical and k-medoids clustering through eight case studies. Cluster estimates based on our method are competitive with those of hierarchical and k-medoids clustering. They also do not require the subjective choice of the linkage method necessary for hierarchical clustering. Furthermore, our distribution-based procedure provides an intuitive graphical representation to assess clustering uncertainty.

rate research

Random Partition Models for Microclustering Tasks

179 - Brenda Betancourt , Giacomo Zanella , Rebecca C. Steorts 2020

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.

Methodology Statistics Theory Statistics Theory

On computing distributions of products of random variables via Gaussian multiresolution analysis

69 - Gregory Beylkin , Lucas Monzon , Ignas Satkauskas 2016

We introduce a new approximate multiresolution analysis (MRA) using a single Gaussian as the scaling function, which we call Gaussian MRA (GMRA). As an initial application, we employ this new tool to accurately and efficiently compute the probability density function (PDF) of the product of independent random variables. In contrast with Monte-Carlo (MC) type methods (the only other universal approach known to address this problem), our method not only achieves accuracies beyond the reach of MC but also produces a PDF expressed as a Gaussian mixture, thus allowing for further efficient computations. We also show that an exact MRA corresponding to our GMRA can be constructed for a matching user-selected accuracy.

Numerical Analysis

Flexible Principal Component Analysis for Exponential Family Distributions

166 - Tonglin Zhang , Baijian Yang , Qianqian Song 2021

Traditional principal component analysis (PCA) is well known in high-dimensional data analysis, but it requires to express data by a matrix with observations to be continuous. To overcome the limitations, a new method called flexible PCA (FPCA) for exponential family distributions is proposed. The goal is to ensure that it can be implemented to arbitrary shaped region for either count or continuous observations. The methodology of FPCA is developed under the framework of generalized linear models. It provides statistical models for FPCA not limited to matrix expressions of the data. A maximum likelihood approach is proposed to derive the decomposition when the number of principal components (PCs) is known. This naturally induces a penalized likelihood approach to determine the number of PCs when it is unknown. By modifying it for missing data problems, the proposed method is compared with previous PCA methods for missing data. The simulation study shows that the performance of FPCA is always better than its competitors. The application uses the proposed method to reduce the dimensionality of arbitrary shaped sub-regions of images and the global spread patterns of COVID-19 under normal and Poisson distributions, respectively.

Methodology

Identifying latent groups in spatial panel data using a Markov random field constrained product partition model

42 - Tianyu Pan , Guanyu Hu , Weining Shen 2020

Understanding the heterogeneity over spatial locations is an important problem that has been widely studied in many applications such as economics and environmental science. In this paper, we focus on regression models for spatial panel data analysis, where repeated measurements are collected over time at various spatial locations. We propose a novel class of nonparametric priors that combines Markov random field (MRF) with the product partition model (PPM), and show that the resulting prior, called by MRF-PPM, is capable of identifying the latent group structure among the spatial locations while efficiently utilizing the spatial dependence information. We derive a closed-form conditional distribution for the proposed prior and introduce a new way to compute the marginal likelihood that renders efficient Bayesian inference. We further study the theoretical properties of the proposed MRF-PPM prior and show a clustering consistency result for the posterior distribution. We demonstrate the excellent empirical performance of our method via extensive simulation studies and applications to a US precipitation data and a California median household income data study.

Methodology

Protein Structure Parameterization via Mobius Distributions on the Torus

110 - Mohammad Arashi , Najmeh Nakhaei Rad , Andriette Bekker 2020

Proteins constitute a large group of macromolecules with a multitude of functions for all living organisms. Proteins achieve this by adopting distinct three-dimensional structures encoded by the sequence of their constituent amino acids in one or more polypeptides. In this paper, the statistical modelling of the protein backbone torsion angles is considered. Two new distributions are proposed for toroidal data by applying the Mobius transformation to the bivariate von Mises distribution. Marginal and conditional distributions in addition to sine-skew

Methodology Biomolecules Quantitative Methods