Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Random Partition Models for Microclustering Tasks

180 0 0.0 ( 0 )

Download Cite

Added by Brenda Betancourt

Publication date 2020

fields Mathematical Statistics

and research's language is English

Authors Brenda Betancourt - Giacomo Zanella - Rebecca C. Steorts

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.

rate research

Flexible Models for Microclustering with Application to Entity Resolution

237 - Giacomo Zanella , Brenda Betancourt , Hanna Wallach 2016

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.

Methodology Statistics Theory Applications

Exponential-Family Random Graph Models for Valued Networks

467 - Pavel N. Krivitsky 2011

Exponential-family random graph models (ERGMs) provide a principled and flexible way to model and simulate features common in social networks, such as propensities for homophily, mutuality, and friend-of-a-friend triad closure, through choice of model terms (sufficient statistics). However, those ERGMs modeling the more complex features have, to date, been limited to binary data: presence or absence of ties. Thus, analysis of valued networks, such as those where counts, measurements, or ranks are observed, has necessitated dichotomizing them, losing information and introducing biases. In this work, we generalize ERGMs to valued networks. Focusing on modeling counts, we formulate an ERGM for networks whose ties are counts and discuss issues that arise when moving beyond the binary case. We introduce model terms that generalize and model common social network features for such data and apply these methods to a network dataset whose values are counts of interactions.

Methodology Statistics Theory Statistics Theory

Tests for separability in nonparametric covariance operators of random surfaces

676 - John A. D. Aston , Davide Pigoli , Shahin Tavakoli 2015

The assumption of separability of the covariance operator for a random image or hypersurface can be of substantial use in applications, especially in situations where the accurate estimation of the full covariance structure is unfeasible, either for computational reasons, or due to a small sample size. However, inferential tools to verify this assumption are somewhat lacking in high-dimensional or functional {data analysis} settings, where this assumption is most relevant. We propose here to test separability by focusing on $K$-dimensional projections of the difference between the covariance operator and a nonparametric separable approximation. The subspace we project onto is one generated by the eigenfunctions of the covariance operator estimated under the separability hypothesis, negating the need to ever estimate the full non-separable covariance. We show that the rescaled difference of the sample covariance operator with its separable approximation is asymptotically Gaussian. As a by-product of this result, we derive asymptotically pivotal tests under Gaussian assumptions, and propose bootstrap methods for approximating the distribution of the test statistics. We probe the finite sample performance through simulations studies, and present an application to log-spectrogram images from a phonetic linguistics dataset.

Methodology Statistics Theory Statistics Theory

Stochastic Search for Semiparametric Linear Regression Models

451 - Lutz Duembgen , Dominic Schuhmacher , Richard Samworth 2011

This paper introduces and analyzes a stochastic search method for parameter estimation in linear regression models in the spirit of Beran and Millar (1987). The idea is to generate a random finite subset of a parameter space which will automatically contain points which are very close to an unknown true parameter. The motivation for this procedure comes from recent work of Duembgen, Samworth and Schuhmacher (2011) on regression models with log-concave error distributions.

Methodology Statistics Theory Statistics Theory

Optimal Design for Probit Choice Models with Dependent Utilities

71 - Ulrike Gra{ss}hoff , Heiko Gro{ss}mann , Heinz Holling 2020

In this paper we derive locally D-optimal designs for discrete choice experiments based on multinomial probit models. These models include several discrete explanatory variables as well as a quantitative one. The commonly used multinomial logit model assumes independent utilities for different choice options. Thus, D-optimal optimal designs for such multinomial logit models may comprise choice sets, e.g., consisting of alternatives which are identical in all discrete attributes but different in the quantitative variable. Obviously such designs are not appropriate for many empirical choice experiments. It will be shown that locally D-optimal designs for multinomial probit models supposing independent utilities consist of counterintuitive choice sets as well. However, locally D-optimal designs for multinomial probit models allowing for dependent utilities turn out to be reasonable for analyzing decisions using discrete choice studies.

Methodology Statistics Theory Statistics Theory

comments

Fetching comments

Higher Institute for Demographic Studies and Researches

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Random Partition Models for Microclustering Tasks

Ask ChatGPT about the research

No Arabic abstract

Read More