Information Theoretic Limits of Exact Recovery in Sub-hypergraph Models for Community Detection

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jiajun Liang

تاريخ النشر 2021

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Jiajun Liang - Chuyang Ke - Jean Honorio

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we study the information theoretic bounds for exact recovery in sub-hypergraph models for community detection. We define a general model called the $m-$uniform sub-hypergraph stochastic block model ($m-$ShSBM). Under the $m-$ShSBM, we use Fanos inequality to identify the region of model parameters where any algorithm fails to exactly recover the planted communities with a large probability. We also identify the region where a Maximum Likelihood Estimation (MLE) algorithm succeeds to exactly recover the communities with high probability. Our bounds are tight and pertain to the community detection problems in various models such as the planted hypergraph stochastic block model, the planted densest sub-hypergraph model, and the planted multipartite hypergraph model.

قيم البحث

72 - Zhongyang Li 2020

We study the community detection problem on a Gaussian mixture model, in which vertices are divided into $kgeq 2$ distinct communities. The major difference in our model is that the intensities for Gaussian perturbations are different for different e ntries in the observation matrix, and we do not assume that every community has the same number of vertices. We explicitly find the threshold for the exact recovery of the maximum likelihood estimation. Applications include the community detection on hypergraphs.

الشبكات الاجتماعية والمعلومات التعلم الآلي الاحتمالات

Exact Recovery of Community Detection in k-partite Graph Models

102 - Zhongyang Li 2019

We study the vertex classification problem on a graph whose vertices are in $k (kgeq 2)$ different communities, edges are only allowed between distinct communities, and the number of vertices in different communities are not necessarily equal. The ob servation is a weighted adjacency matrix, perturbed by a scalar multiple of the Gaussian Orthogonal Ensemble (GOE), or Gaussian Unitary Ensemble (GUE) matrix. For the exact recovery of the maximum likelihood estimation (MLE) with various weighted adjacency matrices, we prove sharp thresholds of the intensity $sigma$ of the Gaussian perturbation. These weighted adjacency matrices may be considered as natural models for the electric network. Surprisingly, these thresholds of $sigma$ do not depend on whether the sample space for MLE is restricted to such classifications that the number of vertices in each group is equal to the true value. In contrast to the $ZZ_2$-synchronization, a new complex version of the semi-definite programming (SDP) is designed to efficiently implement the community detection problem when the number of communities $k$ is greater than 2, and a common region (independent of $k$) for $sigma$ such that SDP exactly recovers the true classification is obtained.

الاحتمالات بنى وهياكل البيانات والخوارزميات

Semidefinite Programs for Exact Recovery of a Hidden Community

69 - Bruce Hajek , Yihong Wu , Jiaming Xu 2016

We study a semidefinite programming (SDP) relaxation of the maximum likelihood estimation for exactly recovering a hidden community of cardinality $K$ from an $n times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} sim P$ if $i, j$ are both in the community and $A_{ij} sim Q$ otherwise, for two known probability distributions $P$ and $Q$. We identify a sufficient condition and a necessary condition for the success of SDP for the general model. For both the Bernoulli case ($P={{rm Bern}}(p)$ and $Q={{rm Bern}}(q)$ with $p>q$) and the Gaussian case ($P=mathcal{N}(mu,1)$ and $Q=mathcal{N}(0,1)$ with $mu>0$), which correspond to the problem of planted dense subgraph recovery and submatrix localization respectively, the general results lead to the following findings: (1) If $K=omega( n /log n)$, SDP attains the information-theoretic recovery limits with sharp constants; (2) If $K=Theta(n/log n)$, SDP is order-wise optimal, but strictly suboptimal by a constant factor; (3) If $K=o(n/log n)$ and $K to infty$, SDP is order-wise suboptimal. The same critical scaling for $K$ is found to hold, up to constant factors, for the performance of SDP on the stochastic block model of $n$ vertices partitioned into multiple communities of equal size $K$. A key ingredient in the proof of the necessary condition is a construction of a primal feasible solution based on random perturbation of the true cluster matrix.

التعلم الالي نظرية المعلومات الشبكات الاجتماعية والمعلومات

Information Limits for Recovering a Hidden Community

155 - Bruce Hajek , Yihong Wu , Jiaming Xu 2015

We study the problem of recovering a hidden community of cardinality $K$ from an $n times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} sim P$ if $i, j$ both belong to the community and $A_{ij} sim Q$ otherwise, for two know n probability distributions $P$ and $Q$ depending on $n$. If $P={rm Bern}(p)$ and $Q={rm Bern}(q)$ with $p>q$, it reduces to the problem of finding a densely-connected $K$-subgraph planted in a large Erdos-Renyi graph; if $P=mathcal{N}(mu,1)$ and $Q=mathcal{N}(0,1)$ with $mu>0$, it corresponds to the problem of locating a $K times K$ principal submatrix of elevated means in a large Gaussian random matrix. We focus on two types of asymptotic recovery guarantees as $n to infty$: (1) weak recovery: expected number of classification errors is $o(K)$; (2) exact recovery: probability of classifying all indices correctly converges to one. Under mild assumptions on $P$ and $Q$, and allowing the community size to scale sublinearly with $n$, we derive a set of sufficient conditions and a set of necessary conditions for recovery, which are asymptotically tight with sharp constants. The results hold in particular for the Gaussian case, and for the case of bounded log likelihood ratio, including the Bernoulli case whenever $frac{p}{q}$ and $frac{1-p}{1-q}$ are bounded away from zero and infinity. An important algorithmic implication is that, whenever exact recovery is information theoretically possible, any algorithm that provides weak recovery when the community size is concentrated near $K$ can be upgraded to achieve exact recovery in linear additional time by a simple voting procedure.

التعلم الالي نظرية المعلومات نظرية المعلومات

Community detection in sparse latent space models

99 - Fengnan Gao , Zongming Ma , Hongsong Yuan 2020

We show that a simple community detection algorithm originated from stochastic blockmodel literature achieves consistency, and even optimality, for a broad and flexible class of sparse latent space models. The class of models includes latent eigenmod els (arXiv:0711.1146). The community detection algorithm is based on spectral clustering followed by local refinement via normalized edge counting.

التعلم الالي التعلم الآلي نظرية الإحصاء