No Arabic abstract
Community detection is considered for a stochastic block model graph of n vertices, with K vertices in the planted community, edge probability p for pairs of vertices both in the community, and edge probability q for other pairs of vertices. The main focus of the paper is on weak recovery of the community based on the graph G, with o(K) misclassified vertices on average, in the sublinear regime $n^{1-o(1)} leq K leq o(n).$ A critical parameter is the effective signal-to-noise ratio $lambda=K^2(p-q)^2/((n-K)q)$, with $lambda=1$ corresponding to the Kesten-Stigum threshold. We show that a belief propagation algorithm achieves weak recovery if $lambda>1/e$, beyond the Kesten-Stigum threshold by a factor of $1/e.$ The belief propagation algorithm only needs to run for $log^ast n+O(1) $ iterations, with the total time complexity $O(|E| log^*n)$, where $log^*n$ is the iterated logarithm of $n.$ Conversely, if $lambda leq 1/e$, no local algorithm can asymptotically outperform trivial random guessing. Furthermore, a linear message-passing algorithm that corresponds to applying power iteration to the non-backtracking matrix of the graph is shown to attain weak recovery if and only if $lambda>1$. In addition, the belief propagation algorithm can be combined with a linear-time voting procedure to achieve the information limit of exact recovery (correctly classify all vertices with high probability) for all $K ge frac{n}{log n} left( rho_{rm BP} +o(1) right),$ where $rho_{rm BP}$ is a function of $p/q$.
We study the problem of recovering a hidden community of cardinality $K$ from an $n times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} sim P$ if $i, j$ both belong to the community and $A_{ij} sim Q$ otherwise, for two known probability distributions $P$ and $Q$ depending on $n$. If $P={rm Bern}(p)$ and $Q={rm Bern}(q)$ with $p>q$, it reduces to the problem of finding a densely-connected $K$-subgraph planted in a large Erdos-Renyi graph; if $P=mathcal{N}(mu,1)$ and $Q=mathcal{N}(0,1)$ with $mu>0$, it corresponds to the problem of locating a $K times K$ principal submatrix of elevated means in a large Gaussian random matrix. We focus on two types of asymptotic recovery guarantees as $n to infty$: (1) weak recovery: expected number of classification errors is $o(K)$; (2) exact recovery: probability of classifying all indices correctly converges to one. Under mild assumptions on $P$ and $Q$, and allowing the community size to scale sublinearly with $n$, we derive a set of sufficient conditions and a set of necessary conditions for recovery, which are asymptotically tight with sharp constants. The results hold in particular for the Gaussian case, and for the case of bounded log likelihood ratio, including the Bernoulli case whenever $frac{p}{q}$ and $frac{1-p}{1-q}$ are bounded away from zero and infinity. An important algorithmic implication is that, whenever exact recovery is information theoretically possible, any algorithm that provides weak recovery when the community size is concentrated near $K$ can be upgraded to achieve exact recovery in linear additional time by a simple voting procedure.
We study a semidefinite programming (SDP) relaxation of the maximum likelihood estimation for exactly recovering a hidden community of cardinality $K$ from an $n times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} sim P$ if $i, j$ are both in the community and $A_{ij} sim Q$ otherwise, for two known probability distributions $P$ and $Q$. We identify a sufficient condition and a necessary condition for the success of SDP for the general model. For both the Bernoulli case ($P={{rm Bern}}(p)$ and $Q={{rm Bern}}(q)$ with $p>q$) and the Gaussian case ($P=mathcal{N}(mu,1)$ and $Q=mathcal{N}(0,1)$ with $mu>0$), which correspond to the problem of planted dense subgraph recovery and submatrix localization respectively, the general results lead to the following findings: (1) If $K=omega( n /log n)$, SDP attains the information-theoretic recovery limits with sharp constants; (2) If $K=Theta(n/log n)$, SDP is order-wise optimal, but strictly suboptimal by a constant factor; (3) If $K=o(n/log n)$ and $K to infty$, SDP is order-wise suboptimal. The same critical scaling for $K$ is found to hold, up to constant factors, for the performance of SDP on the stochastic block model of $n$ vertices partitioned into multiple communities of equal size $K$. A key ingredient in the proof of the necessary condition is a construction of a primal feasible solution based on random perturbation of the true cluster matrix.
Let $(Z_n,ngeq 0)$ be a supercritical Galton-Watson process whose offspring distribution $mu$ has mean $lambda>1$ and is such that $int x(log(x))_+ dmu(x)<+infty$. According to the famous Kesten & Stigum theorem, $(Z_n/lambda^n)$ converges almost surely, as $nto+infty$. The limiting random variable has mean~1, and its distribution is characterised as the solution of a fixed point equation. par In this paper, we consider a family of Galton-Watson processes $(Z_n(lambda), ngeq 0)$ defined for~$lambda$ ranging in an interval $Isubset (1, infty)$, and where we interpret $lambda$ as the time (when $n$ is the generation). The number of children of an individual at time~$lambda$ is given by $X(lambda)$, where $(X(lambda))_{lambdain I}$ is a c`adl`ag integer-valued process which is assumed to be almost surely non-decreasing and such that $mathbb E(X(lambda))=lambda >1$ for all $lambdain I$. This allows us to define $Z_n(lambda)$ the number of elements in the $n$th generation at time $lambda$. Set $W_n(lambda)= Z_n(lambda)/lambda^n$ for all $ngeq 0$ and $lambdain I$. We prove that, under some moment conditions on the process~$X$, the sequence of processes $(W_n(lambda), lambdain I)_{ngeq 0}$ converges in probability as~$n$ tends to infinity in the space of c`adl`ag processes equipped with the Skorokhod topology to a process, which we characterise as the solution of a fixed point equation.
We consider learning of fundamental properties of communities in large noisy networks, in the prototypical situation where the nodes or users are split into two classes according to a binary property, e.g., according to their opinions or preferences on a topic. For learning these properties, we propose a nonparametric, unsupervised, and scalable graph scan procedure that is, in addition, robust against a class of powerful adversaries. In our setup, one of the communities can fall under the influence of a knowledgeable adversarial leader, who knows the full network structure, has unlimited computational resources and can completely foresee our planned actions on the network. We prove strong consistency of our results in this setup with minimal assumptions. In particular, the learning procedure estimates the baseline activity of normal users asymptotically correctly with probability 1; the only assumption being the existence of a single implicit community of asymptotically negligible logarithmic size. We provide experiments on real and synthetic data to illustrate the performance of our method, including examples with adversaries.
The community detection problem requires to cluster the nodes of a network into a small number of well-connected communities. There has been substantial recent progress in characterizing the fundamental statistical limits of community detection under simple stochastic block models. However, in real-world applications, the network structure is typically dynamic, with nodes that join over time. In this setting, we would like a detection algorithm to perform only a limited number of updates at each node arrival. While standard voting approaches satisfy this constraint, it is unclear whether they exploit the network information optimally. We introduce a simple model for networks growing over time which we refer to as streaming stochastic block model (StSBM). Within this model, we prove that voting algorithms have fundamental limitations. We also develop a streaming belief-propagation (StreamBP) approach, for which we prove optimality in certain regimes. We validate our theoretical findings on synthetic and real data.