Chi-squared Amplification: Identifying Hidden Hubs


Abstract in English

We consider the following general hidden hubs model: an $n times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs): entries in rows outside $S$ are generated from the probability distribution $p_0 sim N(0,sigma_0^2)$; for each row in $S$, some $k$ of its entries are generated from $p_1 sim N(0,sigma_1^2)$, $sigma_1>sigma_0$, and the rest of the entries from $p_0$. The problem is to identify the high-degree hubs efficiently. This model includes and significantly generalizes the planted Gaussian Submatrix Model, where the special entries are all in a $k times k$ submatrix. There are two well-known barriers: if $kgeq csqrt{nln n}$, just the row sums are sufficient to find $S$ in the general model. For the submatrix problem, this can be improved by a $sqrt{ln n}$ factor to $k ge csqrt{n}$ by spectral methods or combinatorial methods. In the variant with $p_0=pm 1$ (with probability $1/2$ each) and $p_1equiv 1$, neither barrier has been broken. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k ge n^{0.5-delta}$ for some $delta >0$, when $sigma_1^2>2sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances each at least as large as $sigma_1^2$. We also show a nearly matching lower bound: for $sigma_1^2 le 2sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,sigma_0^2)$ and a matrix with $k=n^{0.5-delta}$ hidden hubs for any $delta >0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $sigma_1^2=2sigma_0^2$, we show that the general hidden hubs problem can be solved for $kgeq csqrt n(ln n)^{1/4}$, improving on the naive row sum-based method.

Download