No Arabic abstract
In this work we study the fundamental limits of approximate recovery in the context of group testing. One of the most well-known, theoretically optimal, and easy to implement testing procedures is the non-adaptive Bernoulli group testing problem, where all tests are conducted in parallel, and each item is chosen to be part of any certain test independently with some fixed probability. In this setting, there is an observed gap between the number of tests above which recovery is information theoretically (IT) possible, and the number of tests required by the currently best known efficient algorithms to succeed. Often times such gaps are explained by a phase transition in the landscape of the solution space of the problem (an Overlap Gap Property phase transition). In this paper we seek to understand whether such a phenomenon takes place for Bernoulli group testing as well. Our main contributions are the following: (1) We provide first moment evidence that, perhaps surprisingly, such a phase transition does not take place throughout the regime for which recovery is IT possible. This fact suggests that the model is in fact amenable to local search algorithms ; (2) we prove the complete absence of bad local minima for a part of the hard regime, a fact which implies an improvement over known theoretical results on the performance of efficient algorithms for approximate recovery without false-negatives, and finally (3) we present extensive simulations that strongly suggest that a very simple local algorithm known as Glauber Dynamics does indeed succeed, and can be used to efficiently implement the well-known (theoretically optimal) Smallest Satisfying Set (SSS) estimator.
We consider the phase retrieval problem of reconstructing a $n$-dimensional real or complex signal $mathbf{X}^{star}$ from $m$ (possibly noisy) observations $Y_mu = | sum_{i=1}^n Phi_{mu i} X^{star}_i/sqrt{n}|$, for a large class of correlated real and complex random sensing matrices $mathbf{Phi}$, in a high-dimensional setting where $m,ntoinfty$ while $alpha = m/n=Theta(1)$. First, we derive sharp asymptotics for the lowest possible estimation error achievable statistically and we unveil the existence of sharp phase transitions for the weak- and full-recovery thresholds as a function of the singular values of the matrix $mathbf{Phi}$. This is achieved by providing a rigorous proof of a result first obtained by the replica method from statistical mechanics. In particular, the information-theoretic transition to perfect recovery for full-rank matrices appears at $alpha=1$ (real case) and $alpha=2$ (complex case). Secondly, we analyze the performance of the best-known polynomial time algorithm for this problem -- approximate message-passing -- establishing the existence of a statistical-to-algorithmic gap depending, again, on the spectral properties of $mathbf{Phi}$. Our work provides an extensive classification of the statistical and algorithmic thresholds in high-dimensional phase retrieval for a broad class of random matrices.
A trade-off between accuracy and fairness is almost taken as a given in the existing literature on fairness in machine learning. Yet, it is not preordained that accuracy should decrease with increased fairness. Novel to this work, we examine fair classification through the lens of mismatched hypothesis testing: trying to find a classifier that distinguishes between two ideal distributions when given two mismatched distributions that are biased. Using Chernoff information, a tool in information theory, we theoretically demonstrate that, contrary to popular belief, there always exist ideal distributions such that optimal fairness and accuracy (with respect to the ideal distributions) are achieved simultaneously: there is no trade-off. Moreover, the same classifier yields the lack of a trade-off with respect to ideal distributions while yielding a trade-off when accuracy is measured with respect to the given (possibly biased) dataset. To complement our main result, we formulate an optimization to find ideal distributions and derive fundamental limits to explain why a trade-off exists on the given biased dataset. We also derive conditions under which active data collection can alleviate the fairness-accuracy trade-off in the real world. Our results lead us to contend that it is problematic to measure accuracy with respect to data that reflects bias, and instead, we should be considering accuracy with respect to ideal, unbiased data.
Motivated by geometric problems in signal processing, computer vision, and structural biology, we study a class of orbit recovery problems where we observe very noisy copies of an unknown signal, each acted upon by a random element of some group (such as Z/p or SO(3)). The goal is to recover the orbit of the signal under the group action in the high-noise regime. This generalizes problems of interest such as multi-reference alignment (MRA) and the reconstruction problem in cryo-electron microscopy (cryo-EM). We obtain matching lower and upper bounds on the sample complexity of these problems in high generality, showing that the statistical difficulty is intricately determined by the invariant theory of the underlying symmetry group. In particular, we determine that for cryo-EM with noise variance $sigma^2$ and uniform viewing directions, the number of samples required scales as $sigma^6$. We match this bound with a novel algorithm for ab initio reconstruction in cryo-EM, based on invariant features of degree at most 3. We further discuss how to recover multiple molecular structures from heterogeneous cryo-EM samples.
We consider the problem of estimating a vector of discrete variables $(theta_1,cdots,theta_n)$, based on noisy observations $Y_{uv}$ of the pairs $(theta_u,theta_v)$ on the edges of a graph $G=([n],E)$. This setting comprises a broad family of statistical estimation problems, including group synchronization on graphs, community detection, and low-rank matrix estimation. A large body of theoretical work has established sharp thresholds for weak and exact recovery, and sharp characterizations of the optimal reconstruction accuracy in such models, focusing however on the special case of Erdos--Renyi-type random graphs. The single most important finding of this line of work is the ubiquity of an information-computation gap. Namely, for many models of interest, a large gap is found between the optimal accuracy achievable by any statistical method, and the optimal accuracy achieved by known polynomial-time algorithms. Moreover, this gap is generally believed to be robust to small amounts of additional side information revealed about the $theta_i$s. How does the structure of the graph $G$ affect this picture? Is the information-computation gap a general phenomenon or does it only apply to specific families of graphs? We prove that the picture is dramatically different for graph sequences converging to amenable graphs (including, for instance, $d$-dimensional grids). We consider a model in which an arbitrarily small fraction of the vertex labels is revealed, and show that a linear-time local algorithm can achieve reconstruction accuracy that is arbitrarily close to the information-theoretic optimum. We contrast this to the case of random graphs. Indeed, focusing on group synchronization on random regular graphs, we prove that the information-computation gap still persists even when a small amount of side information is revealed.
In this paper, we propose a general framework for tensor singular value decomposition (tensor SVD), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data. Comprehensive results are developed on both the statistical and computational limits for tensor SVD. This problem exhibits three different phases according to the signal-to-noise ratio (SNR). In particular, with strong SNR, we show that the classical higher-order orthogonal iteration achieves the minimax optimal rate of convergence in estimation; with weak SNR, the information-theoretical lower bound implies that it is impossible to have consistent estimation in general; with moderate SNR, we show that the non-convex maximum likelihood estimation provides optimal solution, but with NP-hard computational cost; moreover, under the hardness hypothesis of hypergraphic planted clique detection, there are no polynomial-time algorithms performing consistently in general.