No Arabic abstract
We consider the linear regression problem of estimating a $p$-dimensional vector $beta$ from $n$ observations $Y = X beta + W$, where $beta_j stackrel{text{i.i.d.}}{sim} pi$ for a real-valued distribution $pi$ with zero mean and unit variance, $X_{ij} stackrel{text{i.i.d.}}{sim} mathcal{N}(0,1)$, and $W_istackrel{text{i.i.d.}}{sim} mathcal{N}(0, sigma^2)$. In the asymptotic regime where $n/p to delta$ and $ p/ sigma^2 to mathsf{snr}$ for two fixed constants $delta, mathsf{snr}in (0, infty)$ as $p to infty$, the limiting (normalized) minimum mean-squared error (MMSE) has been characterized by the MMSE of an associated single-letter (additive Gaussian scalar) channel. In this paper, we show that if the MMSE function of the single-letter channel converges to a step function, then the limiting MMSE of estimating $beta$ in the linear regression problem converges to a step function which jumps from $1$ to $0$ at a critical threshold. Moreover, we establish that the limiting mean-squared error of the (MSE-optimal) approximate message passing algorithm also converges to a step function with a larger threshold, providing evidence for the presence of a computational-statistical gap between the two thresholds.
We study the statistical problem of estimating a rank-one sparse tensor corrupted by additive Gaussian noise, a model also known as sparse tensor PCA. We show that for Bernoulli and Bernoulli-Rademacher distributed signals and emph{for all} sparsity levels which are sublinear in the dimension of the signal, the sparse tensor PCA model exhibits a phase transition called the emph{all-or-nothing phenomenon}. This is the property that for some signal-to-noise ratio (SNR) $mathrm{SNR_c}$ and any fixed $epsilon>0$, if the SNR of the model is below $left(1-epsilonright)mathrm{SNR_c}$, then it is impossible to achieve any arbitrarily small constant correlation with the hidden signal, while if the SNR is above $left(1+epsilon right)mathrm{SNR_c}$, then it is possible to achieve almost perfect correlation with the hidden signal. The all-or-nothing phenomenon was initially established in the context of sparse linear regression, and over the last year also in the context of sparse 2-tensor (matrix) PCA, Bernoulli group testing, and generalized linear models. Our results follow from a more general result showing that for any Gaussian additive model with a discrete uniform prior, the all-or-nothing phenomenon follows as a direct outcome of an appropriately defined near-orthogonality property of the support of the prior distribution.
We establish a phase transition known as the all-or-nothing phenomenon for noiseless discrete channels. This class of models includes the Bernoulli group testing model and the planted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenon for such models was only known in a limited range of parameters. Our work extends the results to all signals with arbitrary sublinear sparsity. Over the past several years, the all-or-nothing phenomenon has been established in various models as an outcome of two seemingly disjoint results: one positive result establishing the all half of all-or-nothing, and one impossibility result establishing the nothing half. Our main technique in the present work is to show that for noiseless discrete channels, the all half implies the nothing half, that is a proof of all can be turned into a proof of nothing. Since the all half can often be proven by straightforward means -- for instance, by the first-moment method -- our equivalence gives a powerful and general approach towards establishing the existence of this phenomenon in other contexts.
In high-dimensional linear regression, would increasing effect sizes always improve model selection, while maintaining all the other conditions unchanged (especially fixing the sparsity of regression coefficients)? In this paper, we answer this question in the textit{negative} in the regime of linear sparsity for the Lasso method, by introducing a new notion we term effect size heterogeneity. Roughly speaking, a regression coefficient vector has high effect size heterogeneity if its nonzero entries have significantly different magnitudes. From the viewpoint of this new measure, we prove that the false and true positive rates achieve the optimal trade-off uniformly along the Lasso path when this measure is maximal in a certain sense, and the worst trade-off is achieved when it is minimal in the sense that all nonzero effect sizes are roughly equal. Moreover, we demonstrate that the first false selection occurs much earlier when effect size heterogeneity is minimal than when it is maximal. The underlying cause of these two phenomena is, metaphorically speaking, the competition among variables with effect sizes of the same magnitude in entering the model. Taken together, our findings suggest that effect size heterogeneity shall serve as an important complementary measure to the sparsity of regression coefficients in the analysis of high-dimensional regression problems. Our proofs use techniques from approximate message passing theory as well as a novel technique for estimating the rank of the first false variable.
Comparing probability distributions is an indispensable and ubiquitous task in machine learning and statistics. The most common way to compare a pair of Borel probability measures is to compute a metric between them, and by far the most widely used notions of metric are the Wasserstein metric and the total variation metric. The next most common way is to compute a divergence between them, and in this case almost every known divergences such as those of Kullback--Leibler, Jensen--Shannon, Renyi, and many more, are special cases of the $f$-divergence. Nevertheless these metrics and divergences may only be computed, in fact, are only defined, when the pair of probability measures are on spaces of the same dimension. How would one quantify, say, a KL-divergence between the uniform distribution on the interval $[-1,1]$ and a Gaussian distribution on $mathbb{R}^3$? We will show that, in a completely natural manner, various common notions of metrics and divergences give rise to a distance between Borel probability measures defined on spaces of different dimensions, e.g., one on $mathbb{R}^m$ and another on $mathbb{R}^n$ where $m, n$ are distinct, so as to give a meaningful answer to the previous question.
We consider constraints on primordial black holes (PBHs) in the mass range $( 10^{-18}text{-}10^{15} ),M_{odot}$ if the dark matter (DM) comprises weakly interacting massive particles (WIMPs) which form halos around them and generate $gamma$-rays by annihilations. We first study the formation of the halos and find that their density profile prior to WIMP annihilations evolves to a characteristic power-law form. Because of the wide range of PBH masses considered, our analysis forges an interesting link between previous approaches to this problem. We then consider the effect of the WIMP annihilations on the halo profile and the associated generation of $gamma$-rays. The observed extragalactic $gamma$-ray background implies that the PBH DM fraction is $f^{}_{rm PBH} lesssim 2 times 10^{-9},( m_{chi} / {rm TeV} )^{1.1}$ in the mass range $2 times 10^{-12},M_{odot},( m_{chi} / {rm TeV} )^{-3.2} lesssim M lesssim 5 times 10^{12},M_{odot},( m_{chi} / {rm TeV} )^{1.1}$, where $m_{chi}$ and $M$ are the WIMP and PBH masses, respectively. This limit is independent of $M$ and therefore applies for any PBH mass function. For $M lesssim 2times 10^{-12},M_{odot},( m_{chi}/ {rm TeV} )^{-3.2}$, the constraint on $f^{}_{rm PBH}$ is a decreasing function of $M$ and PBHs could still make a significant DM contribution at very low masses. We also consider constraints on WIMPs if the DM is mostly PBHs. If the merging black holes recently discovered by LIGO/Virgo are of primordial origin, this would rule out the standard WIMP DM scenario. More generally, the WIMP DM fraction cannot exceed $10^{-4}$ for $M > 10^{-9},M_{odot}$ and $m_{chi} > 10,$GeV. There is a region of parameter space, with $M lesssim 10^{-11},M_{odot}$ and $m_{chi} lesssim 100,$GeV, in which WIMPs and PBHs can both provide some but not all of the DM, so that one requires a third DM candidate.