ترغب بنشر مسار تعليمي؟ اضغط هنا

We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt u sing these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $rho$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $rho$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.
We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a re lated tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small even when gradient descent can achieve arbitrarily high accuracy. Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.
Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individ ual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an $ell_p$ perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel $ell_p$ analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in $ell_p$ norm, which includes $ell_2$ and $ell_infty$ as special examples. For sub-Gaussian mixture models, the choice of $p$ giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the $ell_p$ theory leads to a simple spectral algorithm that achieves the information threshold for exact recovery. These also provide optimal recovery results for Gaussian mixture and stochastic block models as special cases.
224 - Emmanuel Abbe , Peter Ralli 2020
The r-th power of a graph modifies a graph by connecting every vertex pair within distance r. This paper gives a generalization of the Alon-Boppana Theorem for the r-th power of graphs, including irregular graphs. This leads to a generalized notion o f Ramanujan graphs, those for which the powered graph has a spectral gap matching the derived Alon-Boppana bound. In particular, we show that certain graphs that are not good expanders due to local irregularities, such as Erdos-Renyi random graphs, become almost Ramanujan once powered. A different generalization of Ramanujan graphs can also be obtained from the nonbacktracking operator. We next argue that the powering operator gives a more robust notion than the latter: Sparse Erdos-Renyi random graphs with an adversary modifying a subgraph of log(n)^c$ vertices are still almost Ramanujan in the powered sense, but not in the nonbacktracking sense. As an application, this gives robust community testing for different block models.
Reed-Muller (RM) codes are among the oldest, simplest and perhaps most ubiquitous family of codes. They are used in many areas of coding theory in both electrical engineering and computer science. Yet, many of their important properties are still und er investigation. This paper covers some of the recent developments regarding the weight enumerator and the capacity-achieving properties of RM codes, as well as some of the algorithmic developments. In particular, the paper discusses the recent connections established between RM codes, thresholds of Boolean functions, polarization theory, hypercontractivity, and the techniques of approximating low weight codewords using lower degree polynomials. It then overviews some of the algorithms with performance guarantees, as well as some of the algorithms with state-of-the-art performances in practical regimes. Finally, the paper concludes with a few open problems.
Spectral algorithms, such as principal component analysis and spectral clustering, typically require careful data transformations to be effective: upon observing a matrix $A$, one may look at the spectrum of $psi(A)$ for a properly chosen $psi$. The issue is that the spectrum of $A$ might be contaminated by non-informational top eigenvalues, e.g., due to scale` variations in the data, and the application of $psi$ aims to remove these. Designing a good functional $psi$ (and establishing what good means) is often challenging and model dependent. This paper proposes a simple and generic construction for sparse graphs, $$psi(A) = 1((I+A)^r ge1),$$ where $A$ denotes the adjacency matrix and $r$ is an integer (less than the graph diameter). This produces a graph connecting vertices from the original graph that are within distance $r$, and is referred to as graph powering. It is shown that graph powering regularizes the graph and decontaminates its spectrum in the following sense: (i) If the graph is drawn from the sparse ErdH{o}s-Renyi ensemble, which has no spectral gap, it is shown that graph powering produces a `maximal spectral gap, with the latter justified by establishing an Alon-Boppana result for powered graphs; (ii) If the graph is drawn from the sparse SBM, graph powering is shown to achieve the fundamental limit for weak recovery (the KS threshold) similarly to cite{massoulie-STOC}, settling an open problem therein. Further, graph powering is shown to be significantly more robust to tangles and cliques than previous spectral algorithms based on self-avoiding or nonbacktracking walk counts cite{massoulie-STOC,Mossel_SBM2,bordenave,colin3}. This is illustrated on a geometric block model that is dense in cliques.
Recovering low-rank structures via eigenvector perturbation analysis is a common problem in statistical machine learning, such as in factor analysis, community detection, ranking, matrix completion, among others. While a large variety of bounds are a vailable for average errors between empirical and population statistics of eigenvectors, few results are tight for entrywise analyses, which are critical for a number of problems such as community detection. This paper investigates entrywise behaviors of eigenvectors for a large class of random matrices whose expectations are low-rank, which helps settle the conjecture in Abbe et al. (2014b) that the spectral algorithm achieves exact recovery in the stochastic block model without any trimming or cleaning steps. The key is a first-order approximation of eigenvectors under the $ell_infty$ norm: $$u_k approx frac{A u_k^*}{lambda_k^*},$$ where ${u_k}$ and ${u_k^*}$ are eigenvectors of a random matrix $A$ and its expectation $mathbb{E} A$, respectively. The fact that the approximation is both tight and linear in $A$ facilitates sharp comparisons between $u_k$ and $u_k^*$. In particular, it allows for comparing the signs of $u_k$ and $u_k^*$ even if $| u_k - u_k^*|_{infty}$ is large. The results are further extended to perturbations of eigenspaces, yielding new $ell_infty$-type bounds for synchronization ($mathbb{Z}_2$-spiked Wigner model) and noisy matrix completion.
We develop an information-theoretic view of the stochastic block model, a popular statistical model for the large-scale structure of complex networks. A graph $G$ from such a model is generated by first assigning vertex labels at random from a finite alphabet, and then connecting vertices with edge probabilities depending on the labels of the endpoints. In the case of the symmetric two-group model, we establish an explicit `single-letter characterization of the per-vertex mutual information between the vertex labels and the graph. The explicit expression of the mutual information is intimately related to estimation-theoretic quantities, and --in particular-- reveals a phase transition at the critical point for community detection. Below the critical point the per-vertex mutual information is asymptotically the same as if edges were independent. Correspondingly, no algorithm can estimate the partition better than random guessing. Conversely, above the threshold, the per-vertex mutual information is strictly smaller than the independent-edges upper bound. In this regime there exists a procedure that estimates the vertex labels better than random guessing.
We consider the problem of clustering a graph $G$ into two communities by observing a subset of the vertex correlations. Specifically, we consider the inverse problem with observed variables $Y=B_G x oplus Z$, where $B_G$ is the incidence matrix of a graph $G$, $x$ is the vector of unknown vertex variables (with a uniform prior) and $Z$ is a noise vector with Bernoulli$(varepsilon)$ i.i.d. entries. All variables and operations are Boolean. This model is motivated by coding, synchronization, and community detection problems. In particular, it corresponds to a stochastic block model or a correlation clustering problem with two communities and censored edges. Without noise, exact recovery (up to global flip) of $x$ is possible if and only the graph $G$ is connected, with a sharp threshold at the edge probability $log(n)/n$ for ErdH{o}s-Renyi random graphs. The first goal of this paper is to determine how the edge probability $p$ needs to scale to allow exact recovery in the presence of noise. Defining the degree (oversampling) rate of the graph by $alpha =np/log(n)$, it is shown that exact recovery is possible if and only if $alpha >2/(1-2varepsilon)^2+ o(1/(1-2varepsilon)^2)$. In other words, $2/(1-2varepsilon)^2$ is the information theoretic threshold for exact recovery at low-SNR. In addition, an efficient recovery algorithm based on semidefinite programming is proposed and shown to succeed in the threshold regime up to twice the optimal rate. For a deterministic graph $G$, defining the degree rate as $alpha=d/log(n)$, where $d$ is the minimum degree of the graph, it is shown that the proposed method achieves the rate $alpha> 4((1+lambda)/(1-lambda)^2)/(1-2varepsilon)^2+ o(1/(1-2varepsilon)^2)$, where $1-lambda$ is the spectral gap of the graph $G$.
Polar codes are introduced for discrete memoryless broadcast channels. For $m$-user deterministic broadcast channels, polarization is applied to map uniformly random message bits from $m$ independent messages to one codeword while satisfying broadcas t constraints. The polarization-based codes achieve rates on the boundary of the private-message capacity region. For two-user noisy broadcast channels, polar implementations are presented for two information-theoretic schemes: i) Covers superposition codes; ii) Martons codes. Due to the structure of polarization, constraints on the auxiliary and channel-input distributions are identified to ensure proper alignment of polarization indices in the multi-user setting. The codes achieve rates on the capacity boundary of a few classes of broadcast channels (e.g., binary-input stochastically degraded). The complexity of encoding and decoding is $O(n*log n)$ where $n$ is the block length. In addition, polar code sequences obtain a stretched-exponential decay of $O(2^{-n^{beta}})$ of the average block error probability where $0 < beta < 0.5$.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا