ترغب بنشر مسار تعليمي؟ اضغط هنا

Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in $mathbb{R}^d$, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). Correspondingly, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm -- which is equivalent to a weighted $ell_2$ norm -- is replaced by a weighted functional $ell_p$ norm, which we refer to as $mathcal{F}_p$ norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires to solve an infinite-dimensional convex problem. We study random features approximations to these norms and show that, for $p>1$, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with $mathcal{F}_p$ norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models.
A number of machine learning tasks entail a high degree of invariance: the data distribution does not change if we act on the data with a certain group of transformations. For instance, labels of images are invariant under translations of the images. Certain neural network architectures -- for instance, convolutional networks -- are believed to owe their success to the fact that they exploit such invariance properties. With the objective of quantifying the gain achieved by invariant architectures, we introduce two classes of models: invariant random features and invariant kernel methods. The latter includes, as a special case, the neural tangent kernel for convolutional networks with global average pooling. We consider uniform covariates distributions on the sphere and hypercube and a general invariant target function. We characterize the test error of invariant methods in a high-dimensional regime in which the sample size and number of hidden units scale as polynomials in the dimension, for a class of groups that we call `degeneracy $alpha$, with $alpha leq 1$. We show that exploiting invariance in the architecture saves a $d^alpha$ factor ($d$ stands for the dimension) in sample size and number of hidden units to achieve the same test error as for unstructured architectures. Finally, we show that output symmetrization of an unstructured kernel estimator does not give a significant statistical improvement; on the other hand, data augmentation with an unstructured kernel estimator is equivalent to an invariant kernel estimator and enjoys the same improvement in statistical efficiency.
Consider the classical supervised learning problem: we are given data $(y_i,{boldsymbol x}_i)$, $ile n$, with $y_i$ a response and ${boldsymbol x}_iin {mathcal X}$ a covariates vector, and try to learn a model $f:{mathcal X}to{mathbb R}$ to predict f uture responses. Random features methods map the covariates vector ${boldsymbol x}_i$ to a point ${boldsymbol phi}({boldsymbol x}_i)$ in a higher dimensional space ${mathbb R}^N$, via a random featurization map ${boldsymbol phi}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $ell$ eigenfunctions of the kernel, where $ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $Nle n^{1-delta}$ for some $delta>0$. We characterize this gap. For $Nge n^{1+delta}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error.
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classi fication tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If feature vectors are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the feature vectors display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present a model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.
We study the supervised learning problem under either of the following two models: (1) Feature vectors ${boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Featu re vectors ${boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.
We consider the problem of learning an unknown function $f_{star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples ${(y_i,{boldsymbol x}_i)}_{ile n}$ where ${boldsymbol x}_i$ is a feature vector uniformly distribut ed on the sphere and $y_i=f_{star}({boldsymbol x}_i)+varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{ell + delta} le Nle d^{ell+1-delta}$ for small $delta > 0$, then RF, effectively fits a degree-$ell$ polynomial in the raw features, and NT, fits a degree-$(ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{ell + delta} le n le d^{ell +1-delta}$, then kernel methods can fit at most a a degree-$ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.
We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $R^D $ (where $D$ is the number of parameters associated to each neuron). This evolution can be defined through a partial differential equation or, equivalently, as the gradient flow in the Wasserstein space of probability distributions. Earlier work shows that (under some regularity assumptions), the mean field description is accurate as soon as the number of hidden units is much larger than the dimension $D$. In this paper we establish stronger and more general approximation guarantees. First of all, we show that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. Next, we generalize this analysis to the case of unbounded activation functions, which was not covered by earlier bounds. We extend our results to noisy stochastic gradient descent. Finally, we show that kernel ridge regression can be recovered as a special limit of the mean field analysis.
A number of statistical estimation problems can be addressed by semidefinite programs (SDP). While SDPs are solvable in polynomial time using interior point methods, in practice generic SDP solvers do not scale well to high-dimensional problems. In o rder to cope with this problem, Burer and Monteiro proposed a non-convex rank-constrained formulation, which has good performance in practice but is still poorly understood theoretically. In this paper we study the rank-constrained version of SDPs arising in MaxCut and in synchronization problems. We establish a Grothendieck-type inequality that proves that all the local maxima and dangerous saddle points are within a small multiplicative gap from the global maximum. We use this structural information to prove that SDPs can be solved within a known accuracy, by applying the Riemannian trust-region method to this non-convex problem, while constraining the rank to be of order one. For the MaxCut problem, our inequality implies that any local maximizer of the rank-constrained SDP provides a $ (1 - 1/(k-1)) times 0.878$ approximation of the MaxCut, when the rank is fixed to $k$. We then apply our results to data matrices generated according to the Gaussian ${mathbb Z}_2$ synchronization problem, and the two-groups stochastic block model with large bounded degree. We prove that the error achieved by local maximizers undergoes a phase transition at the same threshold as for information-theoretically optimal methods.
An important problem of reconstruction of diffusion network and transmission probabilities from the data has attracted a considerable attention in the past several years. A number of recent papers introduced efficient algorithms for the estimation of spreading parameters, based on the maximization of the likelihood of observed cascades, assuming that the full information for all the nodes in the network is available. In this work, we focus on a more realistic and restricted scenario, in which only a partial information on the cascades is available: either the set of activation times for a limited number of nodes, or the states of nodes for a subset of observation times. To tackle this problem, we first introduce a framework based on the maximization of the likelihood of the incomplete diffusion trace. However, we argue that the computation of this incomplete likelihood is a computationally hard problem, and show that a fast and robust reconstruction of transmission probabilities in sparse networks can be achieved with a new algorithm based on recently introduced dynamic message-passing equations for the spreading processes. The suggested approach can be easily generalized to a large class of discrete and continuous dynamic models, as well as to the cases of dynamically-changing networks and noisy information.
In this paper we consider regular low-density parity-check codes over a binary-symmetric channel in the decoding regime. We prove that up to a certain noise threshold the bit-error probability of the bit-sampling decoder converges in mean to zero ove r the code ensemble and the channel realizations. To arrive at this result we show that the bit-error probability of the sampling decoder is equal to the derivative of a Bethe free entropy. The method that we developed is new and is based on convexity of the free entropy and loop calculus. Convexity is needed to exchange limit and derivative and the loop series enables us to express the difference between the bit-error probability and the Bethe free entropy. We control the loop series using combinatorial techniques and a first moment method. We stress that our method is versatile and we believe that it can be generalized for LDPC codes with general degree distributions and for asymmetric channels.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا