ﻻ يوجد ملخص باللغة العربية
We consider the problem of learning an unknown function $f_{star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples ${(y_i,{boldsymbol x}_i)}_{ile n}$ where ${boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{star}({boldsymbol x}_i)+varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{ell + delta} le Nle d^{ell+1-delta}$ for small $delta > 0$, then RF, effectively fits a degree-$ell$ polynomial in the raw features, and NT, fits a degree-$(ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{ell + delta} le n le d^{ell +1-delta}$, then kernel methods can fit at most a a degree-$ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.
We study the supervised learning problem under either of the following two models: (1) Feature vectors ${boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Featu
We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $R^D
In this paper, we consider high-dimensional stationary processes where a new observation is generated from a compressed version of past observations. The specific evolution is modeled by an encoder-decoder structure. We estimate the evolution with an
We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-la
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesnt the trained network overfit when it is overparameterized? In this work, we prove that overparamete