Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

254 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zeyuan Allen-Zhu

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zeyuan Allen-Zhu - Yuanzhi Li - Yingyu Liang

التعلم الآلي بنى وهياكل البيانات والخوارزميات الحوسبة العصبية والتطورية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesnt the trained network overfit when it is overparameterized? In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.

قيم البحث

111 - Marco Mondelli , Andrea Montanari 2018

We establish connections between the problem of learning a two-layer neural network and tensor decomposition. We consider a model with feature vectors $boldsymbol x in mathbb R^d$, $r$ hidden units with weights ${boldsymbol w_i}_{1le i le r}$ and out put $yin mathbb R$, i.e., $y=sum_{i=1}^r sigma( boldsymbol w_i^{mathsf T}boldsymbol x)$, with activation functions given by low-degree polynomials. In particular, if $sigma(x) = a_0+a_1x+a_3x^3$, we prove that no polynomial-time learning algorithm can outperform the trivial predictor that assigns to each example the response variable $mathbb E(y)$, when $d^{3/2}ll rll d^2$. Our conclusion holds for a `natural data distribution, namely standard Gaussian feature vectors $boldsymbol x$, and output distributed according to a two-layer neural network with random isotropic weights, and under a certain complexity-theoretic assumption on tensor decomposition. Roughly speaking, we assume that no polynomial-time algorithm can substantially outperform current methods for tensor decomposition based on the sum-of-squares hierarchy. We also prove generalizations of this statement for higher degree polynomial activations, and non-random weight vectors. Remarkably, several existing algorithms for learning two-layer networks with rigorous guarantees are based on tensor decomposition. Our results support the idea that this is indeed the core computational difficulty in learning such networks, under the stated generative model for the data. As a side result, we show that under this model learning the network requires accurate learning of its weights, a property that does not hold in a more general setting.

التعلم الآلي بنى وهياكل البيانات والخوارزميات التعلم الالي

Can SGD Learn Recurrent Neural Networks with Provable Generalization?

194 - Zeyuan Allen-Zhu , Yuanzhi Li 2019

Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other? Existing generalization bounds for RNN scale exponentially with the input length, significantly limiting their practical implications. In this paper, we show using the vanilla stochastic gradient descent (SGD), RNN can actually learn some notable concept class efficiently, meaning that both time and sample complexity scale polynomially in the input length (or almost polynomially, depending on the concept). This concept class at least includes functions where each output token is generated from inputs of earlier tokens using a smooth two-layer neural network.

التعلم الآلي بنى وهياكل البيانات والخوارزميات الحوسبة العصبية والتطورية

What Can ResNet Learn Efficiently, Going Beyond Kernels?

192 - Zeyuan Allen-Zhu , Yuanzhi Li 2019

How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap? Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class BETTER than kernels? We answer this positively in the distribution-free setting. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be MUCH SMALLER than ANY kernel method, including neural tangent kernels (NTK). The main intuition is that multi-layer neural networks can implicitly perform hierarchical learning using different layers, which reduces the sample complexity comparing to one-shot learning algorithms such as kernel methods. In a follow-up work [2], this theory of hierarchical learning is further strengthened to incorporate the backward feature correction process when training deep networks. In the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.

التعلم الآلي بنى وهياكل البيانات والخوارزميات الحوسبة العصبية والتطورية

Learning and Generalization in Overparameterized Normalizing Flows

130 - Kulin Shah , Amit Deshpande , Navin Goyal 2021

In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using stochastic gradient descent with sufficiently small learning rate and suitable initiali zation. In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is one-hidden-layer overparameterized network. Our main contributions are two-fold: (1) On the one hand, we provide theoretical and empirical evidence that for a class of NFs containing most of the existing NF models, overparametrization hurts training. (2) On the other hand, we prove that unconstrained NFs, a recently introduced model, can efficiently learn any reasonable data distribution under minimal assumptions when the underlying network is overparametrized.

التعلم الآلي الذكاء الاصطناعي

Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks

164 - Sina Ghiassian , Banafsheh Rafiee , Yat Long Lo 2020

Reinforcement learning systems require good representations to work well. For decades practical success in reinforcement learning was limited to small domains. Deep reinforcement learning systems, on the other hand, are scalable, not dependent on dom ain specific prior knowledge and have been successfully used to play Atari, in 3D navigation from pixels, and to control high degree of freedom robots. Unfortunately, the performance of deep reinforcement learning systems is sensitive to hyper-parameter settings and architecture choices. Even well tuned systems exhibit significant instability both within a trial and across experiment replications. In practice, significant expertise and trial and error are usually required to achieve good performance. One potential source of the problem is known as catastrophic interference: when later training decreases performance by overriding previous learning. Interestingly, the powerful generalization that makes Neural Networks (NN) so effective in batch supervised learning might explain the challenges when applying them in reinforcement learning tasks. In this paper, we explore how online NN training and interference interact in reinforcement learning. We find that simply re-mapping the input observations to a high-dimensional space improves learning speed and parameter sensitivity. We also show this preprocessing reduces interference in prediction tasks. More practically, we provide a simple approach to NN training that is easy to implement, and requires little additional computation. We demonstrate that our approach improves performance in both prediction and control with an extensive batch of experiments in classic control domains.

التعلم الآلي الذكاء الاصطناعي الحوسبة العصبية والتطورية