Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

107 0 0.0 ( 0 )

Download Cite

Added by Eren Can K{\\i}z{\\i}lda\\u{g}

Publication date 2021

fields Mathematical Statistics Informatics Engineering

and research's language is English

Authors David Gamarnik - Eren C. K{i}z{i}ldau{g} -

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

No English abstract

rate research

Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

170 - Guodong Zhang , James Martens , Roger Grosse 2019

Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for emph{nonlinear} networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of networks output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions, and we give a bound on the rate of this convergence.

Machine Learning Machine Learning

DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks

111 - Shiyun Xu , Zhiqi Bu 2020

Recent years have witnessed strong empirical performance of over-parameterized neural networks on various tasks and many advances in the theory, e.g. the universal approximation and provable convergence to global minimum. In this paper, we incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. By doing so, we can exploit a wide class of networks to approximate the nuisance functions and to estimate the parameters of interest consistently. Therefore, we may offer the best of two worlds: the universal approximation ability from neural networks and the interpretability from classic ordinary linear model, leading to both valid inference and accurate prediction. We show the theoretical foundations that make this possible and demonstrate with numerical experiments. Furthermore, we propose a framework, DebiNet, in which we plug-in arbitrary feature selection methods to our semi-parametric neural network. DebiNet can debias the regularized estimators (e.g. Lasso) and perform well, in terms of the post-selection inference and the generalization error.

Machine Learning Machine Learning

One-pass Stochastic Gradient Descent in Overparametrized Two-layer Neural Networks

196 - Jiaming Xu , Hanjing Zhu 2021

There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous works assume that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmids inequality.

Machine Learning Machine Learning Optimization and Control

Quantitative Propagation of Chaos for SGD in Wide Neural Networks

171 - Valentin De Bortoli , Alain Durmus , Xavier Fontaine 2020

In this paper, we investigate the limiting behavior of a continuous-time counterpart of the Stochastic Gradient Descent (SGD) algorithm applied to two-layer overparameterized neural networks, as the number or neurons (ie, the size of the hidden layer) $N to +infty$. Following a probabilistic approach, we show propagation of chaos for the particle system defined by this continuous-time dynamics under different scenarios, indicating that the statistical interaction between the particles asymptotically vanishes. In particular, we establish quantitative convergence with respect to $N$ of any particle to a solution of a mean-field McKean-Vlasov equation in the metric space endowed with the Wasserstein distance. In comparison to previous works on the subject, we consider settings in which the sequence of stepsizes in SGD can potentially depend on the number of neurons and the iterations. We then identify two regimes under which different mean-field limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand. We perform various experiments on real datasets to validate our theoretical results, assessing the existence of these two regimes on classification problems and illustrating our convergence results.

Machine Learning Machine Learning Probability

Linear approximability of two-layer neural networks: A comprehensive analysis based on spectral decay

82 - Jihao Long , Lei Wu 2021

In this paper, we present a spectral-based approach to study the linear approximation of two-layer neural networks. We first consider the case of single neuron and show that the linear approximability, quantified by the Kolmogorov width, is controlled by the eigenvalue decay of an associate kernel. Then, we show that similar results also hold for two-layer neural networks. This spectral-based approach allows us to obtain upper bounds, lower bounds, and explicit hard examples in a united manner. In particular, these bounds imply that for networks activated by smooth functions, restricting the norms of inner-layer weights may significantly impair the expressiveness. By contrast, for non-smooth activation functions, such as ReLU, the network expressiveness is independent of the inner-layer weight norms. In addition, we prove that for a family of non-smooth activation functions, including ReLU, approximating any single neuron with random features suffers from the emph{curse of dimensionality}. This provides an explicit separation of expressiveness between neural networks and random feature models.

Machine Learning Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

Ask ChatGPT about the research

No Arabic abstract

No English abstract

Read More

suggested questions