Quantitative Propagation of Chaos for SGD in Wide Neural Networks

172 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Valentin De Bortoli

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Valentin De Bortoli - Alain Durmus - Xavier Fontaine

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we investigate the limiting behavior of a continuous-time counterpart of the Stochastic Gradient Descent (SGD) algorithm applied to two-layer overparameterized neural networks, as the number or neurons (ie, the size of the hidden layer) $N to +infty$. Following a probabilistic approach, we show propagation of chaos for the particle system defined by this continuous-time dynamics under different scenarios, indicating that the statistical interaction between the particles asymptotically vanishes. In particular, we establish quantitative convergence with respect to $N$ of any particle to a solution of a mean-field McKean-Vlasov equation in the metric space endowed with the Wasserstein distance. In comparison to previous works on the subject, we consider settings in which the sequence of stepsizes in SGD can potentially depend on the number of neurons and the iterations. We then identify two regimes under which different mean-field limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand. We perform various experiments on real datasets to validate our theoretical results, assessing the existence of these two regimes on classification problems and illustrating our convergence results.

قيم البحث

105 - Jiri Hron , Yasaman Bahri , Roman Novak 2020

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling.

التعلم الالي التعلم الآلي

Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

106 - David Gamarnik , Eren C. K{i}z{i}ldau{g} , 2021

التعلم الالي التعلم الآلي الاحتمالات

Normalization effects on shallow neural networks and related asymptotic expansions

221 - Jiahui Yu , Konstantinos Spiliopoulos 2020

We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units $N$ and gradient descent steps grow to infinity. In particular, we investigate th e effect of different scaling schemes, which lead to different normalizations of the neural network, on the networks statistical output, closing the gap between the $1/sqrt{N}$ and the mean-field $1/N$ normalization. We develop an asymptotic expansion for the neural networks statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in $N$, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in $N$, the variance of the neural networks statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural networks normalization gets closer to the mean field normalization.

التعلم الالي التعلم الآلي الاحتمالات

Neural Gaussian Mirror for Controlled Feature Selection in Neural Networks

265 - Xin Xing , Yu Gui , Chenguang Dai 2020

Deep neural networks (DNNs) have become increasingly popular and achieved outstanding performance in predictive tasks. However, the DNN framework itself cannot inform the user which features are more or less relevant for making the prediction, which limits its applicability in many scientific fields. We introduce neural Gaussian mirrors (NGMs), in which mirrored features are created, via a structured perturbation based on a kernel-based conditional dependence measure, to help evaluate feature importance. We design two modifications of the DNN architecture for incorporating mirrored features and providing mirror statistics to measure feature importance. As shown in simulated and real data examples, the proposed method controls the feature selection error rate at a predefined level and maintains a high selection power even with the presence of highly correlated features.

التعلم الالي التعلم الآلي

Data Cleansing for Models Trained with SGD

423 - Satoshi Hara , Atsushi Nitanda , Takanori Maehara 2019

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm tha t can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.

التعلم الالي التعلم الآلي