مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Sketchy Empirical Natural Gradient Methods for Deep Learning

65 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Minghan Yang

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي

والبحث باللغة English

تأليف Minghan Yang - Dong Xu - Zaiwen Wen

التحسين والتحكم التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we develop an efficient sketchy empirical natural gradient method (SENG) for large-scale deep learning problems. The empirical Fisher information matrix is usually low-rank since the sampling is only practical on a small amount of data at each iteration. Although the corresponding natural gradient direction lies in a small subspace, both the computational cost and memory requirement are still not tractable due to the high dimensionality. We design randomized techniques for different neural network structures to resolve these challenges. For layers with a reasonable dimension, sketching can be performed on a regularized least squares subproblem. Otherwise, since the gradient is a vectorization of the product between two matrices, we apply sketching on the low-rank approximations of these matrices to compute the most expensive parts. A distributed version of SENG is also developed for extremely large-scale applications. Global convergence to stationary points is established under some mild assumptions and a fast linear convergence is analyzed under the neural tangent kernel (NTK) case. Extensive experiments on convolutional neural networks show the competitiveness of SENG compared with the state-of-the-art methods. On the task ResNet50 with ImageNet-1k, SENG achieves 75.9% Top-1 testing accuracy within 41 epochs. Experiments on the distributed large-batch training show that the scaling efficiency is quite reasonable.

قيم البحث

181 - Nihat Ay 2020

We study the natural gradient method for learning in deep Bayesian networks, including neural networks. There are two natural geometries associated with such learning systems consisting of visible and hidden units. One geometry is related to the full system, the other one to the visible sub-system. These two geometries imply different natural gradients. In a first step, we demonstrate a great simplification of the natural gradient with respect to the first geometry, due to locality properties of the Fisher information matrix. This simplification does not directly translate to a corresponding simplification with respect to the second geometry. We develop the theory for studying the relation between the t

التعلم الآلي التعلم الالي

NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

375 - Minghan Yang , Dong Xu , Qiwen Cui 2021

In this paper, a novel second-order method called NG+ is proposed. By following the rule ``the shape of the gradient equals the shape of the parameter, we define a generalized fisher information matrix (GFIM) using the products of gradients in the ma trix form rather than the traditional vectorization. Then, our generalized natural gradient direction is simply the inverse of the GFIM multiplies the gradient in the matrix form. Moreover, the GFIM and its inverse keeps the same for multiple steps so that the computational cost can be controlled and is comparable with the first-order methods. A global convergence is established under some mild conditions and a regret bound is also given for the online learning setting. Numerical results on image classification with ResNet50, quantum chemistry modeling with Schnet, neural machine translation with Transformer and recommendation system with DLRM illustrate that GN+ is competitive with the state-of-the-art methods.

التحسين والتحكم الذكاء الاصطناعي التعلم الآلي

Natural gradient via optimal transport

104 - Wuchen Li , Guido Montufar 2018

We study a natural Wasserstein gradient flow on manifolds of probability distributions with discrete sample spaces. We derive the Riemannian structure for the probability simplex from the dynamical formulation of the Wasserstein distance on a weighte d graph. We pull back the geometric structure to the parameter space of any given probability model, which allows us to define a natural gradient flow there. In contrast to the natural Fisher-Rao gradient, the natural Wasserstein gradient incorporates a ground metric on sample space. We illustrate the analysis of elementary exponential family examples and demonstrate an application of the Wasserstein natural gradient to maximum likelihood estimation.

التحسين والتحكم نظرية المعلومات نظرية المعلومات

Deep Learning Methods for Mean Field Control Problems with Delay

153 - Jean-Pierre Fouque , Zhaoyu Zhang 2019

We consider a general class of mean field control problems described by stochastic delayed differential equations of McKean-Vlasov type. Two numerical algorithms are provided based on deep learning techniques, one is to directly parameterize the opti mal control using neural networks, the other is based on numerically solving the McKean-Vlasov forward anticipated backward stochastic differential equation (MV-FABSDE) system. In addition, we establish a necessary and sufficient stochastic maximum principle for this class of mean field control problems with delay based on the differential calculus on function of measures, as well as existence and uniqueness results for the associated MV-FABSDE system.

التحسين والتحكم

Necessary and Sufficient Geometries for Gradient Methods

297 - Daniel Levy , John C. Duchi 2019

We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods ar e (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient methods are minimax optimal. We further provide a converse that shows that when the constraints are not quadratically convex---for example, any $ell_p$-ball for $p < 2$---the methods are far from optimal. Based on this, we can provide concrete recommendations for when one should use adaptive, mirror or stochastic gradient methods.

التحسين والتحكم نظرية المعلومات التعلم الآلي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد العالي للدراسات والبحوث السكانية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Sketchy Empirical Natural Gradient Methods for Deep Learning

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً