ترغب بنشر مسار تعليمي؟ اضغط هنا

Almost Sure Convergence of Dropout Algorithms for Neural Networks

69   0   0.0 ( 0 )
 نشر من قبل Jaron Sanders
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We investigate the convergence and convergence rate of stochastic training algorithms for Neural Networks (NNs) that, over the years, have spawned from Dropout (Hinton et al., 2012). Modeling that neurons in the brain may not fire, dropout algorithms consist in practice of multiplying the weight matrices of a NN component-wise by independently drawn random matrices with ${0,1}$-valued entries during each iteration of the Feedforward-Backpropagation algorithm. This paper presents a probability theoretical proof that for any NN topology and differentiable polynomially bounded activation functions, if we project the NNs weights into a compact set and use a dropout algorithm, then the weights converge to a unique stationary set of a projected system of Ordinary Differential Equations (ODEs). We also establish an upper bound on the rate of convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms for arborescences (a class of trees) of arbitrary depth and with linear activation functions.



قيم البحث

اقرأ أيضاً

Recently, Keating, Linden, and Wells cite{KLW} showed that the density of states measure of a nearest-neighbor quantum spin glass model is approximately Gaussian when the number of particles is large. The density of states measure is the ensemble ave rage of the empirical spectral measure of a random matrix; in this paper, we use concentration of measure and entropy techniques together with the result of cite{KLW} to show that in fact, the empirical spectral measure of such a random matrix is almost surely approximately Gaussian itself, with no ensemble averaging. We also extend this result to a spherical quantum spin glass model and to the more general coupling geometries investigated by ErdH{o}s and Schroder.
We analyze the convergence rate of gradient flows on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks (NNs) - which can also be viewed as doing matrix factorization using a particular regula rizer. Dropout algorithms such as these are thus regularization techniques that use 0,1-valued random variables to filter weights during training in order to avoid coadaptation of features. By leveraging a recent result on nonconvex optimization and conducting a careful analysis of the set of minimizers as well as the Hessian of the loss function, we are able to obtain (i) a local convergence proof of the gradient flow and (ii) a bound on the convergence rate that depends on the data, the dropout probability, and the width of the NN. Finally, we compare this theoretical bound to numerical simulations, which are in qualitative agreement with the convergence bound and match it when starting sufficiently close to a minimizer.
This paper is concerned with convergence of stochastic gradient algorithms with momentum terms in the nonconvex setting. A class of stochastic momentum methods, including stochastic gradient descent, heavy ball, and Nesterovs accelerated gradient, is analyzed in a general framework under mild assumptions. Based on the convergence result of expected gradients, we prove the almost sure convergence by a detailed discussion of the effects of momentum and the number of upcrossings. It is worth noting that there are not additional restrictions imposed on the objective function and stepsize. Another improvement over previous results is that the existing Lipschitz condition of the gradient is relaxed into the condition of Holder continuity. As a byproduct, we apply a localization procedure to extend our results to stochastic stepsizes.
The family of pairwise independently determined (PID) systems, i.e. those for which the independent joining is the only self joining with independent 2-marginals, is a class of systems for which the long standing open question by Rokhlin, of whether mixing implies mixing of all orders, has a positive answer. We show that in the class of weakly mixing PID one finds a positive answer for another long-standing open problem, whether the multiple ergodic averages begin{equation*} frac 1 Nsum_{n=0}^{N-1}f_1(T^nx)cdots f_d(T^{dn}x), quad Nto infty, end{equation*} almost surely converge.
We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight mat rices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can be re-interpreted as a Riemannian gradient flow on the manifold of rank-$r$ matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank $k$ matrices for some $kleq r$.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا