Mean-field Langevin System, Optimal Control and Deep Neural Networks

96 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhenjie Ren

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kaitong Hu - Anna Kazeykina - Zhenjie Ren

الاحتمالات التعلم الآلي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we study a regularised relaxed optimal control problem and, in particular, we are concerned with the case where the control variable is of large dimension. We introduce a system of mean-field Langevin equations, the invariant measure of which is shown to be the optimal control of the initial problem under mild conditions. Therefore, this system of processes can be viewed as a continuous-time numerical algorithm for computing the optimal control. As an application, this result endorses the solvability of the stochastic gradient descent algorithm for a wide class of deep neural networks.

قيم البحث

131 - Anna Kazeykina , Zhenjie Ren , Xiaolu Tan 2020

We study the long time behavior of an underdamped mean-field Langevin (MFL) equation, and provide a general convergence as well as an exponential convergence rate result under different conditions. The results on the MFL equation can be applied to st udy the convergence of the Hamiltonian gradient descent algorithm for the overparametrized optimization. We then provide a numerical example of the algorithm to train a generative adversarial networks (GAN).

الاحتمالات التعلم الآلي التحسين والتحكم

Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

80 - Cong Fang , Jason D. Lee , Pengkun Yang 2020

This paper proposes a new mean-field framework for over-parameterized deep neural networks (DNNs), which can be used to analyze neural network training. In this framework, a DNN is represented by probability measures and functions over its features ( that is, the function values of the hidden units over the training data) in the continuous limit, instead of the neural network parameters as most existing studies have done. This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, and further leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re-parameterization. Moreover, we construct a non-linear dynamics called neural feature flow, which captures the evolution of an over-parameterized DNN trained by Gradient Descent. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures. Furthermore, we show, for Res-Net, when the neural feature flow process converges, it reaches a global minimal solution under suitable conditions. Our analysis leads to the first global convergence proof for over-parameterized neural network training with more than $3$ layers in the mean-field regime.

التعلم الالي التعلم الآلي التحسين والتحكم

signADAM: Learning Confidences for Deep Neural Networks

299 - Dong Wang , Yicheng Liu , Wenwo Tang 2019

In this paper, we propose a new first-order gradient-based algorithm to train deep neural networks. We first introduce the sign operation of stochastic gradients (as in sign-based methods, e.g., SIGN-SGD) into ADAM, which is called as signADAM. Moreo ver, in order to make the rate of fitting each feature closer, we define a confidence function to distinguish different components of gradients and apply it to our algorithm. It can generate more sparse gradients than existing algorithms do. We call this new algorithm signADAM++. In particular, both our algorithms are easy to implement and can speed up training of various deep neural networks. The motivation of signADAM++ is preferably learning features from the most different samples by updating large and useful gradients regardless of useless information in stochastic gradients. We also establish theoretical convergence guarantees for our algorithms. Empirical results on various datasets and models show that our algorithms yield much better performance than many state-of-the-art algorithms including SIGN-SGD, SIGNUM and ADAM. We also analyze the performance from multiple perspectives including the loss landscape and develop an adaptive method to further improve generalization. The source code is available at https://github.com/DongWanginxdu/signADAM-Learn-by-Confidence.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي التحسين والتحكم

Optimization of Mean-field Spin Glasses

112 - Ahmed El Alaoui , Andrea Montanari , Mark Sellke 2020

Mean-field spin glasses are families of random energy functions (Hamiltonians) on high-dimensional product spaces. In this paper we consider the case of Ising mixed $p$-spin models, namely Hamiltonians $H_N:Sigma_Nto {mathbb R}$ on the Hamming hyperc ube $Sigma_N = {pm 1}^N$, which are defined by the property that ${H_N({boldsymbol sigma})}_{{boldsymbol sigma}in Sigma_N}$ is a centered Gaussian process with covariance ${mathbb E}{H_N({boldsymbol sigma}_1)H_N({boldsymbol sigma}_2)}$ depending only on the scalar product $langle {boldsymbol sigma}_1,{boldsymbol sigma}_2rangle$. The asymptotic value of the optimum $max_{{boldsymbol sigma}in Sigma_N}H_N({boldsymbol sigma})$ was characterized in terms of a variational principle known as the Parisi formula, first proved by Talagrand and, in a more general setting, by Panchenko. The structure of superlevel sets is extremely rich and has been studied by a number of authors. Here we ask whether a near optimal configuration ${boldsymbol sigma}$ can be computed in polynomial time. We develop a message passing algorithm whose complexity per-iteration is of the same order as the complexity of evaluating the gradient of $H_N$, and characterize the typical energy value it achieves. When the $p$-spin model $H_N$ satisfies a certain no-overlap gap assumption, for any $varepsilon>0$, the algorithm outputs ${boldsymbol sigma}inSigma_N$ such that $H_N({boldsymbol sigma})ge (1-varepsilon)max_{{boldsymbol sigma}} H_N({boldsymbol sigma})$, with high probability. The number of iterations is bounded in $N$ and depends uniquely on $varepsilon$. More generally, regardless of whether the no-overlap gap assumption holds, the energy achieved is given by an extended variational principle, which generalizes the Parisi formula.

الاحتمالات الأنظمة المضطربة والشبكات العصبية التحسين والتحكم

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

134 - Zhishuai Guo , Mingrui Liu , Zhuoning Yuan 2020

In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. Although distributed learning techniques have been investigated extensively in deep learning, they are not directly appl icable to stochastic AUC maximization with deep neural networks due to its striking differences from standard loss minimization problems (e.g., cross-entropy). Towards addressing this challenge, we propose and analyze a communication-efficient distributed optimization algorithm based on a {it non-convex concave} reformulation of the AUC maximization, in which the communication of both the primal variable and the dual variable between each worker and the parameter server only occurs after multiple steps of gradient-based updates in each worker. Compared with the naive parallel version of an existing algorithm that computes stochastic gradients at individual machines and averages them for updating the model parameters, our algorithm requires a much less number of communication rounds and still achieves a linear speedup in theory. To the best of our knowledge, this is the textbf{first} work that solves the {it non-convex concave min-max} problem for AUC maximization with deep neural networks in a communication-efficient distributed manner while still maintaining the linear speedup property in theory. Our experiments on several benchmark datasets show the effectiveness of our algorithm and also confirm our theory.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التحسين والتحكم