We study the long time behavior of an underdamped mean-field Langevin (MFL) equation, and provide a general convergence as well as an exponential convergence rate result under different conditions. The results on the MFL equation can be applied to study the convergence of the Hamiltonian gradient descent algorithm for the overparametrized optimization. We then provide a numerical example of the algorithm to train a generative adversarial networks (GAN).
In this paper, we study a regularised relaxed optimal control problem and, in particular, we are concerned with the case where the control variable is of large dimension. We introduce a system of mean-field Langevin equations, the invariant measure of which is shown to be the optimal control of the initial problem under mild conditions. Therefore, this system of processes can be viewed as a continuous-time numerical algorithm for computing the optimal control. As an application, this result endorses the solvability of the stochastic gradient descent algorithm for a wide class of deep neural networks.
In this paper we introduce and analyse Langevin samplers that consist of perturbations of the standard underdamped Langevin dynamics. The perturbed dynamics is such that its invariant measure is the same as that of the unperturbed dynamics. We show that appropriate choices of the perturbations can lead to samplers that have improved properties, at least in terms of reducing the asymptotic variance. We present a detailed analysis of the new Langevin sampler for Gaussian target distributions. Our theoretical results are supported by numerical experiments with non-Gaussian target measures.
The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the other coordinates untouched. We investigate the computational complexity of RC-ULMC and compare it with the classical ULMC for strongly log-concave probability distributions. We show that RC-ULMC is always cheaper than the classical ULMC, with a significant cost reduction when the problem is highly skewed and high dimensional. Our complexity bound for RC-ULMC is also tight in terms of dimension dependence.
Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a emph{continuous-time} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that emph{the optima of stationary distribution} of the dynamics might not match emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as emph{fractional} ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with emph{natural gradient} methods and emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.
Mean-field spin glasses are families of random energy functions (Hamiltonians) on high-dimensional product spaces. In this paper we consider the case of Ising mixed $p$-spin models, namely Hamiltonians $H_N:Sigma_Nto {mathbb R}$ on the Hamming hypercube $Sigma_N = {pm 1}^N$, which are defined by the property that ${H_N({boldsymbol sigma})}_{{boldsymbol sigma}in Sigma_N}$ is a centered Gaussian process with covariance ${mathbb E}{H_N({boldsymbol sigma}_1)H_N({boldsymbol sigma}_2)}$ depending only on the scalar product $langle {boldsymbol sigma}_1,{boldsymbol sigma}_2rangle$. The asymptotic value of the optimum $max_{{boldsymbol sigma}in Sigma_N}H_N({boldsymbol sigma})$ was characterized in terms of a variational principle known as the Parisi formula, first proved by Talagrand and, in a more general setting, by Panchenko. The structure of superlevel sets is extremely rich and has been studied by a number of authors. Here we ask whether a near optimal configuration ${boldsymbol sigma}$ can be computed in polynomial time. We develop a message passing algorithm whose complexity per-iteration is of the same order as the complexity of evaluating the gradient of $H_N$, and characterize the typical energy value it achieves. When the $p$-spin model $H_N$ satisfies a certain no-overlap gap assumption, for any $varepsilon>0$, the algorithm outputs ${boldsymbol sigma}inSigma_N$ such that $H_N({boldsymbol sigma})ge (1-varepsilon)max_{{boldsymbol sigma}} H_N({boldsymbol sigma})$, with high probability. The number of iterations is bounded in $N$ and depends uniquely on $varepsilon$. More generally, regardless of whether the no-overlap gap assumption holds, the energy achieved is given by an extended variational principle, which generalizes the Parisi formula.