Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Improved Learning Rates for Stochastic Optimization: Two Theoretical Viewpoints

265 0 0.0 ( 0 )

Download Cite

Added by Shaojie Li

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Shaojie Li - Yong Liu

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Generalization performance of stochastic optimization stands a central place in learning theory. In this paper, we investigate the excess risk performance and towards improved learning rates for two popular approaches of stochastic optimization: empirical risk minimization (ERM) and stochastic gradient descent (SGD). Although there exists plentiful generalization analysis of ERM and SGD for supervised learning, current theoretical understandings of ERM and SGD either have stronger assumptions in convex learning, e.g., strong convexity, or show slow rates and less studied in nonconvex learning. Motivated by these problems, we aim to provide improved rates under milder assumptions in convex learning and derive faster rates in nonconvex learning. It is notable that our analysis span two popular theoretical viewpoints: emph{stability} and emph{uniform convergence}. Specifically, in stability regime, we present high probability learning rates of order $mathcal{O} (1/n)$ w.r.t. the sample size $n$ for ERM and SGD with milder assumptions in convex learning and similar high probability rates of order $mathcal{O} (1/n)$ in nonconvex learning, rather than in expectation. Furthermore, this type of learning rate is improved to faster order $mathcal{O} (1/n^2)$ in uniform convergence regime. To our best knowledge, for ERM and SGD, the learning rates presented in this paper are all state-of-the-art.

rate research

Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

155 - Wenlong Mou , Liwei Wang , Xiyu Zhai 2017

Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also critical to generalization errors of deep learning. In this paper, we study the generalization errors of Stochastic Gradient Langevin Dynamics (SGLD) with non-convex objectives. Two theories are proposed with non-asymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively. The stability-based theory obtains a bound of $Oleft(frac{1}{n}Lsqrt{beta T_k}right)$, where $L$ is uniform Lipschitz parameter, $beta$ is inverse temperature, and $T_k$ is aggregated step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/sqrt{n})$ rate, the contribution of each step is shown with an exponentially decaying factor by imposing $ell^2$ regularization, and the uniform Lipschitz constant is also replaced by actual norms of gradients along trajectory. Our bounds have no implicit dependence on dimensions, norms or other capacity measures of parameter, which elegantly characterizes the phenomenon of Fast Training Guarantees Generalization in non-convex settings. This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning.

Machine Learning Optimization and Control Machine Learning

Improved Analysis of Clipping Algorithms for Non-convex Optimization

298 - Bohang Zhang , Jikai Jin , Cong Fang 2020

Gradient clipping is commonly used in training deep neural networks partly due to its practicability in relieving the exploding gradient problem. Recently, citet{zhang2019gradient} show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD via introducing a new assumption called $(L_0, L_1)$-smoothness, which characterizes the violent fluctuation of gradients typically encountered in deep neural networks. However, their iteration complexities on the problem-dependent parameters are rather pessimistic, and theoretical justification of clipping combined with other crucial techniques, e.g. momentum acceleration, are still lacking. In this paper, we bridge the gap by presenting a general framework to study the clipping algorithms, which also takes momentum methods into consideration. We provide convergence analysis of the framework in both deterministic and stochastic setting, and demonstrate the tightness of our results by comparing them with existing lower bounds. Our results imply that the efficiency of clipping methods will not degenerate even in highly non-smooth regions of the landscape. Experiments confirm the superiority of clipping-based methods in deep learning tasks.

Machine Learning Optimization and Control Machine Learning

Improved Rates for Differentially Private Stochastic Convex Optimization with Heavy-Tailed Data

137 - Gautam Kamath , Xingtu Liu , Huanyu Zhang 2021

We study stochastic convex optimization with heavy-tailed data under the constraint of differential privacy. Most prior work on this problem is restricted to the case where the loss function is Lipschitz. Instead, as introduced by Wang, Xiao, Devadas, and Xu, we study general convex loss functions with the assumption that the distribution of gradients has bounded $k$-th moments. We provide improved upper bounds on the excess population risk under approximate differential privacy of $tilde{O}left(sqrt{frac{d}{n}}+left(frac{d}{epsilon n}right)^{frac{k-1}{k}}right)$ and $tilde{O}left(frac{d}{n}+left(frac{d}{epsilon n}right)^{frac{2k-2}{k}}right)$ for convex and strongly convex loss functions, respectively. We also prove nearly-matching lower bounds under the constraint of pure differential privacy, giving strong evidence that our bounds are tight.

Machine Learning Cryptography and Security Data Structures and Algorithms

Scalable Second Order Optimization for Deep Learning

84 - Rohan Anil , Vineet Gupta , Tomer Koren 2020

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.

Machine Learning Optimization and Control Machine Learning

On Landscape of Lagrangian Functions and Stochastic Search for Constrained Nonconvex Optimization

299 - Zhehui Chen , Xingguo Li , Lin F. Yang 2018

We study constrained nonconvex optimization problems in machine learning, signal processing, and stochastic control. It is well-known that these problems can be rewritten to a minimax problem in a Lagrangian form. However, due to the lack of convexity, their landscape is not well understood and how to find the stable equilibria of the Lagrangian function is still unknown. To bridge the gap, we study the landscape of the Lagrangian function. Further, we define a special class of Lagrangian functions. They enjoy two properties: 1.Equilibria are either stable or unstable (Formal definition in Section 2); 2.Stable equilibria correspond to the global optima of the original problem. We show that a generalized eigenvalue (GEV) problem, including canonical correlation analysis and other problems, belongs to the class. Specifically, we characterize its stable and unstable equilibria by leveraging an invariant group and symmetric property (more details in Section 3). Motivated by these neat geometric structures, we propose a simple, efficient, and stochastic primal-dual algorithm solving the online GEV problem. Theoretically, we provide sufficient conditions, based on which we establish an asymptotic convergence rate and obtain the first sample complexity result for the online GEV problem by diffusion approximations, which are widely used in applied probability and stochastic control. Numerical results are provided to support our theory.

Machine Learning Optimization and Control Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Improved Learning Rates for Stochastic Optimization: Two Theoretical Viewpoints

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions