Stochastic Bias-Reduced Gradient Methods

154 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yair Carmon

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Hilal Asi - Yair Carmon - Arun Jambulapati

التحسين والتحكم بنى وهياكل البيانات والخوارزميات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We develop a new primitive for stochastic optimization: a low-bias, low-cost estimator of the minimizer $x_star$ of any Lipschitz strongly-convex function. In particular, we use a multilevel Monte-Carlo approach due to Blanchet and Glynn to turn any optimal stochastic gradient method into an estimator of $x_star$ with bias $delta$, variance $O(log(1/delta))$, and an expected sampling cost of $O(log(1/delta))$ stochastic gradient evaluations. As an immediate consequence, we obtain cheap and nearly unbiased gradient estimators for the Moreau-Yoshida envelope of any Lipschitz convex function, allowing us to perform dimension-free randomized smoothing. We demonstrate the potential of our estimator through four applications. First, we develop a method for minimizing the maximum of $N$ functions, improving on recent results and matching a lower bound up logarithmic factors. Second and third, we recover state-of-the-art rates for projection-efficient and gradient-efficient optimization using simple algorithms with a transparent analysis. Finally, we show that an improved version of our estimator would yield a nearly linear-time, optimal-utility, differentially-private non-smooth stochastic optimization method.

قيم البحث

98 - Kun Huang , Shi Pu 2021

We consider the distributed optimization problem where $n$ agents each possessing a local cost function, collaboratively minimize the average of the $n$ cost functions over a connected network. Assuming stochastic gradient information is available, w e study a distributed stochastic gradient algorithm, called exact diffusion with adaptive stepsizes (EDAS) adapted from the Exact Diffusion method and NIDS and perform a non-asymptotic convergence analysis. We not only show that EDAS asymptotically achieves the same network independent convergence rate as centralized stochastic gradient descent (SGD) for minimizing strongly convex and smooth objective functions, but also characterize the transient time needed for the algorithm to approach the asymptotic convergence rate, which behaves as $K_T=mathcal{O}left(frac{n}{1-lambda_2}right)$, where $1-lambda_2$ stands for the spectral gap of the mixing matrix. To the best of our knowledge, EDAS achieves the shortest transient time when the average of the $n$ cost functions is strongly convex and each cost function is smooth. Numerical simulations further corroborate and strengthen the obtained theoretical results.

التحسين والتحكم النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Unifying Width-Reduced Methods for Quasi-Self-Concordant Optimization

96 - Deeksha Adil , Brian Bullins , Sushant Sachdeva 2021

We provide several algorithms for constrained optimization of a large class of convex problems, including softmax, $ell_p$ regression, and logistic regression. Central to our approach is the notion of width reduction, a technique which has proven imm ensely useful in the context of maximum flow [Christiano et al., STOC11] and, more recently, $ell_p$ regression [Adil et al., SODA19], in terms of improving the iteration complexity from $O(m^{1/2})$ to $tilde{O}(m^{1/3})$, where $m$ is the number of rows of the design matrix, and where each iteration amounts to a linear system solve. However, a considerable drawback is that these methods require both problem-specific potentials and individually tailored analyses. As our main contribution, we initiate a new direction of study by presenting the first unified approach to achieving $m^{1/3}$-type rates. Notably, our method goes beyond these previously considered problems to more broadly capture quasi-self-concordant losses, a class which has recently generated much interest and includes the well-studied problem of logistic regression, among others. In order to do so, we develop a unified width reduction method for carefully handling these losses based on a more general set of potentials. Additionally, we directly achieve $m^{1/3}$-type rates in the constrained setting without the need for any explicit acceleration schemes, thus naturally complementing recent work based on a ball-oracle approach [Carmon et al., NeurIPS20].

التحسين والتحكم بنى وهياكل البيانات والخوارزميات

Randomized Stochastic Variance-Reduced Methods for Multi-Task Stochastic Bilevel Optimization

386 - Zhishuai Guo , Quanqi Hu , Lijun Zhang 2021

In this paper, we consider non-convex stochastic bilevel optimization (SBO) problems that have many applications in machine learning. Although numerous studies have proposed stochastic algorithms for solving these problems, they are limited in two pe rspectives: (i) their sample complexities are high, which do not match the state-of-the-art result for non-convex stochastic optimization; (ii) their algorithms are tailored to problems with only one lower-level problem. When there are many lower-level problems, it could be prohibitive to process all these lower-level problems at each iteration. To address these limitations, this paper proposes fast randomized stochastic algorithms for non-convex SBO problems. First, we present a stochastic method for non-convex SBO with only one lower problem and establish its sample complexity of $O(1/epsilon^3)$ for finding an $epsilon$-stationary point under Lipschitz continuous conditions of stochastic oracles, matching the lower bound for stochastic smooth non-convex optimization. Second, we present a randomized stochastic method for non-convex SBO with $m>1$ lower level problems (multi-task SBO) by processing a constant number of lower problems at each iteration, and establish its sample complexity no worse than $O(m/epsilon^3)$, which could be a better complexity than that of simply processing all $m$ lower problems at each iteration. Lastly, we establish even faster convergence results for gradient-dominant functions. To the best of our knowledge, this is the first work considering multi-task SBO and developing state-of-the-art sample complexity results.

التحسين والتحكم التعلم الآلي

Improved Sample Complexity for Stochastic Compositional Variance Reduced Gradient

249 - Tianyi Lin , Chenyou Fan , Mengdi Wang 2018

Convex composition optimization is an emerging topic that covers a wide range of applications arising from stochastic optimal control, reinforcement learning and multi-stage stochastic programming. Existing algorithms suffer from unsatisfactory sampl e complexity and practical issues since they ignore the convexity structure in the algorithmic design. In this paper, we develop a new stochastic compositional variance-reduced gradient algorithm with the sample complexity of $O((m+n)log(1/epsilon)+1/epsilon^3)$ where $m+n$ is the total number of samples. Our algorithm is near-optimal as the dependence on $m+n$ is optimal up to a logarithmic factor. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of the new algorithm.

التحسين والتحكم التعلم الآلي

Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization

166 - Yangyang Xu 2020

Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases . Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result $O(varepsilon^{-3})$ to produce a stochastic $varepsilon$-stationary solution, if a mean-squared smoothness condition holds and $Theta(varepsilon^{-1})$ samples are available for the initial update. Different from existing optimal methods, PStorm can still achieve a near-optimal complexity result $tilde{O}(varepsilon^{-3})$ by using only one or $O(1)$ samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or $O(1)$ new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.

التحسين والتحكم التعلم الآلي التحليل العددي