ترغب بنشر مسار تعليمي؟ اضغط هنا

Multi-Level Stochastic Gradient Methods for Nested Composition Optimization

101   0   0.0 ( 0 )
 نشر من قبل Ethan Fang
 تاريخ النشر 2018
  مجال البحث
والبحث باللغة English




اسأل ChatGPT حول البحث

Stochastic gradient methods are scalable for solving large-scale optimization problems that involve empirical expectations of loss functions. Existing results mainly apply to optimization problems where the objectives are one- or two-level expectations. In this paper, we consider the multi-level compositional optimization problem that involves compositions of multi-level component functions and nested expectations over a random path. It finds applications in risk-averse optimization and sequential planning. We propose a class of multi-level stochastic gradient methods that are motivated from the method of multi-timescale stochastic approximation. First we propose a basic $T$-level stochastic compositional gradient algorithm, establish its almost sure convergence and obtain an $n$-iteration error bound $O (n^{-1/2^T})$. Then we develop accelerated multi-level stochastic gradient methods by using an extrapolation-interpolation scheme to take advantage of the smoothness of individual component functions. When all component functions are smooth, we show that the convergence rate improves to $O(n^{-4/(7+T)})$ for general objectives and $O (n^{-4/(3+T)})$ for strongly convex objectives. We also provide almost sure convergence and rate of convergence results for nonconvex problems. The proposed methods and theoretical results are validated using numerical experiments.



قيم البحث

اقرأ أيضاً

In this paper, we consider non-convex stochastic bilevel optimization (SBO) problems that have many applications in machine learning. Although numerous studies have proposed stochastic algorithms for solving these problems, they are limited in two pe rspectives: (i) their sample complexities are high, which do not match the state-of-the-art result for non-convex stochastic optimization; (ii) their algorithms are tailored to problems with only one lower-level problem. When there are many lower-level problems, it could be prohibitive to process all these lower-level problems at each iteration. To address these limitations, this paper proposes fast randomized stochastic algorithms for non-convex SBO problems. First, we present a stochastic method for non-convex SBO with only one lower problem and establish its sample complexity of $O(1/epsilon^3)$ for finding an $epsilon$-stationary point under Lipschitz continuous conditions of stochastic oracles, matching the lower bound for stochastic smooth non-convex optimization. Second, we present a randomized stochastic method for non-convex SBO with $m>1$ lower level problems (multi-task SBO) by processing a constant number of lower problems at each iteration, and establish its sample complexity no worse than $O(m/epsilon^3)$, which could be a better complexity than that of simply processing all $m$ lower problems at each iteration. Lastly, we establish even faster convergence results for gradient-dominant functions. To the best of our knowledge, this is the first work considering multi-task SBO and developing state-of-the-art sample complexity results.
139 - Yan Yan , Yi Xu , Qihang Lin 2020
Epoch gradient descent method (a.k.a. Epoch-GD) proposed by Hazan and Kale (2011) was deemed a breakthrough for stochastic strongly convex minimization, which achieves the optimal convergence rate of $O(1/T)$ with $T$ iterative updates for the {it ob jective gap}. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is still unclear whether a fast rate of $O(1/T)$ for the {it duality gap} is achievable for stochastic min-max optimization under strong convexity and strong concavity. Although some recent studies have proposed stochastic algorithms with fast convergence rates for min-max problems, they require additional assumptions about the problem, e.g., smoothness, bi-linear structure, etc. In this paper, we bridge this gap by providing a sharp analysis of epoch-wise stochastic gradient descent ascent method (referred to as Epoch-GDA) for solving strongly convex strongly concave (SCSC) min-max problems, without imposing any additional assumption about smoothness or the functions structure. To the best of our knowledge, our result is the first one that shows Epoch-GDA can achieve the optimal rate of $O(1/T)$ for the duality gap of general SCSC min-max problems. We emphasize that such generalization of Epoch-GD for strongly convex minimization problems to Epoch-GDA for SCSC min-max problems is non-trivial and requires novel technical analysis. Moreover, we notice that the key lemma can also be used for proving the convergence of Epoch-GDA for weakly-convex strongly-concave min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions.
We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects. When gradients do not contain noise, we also prove that distributed accelerated methods can emph{achieve acceleration}, requiring $mathcal{O}(kappa log(1/varepsilon))$ gradient evaluations and $mathcal{O}(kappa log(1/varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $kappa$ is the condition number and $varepsilon$ is the target accuracy. To our knowledge, this is the first acceleration result where the iteration complexity scales with the square root of the condition number in the context of emph{primal} distributed inexact first-order methods. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact $mathcal{O}(-k/sqrt{kappa})$ linear decay in the bias term as well as optimal $mathcal{O}(sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in practical algorithms that are robust to gradient noise and that can outperform existing methods.
Consider the stochastic composition optimization problem where the objective is a composition of two expected-value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method, which updates based on queries to the sampling oracle using two different timescales. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We further demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments.
We consider the nonsmooth convex composition optimization problem where the objective is a composition of two finite-sum functions and analyze stochastic compositional variance reduced gradient (SCVRG) methods for them. SCVRG and its variants have re cently drawn much attention given their edge over stochastic compositional gradient descent (SCGD); but the theoretical analysis exclusively assumes strong convexity of the objective, which excludes several important examples such as Lasso, logistic regression, principle component analysis and deep neural nets. In contrast, we prove non-asymptotic incremental first-order oracle (IFO) complexity of SCVRG or its novel variants for nonsmooth convex composition optimization and show that they are provably faster than SCGD and gradient descent. More specifically, our method achieves the total IFO complexity of $Oleft((m+n)logleft(1/epsilonright)+1/epsilon^3right)$ which improves that of $Oleft(1/epsilon^{3.5}right)$ and $Oleft((m+n)/sqrt{epsilon}right)$ obtained by SCGD and accelerated gradient descent (AGD) respectively. Experimental results confirm that our methods outperform several existing methods, e.g., SCGD and AGD, on sparse mean-variance optimization problem.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا