Stochastic Compositional Gradient Descent: Algorithms for Minimizing Compositions of Expected-Value Functions

115 0 0.0 ( 0 )

Download Cite

Added by Han Liu

Publication date 2014

fields Mathematical Statistics

and research's language is English

Authors Mengdi Wang - Ethan X. Fang - Han Liu

Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., problems of the form $min_x mathbf{E}_v [f_vbig(mathbf{E}_w [g_w(x)]big)]$. In order to solve this stochastic composition problem, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochast

rate research

Accelerated Stochastic Gradient Descent for Minimizing Finite Sums

286 - Atsushi Nitanda 2015

We propose an optimization method for minimizing the finite sums of smooth convex functions. Our method incorporates an accelerated gradient descent (AGD) and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. Unlike SVRG, our method can be directly applied to non-strongly and strongly convex problems. We show that our method achieves a lower overall complexity than the recently proposed methods that supports non-strongly convex problems. Moreover, this method has a fast rate of convergence for strongly convex problems. Our experiments show the effectiveness of our method.

Machine Learning Machine Learning

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

221 - Atsushi Nitanda , Taiji Suzuki 2018

We consider stochastic gradient descent and its averaging variant for binary classification problems in a reproducing kernel Hilbert space. In the traditional analysis using a consistency property of loss functions, it is known that the expected classification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition but provided theoretical analysis was limited to the squared loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions. As for the averaged stochastic gradient descent, we show that the same convergence rate holds from the early phase of training. In experiments, we verify our analyses on the $L_2$-regularized logistic regression.

Machine Learning Machine Learning Optimization and Control

Stochastic Particle Gradient Descent for Infinite Ensembles

101 - Atsushi Nitanda , Taiji Suzuki 2017

The superior performance of ensemble methods with infinite models are well known. Most of these methods are based on optimization problems in infinite-dimensional spaces with some regularization, for instance, boosting methods and convex neural networks use $L^1$-regularization with the non-negative constraint. However, due to the difficulty of handling $L^1$-regularization, these problems require early stopping or a rough approximation to solve it inexactly. In this paper, we propose a new ensemble learning method that performs in a space of probability measures, that is, our method can handle the $L^1$-constraint and the non-negative constraint in a rigorous way. Such an optimization is realized by proposing a general purpose stochastic optimization method for learning probability measures via parameterization using transport maps on base models. As a result of running the method, a transport map to output an infinite ensemble is obtained, which forms a residual-type network. From the perspective of functional gradient methods, we give a convergence rate as fast as that of a stochastic optimization method for finite dimensional nonconvex problems. Moreover, we show an interior optimality property of a local optimality condition used in our analysis.

Machine Learning Machine Learning Optimization and Control

Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime

222 - Atsushi Nitanda , Taiji Suzuki 2020

We analyze the convergence of the averaged stochastic gradient descent for overparameterized two-layer neural networks for regression problems. It was recently found that a neural tangent kernel (NTK) plays an important role in showing the global convergence of gradient-based methods under the NTK regime, where the learning dynamics for overparameterized neural networks can be almost characterized by that for the associated reproducing kernel Hilbert space (RKHS). However, there is still room for a convergence rate analysis in the NTK regime. In this study, we show that the averaged stochastic gradient descent can achieve the minimax optimal convergence rate, with the global convergence guarantee, by exploiting the complexities of the target function and the RKHS associated with the NTK. Moreover, we show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate through a smooth approximation of a ReLU network under certain conditions.

Machine Learning Machine Learning

Learning to Learn without Gradient Descent by Gradient Descent

276 - Yutian Chen , Matthew W. Hoffman , Sergio Gomez Colmenarejo 2016

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.

Machine Learning Machine Learning