No Arabic abstract
We introduce a quasi-Newton method with block updates called Block BFGS. We show that this method, performed with inexact Armijo-Wolfe line searches, converges globally and superlinearly under the same convexity assumptions as BFGS. We also show that Block BFGS is globally convergent to a stationary point when applied to non-convex functions with bounded Hessian, and discuss other modifications for non-convex minimization. Numerical experiments comparing Block BFGS, BFGS and gradient descent are presented.
In this paper, we present a new stochastic algorithm, namely the stochastic block mirror descent (SBMD) method for solving large-scale nonsmooth and stochastic optimization problems. The basic idea of this algorithm is to incorporate the block-coordinate decomposition and an incremental block averaging scheme into the classic (stochastic) mirror-descent method, in order to significantly reduce the cost per iteration of the latter algorithm. We establish the rate of convergence of the SBMD method along with its associated large-deviation results for solving general nonsmooth and stochastic optimization problems. We also introduce different variants of this method and establish their rate of convergence for solving strongly convex, smooth, and composite optimization problems, as well as certain nonconvex optimization problems. To the best of our knowledge, all these developments related to the SBMD methods are new in the stochastic optimization literature. Moreover, some of our results also seem to be new for block coordinate descent methods for deterministic optimization.
The popular BFGS quasi-Newton minimization algorithm under reasonable conditions converges globally on smooth convex functions. This result was proved by Powell in 1976: we consider its implications for functions that are not smooth. In particular, an analogous convergence result holds for functions, like the Euclidean norm, that are nonsmooth at the minimizer.
The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the L-BFGS algorithm that combines three basic components - progressive batching, a stochastic line search, and stable quasi-Newton updating - and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method.
We consider the minimization of an $L_0$-Lipschitz continuous and expectation-valued function, denoted by $f$ and defined as $f(x)triangleq mathbb{E}[tilde{f}(x,omega)]$, over a Cartesian product of closed and convex sets with a view towards obtaining both asymptotics as well as rate and complexity guarantees for computing an approximate stationary point (in a Clarke sense). We adopt a smoothing-based approach reliant on minimizing $f_{eta}$ where $f_{eta}(x) triangleq mathbb{E}_{u}[f(x+eta u)]$, $u$ is a random variable defined on a unit sphere, and $eta > 0$. In fact, it is observed that a stationary point of the $eta$-smoothed problem is a $2eta$-stationary point for the original problem in the Clarke sense. In such a setting, we derive a suitable residual function that provides a metric for stationarity for the smoothed problem. By leveraging a zeroth-order framework reliant on utilizing sampled function evaluations implemented in a block-structured regime, we make two sets of contributions for the sequence generated by the proposed scheme. (i) The residual function of the smoothed problem tends to zero almost surely along the generated sequence; (ii) To compute an $x$ that ensures that the expected norm of the residual of the $eta$-smoothed problem is within $epsilon$ requires no greater than $mathcal{O}(tfrac{1}{eta epsilon^2})$ projection steps and $mathcal{O}left(tfrac{1}{eta^2 epsilon^4}right)$ function evaluations. These statements appear to be novel and there appear to be few results to contend with general nonsmooth, nonconvex, and stochastic regimes via zeroth-order approaches.
The method of block coordinate gradient descent (BCD) has been a powerful method for large-scale optimization. This paper considers the BCD method that successively updates a series of blocks selected according to a Markov chain. This kind of block selection is neither i.i.d. random nor cyclic. On the other hand, it is a natural choice for some applications in distributed optimization and Markov decision process, where i.i.d. random and cyclic selections are either infeasible or very expensive. By applying mixing-time properties of a Markov chain, we prove convergence of Markov chain BCD for minimizing Lipschitz differentiable functions, which can be nonconvex. When the functions are convex and strongly convex, we establish both sublinear and linear convergence rates, respectively. We also present a method of Markov chain inertial BCD. Finally, we discuss potential applications.