No Arabic abstract
The stochastic Frank-Wolfe method has recently attracted much general interest in the context of optimization for statistical and machine learning due to its ability to work with a more general feasible region. However, there has been a complexity gap in the guaranteed convergence rate for stochastic Frank-Wolfe compared to its deterministic counterpart. In this work, we present a new generalized stochastic Frank-Wolfe method which closes this gap for the class of structured optimization problems encountered in statistical and machine learning characterized by empirical loss minimization with a certain type of ``linear prediction property (formally defined in the paper), which is typically present loss minimization problems in practice. Our method also introduces the notion of a ``substitute gradient that is a not-necessarily-unbiased sample of the gradient. We show that our new method is equivalent to a particular randomized coordinate mirror descent algorithm applied to the dual problem, which in turn provides a new interpretation of randomized dual coordinate descent in the primal space. Also, in the special case of a strongly convex regularizer our generalized stochastic Frank-Wolfe method (as well as the randomized dual coordinate descent method) exhibits linear convergence. Furthermore, we present computational experiments that indicate that our method outperforms other stochastic Frank-Wolfe methods consistent with the theory developed herein.
We study constrained stochastic programs where the decision vector at each time slot cannot be chosen freely but is tied to the realization of an underlying random state vector. The goal is to minimize a general objective function subject to linear constraints. A typical scenario where such programs appear is opportunistic scheduling over a network of time-varying channels, where the random state vector is the channel state observed, and the control vector is the transmission decision which depends on the current channel state. We consider a primal-dual type Frank-Wolfe algorithm that has a low complexity update during each slot and that learns to make efficient decisions without prior knowledge of the probability distribution of the random state vector. We establish convergence time guarantees for the case of both convex and non-convex objective functions. We also emphasize application of the algorithm to non-convex opportunistic scheduling and distributed non-convex stochastic optimization over a connected graph.
The Frank-Wolfe method and its extensions are well-suited for delivering solutions with desirable structural properties, such as sparsity or low-rank structure. We introduce a new variant of the Frank-Wolfe method that combines Frank-Wolfe steps and steepest descent steps, as well as a novel modification of the Frank-Wolfe gap to measure convergence in the non-convex case. We further extend this method to incorporate in-face directions for preserving structured solutions as well as block coordinate steps, and we demonstrate computational guarantees in terms of the modified Frank-Wolfe gap for all of these variants. We are particularly motivated by the application of this methodology to the training of neural networks with sparse properties, and we apply our block coordinate method to the problem of $ell_1$ regularized neural network training. We present the results of several numerical experiments on both artificial and real datasets demonstrating significant improvements of our method in training sparse neural networks.
We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-iteration cost that is independent of the dataset size. Furthermore, as a byproduct of the method we obtain a stochastic estimator of the Frank-Wolfe gap that can be used as a stopping criterion. Depending on the setting, the proposed method matches or improves on the best computational guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several datasets highlight different regimes in which the proposed method exhibits a faster empirical convergence than related methods. Finally, we provide an implementation of all considered methods in an open-source package.
In this paper we propose several adaptive gradient methods for stochastic optimization. Unlike AdaGrad-type of methods, our algorithms are based on Armijo-type line search and they simultaneously adapt to the unknown Lipschitz constant of the gradient and variance of the stochastic approximation for the gradient. We consider an accelerated and non-accelerated gradient descent for convex problems and gradient descent for non-convex problems. In the experiments we demonstrate superiority of our methods to existing adaptive methods, e.g. AdaGrad and Adam.
Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from meta-learning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure. As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and non-smooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, model-agnostic meta-learning (MAML), and instrumental variable regression (IV).