No Arabic abstract
In this paper, we focus on a theory-practice gap for Adam and its variants (AMSgrad, AdamNC, etc.). In practice, these algorithms are used with a constant first-order moment parameter $beta_{1}$ (typically between $0.9$ and $0.99$). In theory, regret guarantees for online convex optimization require a rapidly decaying $beta_{1}to0$ schedule. We show that this is an artifact of the standard analysis and propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $beta_{1}$, without further assumptions. We also demonstrate the flexibility of our analysis on a wide range of different algorithms and settings.
Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with large-scale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence rate of $O(1/sqrt{n})$ after $n$ iterations, and of $O(1/n)$ for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale $ell_1$-logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems.
We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations $N$. This bound can be made arbitrarily small: Adam with a learning rate $alpha=1/sqrt{N}$ and a momentum parameter on squared gradients $beta_2=1-1/N$ achieves the same rate of convergence $O(ln(N)/sqrt{N})$ as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-beta_1)^{-3})$ to $O((1-beta_1)^{-1})$. Our technique also improves the best known dependency for standard SGD by a factor $1 - beta_1$.
We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems.
Mixed linear regression (MLR) model is among the most exemplary statistical tools for modeling non-linear distributions using a mixture of linear models. When the additive noise in MLR model is Gaussian, Expectation-Maximization (EM) algorithm is a widely-used algorithm for maximum likelihood estimation of MLR parameters. However, when noise is non-Gaussian, the steps of EM algorithm may not have closed-form update rules, which makes EM algorithm impractical. In this work, we study the maximum likelihood estimation of the parameters of MLR model when the additive noise has non-Gaussian distribution. In particular, we consider the case that noise has Laplacian distribution and we first show that unlike the the Gaussian case, the resulting sub-problems of EM algorithm in this case does not have closed-form update rule, thus preventing us from using EM in this case. To overcome this issue, we propose a new algorithm based on combining the alternating direction method of multipliers (ADMM) with EM algorithm idea. Our numerical experiments show that our method outperforms the EM algorithm in statistical accuracy and computational time in non-Gaussian noise case.
We study an online linear programming (OLP) problem under a random input model in which the columns of the constraint matrix along with the corresponding coefficients in the objective function are generated i.i.d. from an unknown distribution and revealed sequentially over time. Virtually all pre-existing online algorithms were based on learning the dual optimal solutions/prices of the linear programs (LP), and their analyses were focused on the aggregate objective value and solving the packing LP where all coefficients in the constraint matrix and objective are nonnegative. However, two major open questions were: (i) Does the set of LP optimal dual prices learned in the pre-existing algorithms converge to those of the offline LP, and (ii) Could the results be extended to general LP problems where the coefficients can be either positive or negative. We resolve these two questions by establishing convergence results for the dual prices under moderate regularity conditions for general LP problems. Specifically, we identify an equivalent form of the dual problem which relates the dual LP with a sample average approximation to a stochastic program. Furthermore, we propose a new type of OLP algorithm, Action-History-Dependent Learning Algorithm, which improves the previous algorithm performances by taking into account the past input data as well as decisions/actions already made. We derive an $O(log n log log n)$ regret bound (under a locally strong convexity and smoothness condition) for the proposed algorithm, against the $O(sqrt{n})$ bound for typical dual-price learning algorithms, where $n$ is the number of decision variables. Numerical experiments demonstrate the effectiveness of the proposed algorithm and the action-history-dependent design.