No Arabic abstract
The extrapolation strategy raised by Nesterov, which can accelerate the convergence rate of gradient descent methods by orders of magnitude when dealing with smooth convex objective, has led to tremendous success in training machine learning tasks. In this article, the convergence of individual iterates of projected subgradient (PSG) methods for nonsmooth convex optimization problems is theoretically studied based on Nesterovs extrapolation, which we name individual convergence. We prove that Nesterovs extrapolation has the strength to make the individual convergence of PSG optimal for nonsmooth problems. In light of this consideration, a direct modification of the subgradient evaluation suffices to achieve optimal individual convergence for strongly convex problems, which can be regarded as making an interesting step toward the open question about stochastic gradient descent (SGD) posed by Shamir. Furthermore, we give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in stochastic settings. Compared with other state-of-the-art nonsmooth methods, the derived algorithms can serve as an alternative to the basic SGD especially in coping with machine learning problems, where an individual output is needed to guarantee the regularization structure while keeping an optimal rate of convergence. Typically, our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems. Several comparison experiments demonstrate that our individual output not only achieves an optimal convergence rate but also guarantees better sparsity than the averaged solution.
We study Nesterovs accelerated gradient method with constant step-size and momentum parameters in the stochastic approximation setting (unbiased gradients with bounded variance) and the finite-sum setting (where randomness is due to sampling mini-batches). To build better insight into the behavior of Nesterovs method in stochastic settings, we focus throughout on objectives that are smooth, strongly-convex, and twice continuously differentiable. In the stochastic approximation setting, Nesterovs method converges to a neighborhood of the optimal point at the same accelerated rate as in the deterministic setting. Perhaps surprisingly, in the finite-sum setting, we prove that Nesterovs method may diverge with the usual choice of step-size and momentum, unless additional conditions on the problem related to conditioning and data coherence are satisfied. Our results shed light as to why Nesterovs method may fail to converge or achieve acceleration in the finite-sum setting.
Structure learning of directed acyclic graphs (DAGs) is a fundamental problem in many scientific endeavors. A new line of work, based on NOTEARS (Zheng et al., 2018), reformulates the structure learning problem as a continuous optimization one by leveraging an algebraic characterization of DAG constraint. The constrained problem is typically solved using the augmented Lagrangian method (ALM) which is often preferred to the quadratic penalty method (QPM) by virtue of its convergence result that does not require the penalty coefficient to go to infinity, hence avoiding ill-conditioning. In this work, we review the standard convergence result of the ALM and show that the required conditions are not satisfied in the recent continuous constrained formulation for learning DAGs. We demonstrate empirically that its behavior is akin to that of the QPM which is prone to ill-conditioning, thus motivating the use of second-order method in this setting. We also establish the convergence guarantee of QPM to a DAG solution, under mild conditions, based on a property of the DAG constraint term.
In this paper, we investigate the non-asymptotic stationary convergence behavior of Stochastic Mirror Descent (SMD) for nonconvex optimization. We focus on a general class of nonconvex nonsmooth stochastic optimization problems, in which the objective can be decomposed into a relatively weakly convex function (possibly non-Lipschitz) and a simple non-smooth convex regularizer. We prove that SMD, without the use of mini-batch, is guaranteed to converge to a stationary point in a convergence rate of $ mathcal{O}(1/sqrt{t}) $. The efficiency estimate matches with existing results for stochastic subgradient method, but is evaluated under a stronger stationarity measure. Our convergence analysis applies to both the original SMD and its proximal version, as well as the deterministic variants, for solving relatively weakly convex problems.
In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clarke Jacobians in the usual differentiation formulas is fully justified for a wide class of nonsmooth problems. Moreover this calculus is entirely compatible with algorithmic differentiation (e.g., backpropagation). We provide several applications such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsmooth Lasso-type models. To show the sharpness of our assumptions, we present numerical experiments showcasing the extremely pathological gradient dynamics one can encounter when applying implicit algorithmic differentiation without any hypothesis.
In this work, we present a globalized stochastic semismooth Newton method for solving stochastic optimization problems involving smooth nonconvex and nonsmooth convex terms in the objective function. We assume that only noisy gradient and Hessian information of the smooth part of the objective function is available via calling stochastic first and second order oracles. The proposed method can be seen as a hybrid approach combining stochastic semismooth Newton steps and stochastic proximal gradient steps. Two inexact growth conditions are incorporated to monitor the convergence and the acceptance of the semismooth Newton steps and it is shown that the algorithm converges globally to stationary points in expectation. Moreover, under standard assumptions and utilizing random matrix concentration inequalities, we prove that the proposed approach locally turns into a pure stochastic semismooth Newton method and converges r-superlinearly with high probability. We present numerical results and comparisons on $ell_1$-regularized logistic regression and nonconvex binary classification that demonstrate the efficiency of our algorithm.