No Arabic abstract
In non-convex settings, it is established that the behavior of gradient-based algorithms is different in the vicinity of local structures of the objective function such as strict and non-strict saddle points, local and global minima and maxima. It is therefore crucial to describe the landscape of non-convex problems. That is, to describe as well as possible the set of points of each of the above categories. In this work, we study the landscape of the empirical risk associated with deep linear neural networks and the square loss. It is known that, under weak assumptions, this objective function has no spurious local minima and no local maxima. We go a step further and characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing a linear neural network. In passing, we also provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.
We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight matrices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can be re-interpreted as a Riemannian gradient flow on the manifold of rank-$r$ matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank $k$ matrices for some $kleq r$.
An almost non-abelian extension of the Rieffel deformation is presented in this work. The non-abelicity comes into play by the introduction of unitary groups which are dependent of the infinitesimal generators of $SU(n)$. This extension is applied to quantum mechanics and quantum field theory.
In this paper, we study the properties of robust nonparametric estimation using deep neural networks for regression models with heavy tailed error distributions. We establish the non-asymptotic error bounds for a class of robust nonparametric regression estimators using deep neural networks with ReLU activation under suitable smoothness conditions on the regression function and mild conditions on the error term. In particular, we only assume that the error distribution has a finite p-th moment with p greater than one. We also show that the deep robust regression estimators are able to circumvent the curse of dimensionality when the distribution of the predictor is supported on an approximate lower-dimensional set. An important feature of our error bound is that, for ReLU neural networks with network width and network size (number of parameters) no more than the order of the square of the dimensionality d of the predictor, our excess risk bounds depend sub-linearly on d. Our assumption relaxes the exact manifold support assumption, which could be restrictive and unrealistic in practice. We also relax several crucial assumptions on the data distribution, the target regression function and the neural networks required in the recent literature. Our simulation studies demonstrate that the robust methods can significantly outperform the least squares method when the errors have heavy-tailed distributions and illustrate that the choice of loss function is important in the context of deep nonparametric regression.
We develop a novel variant of the classical Frank-Wolfe algorithm, which we call spectral Frank-Wolfe, for convex optimization over a spectrahedron. The spectral Frank-Wolfe algorithm has a novel ingredient: it computes a few eigenvectors of the gradient and solves a small-scale SDP in each iteration. Such procedure overcomes slow convergence of the classical Frank-Wolfe algorithm due to ignoring eigenvalue coalescence. We demonstrate that strict complementarity of the optimization problem is key to proving linear convergence of various algorithms, such as the spectral Frank-Wolfe algorithm as well as the projected gradient method and its accelerated version.