Do you want to publish a course? Click here

Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers

240   0   0.0 ( 0 )
 Added by Holger Rauhut
 Publication date 2019
and research's language is English




Ask ChatGPT about the research

We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight matrices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can be re-interpreted as a Riemannian gradient flow on the manifold of rank-$r$ matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank $k$ matrices for some $kleq r$.



rate research

Read More

We study the convergence issue for the gradient algorithm (employing general step sizes) for optimization problems on general Riemannian manifolds (without curvature constraints). Under the assumption of the local convexity/quasi-convexity (resp. weak sharp minima), local/global convergence (resp. linear convergence) results are established. As an application, the linear convergence properties of the gradient algorithm employing the constant step sizes and the Armijo step sizes for finding the Riemannian $L^p$ ($pin[1,+infty)$) centers of mass are explored, respectively, which in particular extend and/or improve the corresponding results in cite{Afsari2013}.
In non-convex settings, it is established that the behavior of gradient-based algorithms is different in the vicinity of local structures of the objective function such as strict and non-strict saddle points, local and global minima and maxima. It is therefore crucial to describe the landscape of non-convex problems. That is, to describe as well as possible the set of points of each of the above categories. In this work, we study the landscape of the empirical risk associated with deep linear neural networks and the square loss. It is known that, under weak assumptions, this objective function has no spurious local minima and no local maxima. We go a step further and characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing a linear neural network. In passing, we also provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
97 - Tian Ye , Simon S. Du 2021
We study the asymmetric low-rank factorization problem: [min_{mathbf{U} in mathbb{R}^{m times d}, mathbf{V} in mathbb{R}^{n times d}} frac{1}{2}|mathbf{U}mathbf{V}^top -mathbf{Sigma}|_F^2] where $mathbf{Sigma}$ is a given matrix of size $m times n$ and rank $d$. This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $mathbf{U}$ and $mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $mathbf{U}$ and $mathbf{V}$. This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.
Communication efficiency is a major bottleneck in the applications of distributed networks. To address the problem, the problem of quantized distributed optimization has attracted a lot of attention. However, most of the existing quantized distributed optimization algorithms can only converge sublinearly. To achieve linear convergence, this paper proposes a novel quantized distributed gradient tracking algorithm (Q-DGT) to minimize a finite sum of local objective functions over directed networks. Moreover, we explicitly derive the update rule for the number of quantization levels, and prove that Q-DGT can converge linearly even when the exchanged variables are respectively one bit. Numerical results also confirm the efficiency of the proposed algorithm.
157 - Jiayi Guo , Adrian Lewis 2017
The popular BFGS quasi-Newton minimization algorithm under reasonable conditions converges globally on smooth convex functions. This result was proved by Powell in 1976: we consider its implications for functions that are not smooth. In particular, an analogous convergence result holds for functions, like the Euclidean norm, that are nonsmooth at the minimizer.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا