Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

98 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Simon Du

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tian Ye - Simon S. Du

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the asymmetric low-rank factorization problem: [min_{mathbf{U} in mathbb{R}^{m times d}, mathbf{V} in mathbb{R}^{n times d}} frac{1}{2}|mathbf{U}mathbf{V}^top -mathbf{Sigma}|_F^2] where $mathbf{Sigma}$ is a given matrix of size $m times n$ and rank $d$. This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $mathbf{U}$ and $mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $mathbf{U}$ and $mathbf{V}$. This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.

قيم البحث

159 - Zhihui Zhu , Qiuwei Li , Xinshuo Yang 2018

We study the convergence of a variant of distributed gradient descent (DGD) on a distributed low-rank matrix approximation problem wherein some optimization variables are used for consensus (as in classical DGD) and some optimization variables appear only locally at a single node in the network. We term the resulting algorithm DGD+LOCAL. Using algorithmic connections to gradient descent and geometric connections to the well-behaved landscape of the centralized low-rank matrix approximation problem, we identify sufficient conditions where DGD+LOCAL is guaranteed to converge with exact consensus to a global minimizer of the original centralized problem. For the distributed low-rank matrix approximation problem, these guarantees are stronger---in terms of consensus and optimality---than what appear in the literature for classical DGD and more general problems.

التحسين والتحكم التعلم الآلي التعلم الالي

Fast Global Convergence for Low-rank Matrix Recovery via Riemannian Gradient Descent with Random Initialization

91 - Thomas Y. Hou , Zhenzhen Li , Ziyun Zhang 2020

In this paper, we propose a new global analysis framework for a class of low-rank matrix recovery problems on the Riemannian manifold. We analyze the global behavior for the Riemannian optimization with random initialization. We use the Riemannian gr adient descent algorithm to minimize a least squares loss function, and study the asymptotic behavior as well as the exact convergence rate. We reveal a previously unknown geometric property of the low-rank matrix manifold, which is the existence of spurious critical points for the simple least squares function on the manifold. We show that under some assumptions, the Riemannian gradient descent starting from a random initialization with high probability avoids these spurious critical points and only converges to the ground truth in nearly linear convergence rate, i.e. $mathcal{O}(text{log}(frac{1}{epsilon})+ text{log}(n))$ iterations to reach an $epsilon$-accurate solution. We use two applications as examples for our global analysis. The first one is a rank-1 matrix recovery problem. The second one is a generalization of the Gaussian phase retrieval problem. It only satisfies the weak isometry property, but has behavior similar to that of the first one except for an extra saddle set. Our convergence guarantee is nearly optimal and almost dimension-free, which fully explains the numerical observations. The global analysis can be potentially extended to other data problems with random measurement structures and empirical least squares loss functions.

التعلم الالي نظرية المعلومات التعلم الآلي

Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent

93 - Qinqing Zheng , John Lafferty 2016

We address the rectangular matrix completion problem by lifting the unknown matrix to a positive semidefinite matrix in higher dimension, and optimizing a nonconvex objective over the semidefinite factor using a simple gradient descent scheme. With $ O( mu r^2 kappa^2 n max(mu, log n))$ random observations of a $n_1 times n_2$ $mu$-incoherent matrix of rank $r$ and condition number $kappa$, where $n = max(n_1, n_2)$, the algorithm linearly converges to the global optimum with high probability.

التعلم الالي التعلم الآلي

Sharp Global Guarantees for Nonconvex Low-Rank Matrix Recovery in the Overparameterized Regime

80 - Richard Y. Zhang 2021

We prove that it is possible for nonconvex low-rank matrix recovery to contain no spurious local minima when the rank of the unknown ground truth $r^{star}<r$ is strictly less than the search rank $r$, and yet for the claim to be false when $r^{star} =r$. Under the restricted isometry property (RIP), we prove, for the general overparameterized regime with $r^{star}le r$, that an RIP constant of $delta<1/(1+sqrt{r^{star}/r})$ is sufficient for the inexistence of spurious local minima, and that $delta<1/(1+1/sqrt{r-r^{star}+1})$ is necessary due to existence of counterexamples. Without an explicit control over $r^{star}le r$, an RIP constant of $delta<1/2$ is both necessary and sufficient for the exact recovery of a rank-$r$ ground truth. But if the ground truth is known a priori to have $r^{star}=1$, then the sharp RIP threshold for exact recovery is improved to $delta<1/(1+1/sqrt{r})$.

التحسين والتحكم التعلم الآلي التعلم الالي

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

95 - Yossi Arjevani , Ohad Shamir , Nathan Srebro 2018

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $tau$ rounds ago. First, we show that without stochastic noise, dela ys strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/tau$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.

التحسين والتحكم التعلم الآلي التعلم الالي