When Does Non-Orthogonal Tensor Decomposition Have No Spurious Local Minima?

214 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Sina Baharlouei

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Maziar Sanjabi - Sina Baharlouei - Meisam Razaviyayn

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components. We derive textit{deterministic} conditions under which such a problem does not have spurious local minima. In particular, we show that if $kappa = frac{lambda_{max}}{lambda_{min}} < frac{5}{4}$, and incoherence coefficient is of the order $O(frac{1}{sqrt{d}})$, then all the local minima are globally optimal. Using standard techniques, these conditions could be easily transformed into conditions that would hold with high probability in high dimensions when the components are generated randomly. Finally, we prove that the tensor power method with deflation and restarts could efficiently extract all the components within a tolerance level $O(kappa sqrt{ktau^3})$ that seems to be the noise floor of non-orthogonal tensor decomposition.

قيم البحث

42 - Robert Kleinberg , Yuanzhi Li , Yang Yuan 2018

Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to under stand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function $f$ has many bad local minima or saddle points, as long as for every point $x$, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution $x^*$, SGD will get close to, and then stay around $x^*$ with constant probability. More specifically, SGD will not get stuck at sharp local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.

التعلم الآلي

Passed & Spurious: Descent Algorithms and Local Minima in Spiked Matrix-Tensor Models

150 - Stefano Sarao Mannelli , Florent Krzakala , Pierfrancesco Urbani 2019

In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of the landscape at large signal-to-noise ratios. We evaluate in a closed form the performance of a gradient flow algorithm using integro-differential PDEs as developed in physics of disordered systems for the Langevin dynamics. We analyze the performance of an approximate message passing algorithm estimating the maximum likelihood configuration via its state evolution. We conclude by comparing the above results: while we observe a drastic slow down of the gradient flow dynamics even in the region where the landscape is trivial, both the analyzed algorithms are shown to perform well even in the part of the region of parameters where spurious local minima are present.

التعلم الآلي الأنظمة المضطربة والشبكات العصبية نظرية الإحصاء

Exact Guarantees on the Absence of Spurious Local Minima for Non-negative Rank-1 Robust Principal Component Analysis

150 - Salar Fattahi , Somayeh Sojoudi 2018

This work is concerned with the non-negative rank-1 robust principal component analysis (RPCA), where the goal is to recover the dominant non-negative principal components of a data matrix precisely, where a number of measurements could be grossly co rrupted with sparse and arbitrary large noise. Most of the known techniques for solving the RPCA rely on convex relaxation methods by lifting the problem to a higher dimension, which significantly increase the number of variables. As an alternative, the well-known Burer-Monteiro approach can be used to cast the RPCA as a non-convex and non-smooth $ell_1$ optimization problem with a significantly smaller number of variables. In this work, we show that the low-dimensional formulation of the symmetric and asymmetric positive rank-1 RPCA based on the Burer-Monteiro approach has benign landscape, i.e., 1) it does not have any spurious local solution, 2) has a unique global solution, and 3) its unique global solution coincides with the true components. An implication of this result is that simple local search algorithms are guaranteed to achieve a zero global optimality gap when directly applied to the low-dimensional formulation. Furthermore, we provide strong deterministic and probabilistic guarantees for the exact recovery of the true principal components. In particular, it is shown that a constant fraction of the measurements could be grossly corrupted and yet they would not create any spurious local solution.

التعلم الآلي التعلم الالي

Orthogonal Decomposition of Tensor Trains

122 - Karim Halaseh , Tommi Muller , 2020

In this paper we study the problem of recovering a tensor network decomposition of a given tensor $mathcal{T}$ in which the tensors at the vertices of the network are orthogonally decomposable. Specifically, we consider tensor networks in the form of tensor trains (aka matrix product states). When the tensor train has length 2, and the orthogonally decomposable tensors at the two vertices of the network are symmetric, we show how to recover the decomposition by considering random linear combinations of slices. Furthermore, if the tensors at the vertices are symmetric but not orthogonally decomposable, we show that a whitening procedure can transform the problem into an orthogonal one, thereby yielding a solution for the decomposition of the tensor. When the tensor network has length 3 or more and the tensors at the vertices are symmetric and orthogonally decomposable, we provide an algorithm for recovering them subject to some rank conditions. Finally, in the case of tensor trains of length two in which the tensors at the vertices are orthogonally decomposable but not necessarily symmetric, we show that the decomposition problem reduces to the problem of a novel matrix decomposition, that of an orthogonal matrix multiplied by diagonal matrices on either side. We provide two solutions for the full-rank tensor case using Sinkhorns theorem and Procrustes algorithm, respectively, and show that the Procrustes-based solution can be generalized to any rank case. We conclude with a multitude of open problems in linear and multilinear algebra that arose in our study.

التحليل العددي التحليل العددي التحسين والتحكم

Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery

191 - Richard Y. Zhang , Somayeh Sojoudi , Javad Lavaei 2019

Nonconvex matrix recovery is known to contain no spurious local minima under a restricted isometry property (RIP) with a sufficiently small RIP constant $delta$. If $delta$ is too large, however, then counterexamples containing spurious local minima are known to exist. In this paper, we introduce a proof technique that is capable of establishing sharp thresholds on $delta$ to guarantee the inexistence of spurious local minima. Using the technique, we prove that in the case of a rank-1 ground truth, an RIP constant of $delta<1/2$ is both necessary and sufficient for exact recovery from any arbitrary initial point (such as a random point). We also prove a local recovery result: given an initial point $x_{0}$ satisfying $f(x_{0})le(1-delta)^{2}f(0)$, any descent algorithm that converges to second-order optimality guarantees exact recovery.

التعلم الآلي التحسين والتحكم التعلم الالي