Passed & Spurious: Descent Algorithms and Local Minima in Spiked Matrix-Tensor Models

151 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Stefano Sarao Mannelli

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية فيزياء

والبحث باللغة English

تأليف Stefano Sarao Mannelli - Florent Krzakala - Pierfrancesco Urbani

التعلم الآلي الأنظمة المضطربة والشبكات العصبية نظرية الإحصاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of the landscape at large signal-to-noise ratios. We evaluate in a closed form the performance of a gradient flow algorithm using integro-differential PDEs as developed in physics of disordered systems for the Langevin dynamics. We analyze the performance of an approximate message passing algorithm estimating the maximum likelihood configuration via its state evolution. We conclude by comparing the above results: while we observe a drastic slow down of the gradient flow dynamics even in the region where the landscape is trivial, both the analyzed algorithms are shown to perform well even in the part of the region of parameters where spurious local minima are present.

قيم البحث

276 - Stefano Sarao Mannelli , Giulio Biroli , Chiara Cammarota 2019

Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes.

التعلم الآلي الأنظمة المضطربة والشبكات العصبية نظرية الإحصاء

Entropic gradient descent algorithms and wide flat minima

403 - Fabrizio Pittorino , Carlo Lucibello , Christoph Feinauer 2020

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. First, we discuss Gaussian mixt ure classification models and show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions. These estimators can be found by applying maximum flatness algorithms either directly on the classifier (which is norm independent) or on the differentiable loss function used in learning. Next, we extend the analysis to the deep learning scenario by extensive numerical validations. Using two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include in the optimization objective a non-local flatness measure known as local entropy, we consistently improve the generalization error for common architectures (e.g. ResNet, EfficientNet). An easy to compute flatness measure shows a clear correlation with test accuracy.

التعلم الآلي الأنظمة المضطربة والشبكات العصبية التعلم الالي

When Does Non-Orthogonal Tensor Decomposition Have No Spurious Local Minima?

213 - Maziar Sanjabi , Sina Baharlouei , Meisam Razaviyayn 2019

We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components. We derive textit{deterministic} conditions under which such a problem does not have spurious local minima. In particular, we sh ow that if $kappa = frac{lambda_{max}}{lambda_{min}} < frac{5}{4}$, and incoherence coefficient is of the order $O(frac{1}{sqrt{d}})$, then all the local minima are globally optimal. Using standard techniques, these conditions could be easily transformed into conditions that would hold with high probability in high dimensions when the components are generated randomly. Finally, we prove that the tensor power method with deflation and restarts could efficiently extract all the components within a tolerance level $O(kappa sqrt{ktau^3})$ that seems to be the noise floor of non-orthogonal tensor decomposition.

التعلم الآلي

Wide flat minima and optimal generalization in classifying high-dimensional Gaussian mixtures

182 - Carlo Baldassi , Enrico M. Malatesta , Matteo Negri 2020

We analyze the connection between minimizers with good generalizing properties and high local entropy regions of a threshold-linear classifier in Gaussian mixtures with the mean squared error loss function. We show that there exist configurations tha t achieve the Bayes-optimal generalization error, even in the case of unbalanced clusters. We explore analytically the error-counting loss landscape in the vicinity of a Bayes-optimal solution, and show that the closer we get to such configurations, the higher the local entropy, implying that the Bayes-optimal solution lays inside a wide flat region. We also consider the algorithmically relevant case of targeting wide flat minima of the (differentiable) mean squared error loss. Our analytical and numerical results show not only that in the balanced case the dependence on the norm of the weights is mild, but also, in the unbalanced case, that the performances can be improved.

التعلم الآلي الأنظمة المضطربة والشبكات العصبية نظرية الإحصاء

Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery

191 - Richard Y. Zhang , Somayeh Sojoudi , Javad Lavaei 2019

Nonconvex matrix recovery is known to contain no spurious local minima under a restricted isometry property (RIP) with a sufficiently small RIP constant $delta$. If $delta$ is too large, however, then counterexamples containing spurious local minima are known to exist. In this paper, we introduce a proof technique that is capable of establishing sharp thresholds on $delta$ to guarantee the inexistence of spurious local minima. Using the technique, we prove that in the case of a rank-1 ground truth, an RIP constant of $delta<1/2$ is both necessary and sufficient for exact recovery from any arbitrary initial point (such as a random point). We also prove a local recovery result: given an initial point $x_{0}$ satisfying $f(x_{0})le(1-delta)^{2}f(0)$, any descent algorithm that converges to second-order optimality guarantees exact recovery.

التعلم الآلي التحسين والتحكم التعلم الالي