ﻻ يوجد ملخص باللغة العربية
We study how permutation symmetries in overparameterized multi-layer neural networks generate `symmetry-induced critical points. Assuming a network with $ L $ layers of minimal widths $ r_1^*, ldots, r_{L-1}^* $ reaches a zero-loss minimum at $ r_1^*! cdots r_{L-1}^*! $ isolated points that are permutations of one another, we show that adding one extra neuron to each layer is sufficient to connect all these previously discrete minima into a single manifold. For a two-layer overparameterized network of width $ r^*+ h =: m $ we explicitly describe the manifold of global minima: it consists of $ T(r^*, m) $ affine subspaces of dimension at least $ h $ that are connected to one another. For a network of width $m$, we identify the number $G(r,m)$ of affine subspaces containing only symmetry-induced critical points that are related to the critical points of a smaller network of width $r<r^*$. Via a combinatorial analysis, we derive closed-form formulas for $ T $ and $ G $ and show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold in the mildly overparameterized regime (small $ h $) and vice versa in the vastly overparameterized regime ($h gg r^*$). Our results provide new insights into the minimization of the non-convex loss function of overparameterized neural networks.
Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN contains all the critical points of all the narrower DNNs. More precise
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesnt the trained network overfit when it is overparameterized? In this work, we prove that overparamete
The theoretical analysis of deep neural networks (DNN) is arguably among the most challenging research directions in machine learning (ML) right now, as it requires from scientists to lay novel statistical learning foundations to explain their behavi
When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. S