Towards Understanding the Condensation of Two-layer Neural Networks at Initial Training


Abstract in English

Studying the implicit regularization effect of the nonlinear training dynamics of neural networks (NNs) is important for understanding why over-parameterized neural networks often generalize well on real dataset. Empirically, for two-layer NN, existing works have shown that input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) condense on isolated orientations with a small initialization. The condensation dynamics implies that NNs can learn features from the training data with a network configuration effectively equivalent to a much smaller network during the training. In this work, we show that the multiple roots of activation function at origin (referred as ``multiplicity) is a key factor for understanding the condensation at the initial stage of training. Our experiments of multilayer networks suggest that the maximal number of condensed orientations is twice the multiplicity of the activation function used. Our theoretical analysis of two-layer networks confirms experiments for two cases, one is for the activation function of multiplicity one, which contains many common activation functions, and the other is for the one-dimensional input. This work makes a step towards understanding how small initialization implicitly leads NNs to condensation at initial training stage, which lays a foundation for the future study of the nonlinear dynamics of NNs and its implicit regularization effect at a later stage of training.

Download