论文标题
具有无限宽度的三层神经网络的经验相图
Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width
论文作者
论文摘要
实质性工作表明,神经网络(NNS)的动力学与它们的参数初始化密切相关。受到无限宽度的两层恢复的相图(Luo等,2021)的启发,我们朝着绘制具有无限宽度的三层relu nns的相图迈出了一步。首先,我们得出三层恢复的归一化梯度流,并获得两个关键的独立量,以区分常见初始化方法的不同动力学方案。借助精心设计的实验和大量计算成本,对于合成数据集和真实数据集,我们发现每层的动力学也可以分为线性制度和凝结的机制,并由关键制度分开。标准是输入权重的相对变化(隐藏神经元的输入权重由从输入层到隐藏的神经元及其偏见术语的权重组成),因为在训练过程中宽度接近无穷大,该宽度往往分别为$ 0 $,$+\ iffty $和$ o(1)$。此外,我们还证明,在深入NN内的训练过程中,不同的层可能位于不同的动力学方案中。在凝结的制度中,我们还观察到具有低复杂性的孤立方向中的权重凝结。通过三层条件下的实验,我们的相图表明,由三个可能的机制以及它们的混合物组成的复杂动力学机制,用于深入NNS,并为在不同初始化方案中研究深NNS提供了指导,这揭示了在其不同层次内出现在深度NN中完全不同的动力学的可能性。
Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradient flow for three-layer ReLU NNs and obtain two key independent quantities to distinguish different dynamical regimes for common initialization methods. With carefully designed experiments and a large computation cost, for both synthetic datasets and real datasets, we find that the dynamics of each layer also could be divided into a linear regime and a condensed regime, separated by a critical regime. The criteria is the relative change of input weights (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) as the width approaches infinity during the training, which tends to $0$, $+\infty$ and $O(1)$, respectively. In addition, we also demonstrate that different layers can lie in different dynamical regimes in a training process within a deep NN. In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity. Through experiments under three-layer condition, our phase diagram suggests a complicated dynamical regimes consisting of three possible regimes, together with their mixture, for deep NNs and provides a guidance for studying deep NNs in different initialization regimes, which reveals the possibility of completely different dynamics emerging within a deep NN for its different layers.