Nesterov在训练深线性神经网络中对Nesterov加速梯度方法的收敛分析

论文标题

Nesterov在训练深线性神经网络中对Nesterov加速梯度方法的收敛分析

A Convergence Analysis of Nesterov's Accelerated Gradient Method in Training Deep Linear Neural Networks

论文作者

Liu, Xin, Tao, Wei, Pan, Zhisong

论文摘要

动量方法，包括重球〜（HB）和Nesterov的加速梯度〜（NAG），被广泛用于训练神经网络的快速收敛。但是，由于神经网络的优化景观是非凸的，因此缺乏理论保证其收敛性和加速度。如今，有些作品在理解过度参数化的方案中的动量方法的融合方面取得了进展，其中参数的数量超过了训练实例的参数。但是，目前的结果主要集中在两层神经网络上，这远非解释训练深神经网络中动量方法的显着成功。在此激励的情况下，我们研究了NAG与持续学习率和动量参数的融合，在训练深层线性网络的两个体系结构中：深度完全连接的线性神经网络和深线性重新连接。基于过度参数化制度，我们首先分析NAG训练轨迹引起的残余动力学，以在随机高斯初始化下深层完全连接的线性神经网络。我们的结果表明，NAG可以以$（1- \ MATHCAL {O}（1/\SQRTκ））^t $ rate的收敛到全球最小值，其中$ t $是迭代号码，$κ> 1 $是持续不断的，具体取决于特征Matrix的条件数。与$（1- \ MATHCAL {O}（1/κ））^t $ GD率相比，NAG达到了GD的加速度。据我们所知，这是训练深神经网络中NAG融合全球最低限度的第一个理论保证。此外，我们将分析扩展到深线性重新结构并得出相似的收敛结果。

Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and acceleration since the optimization landscape of the neural network is non-convex. Nowadays, some works make progress towards understanding the convergence of momentum methods in an over-parameterized regime, where the number of the parameters exceeds that of the training instances. Nonetheless, current results mainly focus on the two-layer neural network, which are far from explaining the remarkable success of the momentum methods in training deep neural networks. Motivated by this, we investigate the convergence of NAG with constant learning rate and momentum parameter in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets. Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under the random Gaussian initialization. Our results show that NAG can converge to the global minimum at a $(1 - \mathcal{O}(1/\sqrtκ))^t$ rate, where $t$ is the iteration number and $κ> 1$ is a constant depending on the condition number of the feature matrix. Compared to the $(1 - \mathcal{O}(1/κ))^t$ rate of GD, NAG achieves an acceleration over GD. To the best of our knowledge, this is the first theoretical guarantee for the convergence of NAG to the global minimum in training deep neural networks. Furthermore, we extend our analysis to deep linear ResNets and derive a similar convergence result.

下载PDF全文

下载文献需遵守相关版权规定

论文标题