论文标题
通过生成功能的迷你批次SGD的视图:收敛条件,相变的条件,受益于负动量
A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta
论文作者
论文摘要
具有动量的迷你批次SGD是学习大型预测模型的基本算法。在本文中,我们开发了一个新的分析框架,以分析以恒定的学习速率,动量和批次大小的线性模型分析微型批次SGD的噪声平均特性。我们的关键思想是考虑特殊“频谱表达”近似值的模型参数第二矩的动力学。这允许获得损失值序列的生成函数的显式表达式。通过分析这种生成函数,我们特别发现1)SGD动力学表现出几种收敛性和不同的制度,具体取决于问题的频谱分布; 2)在幂律频谱分布的情况下,收敛制度承认明确的稳定条件和显式损失; 3)在负动量时可以达到最佳收敛速率。我们通过对MNIST,CIFAR10和合成问题进行广泛的实验来验证我们的理论预测,并找到良好的定量一致性。
Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of "Spectrally Expressible" approximations. This allows to obtain an explicit expression for the generating function of the sequence of loss values. By analyzing this generating function, we find, in particular, that 1) the SGD dynamics exhibits several convergent and divergent regimes depending on the spectral distributions of the problem; 2) the convergent regimes admit explicit stability conditions, and explicit loss asymptotics in the case of power-law spectral distributions; 3) the optimal convergence rate can be achieved at negative momenta. We verify our theoretical predictions by extensive experiments with MNIST, CIFAR10 and synthetic problems, and find a good quantitative agreement.