稳定边缘的自适应梯度方法

论文标题

稳定边缘的自适应梯度方法

Adaptive Gradient Methods at the Edge of Stability

论文作者

Cohen, Jeremy M., Ghorbani, Behrooz, Krishnan, Shankar, Agarwal, Naman, Medapati, Sourabh, Badura, Michal, Suo, Daniel, Cardoze, David, Nado, Zachary, Dahl, George E., Gilmer, Justin

论文摘要

关于自适应梯度方法等自适应梯度方法（如亚当在深度学习中）的训练动力学知之甚少。在本文中，我们阐明了这些算法在全批处理和足够大的批处理设置中的行为。具体而言，我们从经验上证明，在全批训练中，预处理的Hessian的最大特征值通常在某个数值上平衡 - 梯度下降算法的稳定性阈值。对于带有步长$η$和$β_1= 0.9 $的Adam，此稳定性阈值为$ 38/η$。在Minibatch培训期间会发生类似的影响，尤其是随着批处理尺寸的增长。然而，尽管自适应方法在``稳定性的自适应边缘''（AEOS）上训练，但它们在该制度中的行为与EOS的非自适应方法的行为有很大不同。 EOS处的非自适应算法被阻止进入损失景观的高外观区域，而AEOS的自适应梯度方法可以不断发展到高外膜区域，同时适应预先调节器以补偿。我们的发现可以成为社区对深度学习中适应性梯度方法的未来理解的基础。

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题