论文标题
CADA:沟通自适应分布的亚当
CADA: Communication-Adaptive Distributed Adam
论文作者
论文摘要
随机梯度下降(SGD)已将舞台作为大型机器学习的主要主力。它通常与Adagrad,Adam和Amsgrad等自适应变体一起使用。本文提出了一种自适应随机梯度下降方法,用于分布式机器学习,可以将其视为著名的亚当方法的交流自适应对应物 - 证明其名称是CADA的合理性。 CADA的关键组成部分是一套针对自适应随机梯度量身定制的新规则,可以实现以节省通信上传。新算法可以适应陈旧的亚当梯度,从而节省了通信,并且仍然具有与原始亚当相当的收敛速率。在数值实验中,CADA在总沟通量还原方面取得了令人印象深刻的经验表现。
Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.