论文标题
并行和分布的异步自适应随机梯度方法
Parallel and distributed asynchronous adaptive stochastic gradient methods
论文作者
论文摘要
随机梯度方法(SGM)是训练深度学习模型的主要方法。自适应版本(例如Adam和Amsgrad)在实践中已广泛使用,部分原因是它们的收敛速度比非自适应版本更快,同时产生了小开销。另一方面,异步(异步)平行计算在其同步(同步)对应物上显示出明显更高的速度。从理论和实践表现的角度来看,异步 - 平行非自适应SGM在文献中已经进行了很好的研究。自适应SGM也可以以异步 - 并行的方式毫无困难地实现。但是,据我们所知,尚未建立异步自适应SGM的理论结果。 The difficulty for analyzing adaptive SGMs with async updates originates from the second moment term.在本文中,我们提出了基于AMSGRAD的异步平行自适应SGM。我们表明,如果构成异步造成的稳定性(也称为延迟),则提出的方法继承了凸面和非凸问题AMSGRAD的收敛保证。我们的收敛速率结果表明,如果$τ= o(k^{\ frac {1} {4}}})$,则几乎是线性并行的加速度,其中$τ$是稳定性,而$ k $是迭代的数量。在凸和非凸机学习问题上测试了所提出的方法,数值结果证明了其在同步对应物和异步 - 平行非适应性SGM方面的明显优势。
Stochastic gradient methods (SGMs) are the predominant approaches to train deep learning models. The adaptive versions (e.g., Adam and AMSGrad) have been extensively used in practice, partly because they achieve faster convergence than the non-adaptive versions while incurring little overhead. On the other hand, asynchronous (async) parallel computing has exhibited significantly higher speed-up over its synchronous (sync) counterpart. Async-parallel non-adaptive SGMs have been well studied in the literature from the perspectives of both theory and practical performance. Adaptive SGMs can also be implemented without much difficulty in an async-parallel way. However, to the best of our knowledge, no theoretical result of async-parallel adaptive SGMs has been established. The difficulty for analyzing adaptive SGMs with async updates originates from the second moment term. In this paper, we propose an async-parallel adaptive SGM based on AMSGrad. We show that the proposed method inherits the convergence guarantee of AMSGrad for both convex and non-convex problems, if the staleness (also called delay) caused by asynchrony is bounded. Our convergence rate results indicate a nearly linear parallelization speed-up if $τ=o(K^{\frac{1}{4}})$, where $τ$ is the staleness and $K$ is the number of iterations. The proposed method is tested on both convex and non-convex machine learning problems, and the numerical results demonstrate its clear advantages over the sync counterpart and the async-parallel nonadaptive SGM.