一种用于培训深神经网络的沟通效率分布式梯度剪辑算法

论文标题

一种用于培训深神经网络的沟通效率分布式梯度剪辑算法

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

论文作者

Liu, Mingrui, Zhuang, Zhenxun, Lei, Yunwei, Liao, Chunyang

论文摘要

在深层神经网络的分布式培训中，人们通常会在每台机器上运行随机梯度下降（SGD）或其变体，并定期与其他机器进行通信。但是，由于梯度问题爆炸，SGD可能会在培训一些深层神经网络（例如RNN，LSTM）中缓慢收敛。通常使用梯度剪辑来解决单个机器设置中的问题，但是在分布式设置中探索这一技术仍处于起步阶段：梯度剪接方案是否可以利用多台机器来享受并行加速，这仍然是神秘的。主要的技术困难在于处理非凸损耗函数，非lipschitz连续梯度以及同时跳过通信。在本文中，我们探讨了LSTM在以前的工作中显示出可满足的损失景观的放松平滑度假设，并设计了一种通信效率的梯度剪辑算法。该算法可以在多台计算机上运行，每台机器都采用梯度剪辑方案，并在基于梯度的更新的多个步骤后与其他机器进行通信。事实证明，我们的算法具有$ o \ left（\ frac {1} {nε^4} \ right）$迭代复杂性和$ o（\ frac {1} {1} {ε^3}）$通信复杂性，以找到$ a $ $ $ - $ - $ - $ $ - $ - $ n $ n $ n $ n $ n $ machines nubs norkines norkines norkine nork norkine machines nubs nubs nubs nork norkine。这表明我们的算法享有线性加速和减少的通信回合。我们的证明依赖于估计截断的随机变量的新颖分析技术，我们认为这具有独立的兴趣。我们在几个基准数据集和各种方案上进行的实验表明，我们的算法确实在实践中表现出快速的收敛速度，从而验证了我们的理论。

In distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works, and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{Nε^4}\right)$ iteration complexity and $O(\frac{1}{ε^3})$ communication complexity for finding an $ε$-stationary point in the homogeneous data setting, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup and reduced communication rounds. Our proof relies on novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题