BYTECOMP：在分布式培训中重新审视梯度压缩

论文标题

BYTECOMP：在分布式培训中重新审视梯度压缩

ByteComp: Revisiting Gradient Compression in Distributed Training

论文作者

Wang, Zhuang, Lin, Haibin, Zhu, Yibo, Ng, T. S. Eugene

论文摘要

梯度压缩（GC）是解决分布式深度学习（DDL）中通信瓶颈的一种有前途的方法。但是，由于张量之间存在复杂的相互作用，因此找到将GC应用于DDL的最佳压缩策略是一项挑战。要完全释放GC的好处，必须解决两个问题：1）如何表达所有DDL培训工作的张量之间的所有压缩策略以及相应的相互作用？ 2）如何快速选择近乎最佳的压缩策略？在本文中，我们建议Bytecomp回答这些问题。它首先设计了决策树的抽象来表达所有压缩策略，并开发经验模型以时间表张量计算，通信和压缩，以使ByTecomp能够得出张量之间的复杂相互作用。然后，它设计了一种压缩决策算法，该算法分析了张量相互作用，以消除和优先考虑策略，并最佳地将压缩到CPU。实验评估表明，BYTECOMP可以改善支持启动压缩系统的训练吞吐量，最多可用于代表性的DDL培训工作。此外，选择压缩策略所需的计算时间是以毫秒为单位的，所选策略仅占最佳的百分之几。

Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose ByteComp to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable ByteComp to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that ByteComp can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题