用于沟通效率分布式培训的自适应压缩

论文标题

用于沟通效率分布式培训的自适应压缩

Adaptive Compression for Communication-Efficient Distributed Training

论文作者

Makarenko, Maksim, Gasanov, Elnur, Islamov, Rustem, Sadiev, Abdurakhmon, Richtarik, Peter

论文摘要

我们提出了自适应压缩梯度下降（ADACGD） - 一种新颖的优化算法，用于对具有自适应压缩水平的监督机器学习模型进行沟通培训。我们的方法灵感来自Richtarik等人最近提出的三点压缩机（3PC）框架。（2022），其中包括错误反馈（EF21），懒惰的聚集梯度（滞后）及其作为特殊情况的组合，并在弱假设下提供了这些方法的当前最新率。虽然上述机制提供了固定的压缩水平，或者仅在两个极端之间进行适应，但我们的建议是进行更精细的适应性。特别是，我们允许用户选择任意选择的任何任意选择的订单压缩机制，例如使用用户定义的稀疏级别k的选择，或使用用户定义的量化级别选择或组合使用用户定义的量化。在优化过程中，ADACGD选择适当的压缩机和压缩水平。除了i）提出理论上的多自动通信压缩机制外，我们进一步ii）将3PC框架扩展到双向压缩，即，我们允许服务器也可以压缩，iii）在强凸，凸面和非convex和非convex设置中提供急剧的收敛界限。即使对于我们的一般机制的几个关键特殊情况，包括3pc和EF21的几个关键特殊情况，凸状态的结果也是新的。在所有制度中，与所有现有的自适应压缩方法相比，我们的价格都高。

We propose Adaptive Compressed Gradient Descent (AdaCGD) - a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. Our approach is inspired by the recently proposed three point compressor (3PC) framework of Richtarik et al. (2022), which includes error feedback (EF21), lazily aggregated gradient (LAG), and their combination as special cases, and offers the current state-of-the-art rates for these methods under weak assumptions. While the above mechanisms offer a fixed compression level, or adapt between two extremes only, our proposal is to perform a much finer adaptation. In particular, we allow the user to choose any number of arbitrarily chosen contractive compression mechanisms, such as Top-K sparsification with a user-defined selection of sparsification levels K, or quantization with a user-defined selection of quantization levels, or their combination. AdaCGD chooses the appropriate compressor and compression level adaptively during the optimization process. Besides i) proposing a theoretically-grounded multi-adaptive communication compression mechanism, we further ii) extend the 3PC framework to bidirectional compression, i.e., we allow the server to compress as well, and iii) provide sharp convergence bounds in the strongly convex, convex and nonconvex settings. The convex regime results are new even for several key special cases of our general mechanism, including 3PC and EF21. In all regimes, our rates are superior compared to all existing adaptive compression methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题