移动的压缩框架：概括和改进

论文标题

移动的压缩框架：概括和改进

Shifted Compression Framework: Generalizations and Improvements

论文作者

Shulgin, Egor, Richtárik, Peter

论文摘要

沟通是大规模机器学习模型的分布式培训中的关键瓶颈之一，而交换信息（例如随机梯度或模型）的有损压缩是减轻此问题的最有效工具之一。研究最多的压缩技术之一是无偏压缩操作员的类别，其方差为我们希望压缩的向量的正方形范围的倍数界定。根据设计，此方差可能保持较高，并且只有在输入向量接近零时才会减少。但是，除非被训练的模型过度参数化，否则我们希望在经典方法（例如分布式压缩{\ sf sgd}等经典方法的迭代中，我们希望压缩的矢量都没有A-Priori理由，这对逆变速度具有不良影响。由于这个问题，最近提出了一些更详尽且看似截然不同的算法，目的是规避了这个问题。这些方法基于在我们通常希望压缩的向量和一些辅助向量之间压缩{\ em差异}的想法，这些辅助向量会在整个迭代过程中变化。在这项工作中，我们退后一步，并在概念上和理论上开发了研究此类方法的统一框架。我们的框架结合了使用无偏和偏置的压缩机压缩梯度和模型的方法，并阐明了辅助矢量的构造。此外，我们的一般框架可以改善几种现有算法，并可以产生新的算法。最后，我们进行了几个数字实验，以说明和支持我们的理论发现。

Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed {\sf SGD}, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the {\em difference} between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题