论文标题
自适应工人分组进行沟通效率和耐耐耐受性的分布式SGD
Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD
论文作者
论文摘要
壁式收敛时间和通信负载是参数服务器设置中随机梯度下降(SGD)的分布式实现的关键性能指标。沟通自适应分布式亚当(CADA)最近被提出,作为通过适应性工人选择减少交流负荷的一种方式。 CADA在散散的情况下,就壁锁会收敛时间而言,CADA受到性能降解。本文提出了一个名为基于分组的CADA(G-CADA)的新颖方案,该方案保留了CADA在减少通信负载方面的优势,同时以工人的额外存储费用增加了对散乱者的鲁棒性。 G-CADA将工人分为分配相同数据碎片的工人组。组在每次迭代时进行适应性安排,服务器仅等待每个选定组中最快的工人。我们提供分析和实验结果,以详细说明G-CADA在其他基准方案上的墙壁锁定时间以及通信负载和计算负载的显着增长。
Wall-clock convergence time and communication load are key performance metrics for the distributed implementation of stochastic gradient descent (SGD) in parameter server settings. Communication-adaptive distributed Adam (CADA) has been recently proposed as a way to reduce communication load via the adaptive selection of workers. CADA is subject to performance degradation in terms of wall-clock convergence time in the presence of stragglers. This paper proposes a novel scheme named grouping-based CADA (G-CADA) that retains the advantages of CADA in reducing the communication load, while increasing the robustness to stragglers at the cost of additional storage at the workers. G-CADA partitions the workers into groups of workers that are assigned the same data shards. Groups are scheduled adaptively at each iteration, and the server only waits for the fastest worker in each selected group. We provide analysis and experimental results to elaborate the significant gains on the wall-clock time, as well as communication load and computation load, of G-CADA over other benchmark schemes.