优化使用联合OP和张量融合的DNN汇编进行分布式培训

论文标题

优化使用联合OP和张量融合的DNN汇编进行分布式培训

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

论文作者

Yi, Xiaodong, Zhang, Shiwei, Diao, Lansong, Wu, Chuan, Zheng, Zhen, Fan, Shiqing, Wang, Siyu, Yang, Jun, Lin, Wei

论文摘要

本文提出了Disco，这是用于数据并行分布式培训的自动深度学习汇编模块。与大多数专注于培训或推断单个设备推断的深度学习编译器不同，迪斯科优化了在多台GPU机器上分布式培训的DNN模型。现有的单个设备汇编策略在分布式培训中不能很好地工作，这主要是由于它们引起的沟通效率低下。 Disco生成了优化的，联合计算运算符和通信张量融合策略，以实现高效的分布式培训。构建了基于GNN的模拟器，可有效估计操作员/张量融合候选者达到的触电训练时间。回溯搜索算法由模拟器驱动，在大型战略空间中有效导航，以确定良好的操作员/张量融合策略，以最大程度地减少分布式训练时间。我们将Disco与现有的DL融合方案进行了比较，并表明它可以在接近理想的，完整的计算 - 沟通重叠案例附近实现良好的训练。

This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. DisCo generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNN-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare DisCo with existing DL fusion schemes and show that it achieves good training speed-up close to the ideal, full computation-communication overlap case.

下载PDF全文

下载文献需遵守相关版权规定

论文标题