Topoopt：分布式培训工作的合作化网络拓扑和并行化策略

论文标题

Topoopt：分布式培训工作的合作化网络拓扑和并行化策略

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

论文作者

Wang, Weiyang, Khazraee, Moein, Zhong, Zhizhen, Ghobadi, Manya, Jia, Zhihao, Mudigere, Dheevatsa, Zhang, Ying, Kewitsch, Anthony

论文摘要

我们提出了Topoopt，这是一种用于深神经网络（DNN）训练工作量的新型直接连接织物。 Topoopt跨三个维度进行了优化的分布式培训过程：计算，通信和网络拓扑。我们证明了Allreduce流量的可变性，并利用该物业来构建DNN培训工作的有效网络拓扑。然后，Topoopt使用一种交替的优化技术和一种称为potientperms的群体理论启发的算法，以找到最佳的网络拓扑和路由计划以及并行化策略。我们在100 Gbps以远程直接内存访问（RDMA）转发构建功能齐全的12节点直接连接原型。对实际分布式训练模型的大规模模拟表明，与类似成本的脂肪树互连相比，Topoopt将DNN训练时间降低了3.4倍。

We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training workloads. TopoOpt co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. We demonstrate the mutability of AllReduce traffic, and leverage this property to construct efficient network topologies for DNN training jobs. TopoOpt then uses an alternating optimization technique and a group theory-inspired algorithm called TotientPerms to find the best network topology and routing plan, together with a parallelization strategy. We build a fully functional 12-node direct-connect prototype with remote direct memory access (RDMA) forwarding at 100 Gbps. Large-scale simulations on real distributed training models show that, compared to similar-cost Fat-Tree interconnects, TopoOpt reduces DNN training time by up to 3.4x.

下载PDF全文

下载文献需遵守相关版权规定

论文标题