论文标题

广度优先的管道并行性

Breadth-First Pipeline Parallelism

论文作者

Lamy-Poirier, Joel

论文摘要

我们介绍了广度优先的管道并行性,这是一个新颖的培训时间表,可优化管道和数据并行性的组合。广度优先的管道并行性通过将高的GPU利用率与每个GPU的批量尺寸相结合,并利用完全碎片的数据并行性,从而降低了训练时间,成本和内存使用情况。在实验上,与Megatron-LM相比,我们观察到520亿参数模型的训练吞吐量高达43%,这将在大型GPU群集上降低训练时间和成本。

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源