广度优先的管道并行性

论文标题

广度优先的管道并行性

Breadth-First Pipeline Parallelism

论文作者

Lamy-Poirier, Joel

论文摘要

我们介绍了广度优先的管道并行性，这是一个新颖的培训时间表，可优化管道和数据并行性的组合。广度优先的管道并行性通过将高的GPU利用率与每个GPU的批量尺寸相结合，并利用完全碎片的数据并行性，从而降低了训练时间，成本和内存使用情况。在实验上，与Megatron-LM相比，我们观察到520亿参数模型的训练吞吐量高达43％，这将在大型GPU群集上降低训练时间和成本。

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题