鲸鱼：非均质GPU的有效巨型模型培训

论文标题

鲸鱼：非均质GPU的有效巨型模型培训

Whale: Efficient Giant Model Training over Heterogeneous GPUs

论文作者

Jia, Xianyan, Jiang, Le, Wang, Ang, Xiao, Wencong, Shi, Ziji, Zhang, Jie, Li, Xinyuan, Chen, Langshi, Li, Yong, Zheng, Zhen, Liu, Xiaoyong, Lin, Wei

论文摘要

已经证明，扩大深度神经网络的扩展可以有效地提高模型质量，但还包括培训效率，可编程性和资源适应性方面的几个培训挑战。我们提出了鲸鱼，这是一个针对巨型模型的一般有效的分布式培训框架。为了支持各种并行策略及其混合动力车，鲸鱼通过以模型注释的形式定义两个新的原语，从而概括了编程界面，从而可以合并用户提示。鲸鱼运行时使用这些注释，并执行图形优化来转换局部深度学习DAG图，以进行分布式多GPU执行。鲸鱼进一步引入了一种新颖的硬件感知并行策略，该策略以平衡的方式提高了异质GPU的模型培训的性能。鲸鱼部署在具有512 GPU的生产集群中，成功地训练了具有超过100万亿个模型参数的行业规模的多模式模型，名为M6，表明了极大的可扩展性和效率。

The scaling up of deep neural networks has been demonstrated to be effective in improving model quality, but also encompasses several training challenges in terms of training efficiency, programmability, and resource adaptability. We present Whale, a general and efficient distributed training framework for giant models. To support various parallel strategies and their hybrids, Whale generalizes the programming interface by defining two new primitives in the form of model annotations, allowing for incorporating user hints. The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution. Whale further introduces a novel hardware-aware parallel strategy, which improves the performance of model training on heterogeneous GPUs in a balanced manner. Deployed in a production cluster with 512 GPUs, Whale successfully trains an industry-scale multimodal model with over ten trillion model parameters, named M6, demonstrating great scalability and efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题