谁说大象无法运行：将大型Moe模型带入云量表生产

论文标题

谁说大象无法运行：将大型Moe模型带入云量表生产

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

论文作者

Kim, Young Jin, Henry, Rawn, Fahim, Raffy, Awadalla, Hany Hassan

论文摘要

专家（MOE）模型与稀疏激活层的有条件执行的混合物使训练模型具有更多的参数。结果，这些模型在包括机器翻译在内的各种自然语言处理任务上取得了明显提高质量。但是，由于记忆要求较高和推断，在现实生活中部署此类模型仍然具有挑战性。在这项工作中，我们采用了一种高效的推理框架，采用了几种优化方法来加速稀疏模型的计算并大大减少记忆消耗。尽管我们在吞吐量方面达到了高达26倍的速度，但我们还通过将专家权重量化为4位整数，将模型大小降低到原始32位浮点模型的八分之一。结果，与现有解决方案相比，我们能够部署成本降低27％且质量明显更好的136倍大型型号。这使得在部署大规模多语言MOE变形金刚模型的传统实践中，将教师模型蒸馏到每个语言或任务的数十个较小的模型中时，可以进行范式转变。

Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models replacing the traditional practice of distilling teacher models into dozens of smaller models per language or task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题