预先训练语言模型的参数效率混合物体系结构

论文标题

预先训练语言模型的参数效率混合物体系结构

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models

论文作者

Gao, Ze-Feng, Liu, Peiyu, Zhao, Wayne Xin, Lu, Zhong-Yi, Wen, Ji-Rong

论文摘要

最近，在增加大规模语言模型的模型能力方面，Experts（简称为MOE）体系结构取得了巨大的成功。 However, MoE requires incorporating significantly more parameters than the base model being extended. In this paper, we propose building a parameter-efficient MoE architecture by sharing information among experts. We adopt the matrix product operator (MPO, a tensor decomposition from quantum many-body physics) to reconstruct the parameter matrix in the expert layer and increase model capacity for pre-trained language models by sharing parameters of the central tensor (containing the core information) among different experts while enabling the specificity through the auxiliary tensors (complementing the central tensor) of different experts. To address the unbalanced optimization issue, we further design the gradient mask strategy for the MPO-based MoE architecture.基于T5和GPT-2的广泛实验表明，预训练的语言模型的性能和效率提高（与开关变压器相比，高级模型性能的总参数降低了27.2倍）。 Our code is publicly available at https://github.com/RUCAIBox/MPOE.

Recently, Mixture-of-Experts (short as MoE) architecture has achieved remarkable success in increasing the model capacity of large-scale language models. However, MoE requires incorporating significantly more parameters than the base model being extended. In this paper, we propose building a parameter-efficient MoE architecture by sharing information among experts. We adopt the matrix product operator (MPO, a tensor decomposition from quantum many-body physics) to reconstruct the parameter matrix in the expert layer and increase model capacity for pre-trained language models by sharing parameters of the central tensor (containing the core information) among different experts while enabling the specificity through the auxiliary tensors (complementing the central tensor) of different experts. To address the unbalanced optimization issue, we further design the gradient mask strategy for the MPO-based MoE architecture. Extensive experiments based on T5 and GPT-2 show improved performance and efficiency of the pre-trained language model (27.2x reduction in total parameters for the superior model performance, compared with the Switch Transformers). Our code is publicly available at https://github.com/RUCAIBox/MPOE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题