论文标题
MEEC:专家簇的混合物
MoEC: Mixture of Expert Clusters
论文作者
论文摘要
专家(MOE)的稀疏混合物因其有希望的缩放能力和负担得起的计算开销而引起了极大的兴趣。 Moe将密集层转换为稀疏的专家,并利用封闭式路由网络使专家有条件地激活。但是,随着专家数量的增长,具有离谱参数的MOE会受到过度拟合和稀疏数据分配的影响。此类问题在数据有限的任务上尤其严重,从而阻碍了MoE模型通过扩展来提高性能的进度。在这项工作中,我们提出了专家群集的混合,这是一种通用方法,使专家层通过在路由阶段施加基于方差的约束来学习更多多样化和适当的知识。我们进一步提出了专门为专家集群结构设计的集群级专家辍学策略。我们的实验表明,MEEC可以提高机器翻译和自然语言理解任务的性能,并提高在有限数据下扩大专家的绩效上限。我们还验证了MEEC在缓解过度拟合和稀疏数据分配中起积极的作用。
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress for MoE models to improve performance by scaling up. In this work, we propose Mixture of Expert Clusters - a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. We further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could improve performance on machine translation and natural language understanding tasks, and raise the performance upper bound for scaling up experts under limited data. We also verify that MoEC plays a positive role in mitigating overfitting and sparse data allocation.