论文标题
缩放定律与模型架构:归纳偏见如何影响缩放?
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
论文作者
论文摘要
变压器模型的缩放属性引起了很多兴趣。但是,在研究不同电感偏见和模型体系结构的缩放特性的效果的前方并没有做太多。模型体系结构的规模不同吗?如果是这样,归纳偏置如何影响缩放行为?这如何影响上游(训练)和下游(转移)?本文对十种不同模型体系结构的缩放行为进行了系统的研究,例如变压器,交换机变压器,通用变压器,动态卷积,表演者以及最近提出的MLP混合物。通过广泛的实验,我们表明(1)进行缩放时确实是一个重要的考虑因素,(2)最佳性能模型可以在不同的尺度上波动。我们认为,这项工作中概述的发现对当前在社区中评估模型架构的评估具有重要意义。
There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.