论文标题
变形金刚的概括与存储在权重中上下文中的信息不同
Transformers generalize differently from information stored in context vs in weights
论文作者
论文摘要
变压器模型可以使用两种根本不同的信息:在培训过程中存储在权重中的信息,并在推理时提供````context'''的信息。在这项工作中,我们表明,变形金刚在这两个来源中的信息中表现出和概括的方式表现出不同的归纳偏见。特别是,我们表征它们是通过简约的规则(基于规则的概括)或与观察到的示例(基于典范的概括)进行直接比较的概括。这是重要的实际结果,因为它可以告知是否要在权重或上下文中编码信息,具体取决于我们希望模型使用该信息的方式。在对受控刺激训练的变压器中,我们发现重量的概括是基于规则的更多,而来自上下文的概括在很大程度上是基于典范的。相比之下,我们发现在预先训练自然语言的变压器中,内在学习是基于规则的显着,较大的模型显示了更多基于规则的性质。我们假设从文化信息中的基于规则的概括可能是对语言进行大规模培训的新兴结果,该语言具有稀疏的规则式结构。使用受控的刺激,我们验证在包含稀疏规则结构的数据上预处理的变压器表现出更多基于规则的概括。
Transformer models can use two fundamentally different kinds of information: information stored in weights during training, and information provided ``in-context'' at inference time. In this work, we show that transformers exhibit different inductive biases in how they represent and generalize from the information in these two sources. In particular, we characterize whether they generalize via parsimonious rules (rule-based generalization) or via direct comparison with observed examples (exemplar-based generalization). This is of important practical consequence, as it informs whether to encode information in weights or in context, depending on how we want models to use that information. In transformers trained on controlled stimuli, we find that generalization from weights is more rule-based whereas generalization from context is largely exemplar-based. In contrast, we find that in transformers pre-trained on natural language, in-context learning is significantly rule-based, with larger models showing more rule-basedness. We hypothesise that rule-based generalization from in-context information might be an emergent consequence of large-scale training on language, which has sparse rule-like structure. Using controlled stimuli, we verify that transformers pretrained on data containing sparse rule-like structure exhibit more rule-based generalization.