论文标题
经过结构化任务训练的变压器中的系统概括和新兴结构
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks
论文作者
论文摘要
变形金刚网络在自然语言处理和机器视觉中取得了巨大的成功,在该任务目标中,下一个单词预测和图像分类等任务目标受益于跨高维输入的细微上下文敏感性。但是,关于变形金刚如何以及何时可以获取高度结构化行为并实现系统概括的持续辩论。在这里,我们探讨了因果变压器能够执行一组算法任务,包括这些操作的复制,排序和分层组成。我们通过替换序列中通常使用与序列中的项目配对的标签中使用的标准位置编码的标准位置编码来证明对序列的强烈概括。我们搜索足以解决这些任务的层和头部配置,然后探测潜在表示和注意力模式中系统处理的迹象。我们表明,两层变压器学习了可靠的解决方案,以通过鼓励跨相关任务对共享计算的开发进行多级问题,开发任务分解的迹象并编码输入项目。这些结果为注意力层如何支持任务内和多个任务中的结构化计算提供了关键的见解。
Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We search for the layer and head configuration sufficient to solve these tasks, then probe for signs of systematic processing in latent representations and attention patterns. We show that two-layer transformers learn reliable solutions to multi-level problems, develop signs of task decomposition, and encode input items in a way that encourages the exploitation of shared computation across related tasks. These results provide key insights into how attention layers support structured computation both within a task and across multiple tasks.