引导注意力与变压器进行自我监督的学习

论文标题

引导注意力与变压器进行自我监督的学习

Guiding Attention for Self-Supervised Learning with Transformers

论文作者

Deshpande, Ameet, Narasimhan, Karthik

论文摘要

在本文中，我们提出了一种简单有效的技术，可以通过双向变压器进行有效的自学学习。最近的研究激发了我们的方法，该研究表明，训练有素的模型中的自我发场模式包含大多数非语言规律性。我们提出了一个计算有效的辅助损失函数，以指导注意力头以符合此类模式。我们的方法对实际训练的实际训练目标不可知，并且与基线相比，在下游任务上可以更快地收敛，并且在下游任务上具有更好的性能，从而实现了最新的状态，从而导致了低资源设置。令人惊讶的是，我们还发现，注意力头的语言特性不一定与语言建模性能相关。

In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster convergence of models as well as better performance on downstream tasks compared to the baselines, achieving state of the art results in low-resource settings. Surprisingly, we also find that linguistic properties of attention heads are not necessarily correlated with language modeling performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题