RealFormer：变压器喜欢残留的关注

论文标题

RealFormer：变压器喜欢残留的关注

RealFormer: Transformer Likes Residual Attention

论文作者

He, Ruining, Ravula, Anirudh, Kanagal, Bhargav, Ainslie, Joshua

论文摘要

变压器是现代NLP模型的骨干。 In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP.我们还从经验上观察到，RealFormer稳定训练并导致引起更大关注的模型。可以在https://github.com/google-research/google-research/tree/master/master/realformer上找到RealFormer的源代码和预训练的检查点。

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题