论文标题
RealFormer:变压器喜欢残留的关注
RealFormer: Transformer Likes Residual Attention
论文作者
论文摘要
变压器是现代NLP模型的骨干。 In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP.我们还从经验上观察到,RealFormer稳定训练并导致引起更大关注的模型。可以在https://github.com/google-research/google-research/tree/master/master/realformer上找到RealFormer的源代码和预训练的检查点。
Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.