论文标题
Segatron:语言建模和理解的细分感知变压器
Segatron: Segment-Aware Transformer for Language Modeling and Understanding
论文作者
论文摘要
变压器对于序列建模具有强大的功能。几乎所有最先进的语言模型和预训练的语言模型都基于变压器体系结构。但是,它仅通过令牌位置指数区分顺序令牌。我们假设可以通过更丰富的位置信息从变压器生成更好的上下文表示。为了验证这一点,我们通过用段落,句子和令牌的组合位置编码代替原始令牌位置编码的原始令牌位置来提出一个细分感知的变压器(Segatron)。我们首先将细分感知机制介绍给变形金刚-XL,这是一种基于内存扩展和相对位置编码的流行的基于变压器的语言模型。我们发现我们的方法可以进一步改善变压器-XL基本模型和大型模型,在Wikitext-103数据集上达到17.1的困惑。我们进一步使用SegaTron研究了预训练的蒙版语言建模任务。实验结果表明,用SegaTron(Segabert)预先训练的BERT可以在各种NLP任务上胜过BERT,而在各种NLP任务上均超过Bert,而在零句子句子表示学习中,均优于Roberta。
Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position index. We hypothesize that better contextual representations can be generated from the Transformer with richer positional information. To verify this, we propose a segment-aware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model with memory extension and relative position encoding. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset. We further investigate the pre-training masked language modeling task with Segatron. Experimental results show that BERT pre-trained with Segatron (SegaBERT) can outperform BERT with vanilla Transformer on various NLP tasks, and outperforms RoBERTa on zero-shot sentence representation learning.