伯特·贾姆（Bert-Jam）：联合注意的增强伯特增强神经机器翻译

论文标题

伯特·贾姆（Bert-Jam）：联合注意的增强伯特增强神经机器翻译

BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention

论文作者

Zhang, Zhebin, Wu, Sai, Jiang, Dawei, Chen, Gang

论文摘要

Bert增强神经机器翻译（NMT）旨在利用Bert编码的表示来进行翻译任务。最近提出的方法使用注意机制将变压器的编码器和解码器层与Bert的最后一层表示融合，并显示出增强的性能。但是，他们的方法不允许在BERT表示和编码器/解码器表示之间灵活地分布注意力。 In this work, we propose a novel BERT-enhanced NMT model called BERT-JAM which improves upon existing models from two aspects: 1) BERT-JAM uses joint-attention modules to allow the encoder/decoder layers to dynamically allocate attention between different representations, and 2) BERT-JAM allows the encoder/decoder layers to make use of BERT's intermediate representations by composing them using a gated linear unit （glu）。我们使用一种新颖的三相优化策略来训练Bert-JAM，该策略逐渐解散了Bert-Jam的不同组成部分。我们的实验表明，Bert-Jam在多个翻译任务上得分SOTA BLEU。

BERT-enhanced neural machine translation (NMT) aims at leveraging BERT-encoded representations for translation tasks. A recently proposed approach uses attention mechanisms to fuse Transformer's encoder and decoder layers with BERT's last-layer representation and shows enhanced performance. However, their method doesn't allow for the flexible distribution of attention between the BERT representation and the encoder/decoder representation. In this work, we propose a novel BERT-enhanced NMT model called BERT-JAM which improves upon existing models from two aspects: 1) BERT-JAM uses joint-attention modules to allow the encoder/decoder layers to dynamically allocate attention between different representations, and 2) BERT-JAM allows the encoder/decoder layers to make use of BERT's intermediate representations by composing them using a gated linear unit (GLU). We train BERT-JAM with a novel three-phase optimization strategy that progressively unfreezes different components of BERT-JAM. Our experiments show that BERT-JAM achieves SOTA BLEU scores on multiple translation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题