论文标题
EAG:提取和生成多路排列语料库,用于完整的多语言神经机器翻译
EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation
论文作者
论文摘要
完整的多语言神经机器翻译(C-MNMT)通过构建多路排列语料库,即,当它们的源或目标方面相同时,从不同语言对的双语训练示例来实现卓越的性能。但是,由于不同语言对完全相同的句子稀缺,因此多路排列语料库的力量受到规模的限制。为了解决这个问题,本文提出了“提取和生成”(EAG),这是一种两步方法,用于构建双语数据中的大规模和高质量的多路排列语料库。具体而言,我们首先通过将不同语言对的双语示例与高度相似的源或目标句子配对来提取候选示例。然后通过训练有素的生成模型从候选人中生成最终的对齐示例。借助这两步管道,EAG可以构建一个大规模和多路排列的语料库,其多样性几乎与原始的双语语料库相同。在两个公开可用的数据集(即WMT-5和Opus-100)上进行的实验表明,所提出的方法对强基地进行了显着改进,分别在两个数据集上具有+1.1和+1.4 BLEU点的改进。
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes "Extract and Generate" (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.