论文标题

MTET:英语和越南语的多域翻译

MTet: Multi-domain Translation for English and Vietnamese

论文作者

Ngo, Chinh, Trinh, Trieu H., Phan, Long, Tran, Hieu, Dang, Tai, Nguyen, Hieu, Nguyen, Minh, Luong, Minh-Thang

论文摘要

我们介绍了MTET,这是英语 - 越南语翻译的最大公开平行语料库。 MTET由420万高质量的训练对组成,由越南研究社区完善的多域测试组组成。结合以前关于英语 - 越南语翻译的作品,我们将现有的平行数据集增长到620万个句子对。我们还发布了第一个针对英语和越南语言的预算模型Envit5。结合两种资源,我们的模型大大优于先前的最先进结果,最多可以在翻译BLEU分数中获得2分,而小于1.6倍。

We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源