低资源印度语言的神经机器翻译

论文标题

低资源印度语言的神经机器翻译

Neural Machine Translation for Low-Resourced Indian Languages

论文作者

Choudhary, Himanshu, Rao, Shivansh, Rohilla, Rajesh

论文摘要

大量重要资产可以在线提供英语，经常被翻译成母语，以减轻当地人不熟悉英语的当地人的信息。但是，手动翻译是一个非常乏味，昂贵且耗时的过程。为此，机器翻译是一种有效的方法，可以将文本转换为不同的语言而无需任何人类参与。神经机器翻译（NMT）是所有现有机器翻译系统中最熟练的翻译技术之一。在本文中，我们已将NMT应用于两种最丰富的印度语言，即英语 - 塔米尔语和英语 - 马拉雅拉姆语。我们提出了一种新型的NMT模型，并使用多头自我发挥，以及预先训练的字节对编码（BPE）和MultiBPE嵌入式，以开发一个有效的翻译系统，该系统克服了oov（不出词汇）问题，用于低资源的富有形态的印度语言，这些语言不可能在网上提供很多翻译。我们还从不同来源收集了语料库，解决了这些公开数据的问题，并将其完善了以供进一步使用。我们使用BLEU得分来评估我们的系统性能。实验结果和调查证实，我们提出的翻译器（24.34和9.78 BLEU得分）的表现分别优于Google Translator（9.40和5.94 BLEU分数）。

A large number of significant assets are available online in English, which is frequently translated into native languages to ease the information sharing among local people who are not much familiar with English. However, manual translation is a very tedious, costly, and time-taking process. To this end, machine translation is an effective approach to convert text to a different language without any human involvement. Neural machine translation (NMT) is one of the most proficient translation techniques amongst all existing machine translation systems. In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam. We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system that overcomes the OOV (Out Of Vocabulary) problem for low resourced morphological rich Indian languages which do not have much translation available online. We also collected corpus from different sources, addressed the issues with these publicly available data and refined them for further uses. We used the BLEU score for evaluating our system performance. Experimental results and survey confirmed that our proposed translator (24.34 and 9.78 BLEU score) outperforms Google translator (9.40 and 5.94 BLEU score) respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题