英语和印地语的基于BERT的多语言理解

论文标题

英语和印地语的基于BERT的多语言理解

BERT Based Multilingual Machine Comprehension in English and Hindi

论文作者

Gupta, Somil, Khade, Nilesh

论文摘要

多语言机器理解（MMC）是一个提问（QA）子任务，涉及引用给定摘要中问题的答案，其中问题和摘要可以使用不同的语言。最近发布的Bert（M-Bert）的多语言变体预先训练了104种语言，在零射击和微调设置中都表现出色，用于多语言任务；但是，它尚未用于英语印度MMC。因此，我们在本文中提出了M-Bert的MMC实验，以零摄，单语语言（例如印地语Question-hindi Snippet）和跨语言（例如英语QuestionHindi Snippet）微调设置。这些模型变体在所有可能的多语言设置上进行了评估，并将结果与这些语言的当前最新顺序质量质量质量系统进行了比较。实验表明，通过微调，M-Bert提高了先前模型所使用的两个数据集的所有评估设置的性能，因此将基于M-Bert的MMC建立为英语和印地语的新最先进。我们还将结果发布在最近发布的Xquad数据集的扩展版本上，我们建议将其用作未来研究的评估基准。

Multilingual Machine Comprehension (MMC) is a Question-Answering (QA) sub-task that involves quoting the answer for a question from a given snippet, where the question and the snippet can be in different languages. Recently released multilingual variant of BERT (m-BERT), pre-trained with 104 languages, has performed well in both zero-shot and fine-tuned settings for multilingual tasks; however, it has not been used for English-Hindi MMC yet. We, therefore, present in this article, our experiments with m-BERT for MMC in zero-shot, mono-lingual (e.g. Hindi Question-Hindi Snippet) and cross-lingual (e.g. English QuestionHindi Snippet) fine-tune setups. These model variants are evaluated on all possible multilingual settings and results are compared against the current state-of-the-art sequential QA system for these languages. Experiments show that m-BERT, with fine-tuning, improves performance on all evaluation settings across both the datasets used by the prior model, therefore establishing m-BERT based MMC as the new state-of-the-art for English and Hindi. We also publish our results on an extended version of the recently released XQuAD dataset, which we propose to use as the evaluation benchmark for future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题