零拍的跨语性问题答案的合成数据增强

论文标题

零拍的跨语性问题答案的合成数据增强

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

论文作者

Riabi, Arij, Scialom, Thomas, Keraron, Rachel, Sagot, Benoît, Seddah, Djamé, Staiano, Jacopo

论文摘要

加上大规模数据集的可用性，深度学习体系结构已使问题回答任务的快速进步。但是，这些数据集中的大多数都是英文，并且在非英语数据中评估时，最先进的多语言模型的性能明显降低。由于数据收集成本较高，因此获得每种需要支持的语言的带注释数据是没有现实的。我们提出了一种改善跨语性问题回答性能的方法，而无需其他带注释的数据，利用问题生成模型以跨语性方式生成合成样本。我们表明，所提出的方法允许仅在英语数据上训练的基准大大胜过。我们在四个多语言数据集上报告了一个新的最新技术：MLQA，Xquad，Squad-IT和PIAF（FR）。

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

下载PDF全文

下载文献需遵守相关版权规定

论文标题