论文标题
通过图像的低资源语言的实用可比较数据收集
Practical Comparable Data Collection for Low-Resource Languages via Images
论文作者
论文摘要
我们提出了一种用单语注释者策划低资源语言的高质量可比培训数据的方法。我们的方法涉及使用精心选择的图像集作为源和目标语言之间的枢轴,通过独立以两种语言的方式获得此类图像的标题。通过我们的方法创建的对英语印地语的人类评估表明,有81.1%的对是可以接受的翻译,只有2.47%的对根本不是翻译。我们通过尝试两个下游任务 - 机器翻译和字典提取,进一步确定了通过方法收集的数据集的潜力。所有代码和数据均可在https://github.com/madaan/pml4dc-comparable-data-collection上获得。
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all. We further establish the potential of the dataset collected through our approach by experimenting on two downstream tasks - machine translation and dictionary extraction. All code and data are available at https://github.com/madaan/PML4DC-Comparable-Data-Collection.