论文标题
X-FACTR:从验证的语言模型中检索多语言的事实知识
X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models
论文作者
论文摘要
语言模型(LMS)已通过完成固定风格的问题(例如“ Punta Cana位于_”)来捕获事实知识的成功事实证明。但是,尽管知识既以多种语言编写和询问,但关于LMS的事实表示能力的研究几乎总是在英语上进行。为了评估不同语言中LMS中的事实知识检索,我们为23种类型上多样的语言创建了一个多语言的基准。为了正确处理语言变化,我们将探测方法从单个字实体扩展到多个单词实体,并开发多种解码算法以生成多token预测。广泛的实验结果提供了有关以更多或更少可用资源的语言在此任务上执行的(或不良)最新的LMS的见解。我们进一步提出了一种基于代码转换的方法,以提高多语言LMS访问知识的能力,并验证其对几种基准语言的有效性。基准数据和代码已在https://x-factr.github.io上发布。
Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as "Punta Cana is located in _." However, while knowledge is both written and queried in many languages, studies on LMs' factual representation ability have almost invariably been performed on English. To assess factual knowledge retrieval in LMs in different languages, we create a multilingual benchmark of cloze-style probes for 23 typologically diverse languages. To properly handle language variations, we expand probing methods from single- to multi-word entities, and develop several decoding algorithms to generate multi-token predictions. Extensive experimental results provide insights about how well (or poorly) current state-of-the-art LMs perform at this task in languages with more or fewer available resources. We further propose a code-switching-based method to improve the ability of multilingual LMs to access knowledge, and verify its effectiveness on several benchmark languages. Benchmark data and code have been released at https://x-factr.github.io.