对及时基于生物医学知识探测的验证语言模型的上下文差异评估

论文标题

对及时基于生物医学知识探测的验证语言模型的上下文差异评估

Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing

论文作者

Yao, Zonghai, Cao, Yi, Yang, Zhichao, Yu, Hong

论文摘要

审慎的语言模型（PLM）激发了研究这些模型学习的知识的研究。填补空白的问题（例如，披肩测试）是测量此类知识的自然方法。 Biolama生成了生物医学事实知识的提示，并使用TOP-K精度度量来评估不同的PLM知识。但是，现有的研究表明，这种基于及时的知识探测方法只能探究知识的下限。诸如迅速探测偏见等许多因素使喇嘛基准不可靠且不稳定。这个问题在Biolama中更为突出。词汇和大N-M关系中严重的长尾巴分布使喇嘛和Biolama之间的性能差距仍然值得注意。为了解决这些问题，我们将上下文差异介绍到及时的生成中，并提出一个新的基于排名的评估度量。与以前的知名评估标准不同，我们首次提出了喇嘛“误解”的概念。通过对12个PLM的实验，我们的上下文差异提示并理解 - 共同令人毛骨悚然（UCM）指标使Biolama对大N-M关系和罕见关系更友好。我们还进行了一组控制实验，以从“阅读和复制”中解开“理解”。

Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs' knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we propose the concept of "Misunderstand" in LAMA for the first time. Through experiments on 12 PLMs, our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric makes BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle "understand" from just "read and copy".

下载PDF全文

下载文献需遵守相关版权规定

论文标题