通过句子编辑对语言模型解释性的调查

论文标题

通过句子编辑对语言模型解释性的调查

An Investigation of Language Model Interpretability via Sentence Editing

论文作者

Stevens, Samuel, Su, Yu

论文摘要

像BERT这样的预训练的语言模型（PLM）几乎用于所有与语言相关的任务，但是解释其行为仍然是一个重大挑战，许多重要的问题仍然在很大程度上没有得到解答。在这项工作中，我们重新使用了一个句子编辑数据集，可以自动提取忠实的高质量人类理由，并将其与提取的模型理由进行比较，以作为可解释性的新测试。这使我们能够对有关PLM的解释性的一系列问题进行系统调查，包括训练程序的作用，理由提取方法的比较以及PLM中不同的层。例如，该研究产生了新的见解，例如，与共同的理解相反，我们发现注意力与人类的理由相关，并且比基于梯度的显着性在提取模型理由方面更好地相关。数据集和代码均可在https://github.com/samuelstevens/sentence-editing-interpretable上获得，以促进未来的可解释性研究。

Pre-trained language models (PLMs) like BERT are being used for almost all language-related tasks, but interpreting their behavior still remains a significant challenge and many important questions remain largely unanswered. In this work, we re-purpose a sentence editing dataset, where faithful high-quality human rationales can be automatically extracted and compared with extracted model rationales, as a new testbed for interpretability. This enables us to conduct a systematic investigation on an array of questions regarding PLMs' interpretability, including the role of pre-training procedure, comparison of rationale extraction methods, and different layers in the PLM. The investigation generates new insights, for example, contrary to the common understanding, we find that attention weights correlate well with human rationales and work better than gradient-based saliency in extracting model rationales. Both the dataset and code are available at https://github.com/samuelstevens/sentence-editing-interpretability to facilitate future interpretability research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题