野外解释性：GPT-2小的间接对象识别电路

论文标题

野外解释性：GPT-2小的间接对象识别电路

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

论文作者

Wang, Kevin, Variengien, Alexandre, Conmy, Arthur, Shlegeris, Buck, Steinhardt, Jacob

论文摘要

机械解释性的研究旨在根据其内部组件来解释机器学习模型的行为。但是，大多数以前的工作要么着重于小型模型中的简单行为，要么描述了较大笔触的较大模型中的复杂行为。在这项工作中，我们通过介绍GPT-2小型执行自然语言任务的解释来弥合这一差距，称为间接对象识别（IOI）。我们的解释包括26个注意力头，分为7个主要类别，我们使用依靠因果干预措施的可解释性方法的组合发现。据我们所知，这项调查是在语言模型中逆转自然行为的最大端到端尝试。我们使用三个定量标准 - 信仰，完整性和最小性来评估解释的可靠性。尽管这些标准支持我们的解释，但它们也指出了我们的理解差距。我们的工作提供了证据表明，对大型ML模型的机械理解是可行的，开辟了将我们的理解扩展到大型模型和更复杂任务的机会。

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题