事后解释可能无效地检测未知的虚假相关性

论文标题

事后解释可能无效地检测未知的虚假相关性

Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

论文作者

Adebayo, Julius, Muelly, Michael, Abelson, Hal, Kim, Been

论文摘要

我们研究了三种类型的事后模型解释（功能归因，概念激活和训练点排名）是否有效地检测模型在培训数据中依赖模型对虚假信号的依赖。具体而言，我们考虑了在测试时间对解释方法的用户时未知的伪造信号的情况。我们设计了一种经验方法，该方法使用半合成数据集以及预先指定的伪造伪影来获取可依赖这些虚假训练信号的模型。然后，我们提供一套指标，以评估解释方法在各种条件下对伪信号检测的可靠性。我们发现，当伪造的伪像在测试时间未知时，特别是对于不可访问的伪影（如背景模糊）时，已测试的事后解释方法是无效的。此外，我们发现特征归因方法容易被错误地表明对虚假信号的依赖，即使被解释的模型不依赖于虚假伪影。这一发现对这些方法的实用性表示怀疑，这些方法是在从业者的手中，以检测模型对虚假信号的依赖。

We investigate whether three types of post hoc model explanations--feature attribution, concept activation, and training point ranking--are effective for detecting a model's reliance on spurious signals in the training data. Specifically, we consider the scenario where the spurious signal to be detected is unknown, at test-time, to the user of the explanation method. We design an empirical methodology that uses semi-synthetic datasets along with pre-specified spurious artifacts to obtain models that verifiably rely on these spurious training signals. We then provide a suite of metrics that assess an explanation method's reliability for spurious signal detection under various conditions. We find that the post hoc explanation methods tested are ineffective when the spurious artifact is unknown at test-time especially for non-visible artifacts like a background blur. Further, we find that feature attribution methods are susceptible to erroneously indicating dependence on spurious signals even when the model being explained does not rely on spurious artifacts. This finding casts doubt on the utility of these approaches, in the hands of a practitioner, for detecting a model's reliance on spurious signals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题