解释器差异得分（EDS）：一些事后解释可能有效检测未知的虚假相关性

论文标题

解释器差异得分（EDS）：一些事后解释可能有效检测未知的虚假相关性

Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations

论文作者

Cardozo, Shea, Montero, Gabriel Islas, Kazhdan, Dmitry, Dimanov, Botty, Wijaya, Maleakhi, Jamnik, Mateja, Lio, Pietro

论文摘要

最近的工作表明，事后解释者可能无法在检测深神经网络（DNN）中检测虚假相关性。但是，我们表明在此设置的现有评估框架上存在严重的弱点。以前提出的指标极难解释，并且在解释器方法之间并不直接可比。为了减轻这些限制，我们提出了一种新的评估方法，解释者差异得分（ED）基于评估解释者的信息理论方法。 EDS易于解释，并且在解释器之间自然可以比较。我们使用我们的方法来比较两个不同的图像数据集上的三种不同解释器的检测性能 - 特征归因方法，有影响力的示例和概念提取。我们发现事后解释者通常包含有关DNN对虚假伪像的依赖的大量信息，但人类用户通常无法察觉。这表明需要使用这些信息来更好地检测DNN对虚假相关性的依赖。

Recent work has suggested post-hoc explainers might be ineffective for detecting spurious correlations in Deep Neural Networks (DNNs). However, we show there are serious weaknesses with the existing evaluation frameworks for this setting. Previously proposed metrics are extremely difficult to interpret and are not directly comparable between explainer methods. To alleviate these constraints, we propose a new evaluation methodology, Explainer Divergence Scores (EDS), grounded in an information theory approach to evaluate explainers. EDS is easy to interpret and naturally comparable across explainers. We use our methodology to compare the detection performance of three different explainers - feature attribution methods, influential examples and concept extraction, on two different image datasets. We discover post-hoc explainers often contain substantial information about a DNN's dependence on spurious artifacts, but in ways often imperceptible to human users. This suggests the need for new techniques that can use this information to better detect a DNN's reliance on spurious correlations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题