论文标题

重新思考基于归因的解释的稳定性

Rethinking Stability for Attribution-based Explanations

论文作者

Agarwal, Chirag, Johnson, Nari, Pawelczyk, Martin, Krishna, Satyapriya, Saxena, Eshika, Zitnik, Marinka, Lakkaraju, Himabindu

论文摘要

由于基于归因的解释方法越来越多地用于在高风险情况下建立模型可信度,因此必须确保这些解释是稳定的,例如,对输入的无限扰动稳定。但是,以前的作品表明,最新的解释方法会产生不稳定的解释。在这里,我们介绍了指标来量化解释的稳定性,并表明几种流行的解释方法是不稳定的。特别是,我们提出了新的相对稳定性指标,以衡量输入,模型表示或基础预测指标的输出的变化的变化。最后,我们对三个现实世界数据集的实验评估展示了七种解释方法和不同稳定性指标的有趣见解。

As attribution-based explanation methods are increasingly used to establish model trustworthiness in high-stakes situations, it is critical to ensure that these explanations are stable, e.g., robust to infinitesimal perturbations to an input. However, previous works have shown that state-of-the-art explanation methods generate unstable explanations. Here, we introduce metrics to quantify the stability of an explanation and show that several popular explanation methods are unstable. In particular, we propose new Relative Stability metrics that measure the change in output explanation with respect to change in input, model representation, or output of the underlying predictor. Finally, our experimental evaluation with three real-world datasets demonstrates interesting insights for seven explanation methods and different stability metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源