论文标题

前武器:一种可解释的视觉变压器,用于弱监督语义分割

eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation

论文作者

Yu, Lu, Xiang, Wei, Fang, Juan, Chen, Yi-Ping Phoebe, Chi, Lianhua

论文摘要

最近,Vision Transformer模型已成为一系列视觉任务的重要模型。但是,这些模型通常是不透明的,具有弱特征可解释性。此外,目前尚无针对本质上可解释的变压器构建的方法,该方法能够解释其推理过程并提供忠实的解释。为了缩小这些关键差距,我们提出了一种新型视觉变压器,称为“可解释的视觉变压器”(Ex-Vit),这是一种本质上可解释的变压器模型,能够共同发现可解释的功能并执行预测。具体而言,前vit由可解释的多头注意(E-MHA)模块,属性引导的解释器(ATTE)模块和自我监督属性引导的损失组成。 E-MHA裁缝可以解释的注意力重量,能够从本地贴片中学习具有噪音稳健性的模型决策的语义解释表示。同时,建议通过各种属性发现编码目标对象的歧视性属性特征,该发现构成了模型预测的忠实证据。此外,为我们的前武器开发了自我监督的属性引导的损失,该损失旨在通过属性可区分性机制学习增强表示形式,并归因于多样性机制,以定位多样性和歧视性属性,并产生更健壮的解释。结果,我们可以通过拟议的前武器发现具有多种属性的忠实和强大的解释。

Recently vision transformer models have become prominent models for a range of vision tasks. These models, however, are usually opaque with weak feature interpretability. Moreover, there is no method currently built for an intrinsically interpretable transformer, which is able to explain its reasoning process and provide a faithful explanation. To close these crucial gaps, we propose a novel vision transformer dubbed the eXplainable Vision Transformer (eX-ViT), an intrinsically interpretable transformer model that is able to jointly discover robust interpretable features and perform the prediction. Specifically, eX-ViT is composed of the Explainable Multi-Head Attention (E-MHA) module, the Attribute-guided Explainer (AttE) module and the self-supervised attribute-guided loss. The E-MHA tailors explainable attention weights that are able to learn semantically interpretable representations from local patches in terms of model decisions with noise robustness. Meanwhile, AttE is proposed to encode discriminative attribute features for the target object through diverse attribute discovery, which constitutes faithful evidence for the model's predictions. In addition, a self-supervised attribute-guided loss is developed for our eX-ViT, which aims at learning enhanced representations through the attribute discriminability mechanism and attribute diversity mechanism, to localize diverse and discriminative attributes and generate more robust explanations. As a result, we can uncover faithful and robust interpretations with diverse attributes through the proposed eX-ViT.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源