论文标题
理解和增强基于概念的模型的鲁棒性
Understanding and Enhancing Robustness of Concept-based Models
论文作者
论文摘要
在医学诊断和财务分析等关键应用程序中,对深度神经网络的使用量增加了对其可靠性和可信度的担忧。随着自动化系统变得越来越主流,他们的决策很重要,他们的决策是透明,可靠和可以理解的,以获得更好的信任和信心。为此,已经提出了基于概念的模型,例如概念瓶颈模型(CBM)和自我解释的神经网络(SENN),该模型限制了模型的潜在空间,以表示该领域专家很容易理解的高级概念。尽管基于概念的模型有助于提高解释性和可靠性的良好方法,但如果它们在系统的扰动下表现出稳健性和输出一致的概念,则尚待证明。为了更好地了解基于策划的恶意样本的基于概念的模型的性能,在本文中,我们旨在研究它们对对抗性扰动的鲁棒性,这也被称为攻击者制定的输入数据的不可察觉的变化,这些变化是由攻击者制定的,以欺骗了一个良好的概念模型。具体而言,我们首先提出和分析不同的恶意攻击,以评估基于概念的模型的安全脆弱性。随后,我们提出了一种潜在的一般对抗训练机制,以将这些系统的鲁棒性提高到拟议的恶意攻击。对一个合成和两个现实世界数据集进行了广泛的实验,证明了拟议的攻击和防御方法的有效性。
Rising usage of deep neural networks to perform decision making in critical applications like medical diagnosis and financial analysis have raised concerns regarding their reliability and trustworthiness. As automated systems become more mainstream, it is important their decisions be transparent, reliable and understandable by humans for better trust and confidence. To this effect, concept-based models such as Concept Bottleneck Models (CBMs) and Self-Explaining Neural Networks (SENN) have been proposed which constrain the latent space of a model to represent high level concepts easily understood by domain experts in the field. Although concept-based models promise a good approach to both increasing explainability and reliability, it is yet to be shown if they demonstrate robustness and output consistent concepts under systematic perturbations to their inputs. To better understand performance of concept-based models on curated malicious samples, in this paper, we aim to study their robustness to adversarial perturbations, which are also known as the imperceptible changes to the input data that are crafted by an attacker to fool a well-learned concept-based model. Specifically, we first propose and analyze different malicious attacks to evaluate the security vulnerability of concept based models. Subsequently, we propose a potential general adversarial training-based defense mechanism to increase robustness of these systems to the proposed malicious attacks. Extensive experiments on one synthetic and two real-world datasets demonstrate the effectiveness of the proposed attacks and the defense approach.