论文标题
通过识别和净化不良神经元来防止后门攻击
Defense against Backdoor Attacks via Identifying and Purifying Bad Neurons
论文作者
论文摘要
神经网络的不透明度导致其脆弱性发生后门攻击,其中触发了感染神经元的隐藏注意力,以覆盖对攻击者选择的神经元的正常预测。在本文中,我们提出了一种新型的后门防御方法,以标记和净化后部神经网络中的受感染神经元。具体来说,我们首先定义了一个新的指标,称为良性显着性。通过将一阶梯度组合以保持神经元之间的连接,良性显着性可以比后门防御中常用的度量标准识别具有更高精度的感染神经元。然后,提出了一种新的自适应正则化(AR)机制,以帮助通过微调纯化这些被鉴定的感染神经元。由于能够适应不同参数幅度的能力,与神经元纯化中的共同正则化机制相比,AR可以提供更快,更稳定的收敛性。广泛的实验结果表明,我们的方法可以消除具有可忽略的性能降解的神经网络中的后门。
The opacity of neural networks leads their vulnerability to backdoor attacks, where hidden attention of infected neurons is triggered to override normal predictions to the attacker-chosen ones. In this paper, we propose a novel backdoor defense method to mark and purify the infected neurons in the backdoored neural networks. Specifically, we first define a new metric, called benign salience. By combining the first-order gradient to retain the connections between neurons, benign salience can identify the infected neurons with higher accuracy than the commonly used metric in backdoor defense. Then, a new Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified infected neurons via fine-tuning. Due to the ability to adapt to different magnitudes of parameters, AR can provide faster and more stable convergence than the common regularization mechanism in neuron purifying. Extensive experimental results demonstrate that our method can erase the backdoor in neural networks with negligible performance degradation.