对特洛伊木马攻击（Trojdef）的自适应黑盒防御

论文标题

对特洛伊木马攻击（Trojdef）的自适应黑盒防御

An Adaptive Black-box Defense against Trojan Attacks (TrojDef)

论文作者

Liu, Guanxiong, Khreishah, Abdallah, Sharadgah, Fatima, Khalil, Issa

论文摘要

特洛伊木马后门是针对神经网络（NN）分类器的中毒攻击，在该攻击者中，对手试图利用（高度理想的）模型重用属性将特洛伊木马植入模型参数，以通过有毒的训练过程来进行后门漏洞。针对特洛伊木马攻击的大多数防御措施都假设了白盒设置，其中防守者可以访问NN的内部状态，或者能够通过它进行后传播。在这项工作中，我们提出了一个更实用的黑盒防御，称为Trojdef，它只能在NN上进行前进。 Trojdef试图通过监视输入被随机噪声反复扰动时，通过监视预测置信度的变化来识别和滤除特洛伊木马输入（即输入的输入）。我们根据预测输出得出一个函数，该函数称为预测置信度，以决定输入示例是否为特洛伊木马。直觉是，由于错误分类仅取决于触发因素，因此特洛伊木马输入更加稳定，而由于分类特征的扰动，良性输入会遭受良性输入。通过数学分析，我们表明，如果攻击者在注入后门时是完美的，则将训练特洛伊木马感染的模型以学习适当的预测置信度绑定，该模型用于区分特洛伊木马和良性输入。但是，由于攻击者在注入后门时可能不是完美的，因此我们将非线性转换引入了预测置信度，以提高实际环境中的检测准确性。广泛的经验评估表明，即使分类器体系结构，训练过程或超参数变化，Trojdef的表现明显优于州的防御能力，并且在不同的设置下也很稳定。

Trojan backdoor is a poisoning attack against Neural Network (NN) classifiers in which adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches through a poisoned training process. Most of the proposed defenses against Trojan attacks assume a white-box setup, in which the defender either has access to the inner state of NN or is able to run back-propagation through it. In this work, we propose a more practical black-box defense, dubbed TrojDef, which can only run forward-pass of the NN. TrojDef tries to identify and filter out Trojan inputs (i.e., inputs augmented with the Trojan trigger) by monitoring the changes in the prediction confidence when the input is repeatedly perturbed by random noise. We derive a function based on the prediction outputs which is called the prediction confidence bound to decide whether the input example is Trojan or not. The intuition is that Trojan inputs are more stable as the misclassification only depends on the trigger, while benign inputs will suffer when augmented with noise due to the perturbation of the classification features. Through mathematical analysis, we show that if the attacker is perfect in injecting the backdoor, the Trojan infected model will be trained to learn the appropriate prediction confidence bound, which is used to distinguish Trojan and benign inputs under arbitrary perturbations. However, because the attacker might not be perfect in injecting the backdoor, we introduce a nonlinear transform to the prediction confidence bound to improve the detection accuracy in practical settings. Extensive empirical evaluations show that TrojDef significantly outperforms the-state-of-the-art defenses and is highly stable under different settings, even when the classifier architecture, the training process, or the hyper-parameters change.

下载PDF全文

下载文献需遵守相关版权规定

论文标题