通过信息瓶颈在对抗性示例中蒸馏出强大而非舒适的特征

论文标题

通过信息瓶颈在对抗性示例中蒸馏出强大而非舒适的特征

Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck

论文作者

Kim, Junho, Lee, Byung-Kwan, Ro, Yong Man

论文摘要

通过精心制作的扰动产生的对抗性例子在研究领域引起了很大的关注。最近的作品认为，健壮和非舒适特征的存在是对抗性例子的主要原因，并研究了其在特征空间中的内部相互作用。在本文中，我们提出了一种使用信息瓶颈将特征表示形式明确蒸馏成可靠和非稳固特征的方法。具体而言，我们根据噪声变化幅度将噪声变化注入每个特征单元，并评估特征表示中的信息流量以鲁棒或不稳定。通过全面的实验，我们证明了蒸馏特征与对抗性预测高度相关，并且它们本身具有可察觉的语义信息。此外，我们提出了一种攻击机制，该机制加强了与模型预测直接相关的非稳态特征的梯度，并验证了其破坏模型鲁棒性的有效性。

Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields. Recent works have argued that the existence of the robust and non-robust features is a primary cause of the adversarial examples, and investigated their internal interactions in the feature space. In this paper, we propose a way of explicitly distilling feature representation into the robust and non-robust features, using Information Bottleneck. Specifically, we inject noise variation to each feature unit and evaluate the information flow in the feature representation to dichotomize feature units either robust or non-robust, based on the noise variation magnitude. Through comprehensive experiments, we demonstrate that the distilled features are highly correlated with adversarial prediction, and they have human-perceptible semantic information by themselves. Furthermore, we present an attack mechanism intensifying the gradient of non-robust features that is directly related to the model prediction, and validate its effectiveness of breaking model robustness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题