论文标题
寻找对抗性扰动的本质
Searching for the Essence of Adversarial Perturbations
论文作者
论文摘要
神经网络已在各种机器学习领域表现出最先进的表现。但是,在输入数据中引入恶意扰动(称为对抗性示例)已被证明欺骗了神经网络预测。这为实际应用带来了潜在的风险,例如自动驾驶和文本标识。为了减轻这些风险,必须全面了解对抗性例子的机制。在这项研究中,我们证明了对抗性的扰动包含人类识别的信息,这是负责神经网络不正确预测的关键同谋,与广泛认为的人未知特征在欺骗网络中起着至关重要的作用相反。这种可识别的特征的概念使我们能够解释对抗性扰动的关键特征,包括它们的存在,不同神经网络之间的可转移性以及对对抗性训练的可解释性提高。我们还发现了欺骗神经网络的对抗性扰动的两个独特特性:掩盖和产生。此外,当神经网络对输入图像进行分类时,可以确定一个特殊类,互补类。在对抗扰动中存在人类识别信息的存在使研究人员能够深入了解神经网络的工作原理,并可能导致开发用于检测和防御对抗攻击的技术。
Neural networks have demonstrated state-of-the-art performance in various machine learning fields. However, the introduction of malicious perturbations in input data, known as adversarial examples, has been shown to deceive neural network predictions. This poses potential risks for real-world applications such as autonomous driving and text identification. In order to mitigate these risks, a comprehensive understanding of the mechanisms underlying adversarial examples is essential. In this study, we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's incorrect prediction, in contrast to the widely held belief that human-unidentifiable characteristics play a critical role in fooling a network. This concept of human-recognizable characteristics enables us to explain key features of adversarial perturbations, including their existence, transferability among different neural networks, and increased interpretability for adversarial training. We also uncover two unique properties of adversarial perturbations that deceive neural networks: masking and generation. Additionally, a special class, the complementary class, is identified when neural networks classify input images. The presence of human-recognizable information in adversarial perturbations allows researchers to gain insight into the working principles of neural networks and may lead to the development of techniques for detecting and defending against adversarial attacks.