MORA：通过模型剥离攻击改善合奏鲁棒性评估

论文标题

MORA：通过模型剥离攻击改善合奏鲁棒性评估

MORA: Improving Ensemble Robustness Evaluation with Model-Reweighing Attack

论文作者

Yu, Yunrui, Gao, Xitong, Xu, Cheng-Zhong

论文摘要

对抗攻击可以通过在其输入数据中添加微小的扰动来欺骗神经网络。经过训练以最大程度地减少子模型之间的攻击性转移性的训练的集合防御能力提供了有前途的研究方向，以提高对此类攻击的鲁棒性，同时保持自然输入的高精度。但是，我们发现，最近的最新（SOTA）对抗攻击策略无法可靠地评估合奏防御，从而高估了它们的稳健性。本文确定了有助于这种行为的两个因素。首先，由于梯度混淆，这些防御构成了现有的基于梯度的方法攻击的合奏。其次，合奏防御能力使子模型梯度多样化，提出了同时击败所有子模型的挑战，简单地概括了他们的贡献可能会抵消整体攻击目标；但是，我们观察到，尽管大多数子模型都是正确的，但合奏仍然可能被愚弄。因此，我们引入了Mora，这是一种通过重新重新授予子模型梯度的重要性来引导对抗性示例综合的型模型攻击。莫拉（Mora）发现，最近的整体防御能力都表现出不同程度的高估鲁棒性。将其与最近的SOTA白盒攻击进行比较，它可以更快地收敛数量级，同时通过使用三种不同的合奏模式（即，通过SoftMax，投票或逻辑结合）来实现所有合奏模型的较高攻击成功率。特别是，大多数合奏防御均在CIFAR-10上0.02之内的$ \ ell^\ infty $扰动表现出接近或0％的鲁棒性，在CIFAR-100上表现出$ \ ell^\ infty $扰动。我们以可重现的结果和预训练的模型为开源。并在各种攻击策略下提供合奏防御的排行榜。

Adversarial attacks can deceive neural networks by adding tiny perturbations to their input data. Ensemble defenses, which are trained to minimize attack transferability among sub-models, offer a promising research direction to improve robustness against such attacks while maintaining a high accuracy on natural inputs. We discover, however, that recent state-of-the-art (SOTA) adversarial attack strategies cannot reliably evaluate ensemble defenses, sizeably overestimating their robustness. This paper identifies the two factors that contribute to this behavior. First, these defenses form ensembles that are notably difficult for existing gradient-based method to attack, due to gradient obfuscation. Second, ensemble defenses diversify sub-model gradients, presenting a challenge to defeat all sub-models simultaneously, simply summing their contributions may counteract the overall attack objective; yet, we observe that ensemble may still be fooled despite most sub-models being correct. We therefore introduce MORA, a model-reweighing attack to steer adversarial example synthesis by reweighing the importance of sub-model gradients. MORA finds that recent ensemble defenses all exhibit varying degrees of overestimated robustness. Comparing it against recent SOTA white-box attacks, it can converge orders of magnitude faster while achieving higher attack success rates across all ensemble models examined with three different ensemble modes (i.e., ensembling by either softmax, voting or logits). In particular, most ensemble defenses exhibit near or exactly 0% robustness against MORA with $\ell^\infty$ perturbation within 0.02 on CIFAR-10, and 0.01 on CIFAR-100. We make MORA open source with reproducible results and pre-trained models; and provide a leaderboard of ensemble defenses under various attack strategies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题