对VIT对共同腐败的鲁棒性的深入了解

论文标题

对VIT对共同腐败的鲁棒性的深入了解

Deeper Insights into the Robustness of ViTs towards Common Corruptions

论文作者

Tian, Rui, Wu, Zuxuan, Dai, Qi, Hu, Han, Jiang, Yu-Gang

论文摘要

随着视觉变压器（VIT）在各种计算机视觉任务中取得了重大进步，最近的文献提出了各种Vanilla Vits的变体，以提高效率和功效。但是，目前尚不清楚他们的独特建筑如何影响鲁棒性对共同的腐败。在本文中，我们首次尝试探究VIT变体之间的稳健性差距，并探索对鲁棒性必不可少的基础设计。通过广泛而严格的基准测试，我们证明了简单的体系结构设计，例如重叠的补丁嵌入和卷积进料前馈网络（FFN）可以促进VIT的鲁棒性。此外，由于培训对培训的影响很大程度上取决于数据的增强，因此以鲁棒性目的的先前基于CNN的增强策略是否仍然值得研究。我们探索VIT上的不同数据增强，并验证对抗性噪声训练是否强大，而傅立叶域增强则不如。基于这些发现，我们引入了一种新颖的条件方法，该方法生成以输入图像为条件的动态增强参数，从而为常见腐败提供了最新的鲁棒性。

With Vision Transformers (ViTs) making great advances in a variety of computer vision tasks, recent literature have proposed various variants of vanilla ViTs to achieve better efficiency and efficacy. However, it remains unclear how their unique architecture impact robustness towards common corruptions. In this paper, we make the first attempt to probe into the robustness gap among ViT variants and explore underlying designs that are essential for robustness. Through an extensive and rigorous benchmarking, we demonstrate that simple architecture designs such as overlapping patch embedding and convolutional feed-forward network (FFN) can promote the robustness of ViTs. Moreover, since training ViTs relies heavily on data augmentation, whether previous CNN-based augmentation strategies that are targeted at robustness purposes can still be useful is worth investigating. We explore different data augmentation on ViTs and verify that adversarial noise training is powerful while fourier-domain augmentation is inferior. Based on these findings, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题