退化的SWIN赢得胜利：基于窗户的朴素变压器，没有复杂的操作

论文标题

退化的SWIN赢得胜利：基于窗户的朴素变压器，没有复杂的操作

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

论文作者

Yu, Tan, Li, Ping

论文摘要

自然语言处理中变形金刚的杰出成就激发了计算机视觉社区中的研究人员建立视觉变压器。与卷积神经网络（CNN）相比，视觉变压器具有更大的接受场，能够表征远程依赖性。然而，较大的视觉变压器的接收场伴随着巨大的计算成本。为了提高效率，基于窗户的视觉变压器出现。他们将图像裁剪成几个本地窗户，并且在每个窗口内进行自我注意事项。为了恢复全球接收领域，基于窗户的视觉变形金刚通过开发多个复杂的操作来致力于实现交叉窗口通信。在这项工作中，我们检查了Swin Transformer（移动的窗口分区）的关键设计元素的必要性。我们发现，简单的深度卷积足以实现有效的跨窗口通信。具体而言，由于存在深度卷积，Swin Transformer中移动的窗口配置不能导致额外的性能改进。因此，我们通过丢弃复杂的移动窗口分配来使Swin Transformer退化为基于窗口的（WIN）变压器。拟议的Win Transformer在概念上比Swin Transformer更简单，更容易实现。同时，在多个计算机视觉任务上，我们的WIN变压器比SWIN Transformer始终取得优越的性能，包括图像识别，语义细分和对象检测。

The formidable accomplishment of Transformers in natural language processing has motivated the researchers in the computer vision community to build Vision Transformers. Compared with the Convolution Neural Networks (CNN), a Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies. Nevertheless, the large receptive field of Vision Transformer is accompanied by the huge computational cost. To boost efficiency, the window-based Vision Transformers emerge. They crop an image into several local windows, and the self-attention is conducted within each window. To bring back the global receptive field, window-based Vision Transformers have devoted a lot of efforts to achieving cross-window communications by developing several sophisticated operations. In this work, we check the necessity of the key design element of Swin Transformer, the shifted window partitioning. We discover that a simple depthwise convolution is sufficient for achieving effective cross-window communications. Specifically, with the existence of the depthwise convolution, the shifted window configuration in Swin Transformer cannot lead to an additional performance improvement. Thus, we degenerate the Swin Transformer to a plain Window-based (Win) Transformer by discarding sophisticated shifted window partitioning. The proposed Win Transformer is conceptually simpler and easier for implementation than Swin Transformer. Meanwhile, our Win Transformer achieves consistently superior performance than Swin Transformer on multiple computer vision tasks, including image recognition, semantic segmentation, and object detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题