探索纯视觉变压器骨架以进行对象检测

论文标题

探索纯视觉变压器骨架以进行对象检测

Exploring Plain Vision Transformer Backbones for Object Detection

论文作者

Li, Yanghao, Mao, Hanzi, Girshick, Ross, He, Kaiming

论文摘要

我们探索普通的非层次视觉变压器（VIT），作为用于对象检测的骨干网络。此设计使原始的VIT体系结构可以进行微调以进行对象检测，而无需重新设计层次结构的主链以进行预训练。随着微调的最低适应性，我们的纯净背骨检测器可以取得竞争成果。令人惊讶的是，我们观察到：（i）足以从单尺度特征映射（没有常见的FPN设计）构建一个简单的特征金字塔，并且（ii）足以在很少的交叉窗口传播块中使用窗户注意力（无需转移）。鉴于普通的VIT骨架作为掩盖自动编码器（MAE）预先训练，我们的探测器（名为VITDET）可以与先前基于层次骨架的先前领先方法竞争，仅使用ImagEnet-1K预训练的COCO数据集上的61.3 AP_BOX。我们希望我们的研究能够引起人们对普通背骨检测器的研究。 VITDET的代码可在detectron2中获得。

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题