论文标题
有效的无解码器对象检测变压器
Efficient Decoder-free Object Detection with Transformers
论文作者
论文摘要
视觉变压器(VIT)正在改变对象检测方法的景观。 VIT在检测中的自然使用是用基于变压器的主链替换基于CNN的主链,该骨干骨头很直接有效,其价格为推理带来了相当大的计算负担。更微妙的用法是Detr家族,它消除了对物体检测中许多手工设计的组件的需求,但引入了一个解码器,要求超长时间的收敛时间。结果,基于变压器的对象检测不能在大规模应用中占上风。为了克服这些问题,我们提出了一种新型的无解码器基于完全变压器(DFFT)对象检测器,这是第一次在训练和推理阶段达到高效率。我们通过居中两个入口点来简化反对检测到仅编码单层锚点的密集预测问题:1)消除训练感知的解码器,并利用两个强的编码器来保留单层特征映射预测的准确性; 2)探索具有有限的计算资源的检测任务的低级语义特征。特别是,我们设计了一种新型的轻巧的面向检测的变压器主链,该主链有效地捕获了基于良好的消融研究的丰富语义的低级特征。 MS Coco基准测试的广泛实验表明,DFFT_SMALL的表现优于2.5%AP,计算成本降低28%,而$ 10 $ x的培训少于10美元以上。与基于尖端的锚定检测器视网膜相比,DFFT_SMALL获得了超过5.5%的AP增益,同时降低了70%的计算成本。
Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than $10$x fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70% computation cost.