论文标题
针对变压器的对象检测的自我监督学习方法的经验研究
An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers
论文作者
论文摘要
自我监督的学习(SSL)方法(例如蒙版语言建模)通过对各种自然语言处理任务进行预处理的变压器模型,显示出巨大的性能提高。后续研究适应了类似的方法,例如视觉变压器中的蒙版图像建模,并在图像分类任务中表现出改进。对于对象检测变压器(DETR,可变形的DETR),没有详尽研究此类简单的自我监督方法,因为它们的变压器编码器模块在卷积神经网络(CNN)提取的特征空间中,而不是像一般视觉变压器中提取的图像空间。但是,CNN功能地图仍然保持空间关系,我们利用此属性来设计自我监管的学习方法,以训练对象检测变压器的编码器,以训练和多任务学习设置。我们根据图像重建,掩盖图像建模和拼图探索常见的自我监督方法。 ISAID数据集中的初步实验表明,在预训练和多任务学习设置中,在初始时期内的DEDR速度更快;但是,在多任务学习的情况下,并未观察到类似的改进。我们使用DETR和可变形DETR实验的代码可在https://github.com/gokulkarthik/detr和https://github.com/gokulkarthik/deformable分别提供。
Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks. The follow-up research adapted similar methods like masked image modeling in vision transformer and demonstrated improvements in the image classification task. Such simple self-supervised methods are not exhaustively studied for object detection transformers (DETR, Deformable DETR) as their transformer encoder modules take input in the convolutional neural network (CNN) extracted feature space rather than the image space as in general vision transformers. However, the CNN feature maps still maintain the spatial relationship and we utilize this property to design self-supervised learning approaches to train the encoder of object detection transformers in pretraining and multi-task learning settings. We explore common self-supervised methods based on image reconstruction, masked image modeling and jigsaw. Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings; nonetheless, similar improvement is not observed in the case of multi-task learning with Deformable DETR. The code for our experiments with DETR and Deformable DETR are available at https://github.com/gokulkarthik/detr and https://github.com/gokulkarthik/Deformable-DETR respectively.