论文标题
整合迁移的预训练的变压器编码器描述器用于视觉对象检测
Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection
论文作者
论文摘要
现代对象探测器采用了在大型数据集中预先训练的骨干网络的优势。但是,除了骨干网络之外,其他组件(例如检测器头和特征金字塔网络(FPN))仍然受到训练,从头开始训练,这阻碍了代表模型的潜力。 In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is ``fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from特征路径;最多可通过7.6 ap检测到的最先进的对象检测。
Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is ``fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by $\sim$2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is available at https://github.com/LiewFeng/imTED.