论文标题
提示:使用未经切割的图像朝开放式摄取检测
PromptDet: Towards Open-vocabulary Detection using Uncurated Images
论文作者
论文摘要
这项工作的目的是使用零手动注释建立可扩展的管道,以将对象检测器扩展到新颖/看不见的类别。为此,我们做出以下四个贡献:(i)追求概括,我们提出了一个两阶段的开放式唱机对象检测器,其中类无知的对象提案通过预先训练的视觉语言模型中的文本编码来分类; (ii)要将视觉潜在空间(RPN框建议)与预先训练的文本编码器配对,我们提出了区域提示的概念,以学习将文本嵌入空间与区域视觉对象特征保持一致; (iii)为了扩展学习程序以检测更广泛的对象,我们通过新颖的自我训练框架利用可用的在线资源,该框架允许在嘈杂的未经图像的网络图像上训练所提出的检测器。最后,(iv)评估我们所提出的检测器,称为及时插图,我们对具有挑战性的LVI和MS-COCO数据集进行了广泛的实验。提示表显示出优于现有方法的卓越性能,而其他培训图像较少,零手动注释。带代码的项目页面:https://fcjian.github.io/promptdet。
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.