我的钱包在哪里？建模对象提案集用于以自我为中心的视觉查询本地化

论文标题

我的钱包在哪里？建模对象提案集用于以自我为中心的视觉查询本地化

Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

论文作者

Xu, Mengmeng, Li, Yanghao, Fu, Cheng-Yang, Ghanem, Bernard, Xiang, Tao, Perez-Rua, Juan-Manuel

论文摘要

本文讨论了来自视觉示例中的图像和视频数据集中的对象的问题。特别是，我们专注于以自我为中心的视觉查询本地化的具有挑战性的问题。我们首先在当前查询的模型设计和视觉查询数据集中确定严重的隐式偏见。然后，我们直接在框架和对象集级别上直接解决此类偏差。具体而言，我们的方法通过扩大有限的注释并在培训期间动态删除对象建议来解决这些问题。此外，我们提出了一个基于变压器的新型模块，该模块允许在合并查询信息时考虑对象宽面设置上下文。我们命名模块条件上下文变压器或辅助器。我们的实验表明，所提出的适应改善了以egentric的查询检测，从而在2D和3D配置中均可提供更好的视觉查询定位系统。因此，我们能够将AP中的帧级检测性能从26.28％提高到31.26，这相应地将VQ2D和VQ3D定位得分提高了大幅度。我们改进的上下文感知的查询对象检测器在第二个EGO4D挑战中排名VQ2D和VQ3D任务的第一和第二。除此之外，我们还显示了我们提出的模型在几个射击检测（FSD）任务中的相关性，在此我们还获得了SOTA结果。我们的代码可在https://github.com/facebookresearch/vq2d_cvpr上找到。

This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition to this, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. Our code is available at https://github.com/facebookresearch/vq2d_cvpr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题