论文标题
疑似对象很重要:重新思考模型对一阶段视觉接地的预测
Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding
论文作者
论文摘要
最近,一阶段的视觉接地剂由于其准确性可比,但效率高于两阶段地面,引起了人们的关注。但是,对单阶段地面机的对象间关系建模尚未得到很好的研究。对象之间的关系建模虽然很重要,但并不一定在所有对象之间执行,因为它们的一部分与文本查询有关,并且可能会混淆模型。我们称这些对象可疑对象。但是,在一个阶段范式中探索他们的关系是非平凡的,因为:首先,没有对象建议是选择可疑对象并执行关系建模的基础。其次,怀疑的对象比其他物体更令人困惑,因为它们可能共享类似的语义,与某些关系等相似,从而更容易误导模型预测。为此,我们提出了一种可疑的对象变换机制(SOT),可以将其无缝集成到现有的CNN和基于变压器的一阶段视觉接地器中,以鼓励可疑的对象选择。可疑的物体是从训练过程中适应模型当前歧视能力的学习激活图中动态发现的。之后,在可疑的对象之外,提出了一个关键字感知歧视模块(KAD)和随机连接策略(ERC)的探索,以帮助该模型重新考虑其初始预测。一方面,KAD利用了对可疑物体歧视高度贡献的关键字。另一方面,ERC允许模型寻求正确的对象,而不是被困在始终利用当前错误预测的情况下。广泛的实验证明了我们提出的方法的有效性。
Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects suspected objects. However, exploring their relationships in the one-stage paradigm is non-trivial because: First, no object proposals are available as the basis on which to select suspected objects and perform relationship modeling. Second, suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model prediction. Toward this end, we propose a Suspected Object Transformation mechanism (SOT), which can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders to encourage the target object selection among the suspected ones. Suspected objects are dynamically discovered from a learned activation map adapted to the model current discrimination ability during training. Afterward, on top of suspected objects, a Keyword-Aware Discrimination module (KAD) and an Exploration by Random Connection strategy (ERC) are concurrently proposed to help the model rethink its initial prediction. On the one hand, KAD leverages keywords contributing high to suspected object discrimination. On the other hand, ERC allows the model to seek the correct object instead of being trapped in a situation that always exploits the current false prediction. Extensive experiments demonstrate the effectiveness of our proposed method.