考虑查询：与查询条件的卷积进行视觉接地

论文标题

考虑查询：与查询条件的卷积进行视觉接地

Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

论文作者

Chen, Chonghan, Jiang, Qi, Wang, Chih-Hao, Chen, Noel, Wang, Haohan, Li, Xiang, Raj, Bhiksha

论文摘要

视觉接地是一项旨在根据自然语言表达方式定位目标对象的任务。作为一项多模式任务，文本和视觉输入之间的特征相互作用至关重要。但是，先前的解决方案主要在将它们融合在一起之前独立处理每种模式，在提取视觉特征时，这并不能充分利用相关的文本信息。为了更好地利用视觉接地中的文本视觉关系，我们提出了一个查询条件的卷积模块（QCM），该模块（QCM）通过将查询信息纳入卷积内核的产生中来提取查询感知的视觉特征。借助我们提出的QCM，下游融合模块接收到更具歧视性的视觉特征，并专注于表达式中描述的所需对象，从而导致更准确的预测。在三个流行的视觉接地数据集上进行的广泛实验表明，我们的方法实现了最先进的性能。此外，查询感知的视觉特征足以实现与最新方法的可比性能，而无需进一步的多模式融合。

Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual grounding datasets demonstrate that our method achieves state-of-the-art performance. In addition, the query-aware visual features are informative enough to achieve comparable performance to the latest methods when directly used for prediction without further multi-modal fusion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题