论文标题
OMG:观察自然基于语言的车辆检索的多种粒度
OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval
论文作者
论文摘要
通过自然语言描述检索追踪的车辆在智能城市建设中起着至关重要的作用。它的目的是在监视视频中找到一组跟踪车辆的给定文本的最佳匹配。现有作品通常通过双流框架来解决它,该框架由文本编码器,视觉编码器和跨模式损耗函数组成。尽管已经取得了一些进展,但他们未能以各种粒度级别充分利用信息。为了解决这个问题,我们为基于自然语言的车辆检索任务提出了一个新颖的框架,OMG观察到有关视觉表示,文本表示和客观功能的多种粒度。对于视觉表示,目标特征,上下文特征和运动特征是单独编码的。对于文本表示,提取了一个全局嵌入,三个局部嵌入和一个颜色类型的嵌入,以表示语义特征的各种粒度。最后,整体框架通过跨模式多晶体对比损失函数进行了优化。实验证明了我们方法的有效性。我们的OMG极大地胜过所有以前的方法,并在第六AI City Challenge Track2中排名第9。这些代码可在https://github.com/dyhbupt/omg上找到。
Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at https://github.com/dyhBUPT/OMG.