复杂的Query视频检索的树木增强跨模式编码

论文标题

复杂的Query视频检索的树木增强跨模式编码

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

论文作者

Yang, Xun, Dong, Jianfeng, Cao, Yixin, Wang, Xun, Wang, Meng, Chua, Tat-Seng

论文摘要

互联网上用户生成的视频的快速增长加剧了对基于文本的视频检索系统的需求。传统方法主要利用基于概念的范式和简单查询的检索，这些范式通常对具有更复杂语义的复杂查询无效。最近，基于嵌入的范例已成为一种流行的方法。它旨在将查询和视频映射到共享的嵌入空间中，在该空间中，语义相似的文本和视频彼此之间更加接近。尽管它很简单，但它还是放弃了对文本查询的句法结构的开发，使其对复杂查询进行建模。为了通过复杂的查询来促进视频检索，我们通过共同学习查询的语言结构和视频的时间表示，提出了一种绿树成型的跨模式编码方法。具体来说，给定复杂的用户查询，我们首先递归地撰写了一个潜在的语义树来结构描述文本查询。然后，我们设计了一个由树的查询编码器，以得出结构感知的查询表示形式和一个时间的细心视频编码器，以模拟视频的时间特征。最后，查询和视频都被映射到匹配和排名的联合嵌入空间中。在这种方法中，我们对复杂查询有更好的理解和建模，从而实现了更好的视频检索性能。大规模视频检索基准数据集的广泛实验证明了我们方法的有效性。

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题