论文标题
多里(Dori):在视频中发现自然语言查询的瞬间定位对象关系
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video
论文作者
论文摘要
本文在使用自然语言查询的长期未修剪视频中研究了时间力矩本地化的任务。给定查询句子,目标是确定视频中相关段的开始和结尾。我们的关键创新是通过一种适合于时间矩本定位的语言消息来学习嵌入的视频功能,该算法捕获了视频中人类,对象和活动之间的关系。这些关系是通过空间子图获得的,该空间子图使用检测到的对象和语言查询中的人类特征将场景表示形式化。此外,暂时的子图可以随着时间的流逝捕获视频中的活动。我们的方法在三个标准基准数据集上进行了评估,我们还将YouCookii作为此任务的新基准进行了介绍。实验表明,我们的方法在这些数据集上优于最先进的方法,证实了我们方法的有效性。
This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.