论文标题
(2.5+1)D用于视频问题的时空场景图
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
论文作者
论文摘要
时空场景图对基于视频的推理任务(例如视频提问(QA))的方法通常为每个视频框架构造此类图形。这些方法通常会忽略这样一个事实,即视频本质上是3D空间中发生的事件的2D“视图”的序列,因此3D场景的语义可以从框架到框架传递。利用这种见解,我们建议A(2.5+1)D场景图表示,以更好地捕获视频中的时空信息流。具体而言,我们首先通过使用现成的2D到3D转换模块进行推断的3D结构来创建一个2.5D(pseudo-3D)场景图,然后将视频帧注册为共享(2.5+1)D时空空间和地面每个2D场景图。然后将这种(2.5+1)d的图形隔离为静态子图和动态子图,对应于它们中的对象是否通常在世界上移动。动态图中的节点具有捕获其与其他图节点的相互作用的运动特征。接下来,对于视频质量检查任务,我们提出了一个基于变压器的新型推理管道,该管道将(2.5+1)d嵌入到时空的层次层次潜在空间中,在各种粒度下,子图及其相互作用被捕获。为了证明我们的方法的有效性,我们介绍了Next-QA和AVSD-QA数据集的实验。我们的结果表明,我们提出的(2.5+1)d表示会导致更快的训练和推理,而我们的层次模型在视频质量检查任务与艺术状态相比,在视频质量检查任务上展示了卓越的性能。
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic sub-graph, corresponding to whether the objects within them usually move in the world. The nodes in the dynamic graph are enriched with motion features capturing their interactions with other graph nodes. Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions are captured at varied granularity. To demonstrate the effectiveness of our approach, we present experiments on the NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art.