P3IV：指导视频的概率程序计划，有弱的监督

论文标题

P3IV：指导视频的概率程序计划，有弱的监督

P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision

论文作者

Zhao, He, Hadji, Isma, Dvornik, Nikita, Derpanis, Konstantinos G., Wildes, Richard P., Jepson, Allan D.

论文摘要

在本文中，我们研究了教学视频中的程序计划问题。在这里，代理必须产生合理的动作顺序，该顺序可以将环境从给定的开始转变为所需的目标状态。从教学视频中学习程序计划时，最近的工作利用了中间的视觉观察作为监督，这需要昂贵的注释努力来精确地本地化培训视频的所有教学步骤。相比之下，我们消除了对昂贵的时间视频注释的需求，并通过从自然语言指令中学习来提出一种弱监督的方法。我们的模型基于配备了内存模块的变压器，该变压器将开始和目标观察映射到一系列合理的动作。此外，我们使用一个概率生成模块增强了模型，以捕获程序计划固有的不确定性，这在很大程度上被以前的工作所忽略了。我们在三个数据集上评估了我们的模型，并显示我们的弱监管方法的表现优于以前在多个指标上完全监督的最新模型。

In this paper, we study the problem of procedure planning in instructional videos. Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state. When learning procedure planning from instructional videos, most recent work leverages intermediate visual observations as supervision, which requires expensive annotation efforts to localize precisely all the instructional steps in training videos. In contrast, we remove the need for expensive temporal video annotations and propose a weakly supervised approach by learning from natural language instructions. Our model is based on a transformer equipped with a memory module, which maps the start and goal observations to a sequence of plausible actions. Furthermore, we augment our model with a probabilistic generative module to capture the uncertainty inherent to procedure planning, an aspect largely overlooked by previous work. We evaluate our model on three datasets and show our weaklysupervised approach outperforms previous fully supervised state-of-the-art models on multiple metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题