意图条件的长期人类自我为中心的行动预测

论文标题

意图条件的长期人类自我为中心的行动预测

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

论文作者

Mascaro, Esteve Valls, Ahn, Hyemin, Lee, Dongheui

论文摘要

要预见人类将来将如何采取行动，必须了解人类的意图，因为它可以指导人类实现某个目标。在本文中，我们提出了一个层次结构，该层次结构假设人类行为（低级）可以驱使人类的意图（高级）。基于此，我们处理以自我为中心视频的长期行动预期任务。我们的框架首先通过层次多任务MLP混合器（H3M）提取了n个观察到的视频动作的两个级别的人类信息。然后，我们通过意图条件的变异自动编码器（I-CVAE）来调节未来的不确定性，该变异自动编码器（I-CVAE）生成了K稳定的对下一个Z = 20个动作的预测，观察到的人可能会执行的20个动作。通过利用人类的意图作为高级信息，我们声称我们的模型能够长期预期更多的时间一致的行动，从而改善了EGO4D挑战中基线方法的结果。这项工作在CVPR@2022和ECVV@2022 EGO4D LTA挑战中排名第一，通过提供更合理的预期序列，改善了对名词和整体操作的预期。网页：https：//evm7.github.io/icvae-page/

To anticipate how a human would act in the future, it is essential to understand the human intention since it guides the human towards a certain goal. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with Long-Term Action Anticipation task in egocentric videos. Our framework first extracts two level of human information over the N observed videos human actions through a Hierarchical Multi-task MLP Mixer (H3M). Then, we condition the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates K stable predictions of the next Z=20 actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over baseline methods in EGO4D Challenge. This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge by providing more plausible anticipated sequences, improving the anticipation of nouns and overall actions. Webpage: https://evm7.github.io/icvae-page/

下载PDF全文

下载文献需遵守相关版权规定

论文标题