论文标题
以自我为中心的视频语言进行训练
Egocentric Video-Language Pretraining
论文作者
论文摘要
旨在学习可转移表示的视频训练(VLP),以推进各种视频文本下游任务,最近受到了越来越多的关注。表现最佳的作品依赖于大规模的第三人称视频文本数据集,例如howto100m。在这项工作中,我们利用最近发布的EGO4D数据集沿着三个方向将EgeCentric VLP提供给先锋。 (i)我们创建了Egoclip,这是一个由380万剪辑的剪辑式预测数据集,其中包括380万个剪贴版对从EGO4D选择,涵盖了各种各样的人类日常活动。 (ii)我们提出了一个新颖的预处理目标,称为Egonce,它通过开采以自我为中心的阳性和负样本来适应视频文本对比学习。 (iii)我们介绍了EGOMCQ,这是一个接近Egoclip的开发基准,因此可以支持对我们在Egoclip和Egonce中设计决策的有效验证和快速探索。此外,我们在三个数据集中的五个以自我为中心的下游任务上表现出强劲的表现:Epic-kitchens-100上的视频文本检索;对Charades-Ego的行动认可;自然语言查询,力矩查询和对象状态变化分类在EGO4D挑战基准上。数据集和代码可在https://github.com/showlab/egovlp上找到。
Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create EgoClip, a 1st-person video-text pretraining dataset comprising 3.8M clip-text pairs well-chosen from Ego4D, covering a large variety of human daily activities. (ii) We propose a novel pretraining objective, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egocentric-aware positive and negative samples. (iii) We introduce EgoMCQ, a development benchmark that is close to EgoClip and hence can support effective validation and fast exploration of our design decisions in EgoClip and EgoNCE. Furthermore, we demonstrate strong performance on five egocentric downstream tasks across three datasets: video-text retrieval on EPIC-KITCHENS-100; action recognition on Charades-Ego; natural language query, moment query, and object state change classification on Ego4D challenge benchmarks. The dataset and code are available at https://github.com/showlab/EgoVLP.