Smaug：稀疏的蒙版自动编码器，用于有效的视频语言预训练

论文标题

Smaug：稀疏的蒙版自动编码器，用于有效的视频语言预训练

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

论文作者

Lin, Yuanze, Wei, Chen, Wang, Huiyu, Yuille, Alan, Xie, Cihang

论文摘要

视频语言预训练对于学习强大的多模式表示至关重要。但是，通常需要大量的计算。在本文中，我们开发了Smaug，这是一个有效的视频语言模型的预训练框架。 Smaug中的基础组件是蒙版的自动编码器。与只有掩盖文本输入的先前作品不同，我们的掩盖策略考虑了视觉和文本方式，提供了更好的跨模式对齐方式并节省了更多的预训练成本。最重要的是，我们引入了一个时空令牌稀疏模块，该模块利用上下文信息进一步选择“重要”空间区域和时间框架进行预训练。结合所有这些设计，我们的方法可以在文本到视频检索和视频问题回答任务上既享受竞争性表演，又要享受1.9倍或更多的预培训成本。例如，我们的Smaug只需要大约50个NVIDIA A6000 GPU小时即可进行预培训，即可在跨六个流行的基准测试的这两个视频语言任务上获得竞争性表演。

Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9X or more. For example, our SMAUG only needs about 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题