重新思考视频vits：联合图像和视频学习的稀疏视频管

论文标题

重新思考视频vits：联合图像和视频学习的稀疏视频管

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

论文作者

Piergiovanni, AJ, Kuo, Weicheng, Angelova, Anelia

论文摘要

我们提出了一种简单的方法，该方法可以将VIT编码器变成有效的视频模型，该模型可以与图像和视频输入无缝地使用。通过稀疏的输入采样，该模型能够从两个输入中进行训练和推断。该模型易于扩展，并且可以适应大规模的预训练VIT，而无需进行全面填充。该模型可实现SOTA结果，并且代码将被开源。

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题