Tune-a-video：文本到视频生成的图像扩散模型的一声调整

论文标题

Tune-a-video：文本到视频生成的图像扩散模型的一声调整

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

论文作者

Wu, Jay Zhangjie, Ge, Yixiao, Wang, Xintao, Lei, Weixian, Gu, Yuchao, Shi, Yufei, Hsu, Wynne, Shan, Ying, Qie, Xiaohu, Shou, Mike Zheng

论文摘要

为了复制文本对图像（T2I）生成的成功，最近的作品采用大型视频数据集来训练文本对电视（T2V）生成器。尽管结果令人鼓舞，但这种范式在计算上还是昂贵的。在这项工作中，我们提出了一个新的T2V生成设置$ \ unicode {x2014} $单发视频调整，其中仅显示一个文本视频对。我们的模型建立在预先训练大量图像数据的最先进的T2I扩散模型上。我们进行两个关键观察：1）T2i模型可以生成代表动词术语的静止图像； 2）扩展T2I模型以生成多个图像，同时表现出令人惊讶的良好内容一致性。为了进一步学习连续的运动，我们引入了Tune-a-Video，其中涉及量身定制的时空注意机制和有效的一声调音策略。推断时，我们采用DDIM倒置为采样提供结构指导。广泛的定性和数值实验证明了我们方法在各种应用中的显着能力。

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题