论文标题
Tune-a-video:文本到视频生成的图像扩散模型的一声调整
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
论文作者
论文摘要
为了复制文本对图像(T2I)生成的成功,最近的作品采用大型视频数据集来训练文本对电视(T2V)生成器。尽管结果令人鼓舞,但这种范式在计算上还是昂贵的。在这项工作中,我们提出了一个新的T2V生成设置$ \ unicode {x2014} $单发视频调整,其中仅显示一个文本视频对。我们的模型建立在预先训练大量图像数据的最先进的T2I扩散模型上。我们进行两个关键观察:1)T2i模型可以生成代表动词术语的静止图像; 2)扩展T2I模型以生成多个图像,同时表现出令人惊讶的良好内容一致性。为了进一步学习连续的运动,我们引入了Tune-a-Video,其中涉及量身定制的时空注意机制和有效的一声调音策略。推断时,我们采用DDIM倒置为采样提供结构指导。广泛的定性和数值实验证明了我们方法在各种应用中的显着能力。
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.