与自动回归潜伏扩散模型的综合故事

论文标题

与自动回归潜伏扩散模型的综合故事

Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

论文作者

Pan, Xichen, Qin, Pengda, Li, Yuhong, Xue, Hui, Chen, Wenhu

论文摘要

条件扩散模型已证明了最先进的文本对图像合成能力。最近，大多数作品着重于综合独立图像。对于真实的应用程序，生成一系列连贯的图像以进行故事蒸发既常见又必要。在这项工作中，我们主要关注故事可视化和延续任务，并提出AR-LDM，这是一种潜在的扩散模型自动回归，该模型在历史记录标题和生成的图像上。此外，AR-LDM可以通过适应来推广到新角色。据我们所知，这是成功利用扩散模型来综合视觉故事综合的第一批工作。定量结果表明，AR-LDM在PororoSV，FlintStonessV上获得SOTA FID得分，以及新引入的富有挑战性的包含自然图像的数据集Vist。大规模的人类评估表明，AR-LDM在质量，相关性和一致性方面具有出色的性能。

Conditioned diffusion models have demonstrated state-of-the-art text-to-image synthesis capacity. Recently, most works focus on synthesizing independent images; While for real-world applications, it is common and necessary to generate a series of coherent images for story-stelling. In this work, we mainly focus on story visualization and continuation tasks and propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images. Moreover, AR-LDM can generalize to new characters through adaptation. To our best knowledge, this is the first work successfully leveraging diffusion models for coherent visual story synthesizing. Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV, FlintstonesSV, and the newly introduced challenging dataset VIST containing natural images. Large-scale human evaluations show that AR-LDM has superior performance in terms of quality, relevance, and consistency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题