论文标题
转移的文本到图像生成的扩散
Shifted Diffusion for Text-to-image Generation
论文作者
论文摘要
我们提出了Corgi,这是一种新颖的文本形象生成方法。 CORGI基于我们提出的转移扩散模型,该模型可以从输入文本中获得更好的图像嵌入生成。与DALL-E 2中使用的基线扩散模型不同,我们的方法通过设计新的初始化分布和扩散的新过渡步骤,在其扩散过程中无缝地编码了预训练的剪辑模型的先验知识。与强大的DALL-E 2基线相比,我们的方法在从效率和有效性方面从文本中生成图像嵌入的表现更好,从而可以更好地产生文本到图像。通过定量措施和人类评估进行了广泛的大规模实验,并评估了与现有方法相比,我们方法的产生能力更强。此外,我们的模型可以为文本到图像生成的半监督和无语言的培训,其中培训数据集中只有一部分或没有一个图像具有关联的标题。我们的半监督模型只有1.7%的图像被培训,在MS-Coco上评估的零摄影文本对图像生成上,获得了与DALL-E 2相当的FID结果。 Corgi还可以在下游无语言的文本到图像生成任务上的不同数据集中实现新的最新结果,并以较大的利润优于先前的方法Lafite。
We present Corgi, a novel method for text-to-image generation. Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text. Unlike the baseline diffusion model used in DALL-E 2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model in its diffusion process by designing a new initialization distribution and a new transition step of the diffusion. Compared to the strong DALL-E 2 baseline, our method performs better in generating image embedding from the text in terms of both efficiency and effectiveness, resulting in better text-to-image generation. Extensive large-scale experiments are conducted and evaluated in terms of both quantitative measures and human evaluation, indicating a stronger generation ability of our method compared to existing ones. Furthermore, our model enables semi-supervised and language-free training for text-to-image generation, where only part or none of the images in the training dataset have an associated caption. Trained with only 1.7% of the images being captioned, our semi-supervised model obtains FID results comparable to DALL-E 2 on zero-shot text-to-image generation evaluated on MS-COCO. Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks, outperforming the previous method, Lafite, by a large margin.