基于扩散的场景图与图像生成的图像生成，并具有掩盖的对比预训练

论文标题

基于扩散的场景图与图像生成的图像生成，并具有掩盖的对比预训练

Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

论文作者

Yang, Ling, Huang, Zhilin, Song, Yang, Hong, Shenda, Li, Guohao, Zhang, Wentao, Cui, Bin, Ghanem, Bernard, Yang, Ming-Hsuan

论文摘要

从图形结构化输入（例如场景图）中生成图像，由于难以将图形和对象与对象的关系及其在图像中的关系对齐的连接的困难。大多数现有方法通过使用场景布局来应对这一挑战，场景布局是旨在捕获场景图像的粗体结构的场景图的类似图像的表示。由于场景布局是手动制作的，因此与图像的对齐方式可能无法完全优化，从而导致生成的图像和原始场景图之间的次优依从性。为了解决此问题，我们建议通过直接优化图像的对齐方式来学习场景图嵌入。具体而言，我们预先培训是一种编码器，从场景图中提取全局和本地信息，这些图形可以预测相应的图像，依赖于两个损失函数：掩盖的自动编码损失和对比度损失。以前的火车通过重建随机掩盖的图像区域来嵌入，而后者则根据场景图嵌入嵌入以区分合规和不合规图像。考虑到这些嵌入，我们构建了一个潜在扩散模型，以从场景图生成图像。所得的方法称为SGDIFF，允许通过修改场景图节点和连接来对生成的图像进行语义操纵。在视觉基因组和可可-STUFF数据集上，我们证明了SGDIFF胜过最先进的方法，这是通过Inception评分和Fréchet成立距离（FID）指标来衡量的。我们将在https://github.com/yangling0818/sgdiff上发布我们的源代码和训练有素的模型。

Generating images from graph-structured inputs, such as scene graphs, is uniquely challenging due to the difficulty of aligning nodes and connections in graphs with objects and their relations in images. Most existing methods address this challenge by using scene layouts, which are image-like representations of scene graphs designed to capture the coarse structures of scene images. Because scene layouts are manually crafted, the alignment with images may not be fully optimized, causing suboptimal compliance between the generated images and the original scene graphs. To tackle this issue, we propose to learn scene graph embeddings by directly optimizing their alignment with images. Specifically, we pre-train an encoder to extract both global and local information from scene graphs that are predictive of the corresponding images, relying on two loss functions: masked autoencoding loss and contrastive loss. The former trains embeddings by reconstructing randomly masked image regions, while the latter trains embeddings to discriminate between compliant and non-compliant images according to the scene graph. Given these embeddings, we build a latent diffusion model to generate images from scene graphs. The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections. On the Visual Genome and COCO-Stuff datasets, we demonstrate that SGDiff outperforms state-of-the-art methods, as measured by both the Inception Score and Fréchet Inception Distance (FID) metrics. We will release our source code and trained models at https://github.com/YangLing0818/SGDiff.

下载PDF全文

下载文献需遵守相关版权规定

论文标题