Flexit：朝着灵活的语义图像翻译

论文标题

Flexit：朝着灵活的语义图像翻译

FlexIT: Towards Flexible Semantic Image Translation

论文作者

Couairon, Guillaume, Grechka, Asya, Verbeek, Jakob, Schwenk, Holger, Cord, Matthieu

论文摘要

诸如gan之类的深层生成模型已大大改善了图像合成中的最新技术，并能够在人体脸等结构化域中生成近乎光真实的图像。基于这一成功，最新的图像编辑收益工作是通过将图像投影到gan潜在空间并操纵潜在向量的。但是，这些方法受到限制，因为只能转换来自狭窄域的图像，并且只有数量有限的编辑操作。我们提出了FlexIT，这是一种新颖的方法，可以采用任何输入图像和用户定义的文本指令进行编辑。我们的方法实现了灵活而自然的编辑，从而推动了语义图像翻译的限制。首先，FlexIT将输入图像和文本结合到夹子多模式嵌入空间中的单个目标点。通过自动编码器的潜在空间，我们迭代地将输入图像转换为目标点，从而确保与各种新颖的正则化项相干性和质量。我们提出了一种用于语义图像翻译的评估协议，并彻底评估了我们在Imagenet上的方法。代码将公开可用。

Deep generative models, like GANs, have considerably improved the state of the art in image synthesis, and are able to generate near photo-realistic images in structured domains such as human faces. Based on this success, recent work on image editing proceeds by projecting images to the GAN latent space and manipulating the latent vector. However, these approaches are limited in that only images from a narrow domain can be transformed, and with only a limited number of editing operations. We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing. Our method achieves flexible and natural editing, pushing the limits of semantic image translation. First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space. Via the latent space of an auto-encoder, we iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms. We propose an evaluation protocol for semantic image translation, and thoroughly evaluate our method on ImageNet. Code will be made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题