论文标题
RECO:区域控制的文本对图像生成
ReCo: Region-Controlled Text-to-Image Generation
论文作者
论文摘要
最近,大规模的文本对图像(T2I)模型在生成高保真图像时表现出了令人印象深刻的性能,但是具有有限的可控性,例如,通过自由形式的文本描述,精确地指定了特定区域中的内容。在本文中,我们提出了一种有效的T2i生成区域控制技术。我们增强具有额外的位置令牌的T2I模型的输入,该输入代表量化的空间坐标。每个区域都由四个位置令牌指定,以代表左上角和右下角,然后是开放式的自然语言区域描述。然后,我们用这种新的输入接口微调了预训练的T2I模型。我们的模型被称为ROCO(区域控制的T2I),可实现由开放式区域文本描述的任意对象的区域控制,而不是由受约束类别集的对象标签所描述的。从经验上讲,RECO比通过位置单词加强的T2I模型(COCO上的FID:8.82-> 7.36,场景框架:15.54-> 6.51)的T2I模型具有更好的图像质量,同时将物体更准确地放置,相当于Coco上的20.40%区域分类的准确性。此外,我们证明了RECO可以更好地控制对象计数,空间关系和区域属性,例如颜色/大小,以及自由形式的区域描述。对Paintskill的人体评估表明,与T2I模型相比,RECO在生成正确的对象计数和空间关系的图像时为 +19.28%, + +17.21%更准确。
Recently, large-scale text-to-image (T2I) models have shown impressive performance in generating high-fidelity images, but with limited controllability, e.g., precisely specifying the content in a specific region with a free-form text description. In this paper, we propose an effective technique for such regional control in T2I generation. We augment T2I models' inputs with an extra set of position tokens, which represent the quantized spatial coordinates. Each region is specified by four position tokens to represent the top-left and bottom-right corners, followed by an open-ended natural language regional description. Then, we fine-tune a pre-trained T2I model with such new input interface. Our model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set. Empirically, ReCo achieves better image quality than the T2I model strengthened by positional words (FID: 8.82->7.36, SceneFID: 15.54->6.51 on COCO), together with objects being more accurately placed, amounting to a 20.40% region classification accuracy improvement on COCO. Furthermore, we demonstrate that ReCo can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description. Human evaluation on PaintSkill shows that ReCo is +19.28% and +17.21% more accurate in generating images with correct object count and spatial relationship than the T2I model.