论文标题
Text2Human:文本驱动的可控人类形象生成
Text2Human: Text-Driven Controllable Human Image Generation
论文作者
论文摘要
产生高质量和多样化的人类图像是视觉和图形中的重要任务。但是,现有的生成模型通常在服装形状和质地的高度多样性下掉落。此外,对于外行用户来说,生成过程甚至可以直观地控制。在这项工作中,我们提出了一个由文本驱动的可控框架Text2Human,以提供高质量和多样化的人类一代。我们从给定的人姿势开始,通过两个专门的步骤合成全身人类图像。 1)一些文本描述了衣服的形状,给定的人姿势首先转化为人类解析地图。 2)然后通过向系统提供有关衣服纹理的更多属性来生成最终的人类形象。具体而言,为了建模衣服纹理的多样性,我们构建了一个分层纹理感知的代码簿,该代码本可为每种类型的纹理存储多尺度的神经表示。粗级的代码手册包括纹理的结构表示,而精细级别的代码手册则侧重于纹理的细节。为了利用学习的层次结构代码簿来合成所需的图像,首先使用具有专家混合物的基于扩散的变压器采样器来从代码书的最粗糙级别进行示例索引,然后将其用于预测在良好级别上的代码书索引。在解码器中伴有层次代码手册的解码器将不同级别的预测指标转化为人类图像。使用的混合物允许在细粒文本输入上进行生成的图像。较优质索引的预测提高了服装纹理的质量。广泛的定量和定性评估表明,与最先进的方法相比,我们提出的框架可以产生更多样化和现实的人类图像。
Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework can generate more diverse and realistic human images compared to state-of-the-art methods.