通过自我监督的身份和姿势进行逼真的面部重演

论文标题

通过自我监督的身份和姿势进行逼真的面部重演

Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose

论文作者

Zeng, Xianfang, Pan, Yusu, Wang, Mengmeng, Zhang, Jiangning, Liu, Yong

论文摘要

最近的作品表明，在几何指南的监督（例如面部地标或边界）的监督下，如何获得逼真的说话面部图像。为了减轻对手动注释的需求，在本文中，我们提出了一种新颖的自我监管的混合模型（DAE-GAN），该模型（DAE-GAN）学习如何重新演奏如何自然而然地给出大量未标记的视频。我们的方法将两个变形的自动编码器与有条件一代的最新进展相结合。一方面，我们采用变形的自动编码器来解开身份和姿势表示。谈话面部视频的一个很强的先验是，每个帧都可以编码为两个部分：一个用于特定于视频的身份，另一个用于各种姿势。受此启发的启发，我们利用多帧变形自动编码器来学习每个视频的姿势不变的嵌入式脸部。同时，提出了一个多尺度变形自动编码器，以提取每个帧与姿势相关的信息。另一方面，条件发生器可以增强细节和整体现实。它利用了分离的功能来产生照片现实和姿势类似的面部图像。我们在Voxceleb1和RAFD数据集上评估了我们的模型。实验结果表明，重新成熟的图像的质量以及在身份之间转移面部运动的灵活性。

Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题