Synctalkface：通过音频记忆精确的唇部同步的说话面部生成

论文标题

Synctalkface：通过音频记忆精确的唇部同步的说话面部生成

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

论文作者

Park, Se Jin, Kim, Minsu, Hong, Joanna, Choi, Jeongsoo, Ro, Yong Man

论文摘要

说话发表的挑战在于使两个不同的模态信息：音频和视频对齐，使口区域对应于输入音频。以前的方法可以利用视听表示学习或利用中间结构信息，例如地标和3D模型。但是，他们难以合成嘴唇的细节在音素水平上变化，因为它们在视频合成步骤中没有充分提供嘴唇的视觉信息。为了克服这一限制，我们的工作提出了音频唇内的记忆，该记忆带来了与输入音频相对应的口腔区域的视觉信息，并实施了细粒度的视听相干性。它将唇部运动特征从顺序地面真实图像中存储在值存储器中，并将其与相应的音频特征对齐，以便可以在推理时使用音频输入来检索它们。因此，将检索到的唇部运动特征作为视觉提示，它可以很容易地将音频与合成步骤中的视觉动力学相关联。通过分析内存，我们证明了独特的唇彩在音素级别的每个内存插槽中存储，从而根据内存寻址捕获微妙的唇部运动。此外，我们引入了视觉视觉同步损失，当我们模型中使用音频同步损失时，可以增强唇部同步性能。进行了广泛的实验，以验证我们的方法生成高质量的视频，其口形状最能与输入音频保持一致，表现优于先前的最新方法。

The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time. Therefore, using the retrieved lip motion features as visual hints, it can easily correlate audio with visual dynamics in the synthesis step. By analyzing the memory, we demonstrate that unique lip features are stored in each memory slot at the phoneme level, capturing subtle lip motion based on memory addressing. In addition, we introduce visual-visual synchronization loss which can enhance lip-syncing performance when used along with audio-visual synchronization loss in our model. Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题