3D人的形状和姿势来自单个低分辨率图像，并具有自我监督的学习

论文标题

3D人的形状和姿势来自单个低分辨率图像，并具有自我监督的学习

3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning

论文作者

Xu, Xiangyu, Chen, Hao, Moreno-Noguer, Francesc, Jeni, Laszlo A., De la Torre, Fernando

论文摘要

从单眼图像中进行的3D人形状和姿势估计一直是计算机视觉研究的一个积极研究领域，对从活动识别到创建虚拟化身的新应用程序的开发产生了重大影响。现有的3D人形状和姿势估计的深度学习方法依赖于相对高分辨率的输入图像；但是，在视频监视和体育广播等几种实际情况下，高分辨率的视觉内容并不总是可用。在实际场景中，低分辨率的图像在各种尺寸中可能会有所不同，并且在一个分辨率中训练的模型通常不会在分辨率上优雅地降级。解决低分辨率输入问题的两种常见方法是将超分辨率技术应用于输入图像，这可能会导致视觉伪像，或者简单地训练每个分辨率的一个模型，这在许多现实应用中都是不切实际的。为了解决上述问题，本文提出了一种称为RSC-NET的新型算法，该算法由一个分辨率感知的网络，自我实验性损失和对比度学习方案组成。所提出的网络能够通过单个模型学习3D身体形状并在不同的分辨率上构成姿势。自欺欺人的损失鼓励了输出的规模一致性，而对比度学习方案则强制执行深度特征的规模一致性。我们表明，这两种新的训练损失在学习3D形状并以弱监督的方式姿势时均提供了鲁棒性。广泛的实验表明，与挑战低分辨率图像的最新方法相比，RSC-NET可以始终如一地取得更好的结果。

3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning methods for 3D human shape and pose estimation rely on relatively high-resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题