来自单个输入图像的基于NERF的视图综合的视觉变压器

论文标题

来自单个输入图像的基于NERF的视图综合的视觉变压器

Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

论文作者

Lin, Kai-En, Yen-Chen, Lin, Lai, Wei-Sheng, Lin, Tsung-Yi, Shih, Yi-Chang, Ramamoorthi, Ravi

论文摘要

尽管神经辐射场（NERF）在新型视图合成方面显示出令人印象深刻的进步，但大多数方法通常需要具有准确的相机姿势的同一场景的多个输入图像。在这项工作中，我们试图将输入实质上减少到单个未予以的图像。现有的方法在本地图像功能上有条件重建一个3D对象，但通常会在远离源视图的视点处进行模糊的预测。为了解决这个问题，我们建议利用全球和本地功能形成表现力的3D表示形式。全局功能是从视觉变压器中学到的，而本地功能则是从2D卷积网络中提取的。为了综合一种新的视图，我们训练以学习的3D表示条件进行量渲染的多层感知器（MLP）网络。这种新颖的3D表示使网络可以重建看不见的区域，而无需执行对称或规范坐标系等约束。我们的方法只能从单个输入图像中渲染新视图，并使用单个模型在多个对象类别中概括。定量和定性评估表明，所提出的方法可实现最先进的绩效，并使细节比现有方法更丰富。

Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题