MVSFormer：通过学习鲁棒图像特征和基于温度的深度，多视图立体声

论文标题

MVSFormer：通过学习鲁棒图像特征和基于温度的深度，多视图立体声

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

论文作者

Cao, Chenjie, Ren, Xinlin, Fu, Yanwei

论文摘要

功能表示学习是基于学习的多视图立体声（MVS）的关键配方。作为基于学习的MVS的共同特征提取器，香草特征金字塔网络（FPN）遭受了灰心的特征表示反射和无纹理区域的特征表示，这限制了MV的概括。即使是FPN与预训练的卷积神经网络（CNN）一起工作，也无法解决这些问题。另一方面，视觉变压器（VIT）在许多2D视觉任务中取得了杰出的成功。因此，我们问VIT是否可以促进MV中的特征学习？在本文中，我们提出了一个名为MVSFormer的预训练的VIT增强的MVS网络，该网络可以学习更多可靠的特征表示，从VIT提供的信息学先验受益。具有有效注意机制的层次结构VIT的填充MVSFormer可以根据FPN实现明显的改进。此外，进一步提出了具有冷冻VIT重量的替代MVSFormer。这在很大程度上减轻了培训成本，而竞争性绩效通过自我介绍预训练的注意力图加强了竞争性能。 MVSFormer可以推广到各种输入分辨率，并通过梯度积累来加强有效的多尺度训练。此外，我们讨论了分类和基于回归的MVS方法的优点和缺点，并进一步建议将它们统一使用基于温度的策略。 MVSFormer在DTU数据集上实现最先进的性能。尤其是，MVSFormer在高度竞争坦克和故事排行榜的中级和高级集合中排名第一。

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPNs) suffer from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. The finetuned MVSFormer with hierarchical ViTs of efficient attention mechanisms can achieve prominent improvement based on FPNs. Besides, the alternative MVSFormer with frozen ViT weights is further proposed. This largely alleviates the training cost with competitive performance strengthened by the attention map from the self-distillation pre-training. MVSFormer can be generalized to various input resolutions with efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, MVSFormer ranks as Top-1 on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard.

下载PDF全文

下载文献需遵守相关版权规定

论文标题