论文标题
Depthformer:利用远程相关性和局部信息以进行准确的单眼估计
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation
论文作者
论文摘要
本文旨在解决受监督的单眼估计的问题。我们从一项细致的试点研究开始,以证明远程相关性对于准确的深度估计至关重要。因此,我们建议利用变压器以有效的注意机制对这种全球环境进行建模。我们还采用了一个额外的卷积分支来保留本地信息,因为变压器在建模此类内容时缺乏空间归纳偏见。但是,独立的分支导致功能之间的连接短缺。为了弥合这一差距,我们设计了一个层次聚合和异质交互模块,以通过元素的交互来增强变压器特征,并以设定的转换方式建模变压器和CNN特征之间的亲和力。由于全球对高分辨率特征图的关注引起的可无法忍受的记忆成本,我们引入了可变形方案以降低复杂性。对Kitti,Nyu和Sun RGB-D数据集进行的广泛实验表明,我们所提出的称为Depthformer的模型超过了最先进的单眼深度估计方法,具有突出的边缘。值得注意的是,它在竞争激烈的Kitti深度估计基准上取得了最具竞争力的结果。我们的代码和模型可在https://github.com/zhyever/munocular-depth-esimation-toolbox上获得。
This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Therefore, we propose to leverage the Transformer to model this global context with an effective attention mechanism. We also adopt an additional convolution branch to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features via element-wise interaction and model the affinity between the Transformer and the CNN features in a set-to-set translation manner. Due to the unbearable memory cost caused by global attention on high-resolution feature maps, we introduce the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. Notably, it achieves the most competitive result on the highly competitive KITTI depth estimation benchmark. Our codes and models are available at https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox.