Chitransformer：从提示中迈向可靠的立体声

论文标题

Chitransformer：从提示中迈向可靠的立体声

ChiTransformer:Towards Reliable Stereo from Cues

论文作者

Su, Qing, Ji, Shihao

论文摘要

当前的立体声匹配技术受到限制搜索空间，遮挡区域和纯粹的规模的挑战。尽管从这些挑战中得以利用单图形深度估计，并且可以通过提取的单眼线索获得令人满意的结果，但缺乏立体关系会使单眼预测本身降低，尤其是在高度动态或杂乱的环境中。为了在两种情况下解决这些问题，我们提出了一个由视觉启发的自我监视的双目深度估计方法，其中具有封闭式位置跨注意（GPCA）层的视觉变压器（VIT）设计旨在启用特征敏感的图案检索，同时通过自我构图进行广泛的上下文信息聚集。此后，从单一视图的单眼提示通过与检索的图案对有条件地纠正。这种跨界设计在生物学上类似于人类视觉系统中的视觉肺泡结构，因此名称为Chitransformer。我们的实验表明，这种体系结构对最先进的自我监视立体声方法的改善可实现11％，并且可以用于直线和非线性和非线性（例如Fisheye）图像上。项目可从https://github.com/isl-cv/chitransformer获得。

Current stereo matching techniques are challenged by restricted searching space, occluded regions and sheer size. While single image depth estimation is spared from these challenges and can achieve satisfactory results with the extracted monocular cues, the lack of stereoscopic relationship renders the monocular prediction less reliable on its own, especially in highly dynamic or cluttered environments. To address these issues in both scenarios, we present an optic-chiasm-inspired self-supervised binocular depth estimation method, wherein a vision transformer (ViT) with gated positional cross-attention (GPCA) layers is designed to enable feature-sensitive pattern retrieval between views while retaining the extensive context information aggregated through self-attentions. Monocular cues from a single view are thereafter conditionally rectified by a blending layer with the retrieved pattern pairs. This crossover design is biologically analogous to the optic-chasma structure in the human visual system and hence the name, ChiTransformer. Our experiments show that this architecture yields substantial improvements over state-of-the-art self-supervised stereo approaches by 11%, and can be used on both rectilinear and non-rectilinear (e.g., fisheye) images. Project is available at https://github.com/ISL-CV/ChiTransformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题