Croco：通过跨视图完成的3D视觉任务的自我监督预训练

论文标题

Croco：通过跨视图完成的3D视觉任务的自我监督预训练

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

论文作者

Weinzaepfel, Philippe, Leroy, Vincent, Lucas, Thomas, Brégier, Romain, Cabon, Yohann, Arora, Vaibhav, Antsfeld, Leonid, Chidlovskii, Boris, Csurka, Gabriela, Revaud, Jérôme

论文摘要

最近已将蒙版图像建模（MIM）确定为有效的预训练范例。借口任务是通过在输入图像中掩盖斑块来构建的，然后使用可见的贴片作为唯一输入来预测这种掩盖的内容。当对高级语义任务（例如图像分类和对象检测。相反，在本文中，我们试图学习将各种3D视觉和低级几何下游任务（例如深度预测或光流估计）的表示形式。受MIM的启发，我们提出了一项无监督的表示任务，该任务是从成对的图像训练的，从不同的角度显示相同的场景。更确切地说，我们提出了第一个输入图像被部分掩盖的跨视图完成的借口任务，并且必须从可见的内容和第二张图像中重新构造此屏蔽内容。在单视图中，通常不能仅准确地从可见的部分推断出被掩盖的内容，因此该模型学会成为受高级语义影响的先前影响。相反，在模型能够理解两个图像之间的空间关系的条件下，可以通过从第二个未掩盖的图像中跨视图完成这种歧义来解决这种歧义。我们的实验表明，我们的借口任务可显着提高下游任务（例如深度估计）的单眼3D视觉。此外，我们的模型可以直接应用于双眼下游任务，例如光流或相对摄像头姿势估计，为此，我们在没有铃铛和哨声的情况下获得了竞争结果，即使用无需任何特定任务设计的通用体系结构。

Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object detection. In this paper we instead seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks, such as depth prediction or optical flow estimation. Inspired by MIM, we propose an unsupervised representation learning task trained from pairs of images showing the same scene from different viewpoints. More precisely, we propose the pretext task of cross-view completion where the first input image is partially masked, and this masked content has to be reconstructed from the visible content and the second image. In single-view MIM, the masked content often cannot be inferred precisely from the visible portion only, so the model learns to act as a prior influenced by high-level semantics. In contrast, this ambiguity can be resolved with cross-view completion from the second unmasked image, on the condition that the model is able to understand the spatial relationship between the two images. Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks such as depth estimation. In addition, our model can be directly applied to binocular downstream tasks like optical flow or relative camera pose estimation, for which we obtain competitive results without bells and whistles, i.e., using a generic architecture without any task-specific design.

下载PDF全文

下载文献需遵守相关版权规定

论文标题