CVLNET：基于视频的相机本地化的跨视图语义对应学习

论文标题

CVLNET：基于视频的相机本地化的跨视图语义对应学习

CVLNet: Cross-View Semantic Correspondence Learning for Video-based Camera Localization

论文作者

Shi, Yujiao, Yu, Xin, Wang, Shan, Li, Hongdong

论文摘要

本文解决了基于跨视频的相机本地化（CVL）的问题。任务是通过利用其过去观察值的信息来定位查询摄像机，即在以前的时间戳记处观察到的图像连续序列，并将它们与大型开销视图卫星图像匹配。该任务的关键挑战是为顺序地面视图图像学习强大的全局功能描述符，同时考虑其与参考卫星图像的域对齐。为此，我们引入了CVLNET，该CVLNET首先通过探索地面和开头几何对应关系，然后利用预测图像之间的照片一致性来形成全局表示，从而将顺序地面视图图像投射到高架视图中。这样，跨视图域的差异就被桥接了。由于参考卫星图像通常会预先编写并定期采样，因此查询相机位置与其匹配的卫星图像中心之间始终存在未对准。在此激励的情况下，我们建议在相似性匹配之前估算查询摄像机的相对位移对卫星图像。在此位移估计过程中，我们还考虑了相机位置的不确定性。例如，相机不太可能位于树顶上。为了评估所提出方法的性能，我们从Google Map中为Kitti数据集收集卫星图像，并构建一个新的基于跨视频的本地化本地化基准数据集Kitti-CVL。广泛的实验证明了基于视频的本地化对基于单个图像的本地化的有效性以及每个提出的模块的优越性，而不是其他替代方案。

This paper tackles the problem of Cross-view Video-based camera Localization (CVL). The task is to localize a query camera by leveraging information from its past observations, i.e., a continuous sequence of images observed at previous time stamps, and matching them to a large overhead-view satellite image. The critical challenge of this task is to learn a powerful global feature descriptor for the sequential ground-view images while considering its domain alignment with reference satellite images. For this purpose, we introduce CVLNet, which first projects the sequential ground-view images into an overhead view by exploring the ground-and-overhead geometric correspondences and then leverages the photo consistency among the projected images to form a global representation. In this way, the cross-view domain differences are bridged. Since the reference satellite images are usually pre-cropped and regularly sampled, there is always a misalignment between the query camera location and its matching satellite image center. Motivated by this, we propose estimating the query camera's relative displacement to a satellite image before similarity matching. In this displacement estimation process, we also consider the uncertainty of the camera location. For example, a camera is unlikely to be on top of trees. To evaluate the performance of the proposed method, we collect satellite images from Google Map for the KITTI dataset and construct a new cross-view video-based localization benchmark dataset, KITTI-CVL. Extensive experiments have demonstrated the effectiveness of video-based localization over single image-based localization and the superiority of each proposed module over other alternatives.

下载PDF全文

下载文献需遵守相关版权规定

论文标题