超越视野：通过剪辑变压器增强场景的可见性和感知

论文标题

超越视野：通过剪辑变压器增强场景的可见性和感知

Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer

论文作者

Shi, Hao, Jiang, Qi, Yang, Kailun, Yin, Xiaoting, Wang, Ze, Wang, Kaiwei

论文摘要

视觉传感器广泛应用于车辆，机器人和路边基础设施中。但是，由于硬件成本和系统尺寸的限制，相机视野（FOV）通常受到限制，并且可能无法提供足够的覆盖范围。然而，从时空的角度来看，可以从过去的视频流中获取超出相机物理FOV的信息。在本文中，我们提出了自动驾驶汽车以扩大视野的在线视频介绍的概念，从而增强了场景的可见性，感知和系统安全性。为了实现这一目标，我们介绍了Flowlens体系结构，该体系结构明确采用了光流，并隐式地结合了一种新型的夹子转换变压器进行特征传播。 Flowlens提供了两个关键特征：1）Flowlens包括一个新设计的夹子式枢纽，带有3D耦合的交叉注意（DDCA），以逐步处理随着时间的推移积累的全局信息。 2）它集成了多支分支混合融合馈电馈网（MixF3N），以增强局部特征的精确空间流。为了促进培训和评估，我们使用各种FOV面罩得出Kitti360数据集，该数据集涵盖了外部和内部FOV扩展方案。我们还对不同模型进行了范围内fov语义的定量评估和定性比较，以及跨模型对象检测的定性比较。我们说明，使用Flowlens来重建未见的场景，甚至通过提供可靠的语义上下文来增强视野内的感知。广泛的实验和用户研究涉及离线和在线视频介绍以及超越感知任务，这表明Flowlens可以实现最新的性能。源代码和数据集可在https://github.com/masterhow/flowlens上公开提供。

Vision sensors are widely applied in vehicles, robots, and roadside infrastructure. However, due to limitations in hardware cost and system size, camera Field-of-View (FoV) is often restricted and may not provide sufficient coverage. Nevertheless, from a spatiotemporal perspective, it is possible to obtain information beyond the camera's physical FoV from past video streams. In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view, thereby enhancing scene visibility, perception, and system safety. To achieve this, we introduce the FlowLens architecture, which explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation. FlowLens offers two key features: 1) FlowLens includes a newly designed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated over time. 2) It integrates a multi-branch Mix Fusion Feed Forward Network (MixF3N) to enhance the precise spatial flow of local features. To facilitate training and evaluation, we derive the KITTI360 dataset with various FoV mask, which covers both outer- and inner FoV expansion scenarios. We also conduct both quantitative assessments and qualitative comparisons of beyond-FoV semantics and beyond-FoV object detection across different models. We illustrate that employing FlowLens to reconstruct unseen scenes even enhances perception within the field of view by providing reliable semantic context. Extensive experiments and user studies involving offline and online video inpainting, as well as beyond-FoV perception tasks, demonstrate that FlowLens achieves state-of-the-art performance. The source code and dataset are made publicly available at https://github.com/MasterHow/FlowLens.

下载PDF全文

下载文献需遵守相关版权规定

论文标题