论文标题
升降,splat,射击:通过隐式未对3D进行暗示,从任意摄像机钻机中编码图像
Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D
论文作者
论文摘要
自动驾驶汽车感知的目的是从多个传感器中提取语义表示,并将这些表示形式融合到单个“ Bird's Eye-View”坐标框架中,以通过运动计划消耗。我们提出了一种新的端到端体系结构,该体系结构直接从任意数量的摄像机中提取了给定的图像数据的场景的鸟类视图表示。我们方法背后的核心思想是将每个图像单独地“抬起”每个相机的特征,然后将所有flustums“”“ splat”“ splat”为栅格化的鸟类视图网格。通过在整个相机钻机上进行训练,我们提供了证据表明,我们的模型不仅能够学习如何表示图像,还可以学习如何将所有相机的预测融合到场景中的单个凝聚力表示,同时又可以稳健到校准误差。在标准的鸟类视图任务(例如对象分割和地图分割)上,我们的模型优于所有基准和先前的工作。为了追求学习运动计划的密集表示的目标,我们表明,通过我们的模型推断出的表示形式通过“拍摄”模板轨迹可以解释的端到端运动计划,将其“射击”模板轨迹纳入我们网络的鸟类视图成本映射。我们对使用LiDAR使用Oracle深度的模型进行基准测试。带代码的项目页面:https://nv-tlabs.github.io/lift-splat-shoot。
The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: https://nv-tlabs.github.io/lift-splat-shoot .