论文标题
部分可观测时空混沌系统的无模型预测
Bridged Transformer for Vision and Point Cloud 3D Object Detection
论文作者
论文摘要
3D对象检测是计算机视觉中的一个关键研究主题,它通常将3D点云用作传统设置中的输入。最近,有一种利用多种输入数据来源的趋势,例如,与3D点云的2D图像相辅相成,2D图像通常具有更丰富的颜色和更少的声音。但是,由于2D和3D表示的异质几何形式,它阻止了我们应用现成的神经网络以实现多模式融合。为此,我们提出了桥接变压器(BRT),这是一种用于3D对象检测的端到端体系结构。 BRT是简单有效的,它学会了从点和图像补丁识别3D和2D对象边界框。 BRT的关键要素在于利用对象查询桥接3D和2D空间,该空间统一了变压器中的数据表示源的不同来源。我们采用一种通过点对点投影实现的特征聚合形式,从而进一步加强了图像和点之间的相关性。此外,BRT无缝地将点云与多视图图像融合在一起。我们通过实验表明,BRT超过了Sun RGB-D和ScannETV2数据集上的最新方法。
3D object detection is a crucial research topic in computer vision, which usually uses 3D point clouds as input in conventional setups. Recently, there is a trend of leveraging multiple sources of input data, such as complementing the 3D point cloud with 2D images that often have richer color and fewer noises. However, due to the heterogeneous geometrics of the 2D and 3D representations, it prevents us from applying off-the-shelf neural networks to achieve multimodal fusion. To that end, we propose Bridged Transformer (BrT), an end-to-end architecture for 3D object detection. BrT is simple and effective, which learns to identify 3D and 2D object bounding boxes from both points and image patches. A key element of BrT lies in the utilization of object queries for bridging 3D and 2D spaces, which unifies different sources of data representations in Transformer. We adopt a form of feature aggregation realized by point-to-patch projections which further strengthen the correlations between images and points. Moreover, BrT works seamlessly for fusing the point cloud with multi-view images. We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.