EPCL：冷冻夹变压器是一个有效的点云编码器

论文标题

EPCL：冷冻夹变压器是一个有效的点云编码器

EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

论文作者

Huang, Xiaoshui, Huang, Zhou, Li, Sheng, Qu, Wentao, He, Tong, Hou, Yuenan, Zuo, Yifan, Ouyang, Wanli

论文摘要

由于其预告片的模型具有高质量的表示能力和可转移性，因此预处理范式在NLP和2D图像场中取得了巨大的成功。但是，由于点云序列的数量有限，因此在3D点云场中很难进行预处理。本文介绍了\ textbf {e} fficient \ textbf {p} oint \ textbf {c} loud \ textbf {l} renning（epcl），这是一种有效而有效的点云学习者，用于直接训练具有冷冻夹式变压器的高质量点云模型。我们的EPCL通过在没有配对的2D-3D数据的情况下将图像特征和点云特征对准图像特征和点云特征来连接2D和3D模式。具体而言，输入点云分为一系列局部贴片，这些局部贴片通过设计的点云令牌将其转换为令牌嵌入。这些令牌嵌入与任务令牌相连，并馈入冷冻的夹子变压器，以学习点云表示。直觉是，提出的点云令牌仪将输入点云投射到类似于2D图像的统一令牌空间中。关于3D检测，语义分割，分类和少量学习的全面实验表明，夹子变压器可以用作有效的点云编码器，我们的方法在室内和室外基准上都可以实现有希望的性能。特别是，我们的EPCL带来的性能是$ \ textbf {19.7} $ ap $ _ {50} $在扫描仪V2检测上，$ \ textbf {4.4} $ miou在S3DIS细分和$ \ textbf {1.2} $ miou semantickitti sepplation contemeral contemerary preterpriprareary fretrareary pretermery pretrabrareary contegration和$ \ textbf {1.2} $ {1.2} $代码可在\ url {https://github.com/xiaoshuihuang/epcl}中找到。

The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are $\textbf{19.7}$ AP$_{50}$ on ScanNet V2 detection, $\textbf{4.4}$ mIoU on S3DIS segmentation and $\textbf{1.2}$ mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题