使用混合多模式视觉变压器CNN模型增强细颗粒的3D对象识别

论文标题

使用混合多模式视觉变压器CNN模型增强细颗粒的3D对象识别

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

论文作者

Xiong, Songsong, Tziafas, Georgios, Kasaei, Hamidreza

论文摘要

通常需要在以人为中心的环境（例如零售商店，餐馆和家庭）中运行的机器人来区分高度准确性的不同情况下的类似物体。但是，由于类别内较高和类别间差异较低，细粒对象识别仍然是机器人技术的挑战。此外，有限数量的细粒度3D数据集在有效解决此问题方面构成了一个重大问题。在本文中，我们提出了一种混合多模式视觉变压器（VIT）和卷积神经网络（CNN）方法，以改善细粒度视觉分类（FGVC）的性能。为了解决FGVC 3D数据集的短缺，我们生成了两个合成数据集。第一个数据集由与餐馆有关的20个类别组成，共有100个实例，而第二个数据集包含120个鞋类实例。我们在两个数据集上进行了评估，结果表明，它的表现既优于仅CNN和仅VIT的基线，却在餐厅和鞋类数据集上分别达到了94.50％和93.51％的识别精度。此外，我们使我们的FGVC RGB-D数据集可用于研究社区，以实现进一步的实验和进步。此外，我们成功地将提出的方法与机器人框架集成了，并在模拟和现实世界的机器人场景中证明了其作为细粒感知工具的潜力。

Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that it outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50 % and 93.51 % on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we successfully integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题