倾向于多模式多任务场景理解室内移动代理的模型

论文标题

倾向于多模式多任务场景理解室内移动代理的模型

Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents

论文作者

Tsai, Yao-Hung Hubert, Goh, Hanlin, Farhadi, Ali, Zhang, Jian

论文摘要

个性化移动代理中的感知系统需要开发室内场景理解模型，这些模型可以理解3D几何，捕获客观性，分析人类行为等。尽管如此，与户外环境的模型相比，该方向尚未得到充分探索（例如，自治驱动系统（包括人口预测，car contection，cartection contection，cartection contection，cartection，交通信号，交通识别）。在本文中，我们首先讨论了主要挑战：用于现实世界中室内环境的不足，甚至没有标记的数据，以及其他挑战，例如信息源之间的融合（例如，RGB图像和LIDAR点云），对各种产出的建模关系（例如3D对象估算效果，以及人类构造效果，以及人类构造效果），以及人为构成估算，以及人类的构成估算。然后，我们描述MMISM（多模式输入多任务输出室内场景理解模型）来应对上述挑战。 MMISM将RGB图像以及稀疏的LiDAR点视为输入和3D对象检测，深度完成，人姿势估计和语义分割作为输出任务。我们表明，MMism在PAR上的表现甚至比单任务模型更好。例如，我们在基准Arkitscenes数据集上将基线3D对象检测结果提高了11.7％。

The perception system in personalized mobile agents requires developing indoor scene understanding models, which can understand 3D geometries, capture objectiveness, analyze human behaviors, etc. Nonetheless, this direction has not been well-explored in comparison with models for outdoor environments (e.g., the autonomous driving system that includes pedestrian prediction, car detection, traffic sign recognition, etc.). In this paper, we first discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments, and other challenges such as fusion between heterogeneous sources of information (e.g., RGB images and Lidar point clouds), modeling relationships between a diverse set of outputs (e.g., 3D object locations, depth estimation, and human poses), and computational efficiency. Then, we describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models; e.g., we improve the baseline 3D object detection results by 11.7% on the benchmark ARKitScenes dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题