论文标题
使用低成本单眼无人机的人口稠密的室内场景的实时混合映射
Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV
论文作者
论文摘要
近年来,无人驾驶飞机(UAV)已用于许多应用,从城市搜救,到农业测量到自主地下地雷勘探。但是,将无人机部署在紧密的室内空间,尤其是靠近人类的空间仍然是一个挑战。当需要有限的有效载荷时,一种解决方案是使用Micro-uavs,它对人类的风险较小,崩溃后替换的成本通常更低。但是,Micro-UAV只能携带有限的传感器套件,例如单眼相机而不是立体声对或激光镜头,这使得诸如密集的映射和无标记的多人3D人姿势估计的任务复杂化,这是在人们周围的紧张环境中进行操作所需的。存在此类任务的单眼方法,并且已成功地用于无人机应用程序。然而,尽管最近在基于标记和无标记的多人单人运动捕获方面进行了许多著作,但无标记的单相机多人3D人姿势估计仍然是一项更早的技术,我们并不知道现有的尝试在空中部署中部署它。在本文中,我们介绍了据我们所知,这是第一个执行同时映射和多人3D人类姿势估算的系统,该系统从安装在单个无人机上的单眼摄像机进行了估计。特别是,我们展示了如何在实时重建人口稠密的室内场景的混合图。我们通过大规模扫描仪和GTA-IM数据集的广泛实验来验证组件级的设计选择。为了评估我们的系统级别的性能,我们还构建了一个新的牛津混合映射数据集的室内场景。
Unmanned aerial vehicles (UAVs) have been used for many applications in recent years, from urban search and rescue, to agricultural surveying, to autonomous underground mine exploration. However, deploying UAVs in tight, indoor spaces, especially close to humans, remains a challenge. One solution, when limited payload is required, is to use micro-UAVs, which pose less risk to humans and typically cost less to replace after a crash. However, micro-UAVs can only carry a limited sensor suite, e.g. a monocular camera instead of a stereo pair or LiDAR, complicating tasks like dense mapping and markerless multi-person 3D human pose estimation, which are needed to operate in tight environments around people. Monocular approaches to such tasks exist, and dense monocular mapping approaches have been successfully deployed for UAV applications. However, despite many recent works on both marker-based and markerless multi-UAV single-person motion capture, markerless single-camera multi-person 3D human pose estimation remains a much earlier-stage technology, and we are not aware of existing attempts to deploy it in an aerial context. In this paper, we present what is thus, to our knowledge, the first system to perform simultaneous mapping and multi-person 3D human pose estimation from a monocular camera mounted on a single UAV. In particular, we show how to loosely couple state-of-the-art monocular depth estimation and monocular 3D human pose estimation approaches to reconstruct a hybrid map of a populated indoor scene in real time. We validate our component-level design choices via extensive experiments on the large-scale ScanNet and GTA-IM datasets. To evaluate our system-level performance, we also construct a new Oxford Hybrid Mapping dataset of populated indoor scenes.