论文标题
通过合成数据,像素的人群理解
Pixel-wise Crowd Understanding via Synthetic Data
论文作者
论文摘要
通过计算机视觉技术分析的人群分析是视频监视领域的重要主题,该视频监视领域具有广泛的应用程序,包括人群监视,公共安全,空间设计等。与其他分析任务相比,Pixel-Wise人群的理解是人群分析中最基本的任务,因为它对视频序列或静止图像的结果更好。不幸的是,像素级的理解需要大量标记的培训数据。注释它们是一项昂贵的工作,这导致当前的人群数据集很小。结果,大多数算法都会遭受过度拟合的程度。在本文中,以人群计数和细分为像素人群的理解中的示例,我们试图从两个方面(即数据和方法论)纠正这些问题。首先,我们开发了一个免费的数据收集器和标签器,以在计算机游戏(Grand Theft Auto V)中生成合成和标记的人群场景。然后,我们使用它来构建一个大规模,多样的合成人群数据集,该数据集被称为“ GCC数据集”。其次,我们提出了两种简单的方法,以通过利用合成数据来提高人群理解的性能。具体来说,1)监督人群的理解:预先培训综合数据的人群分析模型,然后使用真实的数据和标签对其进行微调,这使得该模型在现实世界中的表现更好; 2)通过域的适应来理解人群:将合成数据转换为照片现实图像,然后将模型训练在翻译数据和标签上。结果,训练有素的模型在真正的人群场景中效果很好。
Crowd analysis via computer vision techniques is an important topic in the field of video surveillance, which has wide-spread applications including crowd monitoring, public safety, space design and so on. Pixel-wise crowd understanding is the most fundamental task in crowd analysis because of its finer results for video sequences or still images than other analysis tasks. Unfortunately, pixel-level understanding needs a large amount of labeled training data. Annotating them is an expensive work, which causes that current crowd datasets are small. As a result, most algorithms suffer from over-fitting to varying degrees. In this paper, take crowd counting and segmentation as examples from the pixel-wise crowd understanding, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a free data collector and labeler to generate synthetic and labeled crowd scenes in a computer game, Grand Theft Auto V. Then we use it to construct a large-scale, diverse synthetic crowd dataset, which is named as "GCC Dataset". Secondly, we propose two simple methods to improve the performance of crowd understanding via exploiting the synthetic data. To be specific, 1) supervised crowd understanding: pre-train a crowd analysis model on the synthetic data, then fine-tune it using the real data and labels, which makes the model perform better on the real world; 2) crowd understanding via domain adaptation: translate the synthetic data to photo-realistic images, then train the model on translated data and labels. As a result, the trained model works well in real crowd scenes.