查看和倾听：一种多模式的后期融合方法，用于自动机器的场景分类

论文标题

查看和倾听：一种多模式的后期融合方法，用于自动机器的场景分类

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

论文作者

Bird, Jordan J., Faria, Diego R., Premebida, Cristiano, Ekárt, Anikó, Vogiatzis, George

论文摘要

这项研究的新颖性包括一种多模式的场景分类方法，其中图像和音频在深层融合的过程中相互补充。该方法在一个困难的分类问题上进行了证明，该问题由两个同步和平衡的数据集组成，这些数据集的16,000个数据对象包括8个环境的4.4小时视频，其相似程度不同。我们首先以一秒钟的间隔提取视频帧和随附的音频。图像和音频数据集首先使用微调的VGG16和进化优化的深神经网络独立进行分类，精度分别为89.27％和93.72％。接下来是两个神经网络的后期融合，以启用更高阶的功能，从而在此多模式分类器和同步视频帧和音频剪辑中，准确性为96.81％。当两个主要网络被视为特征发电机时，为晚融合实施的三级神经网络超过经典的最新分类器。我们表明，现在通过新兴的高阶集成来纠正单模性可能会被异常的数据点混淆的情况。著名的例子包括仅由音频分类器误导为河流的城市中的水特征，仅图像分类器将密集的街道错误地分类为森林。两者都是通过我们的多模式方法正确分类的示例。

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题