使用带有伪强标签的卷积马卡龙网检测声音事件

论文标题

使用带有伪强标签的卷积马卡龙网检测声音事件

Detecting Sound Events Using Convolutional Macaron Net With Pseudo Strong Labels

论文作者

Chan, Teck Kai, Chin, Cheng Siong

论文摘要

在本文中，我们提出了通过使用伪标记的伪标记的数据，使用备量式非负矩阵分解来解决缺乏强烈标记的数据。然后，使用这组数据，我们以半监督的方式将卷积神经网络（CNN）与MN结合在一起的新型架构，称为卷积马卡龙网（CMN）。我们不仅训练单个模型或使用于说的师范方法，而是使用课程一致性成本和课程插值一致性成本同步训练两个不同的CMN。在推理阶段，其中一个模型将提供帧级预测，而另一个模型将提供剪辑级预测。根据我们的框架，我们的系统优于声学场景和事件的检测和分类（DCASE）2020挑战任务4的基线系统，根据我们建议的框架超过10％。通过与Dcase 2019挑战的最高提交相比，我们的系统准确性也更高1.8％。另一方面，与Dcase 2020的最高提交相比，即使变压器编码层更少，我们的准确性也略高于0.3％。我们的系统在看不见的YouTube评估数据集方面仍然坚固，与DCASE 2019和基线系统的最高提交相比，获胜幅度为0.6％和6.3％。

In this paper, we propose addressing the lack of strongly labeled data by using pseudo strongly labeled data approximated using Convolutive Nonnegative Matrix Factorization. Using this set of data, we then train a novel architecture called the Convolutional Macaron Net (CMN), which combines Convolutional Neural Network (CNN) with MN, in a semi-supervised manner. Instead of training only a single model or using the Mean-teacher approach, we train two different CMNs synchronously using a curriculum consistency cost and a curriculum interpolated consistency cost. In the inference stage, one of the models will provide the frame-level prediction while the other model will provide the clip-level prediction. Our system outperforms the baseline system of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge Task 4 by a margin of over 10% based on our proposed framework. By comparing with the top submission of the DCASE 2019 challenge, our system accuracy is also higher by 1.8%. On the other hand, as compared to the top submission of DCASE 2020, our accuracy is also marginally higher by 0.3%, even with fewer Transformer encoding layers. Our system remains robust on unseen YouTube evaluation dataset and has a winning margin of 0.6% and 6.3% against the top submission of DCASE 2019 and the baseline system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题