论文标题
特征金字塔注意的基于环境声音分类的残留神经网络
Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification
论文作者
论文摘要
由于声音信号中存在的非结构化时空关系,环境声音分类(ESC)是一个具有挑战性的问题。最近,许多研究集中在卷积神经网络的抽象特征上,而语义相关的声音信号的学习被忽略了。为此,我们提出了一个端到端框架,即具有金字塔注意网络(FPAM),重点是抽象ESC的语义相关功能。我们首先通过骨干网络提取声波预处理频谱图的特征图。然后,为了构建声频谱图的多尺度分层特征,我们通过汇总来自多尺度层的特征图来构建声频谱图的特征金字塔表示,其中fpam局部将语义上相关帧的时间框架和空间位置定位。具体而言,首先通过维度比对模块处理多个功能。之后,金字塔空间注意模块(PSA)附着在空间注意模块(SAM)上在空间上定位重要频率区域。最后,通过金字塔通道注意(PCA)来完善处理后的特征图,以定位重要的时间框架。为了证明所提出的FPAM的有效性是合理的,已经提出了频谱图上注意图的可视化。可视化结果表明,FPAM可以在忽略噪音的同时更多地关注语义相关区域。在两个广泛使用的ESC数据集:ESC-50和ESC-10数据集上验证了所提出方法的有效性。实验结果表明,FPAM与最先进的方法相当。与基线方法相比,FPAM实现了大幅提高。
Environmental sound classification (ESC) is a challenging problem due to the unstructured spatial-temporal relations that exist in the sound signals. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of semantically relevant frames of sound signals has been overlooked. To this end, we present an end-to-end framework, namely feature pyramid attention network (FPAM), focusing on abstracting the semantically relevant features for ESC. We first extract the feature maps of the preprocessed spectrogram of the sound waveform by a backbone network. Then, to build multi-scale hierarchical features of sound spectrograms, we construct a feature pyramid representation of the sound spectrograms by aggregating the feature maps from multi-scale layers, where the temporal frames and spatial locations of semantically relevant frames are localized by FPAM. Specifically, the multiple features are first processed by a dimension alignment module. Afterward, the pyramid spatial attention module (PSA) is attached to localize the important frequency regions spatially with a spatial attention module (SAM). Last, the processed feature maps are refined by a pyramid channel attention (PCA) to localize the important temporal frames. To justify the effectiveness of the proposed FPAM, visualization of attention maps on the spectrograms has been presented. The visualization results show that FPAM can focus more on the semantic relevant regions while neglecting the noises. The effectiveness of the proposed methods is validated on two widely used ESC datasets: the ESC-50 and ESC-10 datasets. The experimental results show that the FPAM yields comparable performance to state-of-the-art methods. A substantial performance increase has been achieved by FPAM compared with the baseline methods.