CT-SAT：连续音频标记的上下文变压器

论文标题

CT-SAT：连续音频标记的上下文变压器

CT-SAT: Contextual Transformer for Sequential Audio Tagging

论文作者

Hou, Yuanbo, Liu, Zhaoyi, Kang, Bo, Wang, Yun, Botteldooren, Dick

论文摘要

顺序的音频事件标记不仅可以提供音频事件的类型信息，还可以提供事件和音频剪辑中发生的事件数量之间的顺序信息。大多数先前关于音频事件序列分析的工作都取决于连接式时间分类（CTC）。但是，CTC的有条件独立性假设阻止其有效地学习各种音频事件之间的相关性。本文首先尝试将变压器引入顺序的音频标记，因为变压器在与序列相关的任务中表现良好。为了更好地利用音频事件序列的上下文信息，我们借鉴了双向复发性神经网络的概念，并使用双向解码器提出了一个可以利用事件序列的前向和后向信息的上下文变压器（CTRANSFORMER）。现实生活中的多形音频数据集的实验表明，与基于CTC的方法相比，CTRANSFORMER可以有效地结合来自编码器和粗颗粒的音频事件提示的细粒度的声学表示，以利用上下文信息以成功识别和预测音频事件序列。

Sequential audio event tagging can provide not only the type information of audio events, but also the order information between events and the number of events that occur in an audio clip. Most previous works on audio event sequence analysis rely on connectionist temporal classification (CTC). However, CTC's conditional independence assumption prevents it from effectively learning correlations between diverse audio events. This paper first attempts to introduce Transformer into sequential audio tagging, since Transformers perform well in sequence-related tasks. To better utilize contextual information of audio event sequences, we draw on the idea of bidirectional recurrent neural networks, and propose a contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences. Experiments on the real-life polyphonic audio dataset show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse-grained audio event cues to exploit contextual information to successfully recognize and predict audio event sequences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题