论文标题
音频字幕的调查:解决词汇不平衡和评估以语言为中心的性能指标的适用性
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics
论文作者
论文摘要
从我们周围的声音中分析,处理和提取有意义的信息是音频分析更广泛领域的主题。音频字幕是Audio Analytics域的最新补充,这是一个跨模式翻译任务,重点是从音频流中发生的声音事件产生自然描述。在这项工作中,我们确定并改善自动音频字幕的三个主要挑战:i)数据稀缺,ii)音频字幕词汇的不平衡或限制,以及iii)适当的性能评估指标,可以最好地捕获听觉和语义特征。我们发现,通常采用的损失功能会导致模型训练期间不公平的词汇失衡。我们提出了两种音频字幕扩展方法,以丰富训练数据集和词汇量。我们通过探索以前接受过不同音频任务培训的音频编码器的适用性来进一步强调对内域进行预处理的需求。最后,我们系统地探索了从图像字幕域借来的五个性能指标,并突出了它们对音频域的局限性。
The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal translation task that focuses on generating natural descriptions from sound events occurring in an audio stream. In this work, we identify and improve on three main challenges in automated audio captioning: i) data scarcity, ii) imbalance or limitations in the audio captions vocabulary, and iii) the proper performance evaluation metric that can best capture both auditory and semantic characteristics. We find that generally adopted loss functions can result in an unfair vocabulary imbalance during model training. We propose two audio captioning augmentation methods that enrich the training dataset and the vocabulary size. We further underline the need for in-domain pretraining by exploring the suitability of audio encoders that were previously trained on different audio tasks. Finally, we systematically explore five performance metrics borrowed from the image captioning domain and highlight their limitations for the audio domain.