多任务声音爆发模型，具有重新结构和预训练的副语言构象体

论文标题

多任务声音爆发模型，具有重新结构和预训练的副语言构象体

Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers

论文作者

Belanich, Josh, Somandepalli, Krishna, Eoff, Brian, Jou, Brendan

论文摘要

该技术报告介绍了我们提交给ICML表达性发声研讨会和竞争多任务轨道（EXVO-MULTITAKS）的建模方法。我们首先将各种尺寸的图像分类模型应用于声乐爆发的MEL-SPECTROGRAM表示，这是声音事件检测文献的标准。这些模型的结果显示，就任务指标的谐波平均值而言，基线系统的增加了21.24％，并构成了团队对多任务轨道的主要提交。然后，我们试图通过应用大型预训练的构象模型来表征多任务轨道中的净空，该模型以前在副语言任务上取得了最新的结果，例如语音情感识别和掩盖掩膜检测。我们还研究了情感表达，原产国和年龄预测的子任务之间的关系，并发现最佳性能模型是单任务模型的培训，质疑该问题是否真正从多任务设置中受益。

This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics, and comprise our team's main submission to the MultiTask track. We then sought to characterize the headroom in the MultiTask track by applying a large pre-trained Conformer model that previously achieved state-of-the-art results on paralinguistic tasks like speech emotion recognition and mask detection. We additionally investigated the relationship between the sub-tasks of emotional expression, country of origin, and age prediction, and discovered that the best performing models are trained as single-task models, questioning whether the problem truly benefits from a multitask setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题