通过无监督的语音重建解开韵律表征

论文标题

通过无监督的语音重建解开韵律表征

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

论文作者

Qu, Leyuan, Li, Taihao, Weber, Cornelius, Pekarek-Rosin, Theresa, Ren, Fuji, Wermter, Stefan

论文摘要

人言语的特征是不同的组成部分，包括语义内容，说话者身份和韵律信息。分别在自动语音识别（ASR）和说话者验证任务的语义内容和说话者身份方面取得了重大进展。但是，由于不同属性的内在关联，例如音色和节奏，并且由于需要监督的培训方案来实现强大的大型和说话者独立的ASR，因此提取韵律信息仍然是一个挑战性的研究问题。本文的目的是解决基于无监督重建的语音的情绪韵律的解散。 Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations.我们首先在未标记的情感语音语料库上预先确定了韵律2VEC的表示，然后在特定数据集上微调模型以执行语音情感识别（SER）和情感语音转换（EVC）任务。客观（加权和未加权的准确性）和对EVC任务的主观（平均意见评分）评估都表明，韵律2VEC有效地捕获了可以平稳地传递到其他情感语音的一般韵律特征。此外，我们在Iemocap数据集上的SER实验表明，韵律2VEC所学的韵律特征对进行广泛使用的语音预处理模型的性能是有益的，并且在将韵律2VEC与Hubert表示相结合时超过了最新方法。

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题