Mewehv：Mel和Wave嵌入人类语音任务

论文标题

Mewehv：Mel和Wave嵌入人类语音任务

MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

论文作者

Carofilis, Andrés, Fernández-Robles, Laura, Alegre, Enrique, Fidalgo, Eduardo

论文摘要

语音处理的最新趋势是使用通过在特定任务上训练大型数据集的机器学习模型创建的嵌入。通过利用已经获得的知识，可以在可用数据量较小的新任务中重复使用这些模型。本文提出了一条管道，以创建一个用于人类语音任务的新模型，称为MEL和WAVE嵌入（MEWEHV），能够生成可靠的嵌入语音处理。 MeWehv结合了预先训练的原始音频波形编码器产生的嵌入，以及使用卷积神经网络（CNN）从MEL频率Cepstral系数（MFCC）中提取的深度特征。我们评估MeWehv在三个任务上的表现：说话者，语言和口音识别。对于第一个，我们使用voxceleb1数据集并呈现Youspeakers204，这是一种用于英语说话者标识的新的公开数据集，其中包含来自204名在六个不同口音的204人说话的19607音频剪辑，使其他研究人员能够与非常平衡的数据集一起使用，并创建了一个非常强大的模型，可以为多个Accents提供强有力的型号。为了评估语言标识任务，我们使用VoxForge和常见语言数据集。最后，为了识别口音，我们使用拉丁美洲西班牙语料库（LASC）和普通语音数据集。我们的方法允许在所有测试数据集上的最先进模型的性能显着提高，并具有较低的额外计算成本。

A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated by a pre-trained raw audio waveform encoder model, and deep features extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker, language, and accent identification. For the first one, we use the VoxCeleb1 dataset and present YouSpeakers204, a new and publicly available dataset for English speaker identification that contains 19607 audio clips from 204 persons speaking in six different accents, allowing other researchers to work with a very balanced dataset, and to create new models that are robust to multiple accents. For evaluating the language identification task, we use the VoxForge and Common Language datasets. Finally, for accent identification, we use the Latin American Spanish Corpora (LASC) and Common Voice datasets. Our approach allows a significant increase in the performance of state-of-the-art models on all the tested datasets, with a low additional computational cost.

下载PDF全文

下载文献需遵守相关版权规定

论文标题