使用学习的语音隔离和语音vgg从情感和嘈杂的语音数据中识别说话者的识别

论文标题

使用学习的语音隔离和语音vgg从情感和嘈杂的语音数据中识别说话者的识别

Speaker Identification from emotional and noisy speech data using learned voice segregation and Speech VGG

论文作者

Hamsa, Shibani, Shahin, Ismail, Iraqi, Youssef, Damiani, Ernesto, Werghi, Naoufel

论文摘要

与其他信号相比，语音信号受到更多的声学干扰和情绪因素。嘈杂的情感语音数据是实时语音处理应用程序的挑战。找到一种将主要信号与其他外部影响隔离的有效方法至关重要。理想的系统应具有从不利情况下进行的复杂场景中准确识别所需的听觉事件的能力。本文提出了一种新颖的方法，可以在不利的条件下使用预先训练的深度神经网络面具和语音VGG等不利条件（例如情绪和干扰）。提出的模型比最近的英语和阿拉伯情感语音数据获得了优于最新文献的表现，并报告了使用Ryerson Audio-Visual数据集（RAVDESS）（ravdess），模拟和实际的压力（SUSAS）数据集和emirati-act-accenting Datciential（ESD）（ESD）（ESD（ESD），使用Ryerson Audio-Visual数据集（RAVDESS），平均说话者的识别率为85.2 \％，87.0 \％和86.6 \％。

Speech signals are subjected to more acoustic interference and emotional factors than other signals. Noisy emotion-riddled speech data is a challenge for real-time speech processing applications. It is essential to find an effective way to segregate the dominant signal from other external influences. An ideal system should have the capacity to accurately recognize required auditory events from a complex scene taken in an unfavorable situation. This paper proposes a novel approach to speaker identification in unfavorable conditions such as emotion and interference using a pre-trained Deep Neural Network mask and speech VGG. The proposed model obtained superior performance over the recent literature in English and Arabic emotional speech data and reported an average speaker identification rate of 85.2\%, 87.0\%, and 86.6\% using the Ryerson audio-visual dataset (RAVDESS), speech under simulated and actual stress (SUSAS) dataset and Emirati-accented Speech dataset (ESD) respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题