基于多元分解编码器的DeNo AutoCoder的语音增强

论文标题

基于多元分解编码器的DeNo AutoCoder的语音增强

Speech Enhancement based on Denoising Autoencoder with Multi-branched Encoders

论文作者

Yu, Cheng, Zezario, Ryandhimas E., Wang, Syu-Siang, Sherman, Jonathan, Hsieh, Yi-Yen, Lu, Xugang, Wang, Hsin-Min, Tsao, Yu

论文摘要

基于深度学习的模型极大地提高了语音增强（SE）系统的性能。但是，两个问题仍未解决，这与模型对嘈杂条件的概括性密切相关：（1）测试过程中的嘈杂条件不匹配的噪声不匹配，即，当模型测试的模型通常不参与训练数据时，性能通常是优化的；（2）本地关注特定的嘈杂条件，即使用多种类型的噪声训练的模型，即使噪声类型涉及训练数据，也无法最佳地删除特定的噪声类型。这些问题在实际应用中很常见。在本文中，我们提出了一种新颖的DeNoing AutoCododer，该自动编码器使用多支链编码器（称为Daeeme）模型来解决这两个问题。在DAEME模型中，涉及两个阶段：培训和测试。在训练阶段，我们构建了多个组件模型，以基于决策树（DSDT）形成多支出编码器。 DSDT是基于语音和嘈杂条件的先验知识（在本文中考虑说话者，环境和信号因素）构建的，其中多支链编码器的每个组成部分都在DSDT的分支沿线进行特定的映射到干净的语音。最后，在多支链编码器的顶部对解码器进行了训练。在测试阶段，嘈杂的语音首先由每个组件模型处理。然后将来自这些模型的多个输出集成到解码器中，以确定最终的增强语音。实验结果表明，在客观评估指标，自动语音识别结果和主观人类听力测试中的质量方面，DAEME优于几个基线模型。

Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this paper, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: training and testing. In the training stage, we build multiple component models to form a multi-branched encoder based on a decision tree (DSDT). The DSDT is built based on prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the testing stage, noisy speech is first processed by each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics, automatic speech recognition results, and quality in subjective human listening tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题