使用频谱图纠正语音中的错误发音

论文标题

使用频谱图纠正语音中的错误发音

Correcting Mispronunciations in Speech using Spectrogram Inpainting

论文作者

Ben-Simon, Talia, Kreuk, Felix, Awwad, Faten, Cohen, Jacob T., Keshet, Joseph

论文摘要

学习一种新语言涉及不断比较语音作品与环境的参考作品。在言语获取的早期，孩子们进行了发音调整，以符合他们的看护人的言论。一种语言的成年学习者调整他们的演讲以匹配导师参考。本文提出了一种合成产生正确的发音反馈的方法。此外，我们的目的是在保持演讲者的原始声音的同时产生校正的生产。系统提示用户发音短语。记录语音，并用零掩盖与不准确音素相关的样品。该波形是对语音生成器的输入，可作为具有U-NET体系结构的深度学习介绍系统实现，并经过培训以输出重建的语音。训练集由未损坏的适当的语音示例组成，并且对发电机进行了训练以重建原始的适当语音。我们评估了系统的性能在音素替代英语以及发音障碍儿童的最小对单词方面的性能。结果表明，人类听众稍微偏爱我们产生的语音，而不是通过产生不同的扬声器的不准确音素的平滑替换。

Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production. Furthermore, our aim is to generate the corrected production while maintaining the speaker's original voice. The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros. This waveform serves as an input to a speech generator, implemented as a deep learning inpainting system with a U-net architecture, and trained to output a reconstructed speech. The training set is composed of unimpaired proper speech examples, and the generator is trained to reconstruct the original proper speech. We evaluated the performance of our system on phoneme replacement of minimal pair words of English as well as on children with pronunciation disorders. Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker.

下载PDF全文

下载文献需遵守相关版权规定

论文标题