Speechblender：语音增强框架错误发音数据生成

论文标题

Speechblender：语音增强框架错误发音数据生成

SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation

论文作者

Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed, Mubarak, Hamdy, Afzal, Shazia

论文摘要

缺乏标记的第二语言（L2）语音数据是设计错误发音检测模型的主要挑战。我们介绍了SpeechBlender-一种细粒度数据增强管道，用于产生错位错误以克服此类数据稀缺性。 Speechblender利用掩模的品种来针对语音单元的不同区域，并使用混合因子线性插值原始的语音信号，同时增强发音。掩模有助于信号的平滑混合，产生比“切割/粘贴”方法更有效的样品。我们提出的技术在音素水平上获得了依赖ASR的错误发音检测模型的Speechocean762的最先进结果，与先前的最先前的ART相比，Pearson相关系数（PCC）的增长率为2.0％[1]。此外，与基线相比，我们在音素水平上的提高了5.0％。我们还观察到使用阿拉伯Aravoicel2测试集的F1得分增加了4.6％。

The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题