使用语音信号的多分辨率光谱表达式的声学到关节语音反演

论文标题

使用语音信号的多分辨率光谱表达式的声学到关节语音反演

Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals

论文作者

Parikh, Rahil, Seneviratne, Nadee, Sivaraman, Ganesh, Shamma, Shihab, Espy-Wilson, Carol

论文摘要

语音信号的多分辨率光谱特征代表大脑通过将皮质细胞调整为不同光谱和时间调制的方式来感知声音。这些功能会产生语音信号的较高维度表示。本文的目的是评估语音信号的听觉皮层表示对这些相应信号的估算特征的贡献。自从获得语音信号的声学特征的发音特征以来，对于不同的语音社区来说，这是一个充满挑战的话题，我们研究了使用这种多分辨率表示语音信号作为声学特征的可能性。我们使用威斯康星州X射线微束（XRMB）的U.使用适当的量表和速率向量参数选择了最佳的多分辨率光谱特征来训练模型，以获得最佳性能模型。实验与地面图形变量达到0.675的相关性。我们将该语音反演系统的性能与使用MEL频率曲线系数（MFCC）进行的先前实验进行了比较。

Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of the speech signals. The purpose of this paper is to evaluate how well the auditory cortex representation of speech signals contribute to estimate articulatory features of those corresponding signals. Since obtaining articulatory features from acoustic features of speech signals has been a challenging topic of interest for different speech communities, we investigate the possibility of using this multi-resolution representation of speech signals as acoustic features. We used U. of Wisconsin X-ray Microbeam (XRMB) database of clean speech signals to train a feed-forward deep neural network (DNN) to estimate articulatory trajectories of six tract variables. The optimal set of multi-resolution spectro-temporal features to train the model were chosen using appropriate scale and rate vector parameters to obtain the best performing model. Experiments achieved a correlation of 0.675 with ground-truth tract variables. We compared the performance of this speech inversion system with prior experiments conducted using Mel Frequency Cepstral Coefficients (MFCCs).

下载PDF全文

下载文献需遵守相关版权规定

论文标题