语音的不同时间调制的重要性：两个观点的故事

论文标题

语音的不同时间调制的重要性：两个观点的故事

Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives

论文作者

Sadhu, Samik, Hermansky, Hynek

论文摘要

语音识别的不同时间语音调制有多重要？我们从两个互补的角度回答了这个问题。首先，我们通过用框架音素标签计算时间调制之间的相互信息来量化语音调制频谱中语音\ textit {信息}的量。从另一个角度看，我们问 - 哪些语音调制自动语音识别（ASR）系统更喜欢其操作。在调制频谱上学习了数据驱动的权重，并针对端到端的ASR任务进行了优化。两种方法都一致同意语音信息主要包含在缓慢的调制中。最大互信息发生在3-6 Hz左右，这也恰好是ASR最受欢迎的调制范围。此外，我们表明将此知识纳入ASRS会大大减少其对培训数据量的依赖。

How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic \textit{information} in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask - which speech modulations an Automatic Speech Recognition (ASR) system prefers for its operation. Data-driven weights are learned over the modulation spectrum and optimized for an end-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most preferred by the ASR. In addition, we show that the incorporation of this knowledge into ASRs significantly reduces their dependency on the amount of training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题