自我监督的语音学习中声道表达的证据

论文标题

自我监督的语音学习中声道表达的证据

Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech

论文作者

Cho, Cheol Jun, Wu, Peter, Mohamed, Abdelrahman, Anumanchipalli, Gopala K.

论文摘要

最近的自我监督学习（SSL）模型已被证明可以学习丰富的语音表示，可以很容易地通过多样化的下游任务来利用。为了了解此类实用程序，已经为语音SSL模型进行了各种分析，以揭示学习表示中的信息以及如何编码信息。尽管先前分析的范围在声学，语音和语义观点方面是广泛的，但语音产生的物理基础尚未得到全部关注。为了弥合这一差距，我们进行了全面的分析，以将语音表示与通过电磁轨迹学测量（EMA）测量的关节轨迹联系起来。我们的分析基于线性探测方法，在该方法中，我们将关节映射评分作为线性映射与EMA的平均相关性。我们分析了一组从高级基准测试的排行榜中选择的SSL模型，并对两个最成功的模型WAV2VEC 2.0和Hubert进行了进一步的层面分析。令人惊讶的是，最近的语音SSL模型的表示与EMA痕迹高度相关（最佳：r = 0.81），只有5分钟足以训练具有高性能的线性模型（r = 0.77）。我们的发现表明，SSL模型学会与连续的发音紧密保持一致，并为语音SSL提供新颖的见解。

Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and semantic perspectives, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SUPERB benchmark and perform further layer-wise analyses on two most successful models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes are sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题