在无监督的语音表示中探测音素，语言和说话者信息

论文标题

在无监督的语音表示中探测音素，语言和说话者信息

Probing phoneme, language and speaker information in unsupervised speech representations

论文作者

de Seyssel, Maureen, Lavechin, Marvin, Adi, Yossi, Dupoux, Emmanuel, Wisniewski, Guillaume

论文摘要

基于对比性预测编码（CPC）[1]的无监督模型主要用于口语模型，因为它们编码语音信息。在这项研究中，我们询问CPC语音表示中还有哪些其他类型的信息。我们专注于三类：电话类，性别和语言，并比较单语和双语模型。使用定性和定量工具，我们发现两种模型中都存在性别和电话类信息。但是，语言信息仅在双语模型中非常重要，建议CPC模型学会在接受多种语言培训时歧视语言。某些语言信息也可以从单语模型中检索出来，但在所有功能中都更加分散。当分析从下游聚类模型上进行离散单元进行分析时，这些模式会成立。但是，尽管目标群集数量对电话类和语言信息没有影响，但更多的性别信息是用更多的群集编码的。最后，我们发现在下游音素歧视任务上接触两种语言是有成本的。

Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题