论文标题
口语建模是否需要离散单元?
Are discrete units necessary for Spoken Language Modeling?
论文作者
论文摘要
口语建模的最新工作表明,可以从原始音频中学习语言的可能性,而无需任何文本标签。该方法首先依赖于将音频转换为一系列离散单元(或伪文本),然后直接在此类伪文本上训练语言模型。这是必要的离散瓶颈,可能在语音信号编码时可能引入不可逆的错误,还是我们可以完全没有离散单位学习语言模型?在这项工作中,我们研究了离散和连续表示在口语建模中的作用。我们表明,离散化对于口语建模的良好结果确实至关重要。我们表明,离散化可以从连续功能中消除语言上无关的信息,从而有助于提高语言建模性能。在这项研究的基础上,我们培训了Hubert特征的离散单元的语言模型,达到新的最先进的是零资源语音挑战2021(仅轨道1-语音)的词汇,句法和语义指标。
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1 - Speech Only).