论文标题
具有动态编程和自我监督评分的发现的电话单元上的单词分割
Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring
论文作者
论文摘要
关于无监督语音细分的最新工作使用了由电话和单词分割模块共同培训的自我监督模型。相反,本文重新讨论了一种较旧的单词分割方法:首先执行类似电话的单元发现,然后在发现的单元的顶部执行符号单词细分(而不会影响下层)。为此,我提出了一个新的单元发现模型,一个新的符号单词分割模型,然后将两个模型链接到分段语音。两种模型都使用动态编程来最大程度地减少自我监督网络的细分成本,并额外罚款,以鼓励更长的单位。具体而言,对于声学单位发现,持续时间拟合动态编程(DPDP)与对比度预测编码模型一起用作评分网络。对于单词分割,DPDP用自动编码的复发神经作为评分网络应用。为了细分语音,这两个模型被链接。这种方法为英语基准上的最先进的联合自我监督分割模型提供了可比的单词细分结果。在法语,普通话,德语和沃洛夫数据上,它在Zerospeech基准上的先前系统都优于先前的系统。分析表明,链式的DPDP系统段较短的填充单词,但较长的单词可能需要一些外部自上而下的信号。
Recent work on unsupervised speech segmentation has used self-supervised models with phone and word segmentation modules that are trained jointly. This paper instead revisits an older approach to word segmentation: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). To do this, I propose a new unit discovery model, a new symbolic word segmentation model, and then chain the two models to segment speech. Both models use dynamic programming to minimize segment costs from a self-supervised network with an additional duration penalty that encourages longer units. Concretely, for acoustic unit discovery, duration-penalized dynamic programming (DPDP) is used with a contrastive predictive coding model as the scoring network. For word segmentation, DPDP is applied with an autoencoding recurrent neural as the scoring network. The two models are chained in order to segment speech. This approach gives comparable word segmentation results to state-of-the-art joint self-supervised segmentation models on an English benchmark. On French, Mandarin, German and Wolof data, it outperforms previous systems on the ZeroSpeech benchmarks. Analysis shows that the chained DPDP system segments shorter filler words well, but longer words might require some external top-down signal.