探索自动语音识别的自我监督语音模型的有效蒸馏

论文标题

探索自动语音识别的自我监督语音模型的有效蒸馏

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

论文作者

Wang, Yujin, Tang, Changli, Ma, Ziyang, Zheng, Zhisheng, Chen, Xie, Zhang, Wei-Qiang

论文摘要

近年来，在讲话处理方面，自我监督学习（SSL）取得了长足的进步。 SSL模型通常在多种未标记的数据上进行预训练，并且优先考虑大型模型大小以提高建模能力。但是，这可能会限制其潜在应用，这是由于超大型模型引入的昂贵计算和内存成本。 SSL模型的小型化已成为实践价值的重要研究方向。为此，我们探讨了基于休伯特的SSL模型以自动语音识别（ASR）的有效蒸馏。首先，为了建立强大的基准，对不同的学生模型结构进行了全面研究。最重要的是，作为对先前工作中广泛采用的回归损失的补充，Hubert引入了歧视性损失，以提高蒸馏性能，尤其是在低资源场景中。此外，我们设计了一种简单有效的算法，以将前端输入从波形到FBANK功能提炼，从而在边际性能下降时导致17％的参数降低和加倍推理速度。

Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling capacity. However, this might limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Miniaturization for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR). First, in order to establish a strong baseline, a comprehensive study on different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios. In addition, we design a simple and effective algorithm to distill the front-end input from waveform to Fbank feature, resulting in 17% parameter reduction and doubling inference speed, at marginal performance degradation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题