半监督语音识别的暹罗网络

论文标题

半监督语音识别的暹罗网络

Contrastive Siamese Network for Semi-supervised Speech Recognition

论文作者

Khorram, Soheil, Kim, Jaeyoung, Tripathi, Anshuman, Lu, Han, Zhang, Qian, Sak, Hasim

论文摘要

本文介绍了对比性暹罗（C-SIAM）网络，这是一种用于利用语音识别中未标记的声学数据的体系结构。 C-SIAM是第一个通过匹配两个相同的变压器编码器的输出来从语音中提取高级语言信息的网络。它包含由以下培训的增强和目标分支，（1）掩盖输入和对比度损失的匹配输出，（2）在目标分支上包含停止梯度操作，（3）使用增强分支上的额外可学习的转换，（4）引入新的时间增强功能，以防止短途学习问题。我们使用Libri-Light 60k无监督的数据以及Librispeech 100小时/960小时监督的数据来比较C-SIAM和其他表现最佳的系统。我们的实验表明，C-SIAM在WAV2VEC基准中提供20％的相对单词错误率提高。与具有600m参数的最先进的网络相比，具有4.5亿参数的C-SIAM网络可实现竞争结果。

This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题