UCONV-CONFORMER：端到端语音识别的输入序列长度的高降低

论文标题

UCONV-CONFORMER：端到端语音识别的输入序列长度的高降低

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

论文作者

Andrusenko, Andrei, Nasretdinov, Rauf, Romanenko, Aleksei

论文摘要

现代ASR体系结构的优化是优先任务之一，因为它为模型培训和推理节省了许多计算资源。这项工作提出了基于标准构象模型的新的UCONV-CONFORM-former架构。它一致地将输入序列长度降低了16次，从而加快了中间层的工作。为了解决与时间维度的大幅减少相关的收敛问题，我们使用了UP的采样块，例如在U-NET体系结构中确保正确的CTC损失计算并稳定网络训练。 UCONV-CONFORMENTER架构在训练和推理速度方面似乎不仅更快，而且与基线构象异构体相比，它的表现更好。我们最佳的UCONV-CONFORNER模型分别在CPU和GPU上显示了47.8％和23.5％的推理加速度。 librispeech test_clean和test_other的相对降低分别为7.3％和9.2％。

Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence issue connected with such a significant reduction of the time dimension, we use upsampling blocks like in the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference speed but also shows better WER compared to the baseline Conformer. Our best Uconv-Conformer model shows 47.8% and 23.5% inference acceleration on the CPU and GPU, respectively. Relative WER reduction is 7.3% and 9.2% on LibriSpeech test_clean and test_other respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题