在端到端的语音到无限系统中，对对比度进行了对比度预处理

论文标题

在端到端的语音到无限系统中，对对比度进行了对比度预处理

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

论文作者

Sunder, Vishal, Fosler-Lussier, Eric, Thomas, Samuel, Kuo, Hong-Kwang J., Kingsbury, Brian

论文摘要

端到端（E2E）口语理解（SLU）的最新进展主要是由于对语音表征的有效预处理。这样的预处理是一种从最先进的基于文本的模型（如BERT）到语音编码器神经网络的语义知识的蒸馏。这项工作是迈向以更高效和细粒度的方式进行相同操作的一步，我们将语音嵌入和BERT嵌入以逐态为基础。我们介绍了一种简单而新颖的技术，该技术使用跨模式的注意机制从语音编码器中提取令牌级别的上下文嵌入，以便可以直接比较它们并与基于BERT的上下文嵌入。使用新型的令牌对比度损失进行此比对。对这样的预处理模型进行微调，以使用语音直接在两个广泛使用的SLU数据集上产生最先进的性能。当使用规格进行额外的正则化时，尤其是在语音嘈杂时，我们的模型会进一步改善，尤其是在嘈杂的语音时，绝对改善的最高速度比以前的结果高达8％。

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8% over previous results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题