通过音频文本跨模式学习来利用声学上下文表示会话ASR

论文标题

通过音频文本跨模式学习来利用声学上下文表示会话ASR

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

论文作者

Wei, Kun, Zhang, Yike, Sun, Sining, Xie, Lei, Ma, Long

论文摘要

利用上下文信息是提高对话自动语音识别（ASR）绩效的直观想法。以前的作品通常采用公认的历史话语假设作为前面的上下文，这可能会偏向由于不可避免的历史认可错误而导致的当前公认假设。为了避免此问题，我们提出了一个音频文本跨模式表示器，以直接从先前的语音中学习上下文表示。具体而言，它由两个与模态相关的编码器组成，从语音和相应的文本中提取高级潜在特征，以及一个跨模式编码器，旨在学习语音和文本之间的相关性。我们随机掩盖每种模式的一些输入令牌和输入序列。然后，在交叉模式编码器上使用模态级别的CTC损失进行令牌错失或模态失误预测。因此，该模型不仅捕获了特定模式中的双向上下文依赖性，还捕获了不同模态之间的关系。然后，在训练对话ASR系统的训练期间，提取器将被冻结以提取上述语音的文本表示，而这种表示形式则用作通过注意机制将其用作供应给ASR解码器的上下文。拟议方法的有效性通过多个普通话对话语料库进行了验证，并且在MagicData数据集中可以实现最高的字符错误率（CER）最高16％。

Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech and the corresponding text, and a cross-modal encoder, which aims to learn the correlation between speech and text. We randomly mask some input tokens and input sequences of each modality. Then a token-missing or modal-missing prediction with a modal-level CTC loss on the cross-modal encoder is performed. Thus, the model captures not only the bi-directional context dependencies in a specific modality but also relationships between different modalities. Then, during the training of the conversational ASR system, the extractor will be frozen to extract the textual representation of preceding speech, while such representation is used as context fed to the ASR decoder through attention mechanism. The effectiveness of the proposed approach is validated on several Mandarin conversation corpora and the highest character error rate (CER) reduction up to 16% is achieved on the MagicData dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题