通过自我监督的语音和语言模型进行预处理的半监督语言理解

论文标题

通过自我监督的语音和语言模型进行预处理的半监督语言理解

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

论文作者

Lai, Cheng-I, Chuang, Yung-Sung, Lee, Hung-Yi, Li, Shang-Wen, Glass, James

论文摘要

关于口语理解（SLU）的最新工作至少有三种方式有限：对Oracle文本输入和被忽视的ASR错误培训，对模型进行了培训，可以预测没有插槽值的意图，或者对大量内部数据进行了培训。在本文中，我们提出了一个干净而通用的框架，可以直接从语音中学习语义，并从抄录或未转录的语音中进行半掩饰，以解决这些问题。我们的框架建立在验证的端到端（E2E）ASR和自我监督的语言模型（例如BERT）上，并以有限的目标SLU数据进行微调。我们研究了ASR组件的两个半监督设置：对抄录的语音进行了监督预处理，并通过用自我监督的语音表示，例如WAV2VEC来代替ASR编码器，从而无监督的预处理。同时，我们确定了评估SLU模型的两个基本标准：环境噪声和E2E语义评估。 ATIS上的实验表明，我们使用语音的SLU框架作为输入可以与使用Oracle文本作为语义理解中的输入的框架执行，即使存在环境噪声，并且有限的标记语义数据数据可用于培训。

Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题