对文本适应的语音：迈向有效的跨模式蒸馏

论文标题

对文本适应的语音：迈向有效的跨模式蒸馏

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

论文作者

Cho, Won Ik, Kwak, Donghyun, Yoon, Ji Won, Kim, Nam Soo

论文摘要

语音是最有效的交流手段之一，并且充满了帮助传播者思想传播的信息。但是，主要是由于声学特征的繁琐处理，在理解自然语言时，经常丢弃音素或单词后验概率。因此，一些最近的口语理解（SLU）模块利用了保留不确定性信息的端到端结构。这进一步降低了语音识别误差的传播并保证了计算效率。我们声称，在此过程中，语音理解可以从大规模预训练的语言模型（LMS）的推断中受益。我们将知识从基于混凝土变压器的文本LM转移到基于最近的跨模式蒸馏方法的SLU模块。我们证明了我们对英语SLU基准的Fluent语音命令的提议的有效性。因此，我们在实验上验证了我们的假设，即可以从LM的顶层共享知识到完全基于语音的模块，其中抽象的语音被期望满足语义表示。

Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-end structures that preserve the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on Fluent Speech Command, an English SLU benchmark. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题