语音识别和直接翻译的相对位置编码

论文标题

语音识别和直接翻译的相对位置编码

Relative Positional Encoding for Speech Recognition and Direct Translation

论文作者

Pham, Ngoc-Quan, Ha, Thanh-Le, Nguyen, Tuan-Nam, Nguyen, Thai-Son, Salesky, Elizabeth, Stueker, Sebastian, Niehues, Jan, Waibel, Alexander

论文摘要

变压器模型是强大的序列到序列体系结构，能够将语音输入直接映射到转录或翻译。但是，该模型中建模位置的机制是针对文本建模量身定制的，因此对于声学输入而言，不太理想。在这项工作中，我们将相对位置编码方案调整到语音变压器中，其中关键的添加是自我注意力网络中输入状态之间的相对距离。结果，网络可以更好地适应语音数据中存在的变量分布。我们的实验表明，我们所产生的模型在非夸演出条件下实现了总识别板基准的最佳识别结果，并且最佳发表的结果是必不可少的语音翻译基准。我们还表明，与变压器相比，该模型能够更好地利用合成数据，并且可以更好地适应语音翻译的可变句子分割质量。

Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题