基于变压器的文本标准化的语音应用模型

论文标题

基于变压器的文本标准化的语音应用模型

Transformer-based Models of Text Normalization for Speech Applications

论文作者

Ro, Jae Hun, Stahlberg, Felix, Wu, Ke, Kumar, Shankar

论文摘要

文本归一化或将文本转换为一致的规范形式的过程对于语音应用（例如文本到语音综合（TTS））至关重要。在TTS中，该系统必须决定是否将“ 1995”言语为“ 1995年出生”或“一千九百九十五”，“ 1995年”在“ Page 1995”中。我们介绍了对语音归一化的各种基于变压器的序列（SEQ2SEQ）模型的实验比较，以进行语音归一化，并在与其标准化的口语形式一致的各种书面文本数据集上进行评估。这些模型包括Zhang等人引入的基于2级RNN的标记/SEQ2SEQ结构的变体。（2019年），在一个或多个阶段，我们用变压器替换RNN，以及输出编辑序列的字符串表示形式的香草变压器。在我们的方法中，使用变形金刚在2阶段模型中编码的句子上下文被证明是最有效的，而微调的Bert编码器可产生最佳性能。

Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen ninety five" in "born in 1995" or as "one thousand nine hundred ninety five" in "page 1995". We present an experimental comparison of various Transformer-based sequence-to-sequence (seq2seq) models of text normalization for speech and evaluate them on a variety of datasets of written text aligned to its normalized spoken form. These models include variants of the 2-stage RNN-based tagging/seq2seq architecture introduced by Zhang et al. (2019), where we replace the RNN with a Transformer in one or more stages, as well as vanilla Transformers that output string representations of edit sequences. Of our approaches, using Transformers for sentence context encoding within the 2-stage model proved most effective, with the fine-tuned BERT encoder yielding the best performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题