论文标题
严格的端到端培训,用于级联的语音翻译
Tight Integrated End-to-End Training for Cascaded Speech Translation
论文作者
论文摘要
级联的语音翻译模型依赖于离散和非差异转录,该转录提供了从源方提供的监督信号,并有助于源语音和目标文本之间的转换。这种建模遭受了ASR和MT模型之间的错误传播。直接语音翻译是避免错误传播的另一种方法。但是,它的性能通常是级联系统的落后者。为了使用中间表示并保留端到端的训练性,先前的研究已通过将识别器的隐藏向量传递到MT模型的解码器并忽略MT编码器中提出了两阶段模型。这项工作通过优化共同的所有参数,而无需忽略任何学习的参数,探讨了将整个级联组件折叠成单个端到端训练模型的可行性。这是一种紧密整合的方法,它将重新归一化的源单词后验作为一个软决策而不是单速矢量,并启用反向传播。因此,它提供了转录和翻译,并在它们之间达到了强大的一致性。我们对具有不同数据方案的四个任务的实验表明,该模型的级联模型在BLEU中的模型高达1.8%,而TER中的级联模型比直接模型高。
A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.