利用无监督和弱监督的数据来改善直接语音到语音翻译

论文标题

利用无监督和弱监督的数据来改善直接语音到语音翻译

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

论文作者

Jia, Ye, Ding, Yifan, Bapna, Ankur, Cherry, Colin, Zhang, Yu, Conneau, Alexis, Morioka, Nobuyuki

论文摘要

端到端的语音到语音翻译（S2ST）而不依赖中间文本表示是一个快速新兴的研究领域。最近的工作表明，这种直接S2ST系统的性能正在接近常规级联S2ST时，在可比较的数据集中培训时。但是，实际上，Direct S2ST的性能受到配对S2ST培训数据的可用性。在这项工作中，我们探索了多种方法，以利用更广泛的无监督和弱监督的语音和文本数据，以改善基于翻译2的直接S2ST的性能。通过我们最有效的方法，我们最有效的直接S2ST的平均平均翻译质量在21个语言对上的CVSS-C Corpus上没有+13.6 Bleu（OR + +113％），而不是+13.6 Bleu（OR +113％），而 + + +113％（OR +113％），其他数据。低资源语言的改进更加重要（平均+398％）。我们的比较研究表明，S2ST和语音表示学习的未来研究方向。

End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题