论文标题
离散的跨模式对齐能够启用零拍的语音翻译
Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation
论文作者
论文摘要
端到端的语音翻译(ST)旨在将源语言语言语音转换为目标语言文本而无需产生中间转录。但是,端到端方法的培训取决于平行的ST数据,这些数据很难获得。幸运的是,自动语音识别(ASR)和机器翻译(MT)的监督数据通常更容易访问,从而使零声音翻译成为潜在的方向。现有的零击方法无法将语音和文本的两种方式与共享的语义空间保持一致,与监督的ST方法相比,性能差得多。为了启用零射击,我们提出了一种新型的离散跨模式对齐(DCMA)方法,该方法采用共享的离散词汇空间来容纳和匹配语音和文本的方式。具体而言,我们引入了一个向量量化模块,将语音和文本的连续表示形式离散为有限的虚拟令牌集,并使用ASR数据将相应的语音和文本映射到共享代码簿中的同一虚拟令牌。这样,源语言语音可以与源语言文本相同的语义空间嵌入,然后使用MT模块将其转换为目标语言文本。多种语言对的实验表明,我们的零击方法显着改善了SOTA,甚至表现出色的ST基线。
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions. However, the training of end-to-end methods relies on parallel ST data, which are difficult and expensive to obtain. Fortunately, the supervised data for automatic speech recognition (ASR) and machine translation (MT) are usually more accessible, making zero-shot speech translation a potential direction. Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space, resulting in much worse performance compared to the supervised ST methods. In order to enable zero-shot ST, we propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text. Specifically, we introduce a vector quantization module to discretize the continuous representations of speech and text into a finite set of virtual tokens, and use ASR data to map corresponding speech and text to the same virtual token in a shared codebook. This way, source language speech can be embedded in the same semantic space as the source language text, which can be then transformed into target language text with an MT module. Experiments on multiple language pairs demonstrate that our zero-shot ST method significantly improves the SOTA, and even performers on par with the strong supervised ST baselines.