输血：用多项式扩散转录语音

论文标题

输血：用多项式扩散转录语音

TransFusion: Transcribing Speech with Multinomial Diffusion

论文作者

Baas, Matthew, Eloff, Kevin, Kamper, Herman

论文摘要

扩散模型已显示出图像合成域中的出色缩放属性，并且初始尝试显示了将扩散应用于无条件文本合成的相似好处。 denoising扩散模型试图迭代地完善采样的噪声信号，直到它类似于相干信号（例如图像或书面句子）。在这项工作中，我们旨在了解是否还可以实现言语识别的扩散模型的好处。为此，我们提出了一种使用以验证的语音特征为条件的扩散模型执行语音识别的新方法。具体而言，我们提出了输血：一个转录扩散模型，它迭代地将随机字符序列变成与调节性话语的转录本相对应的连贯文本。我们证明了与Librispeech语音识别基准上现有高性能对比模型相当的性能。据我们所知，我们是第一个将denoise扩散应用于语音识别的人。我们还提出了有效采样和解码多项式扩散模型的新技术。这是必需的，因为通过我们的新离散扩散方法，无法从声学模型中采样的传统方法。可以使用代码和训练有素的模型：https：//github.com/rf5/transfusion-asr

Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits of diffusion models can also be realized for speech recognition. To this end, we propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features. Specifically, we propose TransFusion: a transcribing diffusion model which iteratively denoises a random character sequence into coherent text corresponding to the transcript of a conditioning utterance. We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark. To the best of our knowledge, we are the first to apply denoising diffusion to speech recognition. We also propose new techniques for effectively sampling and decoding multinomial diffusion models. These are required because traditional methods of sampling from acoustic models are not possible with our new discrete diffusion approach. Code and trained models are available: https://github.com/RF5/transfusion-asr

下载PDF全文

下载文献需遵守相关版权规定

论文标题