使用学习的分段单元的无文本图像到语音综合

论文标题

使用学习的分段单元的无文本图像到语音综合

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

论文作者

Hsu, Wei-Ning, Harwath, David, Song, Christopher, Glass, James

论文摘要

在本文中，我们介绍了第一个直接合成流利的，自然的语音音频字幕的模型，该图像不需要自然语言文本作为中间表示或监督源。取而代之的是，我们将图像字幕模块和语音综合模块与一组离散的子字语音单元连接起来，这些单元是通过自我监督的视觉接地任务发现的。除了为流行的MSCOCO数据集收集的新型口语音频字幕外，我们还对FlickR8K口语数据集进行了实验，表明我们的生成的字幕还捕获了他们描述的图像的各种视觉语义。我们研究了几种不同的中间语音表示形式，并从经验上发现，该表示必须满足几个重要属性，以作为文本的倒入替换。

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

下载PDF全文

下载文献需遵守相关版权规定

论文标题