通过对抗训练产生多样的音频字幕

论文标题

通过对抗训练产生多样的音频字幕

Towards Generating Diverse Audio Captions via Adversarial Training

论文作者

Mei, Xinhao, Liu, Xubo, Sun, Jianyuan, Plumbley, Mark D., Wang, Wenwu

论文摘要

自动音频字幕是一项跨模式翻译任务，用于描述带有自然语言句子的音频剪辑的内容。这项任务吸引了越来越多的关注，并且近年来已经取得了长足的进步。现有模型生成的字幕通常忠于音频剪辑的内容，但是，这些机器生成的字幕通常是确定性的（例如，为给定的音频剪辑生成固定的字幕），简单（例如，使用通用单词和简单的语法）和通用（例如，对相同的音频剪辑产生相同的字幕）。当要求人们描述音频剪辑的内容时，不同的人倾向于专注于不同的声音事件，并使用不同的单词和语法来描述各个方面的音频剪辑。我们认为，音频字幕系统应具有为固定音频剪辑或类似的音频剪辑生成各种字幕的能力。为此，我们提出了一个基于条件生成对抗网络（C-GAN）的对抗训练框架，以改善音频字幕系统的多样性。字幕发电机和两个混合歧视器竞争并共同学习，标题发电机可以是用于生成标题的任何标准编码器字幕模型，而混合歧视器评估了来自不同标准的产生字幕，例如其自然性和语言。我们在Clotho数据集上进行实验。结果表明，与最先进的方法相比，我们提出的模型可以生成具有更好多样性的字幕。

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题