多模式语音识别的细粒接地

论文标题

多模式语音识别的细粒接地

Fine-Grained Grounding for Multimodal Speech Recognition

论文作者

Srinivasan, Tejas, Sanabria, Ramon, Metze, Florian, Elliott, Desmond

论文摘要

多模式自动语音识别系统通过在视觉上下文中扎根语音来整合图像中的信息，以提高语音识别质量。虽然已证明视觉信号可用于恢复在音频中掩盖的实体，但这些模型应能够恢复更广泛的单词类型。现有系统依赖于代表整个图像的全局视觉特征，但是将图像的相关区域定位将使恢复更大的单词（例如形容词和动词）成为可能。在本文中，我们提出了一个模型，该模型使用自动对象建议使用来自图像不同部分的细粒度视觉信息。在FlickR8K音频字幕的实验中，我们发现我们的模型改进了使用全局视觉特征的方法，该建议使模型能够恢复实体和其他相关词，例如形容词，以及改进是由于该模型可以定位正确建议的模型。

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题