论文标题
多式联运语音识别和非结构化音频掩蔽
Multimodal Speech Recognition with Unstructured Audio Masking
论文作者
论文摘要
当语音信号嘈杂或损坏时,视觉上下文已被证明可用于自动语音识别(ASR)系统。但是,以前的工作仅在不切实际的环境中证明了视觉上下文的实用性,其中固定的单词在音频中被系统地掩盖了。在本文中,我们在模型训练中模拟了更现实的掩蔽场景,称为randwordmask,其中任何单词段都可以在其中发生掩蔽。我们在Flickr 8K音频字幕上进行的实验表明,多模式ASR可以概括以在此非结构化掩蔽设置中恢复不同类型的蒙版单词。此外,我们的分析表明,当音频信号损坏时,我们的模型能够参与视觉信号。这些结果表明,多模式ASR系统可以在更广泛的噪声场景中利用视觉信号。
Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.