细心的融合增强了基于变压器的强大语音识别的视听编码

论文标题

细心的融合增强了基于变压器的强大语音识别的视听编码

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

论文作者

Wei, Liangfa, Zhang, Jie, Hou, Junfeng, Dai, Lirong

论文摘要

视听信息融合可以改善在复杂的声学场景中进行的语音识别，例如嘈杂的环境。需要探索有效的视听融合策略，以进行视听融合和方式可靠性。与以前的端到端方法不同的是，在编码每种模式后执行视听融合，在本文中，我们建议将细心的融合块集成到编码过程中。结果表明，编码器模块中提出的视听融合方法可以丰富视听表示，因为这两种模式之间的相关性是利用的。与基于变压器的体系结构一致，我们使用基于单向或双向交互的多头视听融合实现了嵌入式融合块。提出的方法可以充分结合两个流并削弱对音频模式的过度依赖。 LRS3-TED数据集的实验表明，与最先进的方法相比，在清洁，可见和看不见的噪声条件下，所提出的方法平均可以将识别率提高0.55％，4.51％和4.61％。

Audio-visual information fusion enables a performance improvement in speech recognition performed in complex acoustic scenarios, e.g., noisy environments. It is required to explore an effective audio-visual fusion strategy for audiovisual alignment and modality reliability. Different from the previous end-to-end approaches where the audio-visual fusion is performed after encoding each modality, in this paper we propose to integrate an attentive fusion block into the encoding process. It is shown that the proposed audio-visual fusion method in the encoder module can enrich audio-visual representations, as the relevance between the two modalities is leveraged. In line with the transformer-based architecture, we implement the embedded fusion block using a multi-head attention based audiovisual fusion with one-way or two-way interactions. The proposed method can sufficiently combine the two streams and weaken the over-reliance on the audio modality. Experiments on the LRS3-TED dataset demonstrate that the proposed method can increase the recognition rate by 0.55%, 4.51% and 4.61% on average under the clean, seen and unseen noise conditions, respectively, compared to the state-of-the-art approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题