通过强化学习利用特定于模式的表述来进行视听语音识别

论文标题

通过强化学习利用特定于模式的表述来进行视听语音识别

Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

论文作者

Chen, Chen, Hu, Yuchen, Zhang, Qiang, Zou, Heqing, Zhu, Beier, Chng, Eng Siong

论文摘要

视听语音识别（AVSR）在改善语音识别的噪音方面取得了巨大的成功。主流方法着重于融合音频和视觉输入，以获得模态不变表示。但是，这种表示很容易依赖音频方式，因为在清洁条件下，它比视频方式容易得多。结果，面对噪声损坏，AVSR模型低估了视觉流的重要性。为此，我们利用特定于视觉模式的表示为AVSR任务提供稳定的互补信息。具体而言，我们提出了一个名为MSRL的基于加固学习（RL）的框架，其中代理在自动回归解码过程中动态地统一了模态不变和特定于模态的表示。我们自定义与特定于任务指标（即单词错误率）直接相关的奖励函数，该奖励功能鼓励MSRL有效探索最佳集成策略。 LRS3数据集的实验结果表明，所提出的方法在清洁和各种嘈杂条件下都达到了最先进的方法。此外，当测试集包含看不见的噪声时，我们比其他基线表明，MSRL系统的通用性更好。

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

下载PDF全文

下载文献需遵守相关版权规定

论文标题