FASTLR：具有集成与火的非自动回报式唇读模型

论文标题

FASTLR：具有集成与火的非自动回报式唇读模型

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

论文作者

Liu, Jinglin, Ren, Yi, Zhao, Zhou, Zhang, Chen, Huai, Baoxing, Yuan, Nicholas Jing

论文摘要

唇读是一种令人印象深刻的技术，近年来准确性有一定的提高。但是，现有的唇读方法主要建立在自回归（AR）模型上，该模型一一生成目标令牌并患有高推断潜伏期。为了突破此约束，我们提出了FASTLR，这是一种非自动化（NAR）唇读模型，同时生成所有目标令牌。 NAR唇读是一项具有挑战性的任务，有许多困难：1）源和目标之间的序列长度的差异使得很难估计输出序列的长度； 2）NAR产生的有条件独立的行为缺乏整个时间的相关性，从而导致目标分布的近似值差； 3）由于缺乏有效的比对机制，编码器的特征表示能力可能很弱； 4）AR语言模型的去除加剧了唇部阅读的固有歧义问题。因此，在本文中，我们介绍了三种方法来减少FASTLR和AR模型之间的差距：1）解决挑战1和2，我们利用集成和火力（I \＆F）模块来建模源视频框架和输出文本序列之间的对应关系。 2）为了应对挑战3，我们在编码器的顶部添加了辅助连接师时间分类（CTC）解码器，并通过额外的CTC损失进行优化。我们还添加了一个辅助自动回归解码器，以帮助编码器的特征提取。 3）为了克服挑战4，我们为I \＆f提出了一种新颖的嘈杂平行解码（NPD），并将字节对编码（BPE）带入口头上。我们的实验表明，FASTLR与最新的脂阅读模型相比，最高可达到10.97 $ \ times $，在网格和LRS2 LipReading数据集上的绝对绝对增加为1.5 \％和5.5 \％，这表明了我们提议方法的有效性。

Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I\&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I\&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97$\times$ comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5\% and 5.5\% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题