多模式集成，用于大型视听语音识别

论文标题

多模式集成，用于大型视听语音识别

Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

论文作者

Yu, Wentao, Zeiler, Steffen, Kolossa, Dorothea

论文摘要

对于许多中小型摄影任务，与纯音频系统相比，视听语音识别可以显着提高识别率。但是，关于多模式信息的最佳组合策略仍有一场持续的辩论，这应该可以将这些收益转化为大型唱片识别。尽管使用动态流量加权的状态概率级别的集成几乎对小型香巴布拉里系统有所帮助，但在大型唱机语音识别中，识别精度仍然很难提高。在下文中，我们专门考虑了LRS2数据库的大型摄氏量任务，并研究了广泛的集成策略，比较了早期的集成和端到端学习与许多混合识别版本和动态流量重量的版本。在这里证明，一个方面是使用动态流的可靠性指标，这可以使混合体系结构从包含视觉信息中，每当音频频道略微扭曲甚至略微扭曲时，都可以从包含视觉信息中获得强大的利益。

For many small- and medium-vocabulary tasks, audio-visual speech recognition can significantly improve the recognition rates compared to audio-only systems. However, there is still an ongoing debate regarding the best combination strategy for multi-modal information, which should allow for the translation of these gains to large-vocabulary recognition. While an integration at the level of state-posterior probabilities, using dynamic stream weighting, is almost universally helpful for small-vocabulary systems, in large-vocabulary speech recognition, the recognition accuracy remains difficult to improve. In the following, we specifically consider the large-vocabulary task of the LRS2 database, and we investigate a broad range of integration strategies, comparing early integration and end-to-end learning with many versions of hybrid recognition and dynamic stream weighting. One aspect, which is shown to provide much benefit here, is the use of dynamic stream reliability indicators, which allow for hybrid architectures to strongly profit from the inclusion of visual information whenever the audio channel is distorted even slightly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题