论文标题

多模式的半监督学习文本识别

Multimodal Semi-Supervised Learning for Text Recognition

论文作者

Aberdam, Aviad, Ganz, Roy, Mazor, Shai, Litman, Ron

论文摘要

直到最近,公共现实世界中的文本图像的数量不足以用于培训场景文本识别器。因此,大多数现代培训方法都依赖于合成数据并以完全监督的方式运行。然而,最近的公共现实世界文本图像数量已大大增加,包括大量未标记的数据。利用这些资源需要半监督的方法;但是,少数现有的方法并不考虑视觉多模式结构,因此对于最先进的多模式体系结构而言是次优的。为了弥合这一差距,我们为多模式文本识别仪(SEMIMTR)提供了半监督的学习,该学习在每个模态训练阶段都利用未标记的数据。值得注意的是,我们的方法避免了额外的训练阶段,并保持当前的三阶段多模式训练程序。我们的算法首先是通过单阶段的训练预处理视觉模型,该培训通过监督培训统一了自我监督的学习。更具体地说,我们扩展了现有的视觉表示学习算法,并提出了第一种基于对比的场景文本识别方法。在文本语料库上预处理语言模型之后,我们通过弱和强大的文本图像视图之间的顺序,角色级别的一致性正则化来微调整个网络。在新颖的设置中,每种方式都会在每种方式上执行一致性。广泛的实验验证了我们的方法表现优于当前的训练方案,并在多个场景文本识别基准上获得最先进的结果。

Until recently, the number of public real-world text images was insufficient for training scene text recognizers. Therefore, most modern training methods rely on synthetic data and operate in a fully supervised manner. Nevertheless, the amount of public real-world text images has increased significantly lately, including a great deal of unlabeled data. Leveraging these resources requires semi-supervised approaches; however, the few existing methods do not account for vision-language multimodality structure and therefore suboptimal for state-of-the-art multimodal architectures. To bridge this gap, we present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase. Notably, our method refrains from extra training stages and maintains the current three-stage multimodal training procedure. Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training. More specifically, we extend an existing visual representation learning algorithm and propose the first contrastive-based method for scene text recognition. After pretraining the language model on a text corpus, we fine-tune the entire network via a sequential, character-level, consistency regularization between weakly and strongly augmented views of text images. In a novel setup, consistency is enforced on each modality separately. Extensive experiments validate that our method outperforms the current training schemes and achieves state-of-the-art results on multiple scene text recognition benchmarks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源