论文标题
C3-STISR:带有三重线索的场景文本图像超分辨率
C3-STISR: Scene Text Image Super-resolution with Triple Clues
论文作者
论文摘要
场景文本图像超分辨率(STISR)被视为从低分辨率场景文本图像中识别文本识别的重要预处理任务。最近的方法将识别器的反馈用作指导超分辨率的线索。但是,直接使用识别线索有两个问题:1)兼容性。它是概率分布的形式,具有STISR-像素级任务的明显模态差距; 2)不准确。它通常包含错误的信息,因此会误导主要任务并降低超分辨率性能。在本文中,我们提出了一种新颖的方法C3-STISR,该方法可以共同利用识别者的反馈,视觉和语言信息作为指导超分辨率的线索。在这里,视觉线索来自识别器预测的文本的图像,该文本的信息丰富,与STISR任务更兼容。尽管语言线索是由预先训练的字符级语言模型生成的,该模型能够纠正预测的文本。我们为三重跨模式线索设计有效的提取和融合机制,以生成全面而统一的超分辨率指导。 TextZoom上的广泛实验表明,C3-STISR在保真度和识别性能方面的表现优于SOTA方法。代码可在https://github.com/zhaominyiz/c3-stisr中找到。
Scene text image super-resolution (STISR) has been regarded as an important pre-processing task for text recognition from low-resolution scene text images. Most recent approaches use the recognizer's feedback as clues to guide super-resolution. However, directly using recognition clue has two problems: 1) Compatibility. It is in the form of probability distribution, has an obvious modal gap with STISR - a pixel-level task; 2) Inaccuracy. it usually contains wrong information, thus will mislead the main task and degrade super-resolution performance. In this paper, we present a novel method C3-STISR that jointly exploits the recognizer's feedback, visual and linguistical information as clues to guide super-resolution. Here, visual clue is from the images of texts predicted by the recognizer, which is informative and more compatible with the STISR task; while linguistical clue is generated by a pre-trained character-level language model, which is able to correct the predicted texts. We design effective extraction and fusion mechanisms for the triple cross-modal clues to generate a comprehensive and unified guidance for super-resolution. Extensive experiments on TextZoom show that C3-STISR outperforms the SOTA methods in fidelity and recognition performance. Code is available in https://github.com/zhaominyiz/C3-STISR.