逃避语言偏见和OCR错误：以语义为中心的文本视觉问题回答

论文标题

逃避语言偏见和OCR错误：以语义为中心的文本视觉问题回答

Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering

论文作者

Fang, Chengyang, Zeng, Gangyan, Zhou, Yu, Wu, Daiqing, Ma, Can, Hu, Dayong, Wang, Weiping

论文摘要

场景图像中的文字传达了关键信息，以了解场景的理解和推理。在基于文本的视觉问题回答（TextVQA）过程中，阅读和推理的能力对模型的能力很重要。但是，当前的TextVQA模型并不集中在文本上，并且受到了几个限制。由于在答案预测过程中没有语义指导，该模型很容易由语言偏见和光学特征识别（OCR）错误主导。在本文中，我们提出了一个新颖的以语义为中心的网络（SC-NET），该网络由实例级对比度语义预测模块（ICSP）和以语义为中心的变压器模块（SCT）组成。配备了两个模块，以语义为中心的模型可以抵抗语言偏见和OCR的累积错误。关于TextVQA和ST-VQA数据集的广泛实验显示了我们模型的有效性。 SC-NET超过了以前的工作，并具有明显的利润，对于TextVQA任务来说更合理。

Texts in scene images convey critical information for scene understanding and reasoning. The abilities of reading and reasoning matter for the model in the text-based visual question answering (TextVQA) process. However, current TextVQA models do not center on the text and suffer from several limitations. The model is easily dominated by language biases and optical character recognition (OCR) errors due to the absence of semantic guidance in the answer prediction process. In this paper, we propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module (ICSP) and a semantics-centered transformer module (SCT). Equipped with the two modules, the semantics-centered model can resist the language biases and the accumulated errors from OCR. Extensive experiments on TextVQA and ST-VQA datasets show the effectiveness of our model. SC-Net surpasses previous works with a noticeable margin and is more reasonable for the TextVQA task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题