Prestu：用于场景文本理解的预训练

论文标题

Prestu：用于场景文本理解的预训练

PreSTU: Pre-Training for Scene-Text Understanding

论文作者

Kil, Jihyung, Changpinyo, Soravit, Chen, Xi, Hu, Hexiang, Goodman, Sebastian, Chao, Wei-Lun, Soricut, Radu

论文摘要

视觉和语言（V＆L）模型通常缺乏识别和理性的文本的能力，也许是因为V＆L预培训方法通常未能将这种能力包括在其训练目标中。在本文中，我们提出了Prestu，这是一种专门针对场景文本理解（STU）的新型预训练食谱。 Prestu引入了OCR感知的预训练目标，鼓励模型从图像中识别文本并将其连接到其余图像内容。我们使用简单的基于变压器的编码器架构实现PRESTU，并结合了大规模的图像文本数据集，并与从现成的OCR系统获得的场景文本结合使用。我们从经验上证明了这种预训练方法对八个视觉问题回答和四个图像字幕的基准的有效性。

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题