文本贴图的信任性非重复性多模式变压器

论文标题

文本贴图的信任性非重复性多模式变压器

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

论文作者

Wang, Zhaokai, Bao, Renda, Wu, Qi, Liu, Si

论文摘要

描述图像时，在视觉场景中读取文本对于理解关键信息至关重要。最近的工作探讨了TextCaps任务，即使用读取光学字符识别（OCR）令牌的图像字幕，该任务需要模型来读取文本并在生成的字幕中覆盖它们。由于（1）阅读能力差，因此现有方法无法产生准确的描述；（2）无法在所有提取的OCR代币中选择关键单词；（3）在预测标题中重复单词。为此，我们提出了一种信任意识到的非重复性多模式变压器（CNMT），以应对上述挑战。我们的CNMT由阅读，推理和一代模块组成，其中阅读模块采用更好的OCR系统来增强文本阅读能力和嵌入信心来选择最值得注意的令牌。为了解决字幕中的单词冗余问题，我们的一代模块包括一个重复掩码，以避免在字幕中预测重复的单词。我们的模型在TextCaps数据集上优于最先进的模型，在苹果酒中从81.0提高到93.0。我们的源代码公开可用。

When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题