论文标题
通过知识推理提高图像字幕
Boost Image Captioning with Knowledge Reasoning
论文作者
论文摘要
自动为给定图像产生类似人类的描述是人工智能的潜在研究,最近引起了人们的极大关注。大多数现有的注意方法探讨了句子中的单词与图像中区域之间的映射关系,这种不可预测的匹配方式有时会导致非火刑对准,从而可能降低生成的字幕质量。在本文中,我们努力为更准确和有意义的标题进行推理。我们首先提出单词的注意,以提高视觉注意的正确性,以生成顺序描述逐字。当专注于输入图像的不同区域时,特殊单词的注意力强调了单词的重要性,并充分利用内部注释知识来帮助计算视觉注意力。然后,为了揭示无法直接通过机器表达的不可理解的意图,我们引入了一种新的策略,以将从知识图中提取的外部知识注入到编码器折磨框架中,以促进有意义的字幕。最后,我们在两个免费可用的字幕基准上验证了模型:Microsoft Coco DataSet和Flickr30k数据集。结果表明,我们的方法实现了最先进的绩效,并优于许多现有方法。
Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. Finally, we validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches.