论文标题
探索和提炼跨模式信息以进行图像字幕
Exploring and Distilling Cross-Modal Information for Image Captioning
论文作者
论文摘要
最近,基于注意的编码模型已广泛用于图像字幕。然而,当前方法仍然很难获得深入的图像理解。在这项工作中,我们认为这种理解需要视觉上关注相关的图像区域以及对感兴趣的连贯属性的语义关注。基于变压器,为了进行有效的关注,我们从跨模式的角度探索了图像字幕,并提出了全球和本地信息探索和分配方法,以探索和提炼视觉和语言中的源信息。它在全球范围内提供了基于字幕上下文的图像的空间和关系表示,通过提取显着区域分组和属性搭配,并在本地提取细粒区域和属性,并参考文字选择方面矢量。我们基于变压器的模型在可可测试集中的离线可可评估中达到了129.3的苹果分数,其准确性,速度和参数预算具有显着效率。
Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. Based on the Transformer, to perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.