论文标题
sgram:通过抽象含义表示形式改进场景图形解析
SGRAM: Improving Scene Graph Parsing via Abstract Meaning Representation
论文作者
论文摘要
场景图是结构化的语义表示,可以将图像和文本从图像和文本中建模为图形。基于图像的场景图生成研究一直积极进行,直到最近才进行,而基于文本的场景图生成研究尚未进行。在本文中,我们关注场景图从视觉场景的文本描述中解析的问题。核心思想是使用抽象含义表示(AMR),而不是以前研究中主要使用的依赖性解析。 AMR是一种基于图的自然语言的语义形式主义,它在句子中抽象单词的概念,与依赖解析相反,该句子认为依赖性关系对句子中的所有单词。为此,我们设计了一个简单而有效的两阶段场景图形解析框架,利用抽象含义表示,sgram(通过抽象含义表示的场景图解析):1)将图像的文本描述转换为AMR图(文本到AMR)(文本到AMR)和2)将AMR图编码到基于变速箱的语言模型中,以生成一个基于变速箱的语言模型(AMR到AMR到AMR到SG)。实验结果表明,我们框架生成的场景图的表现优于基于依赖关系解析的模型11.61 \%,并且使用预训练的变压器语言模型比3.78 \%使用了先前的最新模型。此外,我们将sgram应用于图像检索任务,这是场景图的下游任务之一,并确认我们框架生成的场景图的有效性。
Scene graph is structured semantic representation that can be modeled as a form of graph from images and texts. Image-based scene graph generation research has been actively conducted until recently, whereas text-based scene graph generation research has not. In this paper, we focus on the problem of scene graph parsing from textual description of a visual scene. The core idea is to use abstract meaning representation (AMR) instead of the dependency parsing mainly used in previous studies. AMR is a graph-based semantic formalism of natural language which abstracts concepts of words in a sentence contrary to the dependency parsing which considers dependency relationships on all words in a sentence. To this end, we design a simple yet effective two-stage scene graph parsing framework utilizing abstract meaning representation, SGRAM (Scene GRaph parsing via Abstract Meaning representation): 1) transforming a textual description of an image into an AMR graph (Text-to-AMR) and 2) encoding the AMR graph into a Transformer-based language model to generate a scene graph (AMR-to-SG). Experimental results show the scene graphs generated by our framework outperforms the dependency parsing-based model by 11.61\% and the previous state-of-the-art model using a pre-trained Transformer language model by 3.78\%. Furthermore, we apply SGRAM to image retrieval task which is one of downstream tasks for scene graph, and confirm the effectiveness of scene graphs generated by our framework.