论文标题
视觉扎根语言获取的类比推理
Analogical Reasoning for Visually Grounded Language Acquisition
论文作者
论文摘要
孩子们通过观察周围世界并听取描述来获取语言。即使没有明确的语言知识,他们也可以发现单词的含义,并毫不费力地将新颖的作品推广到新颖的作品。在本文中,我们通过研究视觉扎根语言获取(VLA)的任务将这种能力带入了AI。我们提出了一个多模式变压器模型,该模型增强了一种新型的类似推理机制,该模型通过学习语义映射和从先前看到的组合物中学习语义映射和推理操作来近似新颖的组成。我们提出的方法,类似的推理变压器网络(ARTNET),接受了原始多媒体数据(视频框架和笔录)的培训,在观察了一组构图之后,例如“洗苹果”或“切割胡萝卜”,它可以推广并识别新的视频框架中的新构图,例如“洗手果盘”或“洗衣服”或“切入苹果”。为此,ARTNET是指培训数据中的相关实例,并使用其视觉功能和字幕来建立与查询图像的类比。然后,它选择合适的动词和名词来创建一个最佳描述新图像的新构图。在教学视频数据集上进行的广泛实验表明,与最先进的变压器模型相比,所提出的方法具有更好的概括能力和识别精度。
Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of Visually grounded Language Acquisition (VLA). We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNet), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as "washing apple" or "cutting carrot", it can generalize and recognize new compositions in new video frames, such as "washing carrot" or "cutting apple". To this end, ARTNet refers to relevant instances in the training data and uses their visual features and captions to establish analogies with the query image. Then it chooses the suitable verb and noun to create a new composition that describes the new image best. Extensive experiments on an instructional video dataset demonstrate that the proposed method achieves significantly better generalization capability and recognition accuracy compared to state-of-the-art transformer models.