论文标题
深层神经网络用于视觉推理
Deep Neural Networks for Visual Reasoning
论文作者
论文摘要
视觉感知和语言理解是 - 人类智力的基本组成部分,使他们能够理解和理论对象及其相互作用。对于机器具有这种能力来推理这两种方式来发明新的机器人人类协作系统,这一点至关重要。深度学习的最新进展已经建立了视觉场景和语言的各个复杂表示。但是,了解多模式推理的共享上下文中两种方式之间的关联仍然是一个挑战。本文着眼于语言和视觉方式,提高了对如何利用和使用神经网络的视觉和语言任务的关键方面的理解来支持推理。我们从一系列作品中得出了这些理解,做出了两个方面的贡献:(i)从动态的视觉场景中进行内容选择和构建时间关系的有效机制,以响应语言查询,并为推理过程(ii)通过直接从外部启动视觉协会或从数据中剥削神经网络来实现神经网络的推理,从而为推理提供了足够的知识(II)。
Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions. It is crucial for machines to have this capacity to reason using these two modalities to invent new robot-human collaborative systems. Recent advances in deep learning have built separate sophisticated representations of both visual scenes and languages. However, understanding the associations between the two modalities in a shared context for multimodal reasoning remains a challenge. Focusing on language and vision modalities, this thesis advances the understanding of how to exploit and use pivotal aspects of vision-and-language tasks with neural networks to support reasoning. We derive these understandings from a series of works, making a two-fold contribution: (i) effective mechanisms for content selection and construction of temporal relations from dynamic visual scenes in response to a linguistic query and preparing adequate knowledge for the reasoning process (ii) new frameworks to perform reasoning with neural networks by exploiting visual-linguistic associations, deduced either directly from data or guided by external priors.