论文标题
Vu-Bert:视觉对话的统一框架
VU-BERT: A Unified framework for Visual Dialog
论文作者
论文摘要
视觉对话框任务试图训练一个代理,以回答给定图像的多转弯问题,这需要对图像和对话历史记录之间的相互作用深入了解。现有的研究倾向于采用特定于模态的模块来对交互作用进行建模,这可能很麻烦。为了填补这一空白,我们为图像文本嵌入(名为Vu-bert)提出了一个统一的框架,并应用贴片投影以获取视觉嵌入在视觉对话框任务中以简化模型。该模型经过了两个任务的训练:蒙版的语言建模和下一个话语检索。这些任务有助于学习视觉概念,话语依赖性以及这两种方式之间的关系。最后,我们的Vu-Bert在Visdial V1.0数据集上实现了竞争性能(0.7287 NDCG得分)。
The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.