Vu-Bert：视觉对话的统一框架

论文标题

Vu-Bert：视觉对话的统一框架

VU-BERT: A Unified framework for Visual Dialog

论文作者

Ye, Tong, Si, Shijing, Wang, Jianzong, Wang, Rui, Cheng, Ning, Xiao, Jing

论文摘要

视觉对话框任务试图训练一个代理，以回答给定图像的多转弯问题，这需要对图像和对话历史记录之间的相互作用深入了解。现有的研究倾向于采用特定于模态的模块来对交互作用进行建模，这可能很麻烦。为了填补这一空白，我们为图像文本嵌入（名为Vu-bert）提出了一个统一的框架，并应用贴片投影以获取视觉嵌入在视觉对话框任务中以简化模型。该模型经过了两个任务的训练：蒙版的语言建模和下一个话语检索。这些任务有助于学习视觉概念，话语依赖性以及这两种方式之间的关系。最后，我们的Vu-Bert在Visdial V1.0数据集上实现了竞争性能（0.7287 NDCG得分）。

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题