论文标题
DialoGCC:用于创建高质量多模式对话数据集的自动管道
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset
论文作者
论文摘要
由于在即时消息中共享图像是一个关键因素,因此在学习图像文本多模式对话模型方面已经进行了积极的研究。但是,培训良好的多模式对话模型由于现有多模式对话数据集中的图像多样性低且图像多样性有限,因此仍然具有挑战性。在本文中,我们提出了一条自动管道来构建多模式对话数据集,以确保对话质量和图像多样性而无需最少的人为努力。在我们的管道中,为了确保图像和对话之间的连贯性,我们提示GPT-4推断潜在的图像共享时刻 - 特别是话语,扬声器,理由和图像描述。此外,我们利用剪辑相似性来维持多个图像与话语对齐之间的一致性。通过此管道,我们介绍了Dialogcc,这是一个高质量和多样化的多模式对话数据集,该数据集超过了人类评估中质量和多样性的现有数据集。我们的全面实验强调,当使用我们的数据集对多模式对话模型进行培训时,它们在看不见的对话数据集中的概括性能将大大增强。我们公开提供我们的源代码和数据集。
As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance. Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation. Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available.