论文标题
访谈:大规模开源媒体对话框
Interview: A Large-Scale Open-Source Corpus of Media Dialog
论文作者
论文摘要
现有的对话数据集由对话框的书面代理或自然语音的小规模抄录组成。我们介绍“访谈”:从新闻面试成绩单收集的大规模(105K对话)媒体对话框数据集。与现有的大规模代理相比,在我们的数据集中训练的语言模型在现有的口语对话框数据集上表现出更好的零拍摄性能,这表明了其在建模现实世界对话中的有用性。 “访谈”包含每个回合的演讲者角色注释,从而促进了引人入胜的,响应式对话系统的发展。实际上,在两个对话任务上进行的实验表明,利用此类标签可以改善强大的扬声器 - 静态基准的性能,并使模型能够在面试式的对话中产生更具体和好奇的响应。
Existing conversational datasets consist either of written proxies for dialog or small-scale transcriptions of natural speech. We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts. Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance on existing spoken dialog datasets, demonstrating its usefulness in modeling real-world conversations. 'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems. In fact, experiments on two dialog tasks show that leveraging such labels improves performance over strong speaker-agnostic baselines, and enabling models to generate more specific and inquisitive responses in interview-style conversations.