Risawoz：一个大规模的多域向导数据集，具有丰富的语义注释，用于任务为导向的对话建模

论文标题

Risawoz：一个大规模的多域向导数据集，具有丰富的语义注释，用于任务为导向的对话建模

RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling

论文作者

Quan, Jun, Zhang, Shian, Cao, Qian, Li, Zizhong, Xiong, Deyi

论文摘要

为了减轻多域数据的短缺并捕获了以任务为导向的对话建模的话语现象，我们提出了Risawoz，这是一种具有丰富语义注释的大型多域中的中国中国媒介数据集。 Risawoz包含11.2k人类对人类（H2H）多转向语义注释的对话，其中超过150k的话语跨越了12个域，该域比以前所有注释的H2H对话数据集更大。构建单域和多域对话，分别占65％和35％。每次对话都标有全面的对话注释，包括以自然语言描述的形式，域，对话状态和在用户和系统方面的行为的形式。除了传统的对话注释外，我们还特别提供有关话语现象的语言注释，例如省略号和核心，在对话中，这对于对话核心和省略号解决任务很有用。除了完全注释的数据集外，我们还提供了数据收集过程，统计信息和数据集分析的详细说明。报告了一系列基准模型和结果，包括自然语言理解（意图检测和插槽填充），对话状态跟踪和对话上下文到文本生成，以及核心和椭圆形分辨率，这促进了对该语料库的未来研究的基线比较。

In order to alleviate the shortage of multi-domain data and to capture discourse phenomena for task-oriented dialogue modeling, we propose RiSAWOZ, a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labeled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, we especially provide linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. A series of benchmark models and results are reported, including natural language understanding (intent detection & slot filling), dialogue state tracking and dialogue context-to-text generation, as well as coreference and ellipsis resolution, which facilitate the baseline comparison for future research on this corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题