论文标题
飞镖:开放域结构化数据记录到文本生成
DART: Open-Domain Structured Data Record to Text Generation
论文作者
论文摘要
我们向具有超过82K实例(DARTS)的文本生成数据集提出DART,这是一个开放域结构化数据记录。数据之间的注释可能是一个昂贵的过程,尤其是在处理是结构化数据的主要来源并包含非平凡结构的表格时。为此,我们提出了一种从表中提取语义三元的过程,该表通过在表标头和表标题之间利用语义依赖性来编码其结构。我们的数据集构造框架通过利用以下技术(例如:树本体注释,问答性句子对转换),从开放式域语义解析和基于对话的含义表示任务中有效合并了异质源。我们介绍了WebNLG 2017上的DART以及新的最先进结果的系统评估,以表明DART(1)对现有的数据到文本数据集提出了新的挑战,并且(2)促进了外部概括。我们的数据和代码可以在https://github.com/yale-lily/dart上找到。
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks by utilizing techniques such as: tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.