Citesum：引用文本引导的科学极端摘要和域适应性有限的监督

论文标题

Citesum：引用文本引导的科学极端摘要和域适应性有限的监督

CiteSum: Citation Text-guided Scientific Extreme Summarization and Domain Adaptation with Limited Supervision

论文作者

Mao, Yuning, Zhong, Ming, Han, Jiawei

论文摘要

科学的极端摘要（TLDR）旨在形成科学论文的超简短摘要。由于需要大量的人类注释和域专业知识，因此以前策划科学TLDR数据集的努力无法扩大规模。在本文中，我们提出了一种简单而有效的方法，可以自动从其引文文本中提取科学论文的TLDR摘要。基于提出的方法，我们创建了一个新的基准Citesum，而无需人类注释，该基准比以前的人类策划的数据集Scitlr大约30倍。我们对Citesum进行了全面的分析，检查其数据特征并确定强大的基准。我们通过在有限的监督下对在Citesum（命名为Cites）中预先训练的新任务和域进行调整模型来进一步证明Citesum的有用性。为了进行科学的极端总结，CITES的表现优于Scitldr上最完全监督的方法，而无需进行任何微调，并获得了最先进的结果，只有128个示例。对于新闻的极端总结，Cites在XSUM上取得了显着增长，例如其基本模型（未在Citesum上进行预培训），例如+7.2 Rouge-1零射击性能和最先进的几乎没有打击的性能。对于新闻标题的一代，Cites在Gigaword的无监督和零击方法中表现最好。我们的数据集和代码可在https://github.com/morningmoni/citesum上找到。

Scientific extreme summarization (TLDR) aims to form ultra-short summaries of scientific papers. Previous efforts on curating scientific TLDR datasets failed to scale up due to the heavy human annotation and domain expertise required. In this paper, we propose a simple yet effective approach to automatically extracting TLDR summaries for scientific papers from their citation texts. Based on the proposed approach, we create a new benchmark CiteSum without human annotation, which is around 30 times larger than the previous human-curated dataset SciTLDR. We conduct a comprehensive analysis of CiteSum, examining its data characteristics and establishing strong baselines. We further demonstrate the usefulness of CiteSum by adapting models pre-trained on CiteSum (named CITES) to new tasks and domains with limited supervision. For scientific extreme summarization, CITES outperforms most fully-supervised methods on SciTLDR without any fine-tuning and obtains state-of-the-art results with only 128 examples. For news extreme summarization, CITES achieves significant gains on XSum over its base model (not pre-trained on CiteSum), e.g., +7.2 ROUGE-1 zero-shot performance and state-of-the-art few-shot performance. For news headline generation, CITES performs the best among unsupervised and zero-shot methods on Gigaword. Our dataset and code can be found at https://github.com/morningmoni/CiteSum.

下载PDF全文

下载文献需遵守相关版权规定

论文标题