Wikides：一个基于Wikipedia的数据集，用于从段落中生成简短描述

论文标题

Wikides：一个基于Wikipedia的数据集，用于从段落中生成简短描述

WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs

论文作者

Ta, Hoang Thang, Rahman, Abu Bakar Siddiqur, Majumder, Navonil, Hussain, Amir, Najjar, Lotfollah, Howard, Newton, Poria, Soujanya, Gelbukh, Alexander

论文摘要

由于免费的在线百科全书具有大量内容，因此Wikipedia和Wikidata是许多自然语言处理（NLP）任务的关键，例如信息检索，知识基础构建，机器翻译，文本分类和文本摘要。在本文中，我们介绍了Wikides，Wikides是一个新颖的数据集，用于为Wikipedia文章的简短描述，以解决文本摘要问题。该数据集由6987个主题上的80K英语样本组成。我们设置了一种两阶段摘要方法 - 描述生成（I阶段）和候选排名（II阶段）作为一种依赖转移和对比学习的强大方法。对于描述生成，与其他小规模的预训练模型相比，T5和BART显示出它们的优势。通过将对比度学习与Beam Search的不同输入一起应用，基于度量的排名模型优于直接描述生成模型，在主题独立拆分和独立于主题的独立拆分中，高达22个rouge。此外，II期的结果描述得到了人类评估的支持，其中45.33％以上的评估是相比，I阶段的23.66％针对黄金描述。在情感分析方面，生成的描述无法有效地从段落中捕获所有情感极性，而从黄金描述中可以更好地完成此任务。自动产生的新描述减少了人类为创建它们的努力，并丰富了基于Wikidata的知识图。我们的论文对Wikipedia和Wikidata显示了实际的影响，因为有成千上万的描述。最后，我们预计Wikides将成为从短段落中捕获显着信息的相关作品的有用数据集。策划的数据集可公开可用：https：//github.com/declare-lab/wikides。

As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method - description generation (Phase I) and candidate ranking (Phase II) - as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题