MATINF：一个共同标记的大型数据集用于分类，问答和摘要

论文标题

MATINF：一个共同标记的大型数据集用于分类，问答和摘要

MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

论文作者

Xu, Canwen, Pei, Jiaxin, Wu, Hongtao, Liu, Yiyu, Li, Chenliang

论文摘要

最近，大规模数据集已极大地促进了几乎所有自然语言处理领域的发展。但是，NLP中目前尚无交叉任务数据集，这阻碍了多任务学习的发展。我们建议MATINF，这是第一个共同标记的大规模数据集，用于分类，问题答案和摘要。 MATINF包含10700万个问题，与人体标记的类别和用户生成的问题描述。基于这样的丰富信息，MATINF适用于三个主要的NLP任务，包括分类，问题答案和摘要。我们基于MATINF上的现有方法和一种新型的多任务基线，以激发进一步的研究。我们对MATINF和其他数据集进行的全面比较和实验证明了MATINF的优点。

Recently, large-scale datasets have vastly facilitated the development in nearly all domains of Natural Language Processing. However, there is currently no cross-task dataset in NLP, which hinders the development of multi-task learning. We propose MATINF, the first jointly labeled large-scale dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and user-generated question descriptions. Based on such rich information, MATINF is applicable for three major NLP tasks, including classification, question answering, and summarization. We benchmark existing methods and a novel multi-task baseline over MATINF to inspire further research. Our comprehensive comparison and experiments over MATINF and other datasets demonstrate the merits held by MATINF.

下载PDF全文

下载文献需遵守相关版权规定

论文标题