论文标题

研究文章收集的主题分割

Topic Segmentation of Research Article Collections

论文作者

Çano, Erion, Roth, Benjamin

论文摘要

从网络中收获的研究文章数据的集合最近变得很普遍,因为它们是实验诸如命名实体识别,文本摘要或关键字生成等任务的重要资源。实际上,某些类型的实验需要大型且局部结构化的集合,并分配了记录为单独的研究学科。不幸的是,目前的公开研究文章集很小或异构且非结构化。在这项工作中,我们对纸质数据收集进行了主题分割,该纸质数据收集并生成了大约七百万个纸质数据记录的多座数据集。我们构建了从数据记录中提取的主题的分类学,然后用该分类学的相应主题注释每个文档。结果,可以以两种方式使用此新提出的数据集:作为来自各个学科的文档的异质集合,或作为一组均质集合,每个集合都来自一个研究主题。

Collections of research article data harvested from the web have become common recently since they are important resources for experimenting on tasks such as named entity recognition, text summarization, or keyword generation. In fact, certain types of experiments require collections that are both large and topically structured, with records assigned to separate research disciplines. Unfortunately, the current collections of publicly available research articles are either small or heterogeneous and unstructured. In this work, we perform topic segmentation of a paper data collection that we crawled and produce a multitopic dataset of roughly seven million paper data records. We construct a taxonomy of topics extracted from the data records and then annotate each document with its corresponding topic from that taxonomy. As a result, it is possible to use this newly proposed dataset in two modalities: as a heterogeneous collection of documents from various disciplines or as a set of homogeneous collections, each from a single research topic.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源