在山脉19上综合命名实体识别，并有遥远或弱的监督

论文标题

在山脉19上综合命名实体识别，并有遥远或弱的监督

Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision

论文作者

Wang, Xuan, Song, Xiangchen, Li, Bangzheng, Guan, Yingjun, Han, Jiawei

论文摘要

我们在COVID-19开放研究数据集挑战（Cord-19）语料库（2020-03-13）上使用了全面的命名实体识别（NER）创建了这个绳索数据集。该绳索数据集涵盖了75种细粒实体类型：除了常见的生物医学实体类型（例如基因，化学物质和疾病）外，它涵盖了许多与CoVID-19的研究明确相关的许多新实体类型。机制和潜在疫苗。绳索注释是四种具有不同NER方法的来源的组合。绳索注释的质量超过了Scispacy（根据样本文档集，F1分数高10％），这是一种完全监督的Bioner工具。此外，Cord-Ner支持逐步添加新文档，并在需要时添加数十个种子作为输入示例，并在需要时添加新实体类型。我们将根据Cord-19语料库的增量更新和系统的改进来不断地更新脐带符。

We created this CORD-NER dataset with comprehensive named entity recognition (NER) on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020-03-13). This CORD-NER dataset covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types related explicitly to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines. CORD-NER annotation is a combination of four sources with different NER methods. The quality of CORD-NER annotation surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Moreover, CORD-NER supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples. We will constantly update CORD-NER based on the incremental updates of the CORD-19 corpus and the improvement of our system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题