论文标题

在山脉19上综合命名实体识别,并有遥远或弱的监督

Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision

论文作者

Wang, Xuan, Song, Xiangchen, Li, Bangzheng, Guan, Yingjun, Han, Jiawei

论文摘要

我们在COVID-19开放研究数据集挑战(Cord-19)语料库(2020-03-13)上使用了全面的命名实体识别(NER)创建了这个绳索数据集。该绳索数据集涵盖了75种细粒实体类型:除了常见的生物医学实体类型(例如基因,化学物质和疾病)外,它涵盖了许多与CoVID-19的研究明确相关的许多新实体类型。机制和潜在疫苗。绳索注释是四种具有不同NER方法的来源的组合。绳索注释的质量超过了Scispacy(根据样本文档集,F1分数高10%),这是一种完全监督的Bioner工具。此外,Cord-Ner支持逐步添加新文档,并在需要时添加数十个种子作为输入示例,并在需要时添加新实体类型。我们将根据Cord-19语料库的增量更新和系统的改进来不断地更新脐带符。

We created this CORD-NER dataset with comprehensive named entity recognition (NER) on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020-03-13). This CORD-NER dataset covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types related explicitly to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines. CORD-NER annotation is a combination of four sources with different NER methods. The quality of CORD-NER annotation surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Moreover, CORD-NER supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples. We will constantly update CORD-NER based on the incremental updates of the CORD-19 corpus and the improvement of our system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源