论文标题
评估嵌入模型,以自动提取和对科学文档中公认实体的分类进行评估
Evaluation of Embedding Models for Automatic Extraction and Classification of Acknowledged Entities in Scientific Documents
论文作者
论文摘要
科学论文中的致谢可能会深入了解科学界的各个方面,例如奖励系统,协作模式和隐藏的研究趋势。本文的目的是评估不同嵌入模型的性能,以自动提取和从科学论文中的确认文本中对公认实体进行分类。我们使用Flair NLP框架训练并实施了指定的实体识别(NER)任务。培训是使用三个默认的Flair NER模型进行的,这些模型具有两个不同大小的语料库。在较大的训练语料库中训练的FLAIR嵌入模型显示出0.77的最佳准确性。我们的模型能够识别六种实体类型:资金代理,赠款编号,个人,大学,公司和杂项。对于某些实体类型,该模型比其他实体类型更精确,因此,个体和赠款数字显示出非常好的F1得分超过0.9。以前关于确认分析的大多数工作都受到数据的手动评估,因此受到处理的数据量的限制。该模型可以应用于对确认文本的综合分析,并有可能为自动确认分析领域做出巨大贡献。
Acknowledgments in scientific papers may give an insight into aspects of the scientific community, such as reward systems, collaboration patterns, and hidden research trends. The aim of the paper is to evaluate the performance of different embedding models for the task of automatic extraction and classification of acknowledged entities from the acknowledgment text in scientific papers. We trained and implemented a named entity recognition (NER) task using the Flair NLP-framework. The training was conducted using three default Flair NER models with two differently-sized corpora. The Flair Embeddings model trained on the larger training corpus showed the best accuracy of 0.77. Our model is able to recognize six entity types: funding agency, grant number, individuals, university, corporation and miscellaneous. The model works more precise for some entity types than the others, thus, individuals and grant numbers showed very good F1-Score over 0.9. Most of the previous works on acknowledgement analysis were limited by the manual evaluation of data and therefore by the amount of processed data. This model can be applied for the comprehensive analysis of the acknowledgement texts and may potentially make a great contribution to the field of automated acknowledgement analysis.