针对生物医学自然语言处理的特定于领域的语言模型

论文标题

针对生物医学自然语言处理的特定于领域的语言模型

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

论文作者

Gu, Yu, Tinn, Robert, Cheng, Hao, Lucas, Michael, Usuyama, Naoto, Liu, Xiaodong, Naumann, Tristan, Gao, Jianfeng, Poon, Hoifung

论文摘要

预处理大型神经语言模型（例如BERT）导致许多自然语言处理（NLP）任务带来了令人印象深刻的收益。但是，大多数训练的努力都集中在通用领域语料库中，例如新闻和网络。一个普遍的假设是，即使是特定于域的预训练也可以从通用域语言模型开始。在本文中，我们通过证明对于具有丰富未标记的文本的域，例如生物医学，从头开始训练的语言模型，从而挑战了这一假设。为了促进这项调查，我们从公共可用数据集中编辑了全面的生物医学NLP基准。我们的实验表明，特定领域的预审进为各种生物医学NLP任务是稳固的基础，从而全面导致新的最新结果。此外，在对训练和特定于任务的微调进行彻底评估的建模选择评估时，我们发现，对于BERT模型，某些常见实践是不必要的，例如在命名实体识别（NER）中使用复杂的标记方案。为了帮助加速生物医学NLP的研究，我们已经为社区发布了最先进的审核和特定于任务的模型，并在https://aka.ms/blurb上创建了一个以我们的Blurb Benchmark（生物医学理解和推理基准的简称）为特色的排行榜。

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题