论文标题
不要停止训练:将语言模型适应域和任务
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
论文作者
论文摘要
在各种来源的文本上鉴定的语言模型构成了当今NLP的基础。鉴于这些宽覆盖模型的成功,我们研究了针对目标任务的领域量身定制的模型是否仍然有用。我们介绍了跨四个领域(生物医学和计算机科学出版物,新闻和评论)和八项分类任务的研究,表明在高水城和低资产阶级环境下,第二阶段的内域内(域适应性预处理)导致性能提高。此外,即使在域自适应预处理后,适应任务的未标记数据(任务自适应预处理)也可以提高性能。最后,我们表明,适应使用简单数据选择策略增强的任务语料库是一种有效的选择,尤其是在无法进行域名自适应预处理的资源时。总体而言,我们始终发现,多阶段自适应预处理为任务绩效带来了巨大的收益。
Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.