不要停止训练：将语言模型适应域和任务

论文标题

不要停止训练：将语言模型适应域和任务

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

论文作者

Gururangan, Suchin, Marasović, Ana, Swayamdipta, Swabha, Lo, Kyle, Beltagy, Iz, Downey, Doug, Smith, Noah A.

论文摘要

在各种来源的文本上鉴定的语言模型构成了当今NLP的基础。鉴于这些宽覆盖模型的成功，我们研究了针对目标任务的领域量身定制的模型是否仍然有用。我们介绍了跨四个领域（生物医学和计算机科学出版物，新闻和评论）和八项分类任务的研究，表明在高水城和低资产阶级环境下，第二阶段的内域内（域适应性预处理）导致性能提高。此外，即使在域自适应预处理后，适应任务的未标记数据（任务自适应预处理）也可以提高性能。最后，我们表明，适应使用简单数据选择策略增强的任务语料库是一种有效的选择，尤其是在无法进行域名自适应预处理的资源时。总体而言，我们始终发现，多阶段自适应预处理为任务绩效带来了巨大的收益。

Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题