论文标题
低资源域适应的多阶段预训练
Multi-Stage Pre-training for Low-Resource Domain Adaptation
论文作者
论文摘要
转移学习技术在NLP任务中特别有用,在NLP任务中,很难获得大量的高质量注释数据。当前方法在对下游任务进行微调之前,直接在内域文本上调整了预训练的语言模型(LM)。我们表明,使用特定于域的术语扩展LM的词汇会导致进一步的收益。为了更大的效果,我们利用未标记数据中的结构来创建辅助合成任务,这有助于LM转移到下游任务。我们将这些方法逐渐应用于预先训练的Roberta-large LM,并在IT领域的三个任务上显示出可观的性能增长:提取性阅读理解,文档排名和重复的问题检测。
Transfer learning techniques are particularly useful in NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pre-trained language model (LM) on in-domain text before fine-tuning to downstream tasks. We show that extending the vocabulary of the LM with domain-specific terms leads to further gains. To a bigger effect, we utilize structure in the unlabeled data to create auxiliary synthetic tasks, which helps the LM transfer to downstream tasks. We apply these approaches incrementally on a pre-trained Roberta-large LM and show considerable performance gain on three tasks in the IT domain: Extractive Reading Comprehension, Document Ranking and Duplicate Question Detection.