论文标题
无监督的域簇中的语言模型
Unsupervised Domain Clusters in Pretrained Language Models
论文作者
论文摘要
NLP中“内域数据”的概念通常过于简单和模糊,因为文本数据在许多细微的语言方面(例如主题,风格或形式程度)有所不同。此外,域标签不可用很多次,这使得构建特定于域系统的系统具有挑战性。我们表明,大规模的预训练的语言模型隐含地学习了句子表示,这些句子表示,这些句子表示没有域,而无需监督 - 暗示了对文本数据中域的简单定义。我们利用此属性并根据此类模型提出域数据选择方法,该模型仅需要一小部分内域单语言数据。我们评估了跨五个不同领域的神经机器翻译的数据选择方法,在该域上,它们的表现都超过了由BLEU以及通过与Oracle相对于Oracle的句子选择来衡量的已建立方法。
The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision -- suggesting a simple data-driven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and by precision and recall of sentence selection with respect to an oracle.