论文标题
元数据可能会使语言模型更好
Metadata Might Make Language Models Better
论文作者
论文摘要
本文讨论了在培训语言模型对历史收藏的培训时包括元数据的好处。使用19世纪的报纸作为案例研究,我们将Rosin等人,2022年提出的时间掩盖方法扩展了,并比较了将时间,政治和地理信息插入掩盖语言模型的不同策略。在对增强的输入数据上进行了几次大杂项后,我们在一组评估任务上对这些模型进行了系统的评估:伪透明度,元数据蒙版填充和监督分类。我们发现,表现出与语言模型相关的元数据具有有益的影响,甚至可能产生更强大,更公平的模型。
This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.