论文标题

元数据可能会使语言模型更好

Metadata Might Make Language Models Better

论文作者

Beelen, Kaspar, van Strien, Daniel

论文摘要

本文讨论了在培训语言模型对历史收藏的培训时包括元数据的好处。使用19世纪的报纸作为案例研究,我们将Rosin等人,2022年提出的时间掩盖方法扩展了,并比较了将时间,政治和地理信息插入掩盖语言模型的不同策略。在对增强的输入数据上进行了几次大杂项后,我们在一组评估任务上对这些模型进行了系统的评估:伪透明度,元数据蒙版填充和监督分类。我们发现,表现出与语言模型相关的元数据具有有益的影响,甚至可能产生更强大,更公平的模型。

This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源