元数据可能会使语言模型更好

论文标题

元数据可能会使语言模型更好

Metadata Might Make Language Models Better

论文作者

Beelen, Kaspar, van Strien, Daniel

论文摘要

本文讨论了在培训语言模型对历史收藏的培训时包括元数据的好处。使用19世纪的报纸作为案例研究，我们将Rosin等人，2022年提出的时间掩盖方法扩展了，并比较了将时间，政治和地理信息插入掩盖语言模型的不同策略。在对增强的输入数据上进行了几次大杂项后，我们在一组评估任务上对这些模型进行了系统的评估：伪透明度，元数据蒙版填充和监督分类。我们发现，表现出与语言模型相关的元数据具有有益的影响，甚至可能产生更强大，更公平的模型。

This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题