受到质量的困惑：一种基于困惑的成人和有害内容检测的方法

论文标题

受到质量的困惑：一种基于困惑的成人和有害内容检测的方法

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

论文作者

Jansen, Tim, Tong, Yangling, Zevallos, Victoria, Suarez, Pedro Ortiz

论文摘要

随着对大型语料库的需求随当前最新语言模型的规模而增加，将Web数据作为这些模型的培训前语料库的主要部分已成为一种无处不在的实践。反过来，这对NLP从业人员提出了一个重要的挑战，因为他们现在面临着开发高度优化的模型和管道的任务，用于预处理大量文本数据，这意味着在网络规模上有效地分类和过滤多语言，异质和嘈杂的数据。大型语言模型预训练中心的预处理步骤的主要组成部分之一是去除成人和有害内容。在本文中，我们探讨了用于检测成年人的不同方法，并且在多语言异构网络数据中的内容有害。我们首先展示有害内容检测的传统方法如何在面对异质嘈杂的网络数据时迅速在小型且专业的数据集中表现出色。然后，我们求助于使用基于困惑的方法，但会有所不同：而不是使用所谓的“清洁”语料库来训练小型语言模型，然后使用困惑，因此请选择具有低音的文档，即类似于所谓的“清洁”语料库的文档。我们仅使用成人和有害的文本数据训练，然后选择具有比给定阈值高于给定阈值的文档。这种方法将实际上将我们的文档聚集到两个不同的群体中，这将极大地促进了困惑性的阈值，并且还可以使我们获得比传统的分类方法更高的精度，以检测成人和有害内容。

As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous practice. This, in turn, has introduced an important challenge for NLP practitioners, as they are now confronted with the task of developing highly optimized models and pipelines for pre-processing large quantities of textual data, which implies, effectively classifying and filtering multilingual, heterogeneous and noisy data, at web scale. One of the main components of this pre-processing step for the pre-training corpora of large language models, is the removal of adult and harmful content. In this paper we explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data. We first show how traditional methods in harmful content detection, that seemingly perform quite well in small and specialized datasets quickly break down when confronted with heterogeneous noisy web data. We then resort to using a perplexity based approach but with a twist: Instead of using a so-called "clean" corpus to train a small language model and then use perplexity so select the documents with low perplexity, i.e., the documents that resemble this so-called "clean" corpus the most. We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold. This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity and will also allow us to obtain higher precision than with the traditional classification methods for detecting adult and harmful content.

下载PDF全文

下载文献需遵守相关版权规定

论文标题