论文标题
RealtoxicityPrompts:评估语言模型中的神经毒性变性
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
论文作者
论文摘要
预验证的神经语言模型(LMS)容易产生种族主义,性别歧视或其他有毒语言,从而阻碍其安全的部署。我们研究了可以促使经过预定的LMS产生有毒语言的程度,以及可控文本生成算法在防止这种毒性变性方面的有效性。我们创建和释放RealtoxicityPrompts,这是一个自然存在的100K数据集,句子级别的提示来自大的英语Web文本,并与广泛使用的毒性分类器的毒性得分配对。使用实氧化剂,我们发现,即使看似无害的提示,经过验证的LM也可以退化为有毒文本。我们从经验上评估了几种可控的生成方法,发现虽然数据或计算密集型方法(例如,对无毒数据的自适应预处理)比更简单的溶液(例如,禁止“不良”单词)更有效地转向毒性,但没有目前的方法是对神经毒性变性的失败。为了指出这种持续性毒性变性的潜在原因,我们分析了两个用于预识几个LMS(包括GPT-2; Radfordet。Al,2019)的Web文本语料库,并找到了大量的冒犯性,实际上是不可靠的和其他有毒的内容。我们的工作提供了一个测试床,用于评估LMS的有毒世代,并强调需要进行更好的数据选择过程进行预训练。
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.