攻击神经文本探测器

论文标题

攻击神经文本探测器

Attacking Neural Text Detectors

论文作者

Wolff, Max, Wolff, Stuart

论文摘要

基于机器学习的语言模型最近取得了重大进展，这引发了传播错误信息的危险。为了应对这种潜在的危险，已经提出了几种方法来检测这些语言模型所写的文本。本文对这些探测器进行了两类的黑盒攻击，一个攻击将随机替代字符，另一类，另一种简单的方案，可以有意地拼写错误的单词。同质词和拼写攻击使流行的神经文本检测器对神经文本的回忆分别从97.44％降低至0.26％和22.68％。结果还表明，攻击可以转移到其他神经文本探测器上。

Machine learning based language models have recently made significant progress, which introduces a danger to spread misinformation. To combat this potential danger, several methods have been proposed for detecting text written by these language models. This paper presents two classes of black-box attacks on these detectors, one which randomly replaces characters with homoglyphs, and the other a simple scheme to purposefully misspell words. The homoglyph and misspelling attacks decrease a popular neural text detector's recall on neural text from 97.44% to 0.26% and 22.68%, respectively. Results also indicate that the attacks are transferable to other neural text detectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题