线性对抗概念擦除

论文标题

线性对抗概念擦除

Linear Adversarial Concept Erasure

论文作者

Ravfogel, Shauli, Twiton, Michael, Goldberg, Yoav, Cotterell, Ryan

论文摘要

接受文本数据培训的现代神经模型取决于未直接监督的预先训练的表示。由于这些表示越来越多地用于现实世界应用中，因此无法\ emph {Control}它们的内容成为一个越来越重要的问题。我们制定了与给定概念相对应的线性子空间的问题，以防止线性预测因子恢复概念。我们将此问题建模为受约束的线性最大化游戏，并表明现有解决方案通常不是最佳的此任务。我们为某些目标提供了封闭形式的解决方案，并提出了一种对他人效果很好的凸松弛方法。当在二元性别删除的背景下进行评估时，该方法恢复了一个低维子空间，其去除会通过内在和外在评估来减轻偏见。我们表明，该方法具有高度表达性，有效地减轻了深度非线性分类器中的偏见，同时保持易干性和可解释性。

Modern neural models trained on textual data rely on pre-trained representations that emerge without direct supervision. As these representations are increasingly being used in real-world applications, the inability to \emph{control} their content becomes an increasingly important problem. We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept, in order to prevent linear predictors from recovering the concept. We model this problem as a constrained, linear maximin game, and show that existing solutions are generally not optimal for this task. We derive a closed-form solution for certain objectives, and propose a convex relaxation, \method, that works well for others. When evaluated in the context of binary gender removal, the method recovers a low-dimensional subspace whose removal mitigates bias by intrinsic and extrinsic evaluation. We show that the method is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题