论文标题
具有差异隐私的合成文本生成:简单且实用的食谱
Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe
论文作者
论文摘要
由于机器学习模型倾向于记住敏感培训数据,因此隐私问题引起了数据驱动产品的越来越多。使用正式隐私保证(例如差异隐私(DP))生成此类数据的综合版本,为减轻这些隐私问题提供了有希望的途径,但是以前朝这个方向朝着此方向的先前方法未能产生高质量的合成数据。在这项工作中,我们表明文本域中的一个简单且实用的食谱是有效的:简单地使用DP微调了验证的生成语言模型,使该模型能够生成具有强大隐私保护的有用的合成文本。通过对基准和私人客户数据的广泛经验分析,我们证明了我们的方法生成的合成文本在效用方面与其非私人对应物具有竞争力,同时为潜在的隐私泄漏提供了强有力的保护。
Privacy concerns have attracted increasing attention in data-driven products due to the tendency of machine learning models to memorize sensitive training data. Generating synthetic versions of such data with a formal privacy guarantee, such as differential privacy (DP), provides a promising path to mitigating these privacy concerns, but previous approaches in this direction have typically failed to produce synthetic data of high quality. In this work, we show that a simple and practical recipe in the text domain is effective: simply fine-tuning a pretrained generative language model with DP enables the model to generate useful synthetic text with strong privacy protection. Through extensive empirical analyses on both benchmark and private customer data, we demonstrate that our method produces synthetic text that is competitive in terms of utility with its non-private counterpart, meanwhile providing strong protection against potential privacy leakages.