DNN中数据出处的数据同位素

论文标题

DNN中数据出处的数据同位素

Data Isotopes for Data Provenance in DNNs

论文作者

Wenger, Emily, Li, Xiuyu, Zhao, Ben Y., Shmatikov, Vitaly

论文摘要

如今，渴望数据的深神经网络（DNNS）的创建者搜索互联网培训饲料，使用户几乎无法控制或了解何时将其数据拨款用于模型培训。为了使用户能够抵消不需要的数据使用，我们设计，实施和评估一个实用系统，该系统使用户能够检测其数据是否用于培训DNN模型。我们展示了用户如何创建我们称为同位素的特殊数据点，该数据点在培训期间将“虚假功能”引入DNN中。只有查询访问受过训练的模型，并且对模型培训过程不了解或控制数据标签，用户可以应用统计假设测试来检测模型是否通过对用户数据培训进行培训来了解与同位素相关的虚假功能。这有效地将DNNS对记忆和虚假相关性的脆弱性变成了数据出处的工具。我们的结果证实了在多种设置中的功效，检测并区分了数百种具有高精度的同位素。我们进一步表明，我们的系统在公共ML-AS-A-Service平台和较大的模型（例如Imagenet）上工作，可以使用物理对象代替数字标记，并且通常对几种自适应对策保持强劲。

Today, creators of data-hungry deep neural networks (DNNs) scour the Internet for training fodder, leaving users with little control over or knowledge of when their data is appropriated for model training. To empower users to counteract unwanted data use, we design, implement and evaluate a practical system that enables users to detect if their data was used to train an DNN model. We show how users can create special data points we call isotopes, which introduce "spurious features" into DNNs during training. With only query access to a trained model and no knowledge of the model training process, or control of the data labels, a user can apply statistical hypothesis testing to detect if a model has learned the spurious features associated with their isotopes by training on the user's data. This effectively turns DNNs' vulnerability to memorization and spurious correlations into a tool for data provenance. Our results confirm efficacy in multiple settings, detecting and distinguishing between hundreds of isotopes with high accuracy. We further show that our system works on public ML-as-a-service platforms and larger models such as ImageNet, can use physical objects instead of digital marks, and remains generally robust against several adaptive countermeasures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题