论文标题

通过随机实验测量训练数据对深度学习预测的影响

Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

论文作者

Lin, Jinkun, Zhang, Anqi, Lecuyer, Mathias, Li, Jinyang, Panda, Aurojit, Sen, Siddhartha

论文摘要

我们开发了一种新的原则性算法,用于估计培训数据点对深度学习模型的行为的贡献,例如它做出的特定预测。我们的算法估计了AME,该算法是衡量从给定分布采样的训练数据子集中添加数据点的预期(平均)边际效应的数量。当从均匀分布中采样子集时,AME将减少为众所周知的Shapley值。我们的方法是受因果推断和随机实验的启发:我们采样了训练数据的不同子集以训练多个子模型,并评估每个子模型的行为。然后,我们使用套索回归来基于子集组成共同估计每个数据点的AME。在稀疏假设($ k \ ll n $数据点具有较大的AME)下,我们的估计器仅需要$ O(k \ log n)$随机的子模型培训,从而改善了最佳先前的Shapley值估算器。

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel's behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions ($k \ll N$ datapoints have large AME), our estimator requires only $O(k\log N)$ randomized submodel trainings, improving upon the best prior Shapley value estimators.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源