通过随机实验测量训练数据对深度学习预测的影响

论文标题

通过随机实验测量训练数据对深度学习预测的影响

Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

论文作者

Lin, Jinkun, Zhang, Anqi, Lecuyer, Mathias, Li, Jinyang, Panda, Aurojit, Sen, Siddhartha

论文摘要

我们开发了一种新的原则性算法，用于估计培训数据点对深度学习模型的行为的贡献，例如它做出的特定预测。我们的算法估计了AME，该算法是衡量从给定分布采样的训练数据子集中添加数据点的预期（平均）边际效应的数量。当从均匀分布中采样子集时，AME将减少为众所周知的Shapley值。我们的方法是受因果推断和随机实验的启发：我们采样了训练数据的不同子集以训练多个子模型，并评估每个子模型的行为。然后，我们使用套索回归来基于子集组成共同估计每个数据点的AME。在稀疏假设（$ k \ ll n $数据点具有较大的AME）下，我们的估计器仅需要$ O（k \ log n）$随机的子模型培训，从而改善了最佳先前的Shapley值估算器。

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel's behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions ($k \ll N$ datapoints have large AME), our estimator requires only $O(k\log N)$ randomized submodel trainings, improving upon the best prior Shapley value estimators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题