Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

Jinkun Lin, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, Siddhartha Sen
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:13468-13504, 2022.

Abstract

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel’s behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions ($k \ll N$ datapoints have large AME), our estimator requires only $O(k\log N)$ randomized submodel trainings, improving upon the best prior Shapley value estimators.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-lin22h, title = {Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments}, author = {Lin, Jinkun and Zhang, Anqi and L{\'e}cuyer, Mathias and Li, Jinyang and Panda, Aurojit and Sen, Siddhartha}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {13468--13504}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/lin22h/lin22h.pdf}, url = {https://proceedings.mlr.press/v162/lin22h.html}, abstract = {We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel’s behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions ($k \ll N$ datapoints have large AME), our estimator requires only $O(k\log N)$ randomized submodel trainings, improving upon the best prior Shapley value estimators.} }
Endnote
%0 Conference Paper %T Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments %A Jinkun Lin %A Anqi Zhang %A Mathias Lécuyer %A Jinyang Li %A Aurojit Panda %A Siddhartha Sen %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-lin22h %I PMLR %P 13468--13504 %U https://proceedings.mlr.press/v162/lin22h.html %V 162 %X We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel’s behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions ($k \ll N$ datapoints have large AME), our estimator requires only $O(k\log N)$ randomized submodel trainings, improving upon the best prior Shapley value estimators.
APA
Lin, J., Zhang, A., Lécuyer, M., Li, J., Panda, A. & Sen, S.. (2022). Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:13468-13504 Available from https://proceedings.mlr.press/v162/lin22h.html.

Related Material