Datamodels: Understanding Predictions with Data and Data with Predictions

Andrew Ilyas; Sung Min Park; Logan Engstrom; Guillaume Leclerc; Aleksander Madry

Datamodels: Understanding Predictions with Data and Data with Predictions

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9525-9587, 2022.

Abstract

We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed “target” example

$x$ , training set

$S$ , and learning algorithm, a datamodel is a parameterized function

$2^S \to \mathbb{R}$ that for any subset of

$S’ \subset S$ —using only information about which examples of

$S$ are contained in

$S’$ —predicts the outcome of training a model on

$S’$ and evaluating on

$x$ . Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-ilyas22a,
  title = 	 {Datamodels: Understanding Predictions with Data and Data with Predictions},
  author =       {Ilyas, Andrew and Park, Sung Min and Engstrom, Logan and Leclerc, Guillaume and Madry, Aleksander},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {9525--9587},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/ilyas22a/ilyas22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/ilyas22a.html},
  abstract = 	 {We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed “target” example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S’ \subset S$—using only information about which examples of $S$ are contained in $S’$—predicts the outcome of training a model on $S’$ and evaluating on $x$. Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.}
}

Endnote

%0 Conference Paper
%T Datamodels: Understanding Predictions with Data and Data with Predictions
%A Andrew Ilyas
%A Sung Min Park
%A Logan Engstrom
%A Guillaume Leclerc
%A Aleksander Madry
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-ilyas22a
%I PMLR
%P 9525--9587
%U https://proceedings.mlr.press/v162/ilyas22a.html
%V 162
%X We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed “target” example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S’ \subset S$—using only information about which examples of $S$ are contained in $S’$—predicts the outcome of training a model on $S’$ and evaluating on $x$. Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.

APA


Ilyas, A., Park, S.M., Engstrom, L., Leclerc, G. & Madry, A.. (2022). Datamodels: Understanding Predictions with Data and Data with Predictions. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:9525-9587 Available from https://proceedings.mlr.press/v162/ilyas22a.html.

Related Material

Download PDF