Controlling Bias in Adaptive Data Analysis Using Information Theory

Daniel Russo; James Zou

Controlling Bias in Adaptive Data Analysis Using Information Theory

Daniel Russo, James Zou

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:1232-1240, 2016.

Abstract

Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v51-russo16,
  title = 	 {Controlling Bias in Adaptive Data Analysis Using Information Theory},
  author = 	 {Russo, Daniel and Zou, James},
  booktitle = 	 {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1232--1240},
  year = 	 {2016},
  editor = 	 {Gretton, Arthur and Robert, Christian C.},
  volume = 	 {51},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Cadiz, Spain},
  month = 	 {09--11 May},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v51/russo16.pdf},
  url = 	 {https://proceedings.mlr.press/v51/russo16.html},
  abstract = 	 {Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.}
}

Endnote

%0 Conference Paper
%T Controlling Bias in Adaptive Data Analysis Using Information Theory
%A Daniel Russo
%A James Zou
%B Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2016
%E Arthur Gretton
%E Christian C. Robert	
%F pmlr-v51-russo16
%I PMLR
%P 1232--1240
%U https://proceedings.mlr.press/v51/russo16.html
%V 51
%X Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.

RIS

TY  - CPAPER
TI  - Controlling Bias in Adaptive Data Analysis Using Information Theory
AU  - Daniel Russo
AU  - James Zou
BT  - Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
DA  - 2016/05/02
ED  - Arthur Gretton
ED  - Christian C. Robert	
ID  - pmlr-v51-russo16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 51
SP  - 1232
EP  - 1240
L1  - http://proceedings.mlr.press/v51/russo16.pdf
UR  - https://proceedings.mlr.press/v51/russo16.html
AB  - Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.
ER  -

APA

Russo, D. & Zou, J.. (2016). Controlling Bias in Adaptive Data Analysis Using Information Theory. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 51:1232-1240 Available from https://proceedings.mlr.press/v51/russo16.html.

Controlling Bias in Adaptive Data Analysis Using Information Theory

Abstract

Cite this Paper

Related Material