Controlling Bias in Adaptive Data Analysis Using Information Theory

Daniel Russo, James Zou
; Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:1232-1240, 2016.

Abstract

Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v51-russo16, title = {Controlling Bias in Adaptive Data Analysis Using Information Theory}, author = {Daniel Russo and James Zou}, pages = {1232--1240}, year = {2016}, editor = {Arthur Gretton and Christian C. Robert}, volume = {51}, series = {Proceedings of Machine Learning Research}, address = {Cadiz, Spain}, month = {09--11 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v51/russo16.pdf}, url = {http://proceedings.mlr.press/v51/russo16.html}, abstract = {Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.} }
Endnote
%0 Conference Paper %T Controlling Bias in Adaptive Data Analysis Using Information Theory %A Daniel Russo %A James Zou %B Proceedings of the 19th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2016 %E Arthur Gretton %E Christian C. Robert %F pmlr-v51-russo16 %I PMLR %J Proceedings of Machine Learning Research %P 1232--1240 %U http://proceedings.mlr.press %V 51 %W PMLR %X Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.
RIS
TY - CPAPER TI - Controlling Bias in Adaptive Data Analysis Using Information Theory AU - Daniel Russo AU - James Zou BT - Proceedings of the 19th International Conference on Artificial Intelligence and Statistics PY - 2016/05/02 DA - 2016/05/02 ED - Arthur Gretton ED - Christian C. Robert ID - pmlr-v51-russo16 PB - PMLR SP - 1232 DP - PMLR EP - 1240 L1 - http://proceedings.mlr.press/v51/russo16.pdf UR - http://proceedings.mlr.press/v51/russo16.html AB - Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations. ER -
APA
Russo, D. & Zou, J.. (2016). Controlling Bias in Adaptive Data Analysis Using Information Theory. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, in PMLR 51:1232-1240

Related Material