CONTRA: Contrarian statistics for controlled variable selection

Mukund Sudarshan, Aahlad Puli, Lakshmi Subramanian, Sriram Sankararaman, Rajesh Ranganath
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:1900-1908, 2021.

Abstract

The holdout randomization test (HRT) discovers a set of covariates most predictive of a response. Given the covariate distribution, HRTs can explicitly control the false discovery rate (FDR). However, if this distribution is unknown and must be estimated from data, HRTs can inflate the FDR. To alleviate the inflation of FDR, we propose the contrarian randomization test (CONTRA), which is designed explicitly for scenarios where the covariate distribution must be estimated from data and may even be misspecified. Our key insight is to use an equal mixture of two “contrarian” probabilistic models in determining the importance of a covariate. One model is fit with the real data, while the other is fit using the same data, but with the covariate being tested replaced with samples from an estimate of the covariate distribution. CONTRA is flexible enough to achieve a power of 1 asymptotically, can reduce the FDR compared to state-of-the-art CVS methods when the covariate distribution is misspecified, and is computationally efficient in high dimensions and large sample sizes. We further demonstrate the effectiveness of CONTRA on numerous synthetic benchmarks, and highlight its capabilities on a genetic dataset.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-sudarshan21a, title = { CONTRA: Contrarian statistics for controlled variable selection }, author = {Sudarshan, Mukund and Puli, Aahlad and Subramanian, Lakshmi and Sankararaman, Sriram and Ranganath, Rajesh}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {1900--1908}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/sudarshan21a/sudarshan21a.pdf}, url = {https://proceedings.mlr.press/v130/sudarshan21a.html}, abstract = { The holdout randomization test (HRT) discovers a set of covariates most predictive of a response. Given the covariate distribution, HRTs can explicitly control the false discovery rate (FDR). However, if this distribution is unknown and must be estimated from data, HRTs can inflate the FDR. To alleviate the inflation of FDR, we propose the contrarian randomization test (CONTRA), which is designed explicitly for scenarios where the covariate distribution must be estimated from data and may even be misspecified. Our key insight is to use an equal mixture of two “contrarian” probabilistic models in determining the importance of a covariate. One model is fit with the real data, while the other is fit using the same data, but with the covariate being tested replaced with samples from an estimate of the covariate distribution. CONTRA is flexible enough to achieve a power of 1 asymptotically, can reduce the FDR compared to state-of-the-art CVS methods when the covariate distribution is misspecified, and is computationally efficient in high dimensions and large sample sizes. We further demonstrate the effectiveness of CONTRA on numerous synthetic benchmarks, and highlight its capabilities on a genetic dataset. } }
Endnote
%0 Conference Paper %T CONTRA: Contrarian statistics for controlled variable selection %A Mukund Sudarshan %A Aahlad Puli %A Lakshmi Subramanian %A Sriram Sankararaman %A Rajesh Ranganath %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-sudarshan21a %I PMLR %P 1900--1908 %U https://proceedings.mlr.press/v130/sudarshan21a.html %V 130 %X The holdout randomization test (HRT) discovers a set of covariates most predictive of a response. Given the covariate distribution, HRTs can explicitly control the false discovery rate (FDR). However, if this distribution is unknown and must be estimated from data, HRTs can inflate the FDR. To alleviate the inflation of FDR, we propose the contrarian randomization test (CONTRA), which is designed explicitly for scenarios where the covariate distribution must be estimated from data and may even be misspecified. Our key insight is to use an equal mixture of two “contrarian” probabilistic models in determining the importance of a covariate. One model is fit with the real data, while the other is fit using the same data, but with the covariate being tested replaced with samples from an estimate of the covariate distribution. CONTRA is flexible enough to achieve a power of 1 asymptotically, can reduce the FDR compared to state-of-the-art CVS methods when the covariate distribution is misspecified, and is computationally efficient in high dimensions and large sample sizes. We further demonstrate the effectiveness of CONTRA on numerous synthetic benchmarks, and highlight its capabilities on a genetic dataset.
APA
Sudarshan, M., Puli, A., Subramanian, L., Sankararaman, S. & Ranganath, R.. (2021). CONTRA: Contrarian statistics for controlled variable selection . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:1900-1908 Available from https://proceedings.mlr.press/v130/sudarshan21a.html.

Related Material