Knockoffs for the Mass: New Feature Importance Statistics with False Discovery Guarantees

Jaime Roquero Gimenez, Amirata Ghorbani, James Zou
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2125-2133, 2019.

Abstract

An important problem in machine learning and statistics is to identify features that causally affect the outcome. This is often impossible to do from purely observational data, and a natural relaxation is to identify features that are correlated with the outcome even conditioned on all other observed features. For example, we want to identify that smoking really is correlated with cancer conditioned on demographics. The knockoff procedure is a recent breakthrough in statistics that, in theory, can identify truly correlated features while guaranteeing that false discovery rate is controlled. The idea is to create synthetic data-knockoffs-that capture correlations among the features. However, there are substantial computational and practical challenges to generating and using knockoffs. This paper makes several key advances that enable knockoff application to be more efficient and powerful. We develop an efficient algorithm to generate valid knockoffs from Bayesian Networks. Then we systematically evaluate knockoff test statistics and develop new statistics with improved power. The paper combines new mathematical guarantees with systematic experiments on real and synthetic data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-gimenez19a, title = {Knockoffs for the Mass: New Feature Importance Statistics with False Discovery Guarantees}, author = {Gimenez, Jaime Roquero and Ghorbani, Amirata and Zou, James}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {2125--2133}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf}, url = {https://proceedings.mlr.press/v89/gimenez19a.html}, abstract = {An important problem in machine learning and statistics is to identify features that causally affect the outcome. This is often impossible to do from purely observational data, and a natural relaxation is to identify features that are correlated with the outcome even conditioned on all other observed features. For example, we want to identify that smoking really is correlated with cancer conditioned on demographics. The knockoff procedure is a recent breakthrough in statistics that, in theory, can identify truly correlated features while guaranteeing that false discovery rate is controlled. The idea is to create synthetic data-knockoffs-that capture correlations among the features. However, there are substantial computational and practical challenges to generating and using knockoffs. This paper makes several key advances that enable knockoff application to be more efficient and powerful. We develop an efficient algorithm to generate valid knockoffs from Bayesian Networks. Then we systematically evaluate knockoff test statistics and develop new statistics with improved power. The paper combines new mathematical guarantees with systematic experiments on real and synthetic data.} }
Endnote
%0 Conference Paper %T Knockoffs for the Mass: New Feature Importance Statistics with False Discovery Guarantees %A Jaime Roquero Gimenez %A Amirata Ghorbani %A James Zou %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-gimenez19a %I PMLR %P 2125--2133 %U https://proceedings.mlr.press/v89/gimenez19a.html %V 89 %X An important problem in machine learning and statistics is to identify features that causally affect the outcome. This is often impossible to do from purely observational data, and a natural relaxation is to identify features that are correlated with the outcome even conditioned on all other observed features. For example, we want to identify that smoking really is correlated with cancer conditioned on demographics. The knockoff procedure is a recent breakthrough in statistics that, in theory, can identify truly correlated features while guaranteeing that false discovery rate is controlled. The idea is to create synthetic data-knockoffs-that capture correlations among the features. However, there are substantial computational and practical challenges to generating and using knockoffs. This paper makes several key advances that enable knockoff application to be more efficient and powerful. We develop an efficient algorithm to generate valid knockoffs from Bayesian Networks. Then we systematically evaluate knockoff test statistics and develop new statistics with improved power. The paper combines new mathematical guarantees with systematic experiments on real and synthetic data.
APA
Gimenez, J.R., Ghorbani, A. & Zou, J.. (2019). Knockoffs for the Mass: New Feature Importance Statistics with False Discovery Guarantees. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:2125-2133 Available from https://proceedings.mlr.press/v89/gimenez19a.html.

Related Material