Improving the Fairness of Chest X-ray Classifiers

Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-Rayner, Stephen Pfohl, Marzyeh Ghassemi
Proceedings of the Conference on Health, Inference, and Learning, PMLR 174:204-233, 2022.

Abstract

Deep learning models have reached or surpassed human-level performance in the field of medical imaging, especially in disease diagnosis using chest x-rays. However, prior work has found that such classifiers can exhibit biases in the form of gaps in predictive performance across protected groups. In this paper, we question whether striving to achieve zero disparities in predictive performance (i.e. group fairness) is the appropriate fairness definition in the clinical setting, over minimax fairness, which focuses on maximizing the performance of the worst-case group. We benchmark the performance of nine methods in improving classifier fairness across these two definitions. We find, consistent with prior work on non-clinical data, that methods which strive to achieve better worst-group performance do not outperform simple data balancing. We also find that methods which achieve group fairness do so by worsening performance for all groups. In light of these results, we discuss the utility of fairness definitions in the clinical setting, advocating for an investigation of the bias-inducing mechanisms in the underlying data distribution whenever possible.

Cite this Paper


BibTeX
@InProceedings{pmlr-v174-zhang22a, title = {Improving the Fairness of Chest X-ray Classifiers}, author = {Zhang, Haoran and Dullerud, Natalie and Roth, Karsten and Oakden-Rayner, Lauren and Pfohl, Stephen and Ghassemi, Marzyeh}, booktitle = {Proceedings of the Conference on Health, Inference, and Learning}, pages = {204--233}, year = {2022}, editor = {Flores, Gerardo and Chen, George H and Pollard, Tom and Ho, Joyce C and Naumann, Tristan}, volume = {174}, series = {Proceedings of Machine Learning Research}, month = {07--08 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v174/zhang22a/zhang22a.pdf}, url = {https://proceedings.mlr.press/v174/zhang22a.html}, abstract = {Deep learning models have reached or surpassed human-level performance in the field of medical imaging, especially in disease diagnosis using chest x-rays. However, prior work has found that such classifiers can exhibit biases in the form of gaps in predictive performance across protected groups. In this paper, we question whether striving to achieve zero disparities in predictive performance (i.e. group fairness) is the appropriate fairness definition in the clinical setting, over minimax fairness, which focuses on maximizing the performance of the worst-case group. We benchmark the performance of nine methods in improving classifier fairness across these two definitions. We find, consistent with prior work on non-clinical data, that methods which strive to achieve better worst-group performance do not outperform simple data balancing. We also find that methods which achieve group fairness do so by worsening performance for all groups. In light of these results, we discuss the utility of fairness definitions in the clinical setting, advocating for an investigation of the bias-inducing mechanisms in the underlying data distribution whenever possible.} }
Endnote
%0 Conference Paper %T Improving the Fairness of Chest X-ray Classifiers %A Haoran Zhang %A Natalie Dullerud %A Karsten Roth %A Lauren Oakden-Rayner %A Stephen Pfohl %A Marzyeh Ghassemi %B Proceedings of the Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2022 %E Gerardo Flores %E George H Chen %E Tom Pollard %E Joyce C Ho %E Tristan Naumann %F pmlr-v174-zhang22a %I PMLR %P 204--233 %U https://proceedings.mlr.press/v174/zhang22a.html %V 174 %X Deep learning models have reached or surpassed human-level performance in the field of medical imaging, especially in disease diagnosis using chest x-rays. However, prior work has found that such classifiers can exhibit biases in the form of gaps in predictive performance across protected groups. In this paper, we question whether striving to achieve zero disparities in predictive performance (i.e. group fairness) is the appropriate fairness definition in the clinical setting, over minimax fairness, which focuses on maximizing the performance of the worst-case group. We benchmark the performance of nine methods in improving classifier fairness across these two definitions. We find, consistent with prior work on non-clinical data, that methods which strive to achieve better worst-group performance do not outperform simple data balancing. We also find that methods which achieve group fairness do so by worsening performance for all groups. In light of these results, we discuss the utility of fairness definitions in the clinical setting, advocating for an investigation of the bias-inducing mechanisms in the underlying data distribution whenever possible.
APA
Zhang, H., Dullerud, N., Roth, K., Oakden-Rayner, L., Pfohl, S. & Ghassemi, M.. (2022). Improving the Fairness of Chest X-ray Classifiers. Proceedings of the Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 174:204-233 Available from https://proceedings.mlr.press/v174/zhang22a.html.

Related Material