Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations

Kevin Wu; Dominik Dahlem; Christopher Hane; Eran Halperin; James Zou

Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations

Kevin Wu, Dominik Dahlem, Christopher Hane, Eran Halperin, James Zou

Proceedings of the Conference on Health, Inference, and Learning, PMLR 209:229-242, 2023.

Abstract

Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect, which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be re-trained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model’s performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v209-wu23b,
  title = 	 {Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations},
  author =       {Wu, Kevin and Dahlem, Dominik and Hane, Christopher and Halperin, Eran and Zou, James},
  booktitle = 	 {Proceedings of the Conference on Health, Inference, and Learning},
  pages = 	 {229--242},
  year = 	 {2023},
  editor = 	 {Mortazavi, Bobak J. and Sarker, Tasmie and Beam, Andrew and Ho, Joyce C.},
  volume = 	 {209},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {22 Jun--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v209/wu23b/wu23b.pdf},
  url = 	 {https://proceedings.mlr.press/v209/wu23b.html},
  abstract = 	 {Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect,  which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers  allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be re-trained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model’s performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.}
}

Endnote

%0 Conference Paper
%T Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations
%A Kevin Wu
%A Dominik Dahlem
%A Christopher Hane
%A Eran Halperin
%A James Zou
%B Proceedings of the Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Bobak J. Mortazavi
%E Tasmie Sarker
%E Andrew Beam
%E Joyce C. Ho	
%F pmlr-v209-wu23b
%I PMLR
%P 229--242
%U https://proceedings.mlr.press/v209/wu23b.html
%V 209
%X Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect,  which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers  allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be re-trained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model’s performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.

APA

Wu, K., Dahlem, D., Hane, C., Halperin, E. & Zou, J.. (2023). Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations. Proceedings of the Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 209:229-242 Available from https://proceedings.mlr.press/v209/wu23b.html.

Related Material

Download PDF