Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift

Chao Ying, Jun Jin, Yi Guo, Xiudi Li, Muxuan Liang, Jiwei Zhao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:72505-72534, 2025.

Abstract

Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ying25a, title = {Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift}, author = {Ying, Chao and Jin, Jun and Guo, Yi and Li, Xiudi and Liang, Muxuan and Zhao, Jiwei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {72505--72534}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ying25a/ying25a.pdf}, url = {https://proceedings.mlr.press/v267/ying25a.html}, abstract = {Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach.} }
Endnote
%0 Conference Paper %T Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift %A Chao Ying %A Jun Jin %A Yi Guo %A Xiudi Li %A Muxuan Liang %A Jiwei Zhao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ying25a %I PMLR %P 72505--72534 %U https://proceedings.mlr.press/v267/ying25a.html %V 267 %X Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach.
APA
Ying, C., Jin, J., Guo, Y., Li, X., Liang, M. & Zhao, J.. (2025). Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:72505-72534 Available from https://proceedings.mlr.press/v267/ying25a.html.

Related Material