Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

Jean Feng, Adarsh Subbaswamy, Alexej Gossmann, Harvineet Singh, Berkman Sahiner, Mi-Ok Kim, Gene Anthony Pennello, Nicholas Petrick, Romain Pirracchio, Fan Xia
Proceedings of the Third Conference on Causal Learning and Reasoning, PMLR 236:587-608, 2024.

Abstract

After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v236-feng24a, title = {Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens}, author = {Feng, Jean and Subbaswamy, Adarsh and Gossmann, Alexej and Singh, Harvineet and Sahiner, Berkman and Kim, Mi-Ok and Pennello, Gene Anthony and Petrick, Nicholas and Pirracchio, Romain and Xia, Fan}, booktitle = {Proceedings of the Third Conference on Causal Learning and Reasoning}, pages = {587--608}, year = {2024}, editor = {Locatello, Francesco and Didelez, Vanessa}, volume = {236}, series = {Proceedings of Machine Learning Research}, month = {01--03 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v236/feng24a/feng24a.pdf}, url = {https://proceedings.mlr.press/v236/feng24a.html}, abstract = {After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.} }
Endnote
%0 Conference Paper %T Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens %A Jean Feng %A Adarsh Subbaswamy %A Alexej Gossmann %A Harvineet Singh %A Berkman Sahiner %A Mi-Ok Kim %A Gene Anthony Pennello %A Nicholas Petrick %A Romain Pirracchio %A Fan Xia %B Proceedings of the Third Conference on Causal Learning and Reasoning %C Proceedings of Machine Learning Research %D 2024 %E Francesco Locatello %E Vanessa Didelez %F pmlr-v236-feng24a %I PMLR %P 587--608 %U https://proceedings.mlr.press/v236/feng24a.html %V 236 %X After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.
APA
Feng, J., Subbaswamy, A., Gossmann, A., Singh, H., Sahiner, B., Kim, M., Pennello, G.A., Petrick, N., Pirracchio, R. & Xia, F.. (2024). Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens. Proceedings of the Third Conference on Causal Learning and Reasoning, in Proceedings of Machine Learning Research 236:587-608 Available from https://proceedings.mlr.press/v236/feng24a.html.

Related Material