EHR Safari: Data is Contextual

William Boag, Mercy Oladipo, Peter Szolovits
Proceedings of the 7th Machine Learning for Healthcare Conference, PMLR 182:391-408, 2022.

Abstract

In the last decade, machine learning (ML) has shown tremendous success in areas such as vision, language, strategic games, and more. Parallel to this, hospitals’ capacity for data collection has greatly increased with the adoption and continuing maturation of electronic health records (EHRs). The result of these trends has been a large degree of excitement and optimism about how ML will revolutionize healthcare once researchers get access to data. In this work, we present a cautionary tale of the instinct some computer scientists have to “let the data speak for itself.” Using a popular, public EHR dataset as a case study, we demonstrate numerous examples where a non-clinician’s intuition may lead to incorrect – and potentially harmful – modeling assumptions. We explore both non-obvious quirks in the data (i.e., hypothetical incorrect assumptions) and examples of published papers that misunderstood the data generating process (i.e., actual incorrect assumptions). This case study is meant to serve as a cautionary tale to encourage every data scientist to approach their projects with the humility to know what they can do well and what they cannot. Without the guidance of stakeholders that understand the data generating process, data scientists run the risk of “garbage-in, garbage-out” analysis because their models are not measuring meaningful relationships.

Cite this Paper


BibTeX
@InProceedings{pmlr-v182-boag22a, title = {EHR Safari: Data is Contextual}, author = {Boag, William and Oladipo, Mercy and Szolovits, Peter}, booktitle = {Proceedings of the 7th Machine Learning for Healthcare Conference}, pages = {391--408}, year = {2022}, editor = {Lipton, Zachary and Ranganath, Rajesh and Sendak, Mark and Sjoding, Michael and Yeung, Serena}, volume = {182}, series = {Proceedings of Machine Learning Research}, month = {05--06 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v182/boag22a/boag22a.pdf}, url = {https://proceedings.mlr.press/v182/boag22a.html}, abstract = {In the last decade, machine learning (ML) has shown tremendous success in areas such as vision, language, strategic games, and more. Parallel to this, hospitals’ capacity for data collection has greatly increased with the adoption and continuing maturation of electronic health records (EHRs). The result of these trends has been a large degree of excitement and optimism about how ML will revolutionize healthcare once researchers get access to data. In this work, we present a cautionary tale of the instinct some computer scientists have to “let the data speak for itself.” Using a popular, public EHR dataset as a case study, we demonstrate numerous examples where a non-clinician’s intuition may lead to incorrect – and potentially harmful – modeling assumptions. We explore both non-obvious quirks in the data (i.e., hypothetical incorrect assumptions) and examples of published papers that misunderstood the data generating process (i.e., actual incorrect assumptions). This case study is meant to serve as a cautionary tale to encourage every data scientist to approach their projects with the humility to know what they can do well and what they cannot. Without the guidance of stakeholders that understand the data generating process, data scientists run the risk of “garbage-in, garbage-out” analysis because their models are not measuring meaningful relationships.} }
Endnote
%0 Conference Paper %T EHR Safari: Data is Contextual %A William Boag %A Mercy Oladipo %A Peter Szolovits %B Proceedings of the 7th Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2022 %E Zachary Lipton %E Rajesh Ranganath %E Mark Sendak %E Michael Sjoding %E Serena Yeung %F pmlr-v182-boag22a %I PMLR %P 391--408 %U https://proceedings.mlr.press/v182/boag22a.html %V 182 %X In the last decade, machine learning (ML) has shown tremendous success in areas such as vision, language, strategic games, and more. Parallel to this, hospitals’ capacity for data collection has greatly increased with the adoption and continuing maturation of electronic health records (EHRs). The result of these trends has been a large degree of excitement and optimism about how ML will revolutionize healthcare once researchers get access to data. In this work, we present a cautionary tale of the instinct some computer scientists have to “let the data speak for itself.” Using a popular, public EHR dataset as a case study, we demonstrate numerous examples where a non-clinician’s intuition may lead to incorrect – and potentially harmful – modeling assumptions. We explore both non-obvious quirks in the data (i.e., hypothetical incorrect assumptions) and examples of published papers that misunderstood the data generating process (i.e., actual incorrect assumptions). This case study is meant to serve as a cautionary tale to encourage every data scientist to approach their projects with the humility to know what they can do well and what they cannot. Without the guidance of stakeholders that understand the data generating process, data scientists run the risk of “garbage-in, garbage-out” analysis because their models are not measuring meaningful relationships.
APA
Boag, W., Oladipo, M. & Szolovits, P.. (2022). EHR Safari: Data is Contextual. Proceedings of the 7th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 182:391-408 Available from https://proceedings.mlr.press/v182/boag22a.html.

Related Material