A cross-study Analysis of Wearable Datasets and the Generalizability of Acute Illness Monitoring Models

Patrick Kasl, Severine Soltani, Lauryn Keeler Bruce, Varun Kumar Viswanath, Wendy Hartogensis, Amarnath Gupta, Ilkay Altintas, Stephan Dilchert, Frederick M Hecht, Ashley Mason, Benjamin L Smarr
Proceedings of the fifth Conference on Health, Inference, and Learning, PMLR 248:644-682, 2024.

Abstract

Large-scale wearable datasets are increasingly being used for biomedical research and to develop machine learning (ML) models for longitudinal health monitoring applications. However, it is largely unknown whether biases in these datasets lead to findings that do not generalize. Here, we present the first comparison of the data underlying multiple longitudinal, wearable-device-based datasets. We examine participant-level resting heart rate (HR) from four studies, each with thousands of wearable device users. We demonstrate that multiple regression, a community standard statistical approach, leads to conflicting conclusions about important demographic variables (age vs resting HR) and significant intra- and inter-dataset differences in HR. We then directly test the cross-dataset generalizability of a commonly used ML model trained for three existing day-level monitoring tasks: prediction of testing positive for a respiratory virus, flu symptoms, and fever symptoms. Regardless of task, most models showed relative performance loss on external datasets; most of this performance change can be attributed to concept shift between datasets. These findings suggest that research using large-scale, pre-existing wearable datasets might face bias and generalizability challenges similar to research in more established biomedical and ML disciplines. We hope that the findings from this study will encourage discussion in the wearable-ML community around standards that anticipate and account for challenges in dataset bias and model generalizability.

Cite this Paper


BibTeX
@InProceedings{pmlr-v248-kasl24a, title = {A cross-study Analysis of Wearable Datasets and the Generalizability of Acute Illness Monitoring Models}, author = {Kasl, Patrick and Soltani, Severine and Keeler Bruce, Lauryn and Kumar Viswanath, Varun and Hartogensis, Wendy and Gupta, Amarnath and Altintas, Ilkay and Dilchert, Stephan and Hecht, Frederick M and Mason, Ashley and Smarr, Benjamin L}, booktitle = {Proceedings of the fifth Conference on Health, Inference, and Learning}, pages = {644--682}, year = {2024}, editor = {Pollard, Tom and Choi, Edward and Singhal, Pankhuri and Hughes, Michael and Sizikova, Elena and Mortazavi, Bobak and Chen, Irene and Wang, Fei and Sarker, Tasmie and McDermott, Matthew and Ghassemi, Marzyeh}, volume = {248}, series = {Proceedings of Machine Learning Research}, month = {27--28 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v248/main/assets/kasl24a/kasl24a.pdf}, url = {https://proceedings.mlr.press/v248/kasl24a.html}, abstract = {Large-scale wearable datasets are increasingly being used for biomedical research and to develop machine learning (ML) models for longitudinal health monitoring applications. However, it is largely unknown whether biases in these datasets lead to findings that do not generalize. Here, we present the first comparison of the data underlying multiple longitudinal, wearable-device-based datasets. We examine participant-level resting heart rate (HR) from four studies, each with thousands of wearable device users. We demonstrate that multiple regression, a community standard statistical approach, leads to conflicting conclusions about important demographic variables (age vs resting HR) and significant intra- and inter-dataset differences in HR. We then directly test the cross-dataset generalizability of a commonly used ML model trained for three existing day-level monitoring tasks: prediction of testing positive for a respiratory virus, flu symptoms, and fever symptoms. Regardless of task, most models showed relative performance loss on external datasets; most of this performance change can be attributed to concept shift between datasets. These findings suggest that research using large-scale, pre-existing wearable datasets might face bias and generalizability challenges similar to research in more established biomedical and ML disciplines. We hope that the findings from this study will encourage discussion in the wearable-ML community around standards that anticipate and account for challenges in dataset bias and model generalizability.} }
Endnote
%0 Conference Paper %T A cross-study Analysis of Wearable Datasets and the Generalizability of Acute Illness Monitoring Models %A Patrick Kasl %A Severine Soltani %A Lauryn Keeler Bruce %A Varun Kumar Viswanath %A Wendy Hartogensis %A Amarnath Gupta %A Ilkay Altintas %A Stephan Dilchert %A Frederick M Hecht %A Ashley Mason %A Benjamin L Smarr %B Proceedings of the fifth Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2024 %E Tom Pollard %E Edward Choi %E Pankhuri Singhal %E Michael Hughes %E Elena Sizikova %E Bobak Mortazavi %E Irene Chen %E Fei Wang %E Tasmie Sarker %E Matthew McDermott %E Marzyeh Ghassemi %F pmlr-v248-kasl24a %I PMLR %P 644--682 %U https://proceedings.mlr.press/v248/kasl24a.html %V 248 %X Large-scale wearable datasets are increasingly being used for biomedical research and to develop machine learning (ML) models for longitudinal health monitoring applications. However, it is largely unknown whether biases in these datasets lead to findings that do not generalize. Here, we present the first comparison of the data underlying multiple longitudinal, wearable-device-based datasets. We examine participant-level resting heart rate (HR) from four studies, each with thousands of wearable device users. We demonstrate that multiple regression, a community standard statistical approach, leads to conflicting conclusions about important demographic variables (age vs resting HR) and significant intra- and inter-dataset differences in HR. We then directly test the cross-dataset generalizability of a commonly used ML model trained for three existing day-level monitoring tasks: prediction of testing positive for a respiratory virus, flu symptoms, and fever symptoms. Regardless of task, most models showed relative performance loss on external datasets; most of this performance change can be attributed to concept shift between datasets. These findings suggest that research using large-scale, pre-existing wearable datasets might face bias and generalizability challenges similar to research in more established biomedical and ML disciplines. We hope that the findings from this study will encourage discussion in the wearable-ML community around standards that anticipate and account for challenges in dataset bias and model generalizability.
APA
Kasl, P., Soltani, S., Keeler Bruce, L., Kumar Viswanath, V., Hartogensis, W., Gupta, A., Altintas, I., Dilchert, S., Hecht, F.M., Mason, A. & Smarr, B.L.. (2024). A cross-study Analysis of Wearable Datasets and the Generalizability of Acute Illness Monitoring Models. Proceedings of the fifth Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 248:644-682 Available from https://proceedings.mlr.press/v248/kasl24a.html.

Related Material