From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source Large Language Models to Clinical Symptom Extraction

Li-Ching Chen; Travis Zack; Divneet Mandair; Aditya Mahadevan; Arvind Suresh; Yuta Ishiyama; Yiping Li; Julian C. Hong; Atul J. Butte

From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source Large Language Models to Clinical Symptom Extraction

Li-Ching Chen, Travis Zack, Divneet Mandair, Aditya Mahadevan, Arvind Suresh, Yuta Ishiyama, Yiping Li, Julian C. Hong, Atul J. Butte

Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1023-1046, 2026.

Abstract

Large language models ({LLM}s) are increasingly applied to clinical notes, but guidance on how to adapt open-source models to specific tasks and manage annotation quality at scale is limited. We present a playbook for fine-tuning {LLM}s on de-identified clinical notes from patients with pancreatic cancer, spanning both pre-diagnosis and on-treatment settings. We evaluate prompting strategies, contrast open-source models with {GPT}-4o, and explore disease-level versus task-specific adaptation. A key contribution is an {LLM}-assisted adjudication workflow in which models flag notes where predictions consistently conflict with initial human labels. This approach concentrated expert review on a small fraction of cases while identifying many true annotation errors, ultimately improving downstream model performance. We further examine the use of machine-generated annotations to augment limited expert labels, showing that balanced mixtures of synthetic and human data can enhance fine-tuned models. Our findings provide practical guidance for deploying open-source {LLM}s in clinical contexts, offering strategies to improve accuracy, reduce annotation burden, and enable privacy-preserving, site-adapted clinical natural language processing ({NLP}).

Cite this Paper

BibTeX

@InProceedings{pmlr-v297-chen26a,
  title = 	 {From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source Large Language Models to Clinical Symptom Extraction},
  author =       {Chen, Li-Ching and Zack, Travis and Mandair, Divneet and Mahadevan, Aditya and Suresh, Arvind and Ishiyama, Yuta and Li, Yiping and Hong, Julian C. and Butte, Atul J.},
  booktitle = 	 {Proceedings of the Fifth Machine Learning for Health Symposium},
  pages = 	 {1023--1046},
  year = 	 {2026},
  editor = 	 {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush},
  volume = 	 {297},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v297/main/assets/chen26a/chen26a.pdf},
  url = 	 {https://proceedings.mlr.press/v297/chen26a.html},
  abstract = 	 {Large language models ({LLM}s) are increasingly applied to clinical notes, but guidance on how to adapt open-source models to specific tasks and manage annotation quality at scale is limited. We present a playbook for fine-tuning {LLM}s on de-identified clinical notes from patients with pancreatic cancer, spanning both pre-diagnosis and on-treatment settings. We evaluate prompting strategies, contrast open-source models with {GPT}-4o, and explore disease-level versus task-specific adaptation. A key contribution is an {LLM}-assisted adjudication workflow in which models flag notes where predictions consistently conflict with initial human labels. This approach concentrated expert review on a small fraction of cases while identifying many true annotation errors, ultimately improving downstream model performance. We further examine the use of machine-generated annotations to augment limited expert labels, showing that balanced mixtures of synthetic and human data can enhance fine-tuned models. Our findings provide practical guidance for deploying open-source {LLM}s in clinical contexts, offering strategies to improve accuracy, reduce annotation burden, and enable privacy-preserving, site-adapted clinical natural language processing ({NLP}).}
}

Endnote

%0 Conference Paper
%T From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source Large Language Models to Clinical Symptom Extraction
%A Li-Ching Chen
%A Travis Zack
%A Divneet Mandair
%A Aditya Mahadevan
%A Arvind Suresh
%A Yuta Ishiyama
%A Yiping Li
%A Julian C. Hong
%A Atul J. Butte
%B Proceedings of the Fifth Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2026
%E Peniel Argaw
%E Haoran Zhang
%E Sarah Jabbour
%E Payal Chandak
%E Jerry Ji
%E Sumit Mukherjee
%E Olawale Salaudeen
%E Trenton Chang
%E Elizabeth Healey
%E Fabian Gröger
%E Amin Adibi
%E Stefan Hegselmann
%E Benjamin Wild
%E Ayush Noori	
%F pmlr-v297-chen26a
%I PMLR
%P 1023--1046
%U https://proceedings.mlr.press/v297/chen26a.html
%V 297
%X Large language models ({LLM}s) are increasingly applied to clinical notes, but guidance on how to adapt open-source models to specific tasks and manage annotation quality at scale is limited. We present a playbook for fine-tuning {LLM}s on de-identified clinical notes from patients with pancreatic cancer, spanning both pre-diagnosis and on-treatment settings. We evaluate prompting strategies, contrast open-source models with {GPT}-4o, and explore disease-level versus task-specific adaptation. A key contribution is an {LLM}-assisted adjudication workflow in which models flag notes where predictions consistently conflict with initial human labels. This approach concentrated expert review on a small fraction of cases while identifying many true annotation errors, ultimately improving downstream model performance. We further examine the use of machine-generated annotations to augment limited expert labels, showing that balanced mixtures of synthetic and human data can enhance fine-tuned models. Our findings provide practical guidance for deploying open-source {LLM}s in clinical contexts, offering strategies to improve accuracy, reduce annotation burden, and enable privacy-preserving, site-adapted clinical natural language processing ({NLP}).

APA

Chen, L., Zack, T., Mandair, D., Mahadevan, A., Suresh, A., Ishiyama, Y., Li, Y., Hong, J.C. & Butte, A.J.. (2026). From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source Large Language Models to Clinical Symptom Extraction. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1023-1046 Available from https://proceedings.mlr.press/v297/chen26a.html.

From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source Large Language Models to Clinical Symptom Extraction

Abstract

Cite this Paper

Related Material