Cached Summary Embeddings for Memory-Efficient EHR Inference

Rafi Al Attrach; Rajna Fani; David Restrepo; Yugang Jia; Leo Anthony Celi; Peter Schuffler

Cached Summary Embeddings for Memory-Efficient EHR Inference

Rafi Al Attrach, Rajna Fani, David Restrepo, Yugang Jia, Leo Anthony Celi, Peter Schuffler

Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:778-793, 2026.

Abstract

Transformer-based clinical prediction models face a deployment challenge: processing long patient histories can require memory that exceeds available resources in resource-constrained settings. We propose a deployment architecture that separates expensive historical encoding from lightweight inference. In an offline preprocessing phase, a clinical language model compresses each patient’s historical events into a fixed-size vector (768 dimensions, 5 KB per patient). At inference time, the prediction model processes only a short window of recent events, conditioned on the cached summary. Through 252 experiments on a 24-hour in-ICU mortality cohort from MIMIC-IV, we characterize when this architecture provides value. The benefit of cached summaries decays as the recent context window grows: a 6.5% relative AUROC improvement at $N$=8 recent events ($p < 0.001$) shrinks to a negligible 0.1% at $N$=256 (not statistically significant). We find that Feature-wise Linear Modulation (FiLM) outperforms token injection for integrating summaries ($p < 0.001$). Our results provide deployment guidance: when hardware constraints limit the recent context to 32 events or fewer, cached summaries recover meaningful predictive signal; when longer sequences are feasible, the caching overhead is not justified.

Cite this Paper

BibTeX

@InProceedings{pmlr-v333-al-attrach26a,
  title = 	 {Cached Summary Embeddings for Memory-Efficient EHR Inference},
  author =       {Al Attrach, Rafi and Fani, Rajna and Restrepo, David and Jia, Yugang and Celi, Leo Anthony and Schuffler, Peter},
  booktitle = 	 {Proceedings of the 7th Conference on Health, Inference, and Learning},
  pages = 	 {778--793},
  year = 	 {2026},
  editor = 	 {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily},
  volume = 	 {333},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29--30 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v333/main/assets/al-attrach26a/al-attrach26a.pdf},
  url = 	 {https://proceedings.mlr.press/v333/al-attrach26a.html},
  abstract = 	 {Transformer-based clinical prediction models face a deployment challenge: processing long patient histories can require memory that exceeds available resources in resource-constrained settings. We propose a deployment architecture that separates expensive historical encoding from lightweight inference. In an offline preprocessing phase, a clinical language model compresses each patient’s historical events into a fixed-size vector (768 dimensions, 5 KB per patient). At inference time, the prediction model processes only a short window of recent events, conditioned on the cached summary. Through 252 experiments on a 24-hour in-ICU mortality cohort from MIMIC-IV, we characterize when this architecture provides value. The benefit of cached summaries decays as the recent context window grows: a 6.5% relative AUROC improvement at $N$=8 recent events ($p < 0.001$) shrinks to a negligible 0.1% at $N$=256 (not statistically significant). We find that Feature-wise Linear Modulation (FiLM) outperforms token injection for integrating summaries ($p < 0.001$). Our results provide deployment guidance: when hardware constraints limit the recent context to 32 events or fewer, cached summaries recover meaningful predictive signal; when longer sequences are feasible, the caching overhead is not justified.}
}

Endnote

%0 Conference Paper
%T Cached Summary Embeddings for Memory-Efficient EHR Inference
%A Rafi Al Attrach
%A Rajna Fani
%A David Restrepo
%A Yugang Jia
%A Leo Anthony Celi
%A Peter Schuffler
%B Proceedings of the 7th Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Elizabeth Healey
%E Jason Fries
%E Tom Pollard
%E Shengpu Tang
%E Anna Zink
%E Tom Hartvigsen
%E Monica Agrawal
%E Sam Finlayson
%E Benjamin Glicksberg
%E Brett Beaulieu-Jones
%E Kai Wang
%E Daseyra Fontalvo
%E Tasmie Sarker
%E Irene Chen
%E Emily Alsentzer	
%F pmlr-v333-al-attrach26a
%I PMLR
%P 778--793
%U https://proceedings.mlr.press/v333/al-attrach26a.html
%V 333
%X Transformer-based clinical prediction models face a deployment challenge: processing long patient histories can require memory that exceeds available resources in resource-constrained settings. We propose a deployment architecture that separates expensive historical encoding from lightweight inference. In an offline preprocessing phase, a clinical language model compresses each patient’s historical events into a fixed-size vector (768 dimensions, 5 KB per patient). At inference time, the prediction model processes only a short window of recent events, conditioned on the cached summary. Through 252 experiments on a 24-hour in-ICU mortality cohort from MIMIC-IV, we characterize when this architecture provides value. The benefit of cached summaries decays as the recent context window grows: a 6.5% relative AUROC improvement at $N$=8 recent events ($p < 0.001$) shrinks to a negligible 0.1% at $N$=256 (not statistically significant). We find that Feature-wise Linear Modulation (FiLM) outperforms token injection for integrating summaries ($p < 0.001$). Our results provide deployment guidance: when hardware constraints limit the recent context to 32 events or fewer, cached summaries recover meaningful predictive signal; when longer sequences are feasible, the caching overhead is not justified.

APA

Al Attrach, R., Fani, R., Restrepo, D., Jia, Y., Celi, L.A. & Schuffler, P.. (2026). Cached Summary Embeddings for Memory-Efficient EHR Inference. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:778-793 Available from https://proceedings.mlr.press/v333/al-attrach26a.html.

Related Material

Download PDF