[edit]
Cached Summary Embeddings for Memory-Efficient EHR Inference
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:778-793, 2026.
Abstract
Transformer-based clinical prediction models face a deployment challenge: processing long patient histories can require memory that exceeds available resources in resource-constrained settings. We propose a deployment architecture that separates expensive historical encoding from lightweight inference. In an offline preprocessing phase, a clinical language model compresses each patient’s historical events into a fixed-size vector (768 dimensions, 5 KB per patient). At inference time, the prediction model processes only a short window of recent events, conditioned on the cached summary. Through 252 experiments on a 24-hour in-ICU mortality cohort from MIMIC-IV, we characterize when this architecture provides value. The benefit of cached summaries decays as the recent context window grows: a 6.5% relative AUROC improvement at $N$=8 recent events ($p < 0.001$) shrinks to a negligible 0.1% at $N$=256 (not statistically significant). We find that Feature-wise Linear Modulation (FiLM) outperforms token injection for integrating summaries ($p < 0.001$). Our results provide deployment guidance: when hardware constraints limit the recent context to 32 events or fewer, cached summaries recover meaningful predictive signal; when longer sequences are feasible, the caching overhead is not justified.