Proceedings of Machine Learning Research

Open-Ended Clinical Text Generation for Acute Care: Applying Reinforcement Learning with Clinically Grounded Rewards

Mon, 29 Jun 2026 00:00:00 +0000

Acute care clinicians generate critical clinical text—diagnoses, treatment plans, discharge instructions—under time pressure where errors can be life-threatening. Large proprietary AI models raise privacy concerns, while smaller models lack clinical quality. We extend reinforcement learning with verifiable rewards (RLVR) to open-ended clinical text generation using two generalizable reward patterns: equivalence-based rewards for medical synonymy and diagnosis matching, as well as rubric-based rewards for multi-dimensional quality assessment. Using group relative policy optimization, we trained compact 7–8 billion parameter models on diagnosis generation (MIMIC-III), discharge instructions (DischargeMe), and treatment planning (MTSamples). Trained models achieve clinical quality across tasks (best results: F1 0.48, 4.28/5.0, 4.47/5.0 respectively), matching or surpassing the performance of large proprietary GPT-based models, while enabling on-premise deployment, sub-second inference, and full privacy. Physician review confirmed superior content comprehensiveness and fewer dangerous errors versus base models. This demonstrates a practical pathway for deploying clinical text generation in acute care with generalizable reward design patterns.

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Mon, 29 Jun 2026 00:00:00 +0000

Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a “pre-CXR” probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance is highly sensitive to the underlying distribution of clinical context. Specifically, cases with high pre-test probabilities present a fundamentally more difficult visual classification task, highlighting a gap in clinical utility when models are applied to high-risk cohorts.

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Mon, 29 Jun 2026 00:00:00 +0000

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

Mon, 29 Jun 2026 00:00:00 +0000

Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis shows that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.

DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift

Mon, 29 Jun 2026 00:00:00 +0000

Despite the burgeoning body of work on distribution shifts, provenance shift—where the relationship between data source and label changes at deployment—remains poorly understood and under-addressed. In this paper, we establish a formal connection between provenance shift, counterfactual invariance, and invariant learning to derive a learning objective for robustness. We then introduce DeconDTN-Toolkit, a specialized evaluation and remediation suite designed to simulate provenance shifts of varying degrees while maintaining the training protocol and the infrastructure of existing benchmarks. We reveal the vulnerability of Empirical Risk Minimization under provenance shift, introduce a robust out-of-distribution performance indicator, and conduct a comprehensive evaluation on existing algorithms. Our work provides both the theoretical grounding and the practical tools necessary to characterize the problem of confounding by provenance, and implementations of methods to mitigate it.

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

Mon, 29 Jun 2026 00:00:00 +0000

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models

Mon, 29 Jun 2026 00:00:00 +0000

Medical vision-language models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med, which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions ($n=158$), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving $R^2 \approx 0.997$ on both medical and general text ($n=100$ prompts each, $p<0.001$ for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% ($p=0.002$, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced ($n=250$), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.

Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Mon, 29 Jun 2026 00:00:00 +0000

Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model’s reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity. This technique especially benefits healthcare institutions with constrained computational capacity that cannot support iterative clinical training and want to adopt emerging general-domain model advances.

Bridging the Reliability Gap: INT8 Quantization Effects on Discrimination and Calibration in Medical Imaging

Mon, 29 Jun 2026 00:00:00 +0000

Deploying medical imaging classifiers often requires reduced-precision inference for practical latency and memory budgets, yet the impact of quantization on discrimination and calibration varies across tasks and architectures. We evaluate three public medical imaging datasets (BrainMRI, ChestXray, SkinCancer) and eight ImageNet-pretrained backbones under FP32, FP16, INT8 post-training quantization (PTQ), and INT8 quantization-aware training (QAT). We report macro one-vs-rest ROC-AUC and AUPRC, calibration metrics (ECE, Brier score), and efficiency metrics (throughput, p50 and p99 batch latency) measured on GPU and CPU. FP16 closely matches FP32 across datasets, while INT8-PTQ can introduce substantial and architecture-dependent degradation and calibration shifts. INT8-QAT largely recovers floating-point behavior while enabling integer inference. These results motivate evaluating accuracy, calibration, and efficiency together when selecting quantization strategies for clinical deployment.

Survey-Aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review

Mon, 29 Jun 2026 00:00:00 +0000

Machine Learning (ML) models trained on complex health surveys such as the National Health and Nutrition Examination Survey (NHANES) often ignore primary sampling units, stratification variables, and sampling weights. This practice violates the independence assumptions of standard evaluation methods. As a result, estimates become biased, uncertainty is underestimated, and fairness assessments fail to reflect population-level disparities. We propose Survey-aware Machine Learning (SaML), a nine-step guideline that incorporates survey design metadata across the ML lifecycle. Through a scoping review of 16 methodological papers, we summarize existing work on weighted model training, design-based cross-validation, and survey-adjusted performance evaluation. We also identify gaps in hyperparameter tuning and deployment. We provide task-specific guidance that clarifies which steps are required for different analytical objectives. SaML provides a checklist for valid population inference from survey data.

Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis

Mon, 29 Jun 2026 00:00:00 +0000

Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5–0.93, Llama 3.3 70B Instruct–0.76) and strong temporal ordering (concordance: GPT-5–0.965, Llama 3.3 70B Instruct–0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.

MAM: Multinomial Attention Masking for Foundation Models on Sparse Single-Cell RNA-seq Data

Mon, 29 Jun 2026 00:00:00 +0000

Single-cell RNA sequencing (scRNA-seq) has transformed biology by enabling the measurement of gene expression across millions of individual cells, revealing cellular heterogeneity that underlies development, disease progression, and treatment response. This has made scRNA-seq a central data modality in modern biology and drug discovery. Recently, transformer-based foundation models (FMs) have shown strong potential for scRNA-seq analysis, but they often rely on random masking during training. Due to the extreme sparsity of scRNA-seq datasets, conventional uniform masking samples genes without considering their biological importance. In this work, we propose Multinomial Attention Masking (MAM), a module that learns which gene positions are most informative to mask at each training step. We define a set of trainable latent vectors that attend over gene embeddings to produce attention maps, from which a multinomial sampler selects the highest-scoring positions for masking. We show MAM improves FMs pretraining performance and consistently outperforms uniform masking on cell-type classification tasks, while adding negligible computational overhead. Our work benefits researchers building FMs for sparse data and those rely on accurate scRNA-seq analysis to study cell types and disease.

LUNGUAGE: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Mon, 29 Jun 2026 00:00:00 +0000

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation.

Proto4DME: Interpretable Cell Counting via Additive Prototype Density Decomposition and Optimal-Transport Coverage

Mon, 29 Jun 2026 00:00:00 +0000

Cell counting via density map estimation predicts a per-pixel density. Summing the density yields the final count, a common readout in clinical diagnostics and disease monitoring. Yet these models are often hard to audit when errors occur. We present Proto4DME, an interpretable density map estimator with faithful explanations by construction. The predicted density (and thus the count) is an additive, non-negative combination of contributions from learned visual patterns (prototypes). Prior prototype-based counting uses signed aggregation, which permits cancellation. In contrast, Proto4DME provides non-canceling attributions, in which increasing a prototype’s activation can only increase the predicted density. So prototype heatmaps correspond to positive contributions for the count. Proto4DME learns spatial prototype activation maps from backbone features and selects a compact set of prototypes using sparsity-inducing Hard-Concrete gates. To encourage diverse foreground coverage and prevent prototype collapse, we introduce an entropically-regularized optimal-transport coverage objective. It allocates ground-truth density mass across prototypes under capacity constraints and induces competition among prototypes. Across three microscopy benchmarks (MBM, ADI, and DCC), Proto4DME achieves competitive mean absolute error (MAE) while producing compact, auditable explanations that support error analysis and model debugging.

CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Mon, 29 Jun 2026 00:00:00 +0000

Off-policy evaluation (OPE) is critical for applying contextual bandit algorithms to high-stakes decision-making settings such as healthcare, where new treatment policies must be evaluated prior to deployment. Unfortunately, OPE techniques are inherently limited by the breadth of the available data, which may not be sufficient to evaluate the performance of a new policy. Recent work attempts to improve dataset coverage by adding expert-annotated counterfactual samples. However, such annotations are often imperfect and can lead to worse estimator performance than using no annotations at all. To better leverage imperfect annotations, we propose a family of OPE estimators grounded in the doubly robust (DR) framework, which combines importance sampling (IS) with a reward model (direct method, DM) for better statistical guarantees. We study three ways of incorporating counterfactual annotations. Under mild assumptions, we prove that using annotations within just the DM component yields the most desirable theoretical results. Experiments on multiple healthcare tasks, including real-world electronic health records (EHR) data, show that this strategy is most robust under misspecified reward models and inaccurate annotations. By addressing the challenges posed by imperfect annotations, this work broadens the applicability of OPE methods and facilitates safer deployment of decision-making policies in healthcare.

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

Mon, 29 Jun 2026 00:00:00 +0000

As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

Generation of Bilingual Synthetic Clinical Notes for Realistic Data Augmentation

Mon, 29 Jun 2026 00:00:00 +0000

Synthetic clinical notes offer a promising solution to data scarcity and privacy constraints in clinical natural language processing. However, existing generation approaches often prioritize semantic accuracy while not adequately reproducing the linguistic and structural (i.e., surface) characteristics of real-world clinical documentation, limiting their utility for downstream clinical tasks. In this study, we propose an expert-informed prompt with feedback-loop generation framework to improve the fidelity of synthetic clinical notes across both semantic and surface-level dimensions. Using individual case safety reports from FAERS, we formulated synthetic note generation as a controlled text generation task conditioned on adverse drug reaction descriptions and clinical narratives. We evaluated the performance of the proposed approach by comparing it with other generation strategies (in-context learning and multi-agent generation) and prompting methods (base and expert-informed) under a unified experimental condition. Generation quality was assessed using embedding-based semantic similarity, surface-level statistical and distributional metrics, and blinded human evaluation. The feedback-loop generation framework achieved superior performance across semantic (mean clinical BERTScore = 0.885) and surface-level distributional metrics (token-level Jensen-Shannon divergence = 0.344), producing synthetic clinical notes that more closely resembled real-world clinical notes than other approaches. Expert-informed prompting further improved semantic fidelity and lexical diversity.

H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration

Mon, 29 Jun 2026 00:00:00 +0000

Hospital administration departments handle a wide range of operational tasks and, in large hospitals, process over 10,000 requests per day, driving growing interest in LLM-based automation. However, prior work has focused primarily on patient–physician interactions or isolated administrative subtasks, failing to capture the complexity of real administrative workflows. To address this gap, we propose H-AdminSim, a comprehensive simulation framework that combines realistic data generation with multi-agent–based simulation of hospital administrative workflows. These tasks are quantitatively evaluated using detailed rubrics, enabling systematic comparison of LLMs. Through FHIR integration, H-AdminSim provides a unified and interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing the feasibility and performance of LLM-driven administrative automation.

Evaluating Robustness of LLM-Based Ambient Scribes for SOAP Note Generation

Mon, 29 Jun 2026 00:00:00 +0000

Clinical documentation is a major driver of clinician workload and burnout, motivating the adoption of ambient AI scribes that transcribe clinician-patient conversations into clinical notes. Safe deployment requires both transcript-grounded fidelity and robustness to upstream Automatic Speech Recognition (ASR) noise-properties not captured by traditional ROUGE-like metrics. We propose a clinically grounded evaluation framework that decomposes notes into atomic, QNOTE-structured facts and applies a two-phase triangulated protocol: (1) align generated facts to clinician-authored gold notes to measure coverage, omission, contradiction, and candidate additions; (2) verify gold-absent generated facts against transcripts to distinguish valid elaborations from unsupported content. Across eight LLM-based note generators, we find that omissions are the primary source of contextual degradation (8.5%–24.0%), while contradictions remain relatively stable (6.2%–7.9%). A large majority of content initially flagged as “added” relative to gold is supported by the transcript (92%), highlighting the importance of transcript verification. Robustness analysis with controlled transcript-level perturbations shows that conversational redundancy often mitigates errors (38.6% recovery), whereas substitution errors (e.g., negation flips, medical homophones) are more likely to propagate when redundancy is absent. These results provide a structured approach for evaluating fidelity and robustness in clinical note generation and suggest practical considerations for safer deployment.

A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

Mon, 29 Jun 2026 00:00:00 +0000

Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the learned representations. Recent EEG foundation models have demonstrated promising transfer capabilities across tasks and datasets, motivating their growing use in neurotechnology and clinical applications. However, these models are typically evaluated under full fine-tuning on well-curated downstream datasets, a setting that does not reflect biomedical domain constraints such as limited labeled data, reduced sensor coverage, or parameter-efficient adaptation. In this work, we propose a multi-dimensional evaluation framework for assessing EEG models under realistic low-resource conditions. Empirical analysis of both supervised EEG models and recent EEG foundation models, including LaBraM, CSBrain, and CBraMod, across 6 different datasets is performed under the proposed multi-dimensional evaluation framework. We find that EEG foundation models consistently provide performance gains on long-context tasks such as sleep stage prediction and mental health state classification. In contrast, for short-window Brain Computer Interface style tasks, supervised models achieve comparable performance despite having substantially fewer parameters. Additional analyses demonstrate that current foundation models provide limited robustness to short-window tasks and channel constrained settings. Together, these findings motivate the use of multi-dimensional evaluation protocols that characterize model behavior under realistic use constraints.

Structured Treatment Modeling in Deep Survival Analysis via Hazard Factorization

Mon, 29 Jun 2026 00:00:00 +0000

Deep learning models trained on electronic health records are increasingly used for clinical risk prediction, yet modeling heterogeneous treatment effects remains challenging. Most approaches treat treatment as an undifferentiated covariate (S-Learner), conflating treatment effects with baseline risk, while training separate models for treated and untreated patients (T-Learner) suffers from treatment imbalance and sparsity. We propose a structured hazard factorization that decomposes the hazard into a shared baseline component and a treatment-specific hazard ratio network, enabling direct estimation of time-varying, covariate-dependent hazard ratios without post-hoc computation. By sharing a baseline while isolating treatment effects, the framework acts as a hybrid between S- and T-Learners, improving efficiency and reducing majority-group dominance under imbalance. We further extend the model with differentiable subgroup assignment for regularized treatment effect estimation and inverse propensity weighting to adjust for confounding. In simulations with known ground truth, our approach improves hazard ratio recovery while maintaining competitive survival prediction, and the subgroup extension recovers latent heterogeneity when assumptions hold. On two real-world clinical cohorts from the UK Clinical Practice Research Datalink, the framework produces time-varying hazard ratios and identifies subgroups characterized by established risk factors. Our results demonstrate that explicit hazard factorization provides useful inductive bias for incorporating treatment into deep survival models, bridging flexible neural architectures with hazard ratio estimation familiar to clinical practice.

Learning Under Extreme Label Imbalance in EHRs: A Dependency-Aware Loss for Multi-Label Classification

Mon, 29 Jun 2026 00:00:00 +0000

Extreme multi-label next-visit diagnosis forecasting from electronic health records is dominated by label sparsity. Each visit contains only a handful of positive ICD-10 codes among thousands of candidates, yet codes are strongly correlated through comorbidity structure. In this regime, standard element-wise objectives (such as focal, and class-balanced loss) often maximize sensitivity at the cost of severe precision degradation, producing clinically impractical alert volumes. We propose an architecture-compatible dependency-aware ranking loss that (i) reweights per-code correctness under severe imbalance, (ii) aggregates errors with rank-based emphasis on the hardest labels, and (iii) regularizes predictions with a learned pairwise dependency term in the output space. Using an EHR Transformer backbone, we evaluate on the CPRD cohort ($V{=}1{,}538$ codes), benchmarking loss functions on 200{,}000 patients and validating scalability up to 3.2 million. The proposed objective shifts the precision–recall trade-off toward fewer false positives while maintaining competitive sensitivity, and preserves overall ranking quality (PRC–AUC comparable to weighted BCE). In addition, it yields an auditable population-level dependency matrix summarizing learned co-occurrence structure. These results suggest that explicit output-space structure can improve the precision–recall trade-off in sparse, high-dimensional next-visit diagnosis prediction from EHRs.

Conference on Health, Inference, and Learning (CHIL) 2026

Mon, 29 Jun 2026 00:00:00 +0000

The Conference on Health, Inference, and Learning (CHIL) focuses on advancing machine learning for health, bringing together clinicians and researchers, across both industry and academia. Since 2022, CHIL has been an official conference of the Association for Health Learning and Inference (AHLI). This volume contains proceedings of the seventh annual CHIL conference, held at the Seattle Children’s Research Institute in the US.

A Multi-Dataset Benchmark of Multiple Instance Learning for 3D Neuroimage Classification

Mon, 29 Jun 2026 00:00:00 +0000

Despite being resource-intensive to train, 3D convolutional neural networks (CNNs) have been the standard approach to classify CT and MRI scans. Recent work suggests that deep multiple instance learning (MIL) may be a more efficient alternative for 3D brain scans, especially when the pre-trained image encoder used to embed each 2D slice is frozen and only the pooling operation and classifier are trained. In this paper, we systematically compare simple MIL, attention-based MIL, 3D CNNs, and 3D ViTs across three CT and four MRI datasets, including two large datasets of at least 10,000 scans. Our goal is to help resource-constrained practitioners understand which neural networks work well for 3D neuroimages and why. We further compare design choices for MIL, including different encoders, pooling operations, and architectural orderings. We find that simple mean pooling MIL, without any learnable attention, matches or outperforms recent MIL or 3D NN alternatives on 4 of 6 moderate-sized tasks. This baseline remains competitive on two large datasets while being 25x faster to train. To explain mean pooling’s success, we examine per-slice attention quality and a semi-synthetic dataset where we can derive the best possible classifier via a Bayes estimator. This analysis reveals the limits of existing MIL approaches and suggests routes for future improvements.

TEMPO: Transformers for Temporal Disease Progression from Cross-Sectional Data

Mon, 29 Jun 2026 00:00:00 +0000

Event-Based Models (EBMs) infer biomarker progression from cross-sectional data but typically only as ordinal sequences and rely on rigid model assumptions. We propose Tempo, a Transformer architecture that learns both ordinal and continuous event sequences through simulation-based supervised learning. Tempo uses two Transformer modules: one treats biomarkers as tokens to infer event sequencing; the other treats patients as tokens, representing each by their per-biomarker abnormality profile, to infer patients’ disease stages. On synthetic benchmarks, Tempo reduces normalized Kendall’s Tau distance by 52.89% and staging MAE by 25.33% compared to state-of-the-art SA-EBM, with larger reductions in high-dimensional settings (58.88% and 61.10%). Applied to ADNI, Tempo recovers a biologically plausible Alzheimer’s progression: early medial temporal atrophy, followed by amyloid accumulation and cognitive decline, and late-stage tau pathology with terminal acceleration of global neurodegeneration—broadly consistent with established disease models. Tempo also eliminates the need to derive custom inference algorithms and enables rapid empirical comparison of generative hypotheses.

ML-Powered Triage and Queue Optimization for Resource-Constrained Free Clinics

Mon, 29 Jun 2026 00:00:00 +0000

Free clinics serve about 1.7 million uninsured Americans annually in the US, yet operate under severe resource constraints that lead to missed urgent cases, inefficient patient flow, and long waits. Our machine learning (ML)-powered triage and queue optimization system is designed as a decision support tool specifically for resource-constrained free clinics. Our system combines a Random Forest (RF) classifier trained on MIMIC-IV-ED data with a multi-objective queue optimization algorithm presented via a staff-facing interface. Our triage model, when simulating free clinic deployment without vital sign equipment, achieves an 83.6% critical case detection rate and no dangerous misses on the test set (0.012% on holdout); optimizing for patient safety over raw accuracy. Monte Carlo simulation across 1,000 clinic sessions demonstrates a 72% reduction in wait times ($p<0.001$) for critical patients compared to first-come-first-served (FCFS) queue ordering. Unlike commercial triage systems that can be prohibitively expensive, our solution is built entirely on free and free-tier tools and designed for volunteer-staffed environments lacking trained intake nurses, vital sign monitors, and electronic health records (EHR). We developed the system as a tablet-optimized web application with real-time queue updates and physician queue overrides. Our work is entirely retrospective evaluation and simulation. Prospective studies in free clinic settings, in collaboration with clinicians and domain experts, are planned as a next step.

Generating synthetic electronic health record data using agent-based models to evaluate machine learning robustness under mass casualty incidents

Mon, 29 Jun 2026 00:00:00 +0000

Machine learning (ML) models in healthcare are typically evaluated using curated real-world electronic health record (EHR) data. A key limitation of such evaluations is that they may fail to assess the robustness of ML models to changes in the data at deployment, which is a common issue because EHR data used for ML model development cannot capture all such changes. Mass casualty incidents (MCIs) caused by disasters are critical instances where this will be an issue, as they induce rare, uncertain, and novel changes to routine system conditions. Because real-world EHR data from MCIs are often limited or unavailable, assessing ML robustness under such conditions before deployment remains challenging. Here, we propose an agent-based modelling approach for generating synthetic EHR data to evaluate the robustness of ML models under MCI scenarios. We use real-world EHR data to develop and calibrate an agent-based model (ABM) of an emergency department (ED) that explicitly models patient arrivals, resource capacity, and clinical workflow. By changing these system conditions to reflect plausible MCI scenarios, the ED model generates synthetic versions of the real-world EHR data that exhibit shifts in system behaviour. Using these synthetic data, we test ML models for predicting length of stay. We observed consistent declines in recall under MCI conditions relative to baseline system conditions, resulting in an increase in the number of patients with prolonged length of stay that were missed by the ML models. These results highlight the impact of changes in system conditions on patient outcomes, EHR data, and ML model performance. Our work establishes ABM-based synthetic EHR data generation as a proactive and systematic approach for evaluating the robustness of ML models under MCI or other system conditions not captured in real-world EHR data, supporting the safer and more effective deployment of ML models in healthcare systems.

MSnet: A deep neural network based on piecewise-constant proposals within Multi-State event history analysis

Mon, 29 Jun 2026 00:00:00 +0000

Multi-state models are essential to represent realistic disease trajectories in oncology, yet most existing survival and deep-learning approaches either rely on restrictive Markov assumptions or fail to provide subject-specific transition risks. We propose MSnet, a deep learning framework for progressive semi-Markov multi-state processes with right-censoring. MSnet models transition-specific cumulative risks as functions of sojourn time using a multi-task architecture that flexibly integrates high-dimensional clinical and omics data. Experiments on simulated data and two real-world breast cancer cohorts show that MSnet improves predictive performance while yielding clinically interpretable transition dynamics, extending deep learning–based survival analysis to more realistic, patient-centered disease processes.

ASCENT: A Benchmark for Evaluating and Advancing Stepwise Diagnostic Reasoning in Large Language Models on Common Clinical Scenarios

Mon, 29 Jun 2026 00:00:00 +0000

Large language models (LLMs) excel at medical question answering yet are rarely evaluated on the stepwise diagnostic reasoning that defines real clinical workflows, where impressions are revised as information accumulates. We build Annotated Stepwise Clinical rEasoning for NaturalisTic Diagnosis (ASCENT), a clinician-annotated benchmark and training resource of 3{,}078 stepwise problems derived from MedQA-USMLE that decomposes each vignette into EMR-aligned steps (Findings, Impression, supporting Rationale), enabling evaluation of intermediate reasoning under incomplete information. Experiments and training with ASCENT revealed insights into how current LLMs handle stepwise diagnostic reasoning. Even strong reasoning models that perform well on MedQA-USMLE leave substantial headroom on ASCENT, and general-purpose frontier models trail further—exposing a persistent gap between fully informed and stepwise diagnosis. Fine-tuning Qwen2.5-7B and 32B on ASCENT yields measurable F1 gains over both pre-trained and HuatuoGPT-o1 CoT-trained baselines, with gains driven primarily by precision. Complementary robustness analyses (counterfactual perturbation, format-vs-content control, judge agreement, and rollout) further show that ASCENT-fine-tuned models rely on the diagnostic content of prior impressions rather than imitating their output format, while error propagation under rollout remains a key challenge for clinical deployment.

Does Where You Live Affect How You Feel? Causal Evidence from an Integrated Econometric and Machine Learning Framework

Mon, 29 Jun 2026 00:00:00 +0000

We investigate whether residential relocation causally improves subjective wellbeing by leveraging household relocations in the UK Household Longitudinal Survey as natural experiments. An integrated framework combining a difference-in-differences and synthetic control ensemble with a causal forest model is applied to nearly a decade of panel data. Relocation causes an immediate and sustained improvement of 8% in subjective wellbeing; a change in the built environment type (e.g. suburb to city) adds a further 5%. We demonstrate the complementarity and interoperability of canonical econometric and machine learning methods for causal inference on subjective panel data.

ALMo: Interactive Aim-Limit-Defined, Multi-Objective System for Personalized High-Dose-Rate Brachytherapy Treatment Planning and Visualization for Cervical Cancer

Mon, 29 Jun 2026 00:00:00 +0000

In complex clinical decision-making, clinicians must often track a variety of competing metrics defined by “aim” (ideal) and “limit” (strict) thresholds. Sifting through these high-dimensional tradeoffs to infer the optimal patient-specific strategy is cognitively demanding and historically prone to variability. In this paper, we address this challenge within the context of High-Dose-Rate (HDR) brachytherapy for cervical cancer, where planning requires strictly managing radiation hot spots while balancing tumor coverage against organ sparing. We present ALMo (Aim-Limit-defined Multi-Objective system), an interactive decision support system designed to infer and operationalize clinician intent. ALMo employs a novel optimization framework that minimizes manual input through automated parameter setup and enables flexible control over toxicity risks. Crucially, the system allows clinicians to navigate the Pareto surface of dosimetric tradeoffs by directly manipulating intuitive aim and limit values. In a retrospective evaluation of 25 clinical cases, ALMo generated treatment plans that consistently met or exceeded manual planning quality, with 65% of cases demonstrating dosimetric improvements. Furthermore, the system significantly enhanced efficiency, reducing average planning time to approximately 17 minutes, compared to the conventional 30–60 minutes. While validated in brachytherapy, ALMo demonstrates a generalized framework for streamlining interaction in multi-criteria clinical decision-making.

Enhancing Extubation Failure Prediction with LLM-Derived Features from Respiratory Therapy Clinical Notes

Mon, 29 Jun 2026 00:00:00 +0000

Invasive mechanical ventilation is a lifesaving therapy, but timely, safe discontinuation is essential to preventing extubation failure (EF) and related risks to health. We present a novel approach to EF prediction that leverages features classified in free-text respiratory therapy notes using a large language model and logistic regression pipeline. Applied to a patient cohort from University of Washington Medicine, our method identifies clinically meaningful EF-related features that improve EF prediction performance when included alongside structured patient data. We further highlight how differences in target populations in prior EF prediction studies, such as heterogenous inclusion criteria and EF definition, can lead to systematic differences in model performance and hinder generalizability between studies.

Video-based Disease Progression Simulation

Mon, 29 Jun 2026 00:00:00 +0000

Modeling disease progression is crucial for improving the quality and efficacy of clinical diagnosis and prognosis, but it is often hindered by a lack of longitudinal medical image monitoring for individual patients. To address this challenge, we propose MedDream, the first video-based disease progression framework that enables controlled manipulation of disease-related image and video features, allowing precise, and personalized simulations of disease progression. Our approach begins by disease trajectory description recaptioning. Next, a controllable multi-round diffusion model simulates the disease progression state for each patient, creating realistic intermediate disease state sequences. Finally, a diffusion-based video transition generation model interpolates disease progression between these states. We validate our framework across three medical imaging domains: chest X-ray, fundus photography, and skin image. Our results demonstrate that MedDream significantly outperforms baseline models in generating coherent and clinically plausible disease trajectories. Two user studies by veteran physicians provide further validation into the clinical relevance of the generated sequences. MedDream has the potential to assist healthcare providers in modeling disease trajectories, interpolating missing medical image data, and enhancing medical education through realistic, dynamic visualizations of disease progression.

ALPACA: A Reinforcement Learning Environment for Medication Repurposing and Treatment Optimization in Alzheimer’s Disease

Mon, 29 Jun 2026 00:00:00 +0000

Evaluating personalized, sequential treatment strategies for Alzheimer’s disease (AD) using clinical trials is often impractical due to long disease horizons and substantial inter-patient heterogeneity. To address these constraints, we present the Alzheimer’s Learning Platform for Adaptive Care Agents (ALPACA), an open-source, Gym-compatible reinforcement learning (RL) environment for systematically exploring personalized treatment strategies using existing therapies. ALPACA is powered by the Continuous Action-conditioned State Transitions (CAST) model trained on longitudinal trajectories from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), enabling medication-conditioned simulation of disease progression under alternative treatment decisions. We show that CAST autoregressively generates realistic medication-conditioned trajectories and that RL policies trained in ALPACA outperform no-treatment and behavior-cloned clinician baselines on memory-related outcomes. Interpretability analyses further indicated that the learned policies relied on clinically meaningful patient features when selecting actions. Overall, ALPACA provides a reusable in silico testbed for studying individualized sequential treatment decision-making for AD.

Towards the Anonymization of Masked Language Modeling

Mon, 29 Jun 2026 00:00:00 +0000

Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models, fine-tuned and specialized on sensitive data, can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of anonymization of language models, and thus promote their sharing. Specifically, we propose a Masked Language Modeling (MLM) methodology to specialize a BERT-like language model that prevents the model from memorizing direct and indirect identifying information present in the training data. We comprehensively evaluated our approach on several models using a medical dataset and a corpus of legal texts, and compared it to different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masked language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

LapDDPM: Spectral Perturbation Diffusion for Robust Single-Cell Manifold Generation

Mon, 29 Jun 2026 00:00:00 +0000

Generating high-fidelity and biologically plausible synthetic single-cell RNA sequencing (scRNA-seq) data is a critical challenge in computational biology, driven by the need to model high-dimensional, sparse, and non-linear cellular manifolds. Existing generative models often fail to capture the complex topology of cellular differentiation or lack robustness against technical noise and structural variability. We introduce LapDDPM, a novel conditional Graph Diffusion Probabilistic Model designed for robust manifold learning and high-fidelity generation. LapDDPM integrates graph-based inductive biases with score-based generative modeling, enhanced by a novel spectral adversarial perturbation mechanism. By systematically perturbing graph edge weights along principal spectral modes during training, our method acts as a Distributionally Robust Optimization (DRO) framework, enforcing invariance to structural noise. We further extend LapDDPM to spatial transcriptomics and multi-modal data, treating generation as a robust inverse problem on cellular graphs. Extensive experiments on diverse datasets, including PBMC3K, Dentate Gyrus, HLCA, Visium, and 10x Multiome, demonstrate that LapDDPM significantly outperforms state-of-the-art baselines in distribution matching, manifold preservation, and downstream utility, generating biologically coherent cell states.

SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation

Mon, 29 Jun 2026 00:00:00 +0000

Effective patient-provider communication is difficult to assess at scale. We examine whether large language models (LLMs) can track 20 social behaviors from clinical transcripts without fine-tuning. Across three model families and multiple prompting strategies, LLMs reliably detect social signals, though performance varies by patient race and visit segment. To address this variability under query-only API constraints, we introduce an agreement-weighted ensemble using group-level agreement patterns. This approach improves both accuracy and stability over the best individual model, demonstrating a practical pathway for scalable social signal tracking in clinical conversations.

Toward Improving Diagnostic Reasoning for Spina Bifida Care: Benchmarking LLM–Patient Interactions

Mon, 29 Jun 2026 00:00:00 +0000

Spina Bifida (SB) is a complex neural tube defect that presents multifaceted healthcare challenges requiring multidisciplinary management. While advances in foundation models (FMs) offer promising avenues for enhancing SB care through intelligent, context-aware support, existing models struggle to accurately identify and reason about SB’s diverse symptoms. This study benchmarks eight widely used large language models (LLMs) through qualitative and quantitative evaluations, focusing on their ability to address the unique medical challenges of SB. This study presents an \textit{inverse prompting} technique aimed at guiding LLMs through a step-by-step diagnostic process. By incorporating a predefined set of symptoms relevant to SB, this approach prevents premature conclusions and enhances diagnostic reasoning, starting to address the Problem of Inclusion-Exclusion (PIE) as formulated in this study. Our evaluations reveal significant limitations in the LLMs’ abilities to accurately diagnose SB-related conditions, underscoring the need for specialized approaches. Building on these findings, this study proposes a novel framework that integrates a structured, symptom-based knowledge base specific to SB, enhancing the models’ contextual understanding and reasoning capabilities. This work highlights the potential of tailored AI solutions in improving access to care for individuals with SB, particularly in populations where gaps in knowledgeable providers persist. By addressing the shortcomings of general-purpose LLMs, our suggested framework aims to streamline SB care and improve patient outcomes, paving the way for more effective AI-assisted healthcare interventions in complex chronic conditions.

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Mon, 29 Jun 2026 00:00:00 +0000

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)–based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

Cached Summary Embeddings for Memory-Efficient EHR Inference

Mon, 29 Jun 2026 00:00:00 +0000

Transformer-based clinical prediction models face a deployment challenge: processing long patient histories can require memory that exceeds available resources in resource-constrained settings. We propose a deployment architecture that separates expensive historical encoding from lightweight inference. In an offline preprocessing phase, a clinical language model compresses each patient’s historical events into a fixed-size vector (768 dimensions, 5 KB per patient). At inference time, the prediction model processes only a short window of recent events, conditioned on the cached summary. Through 252 experiments on a 24-hour in-ICU mortality cohort from MIMIC-IV, we characterize when this architecture provides value. The benefit of cached summaries decays as the recent context window grows: a 6.5% relative AUROC improvement at $N$=8 recent events ($p < 0.001$) shrinks to a negligible 0.1% at $N$=256 (not statistically significant). We find that Feature-wise Linear Modulation (FiLM) outperforms token injection for integrating summaries ($p < 0.001$). Our results provide deployment guidance: when hardware constraints limit the recent context to 32 events or fewer, cached summaries recover meaningful predictive signal; when longer sequences are feasible, the caching overhead is not justified.