[edit]
HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4010-4025, 2026.
Abstract
The adaptation of vision-language models (VLMs) to 3D medical imaging is currently impeded by two fundamental bottlenecks: the dilution of local features caused by the granularity mismatch between volumetric data and textual reports, and the inability of deterministic embeddings to capture the inherent semantic uncertainty of clinical descriptions. To address these challenges, we propose HiPro-CT, a novel hierarchical probabilistic framework for 3D medical vision-language alignment. Unlike traditional point-based approaches, HiPro-CT maps images and texts into Gaussian probability distributions, utilizing variance to explicitly quantify uncertainty and enhance robustness against incompleteness and polysemy. We introduce a soft masked pooling strategy that performs weighted feature aggregation guided by anatomical masks, enabling precise organ-level alignment while preserving boundary context. Furthermore, we devise a hierarchical inclusion loss to enforce geometric constraints within the embedding space, ensuring that the deterministic global representations are geometrically grounded within the strictly more uncertain local distributions. Extensive experiments demonstrate that HiPro-CT significantly outperforms state-of-the-art deterministic baselines in zero-shot multi-abnormality detection and cross-modal retrieval, validating the efficacy of integrating fine-grained anatomical supervision with probabilistic representation learning.