HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment

Lin Lu; Zihan Liu; Chaoxiang Tang; Hui Zhang

HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment

Lin Lu, Zihan Liu, Chaoxiang Tang, Hui Zhang

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4010-4025, 2026.

Abstract

The adaptation of vision-language models (VLMs) to 3D medical imaging is currently impeded by two fundamental bottlenecks: the dilution of local features caused by the granularity mismatch between volumetric data and textual reports, and the inability of deterministic embeddings to capture the inherent semantic uncertainty of clinical descriptions. To address these challenges, we propose HiPro-CT, a novel hierarchical probabilistic framework for 3D medical vision-language alignment. Unlike traditional point-based approaches, HiPro-CT maps images and texts into Gaussian probability distributions, utilizing variance to explicitly quantify uncertainty and enhance robustness against incompleteness and polysemy. We introduce a soft masked pooling strategy that performs weighted feature aggregation guided by anatomical masks, enabling precise organ-level alignment while preserving boundary context. Furthermore, we devise a hierarchical inclusion loss to enforce geometric constraints within the embedding space, ensuring that the deterministic global representations are geometrically grounded within the strictly more uncertain local distributions. Extensive experiments demonstrate that HiPro-CT significantly outperforms state-of-the-art deterministic baselines in zero-shot multi-abnormality detection and cross-modal retrieval, validating the efficacy of integrating fine-grained anatomical supervision with probabilistic representation learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-lu26b,
  title = 	 {HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment},
  author =       {Lu, Lin and Liu, Zihan and Tang, Chaoxiang and Zhang, Hui},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {4010--4025},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/lu26b/lu26b.pdf},
  url = 	 {https://proceedings.mlr.press/v315/lu26b.html},
  abstract = 	 {The adaptation of vision-language models (VLMs) to 3D medical imaging is currently impeded by two fundamental bottlenecks: the dilution of local features caused by the granularity mismatch between volumetric data and textual reports, and the inability of deterministic embeddings to capture the inherent semantic uncertainty of clinical descriptions. To address these challenges, we propose HiPro-CT, a novel hierarchical probabilistic framework for 3D medical vision-language alignment. Unlike traditional point-based approaches, HiPro-CT maps images and texts into Gaussian probability distributions, utilizing variance to explicitly quantify uncertainty and enhance robustness against incompleteness and polysemy. We introduce a soft masked pooling strategy that performs weighted feature aggregation guided by anatomical masks, enabling precise organ-level alignment while preserving boundary context. Furthermore, we devise a hierarchical inclusion loss to enforce geometric constraints within the embedding space, ensuring that the deterministic global representations are geometrically grounded within the strictly more uncertain local distributions. Extensive experiments demonstrate that HiPro-CT significantly outperforms state-of-the-art deterministic baselines in zero-shot multi-abnormality detection and cross-modal retrieval, validating the efficacy of integrating fine-grained anatomical supervision with probabilistic representation learning.}
}

Endnote

%0 Conference Paper
%T HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment
%A Lin Lu
%A Zihan Liu
%A Chaoxiang Tang
%A Hui Zhang
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-lu26b
%I PMLR
%P 4010--4025
%U https://proceedings.mlr.press/v315/lu26b.html
%V 315
%X The adaptation of vision-language models (VLMs) to 3D medical imaging is currently impeded by two fundamental bottlenecks: the dilution of local features caused by the granularity mismatch between volumetric data and textual reports, and the inability of deterministic embeddings to capture the inherent semantic uncertainty of clinical descriptions. To address these challenges, we propose HiPro-CT, a novel hierarchical probabilistic framework for 3D medical vision-language alignment. Unlike traditional point-based approaches, HiPro-CT maps images and texts into Gaussian probability distributions, utilizing variance to explicitly quantify uncertainty and enhance robustness against incompleteness and polysemy. We introduce a soft masked pooling strategy that performs weighted feature aggregation guided by anatomical masks, enabling precise organ-level alignment while preserving boundary context. Furthermore, we devise a hierarchical inclusion loss to enforce geometric constraints within the embedding space, ensuring that the deterministic global representations are geometrically grounded within the strictly more uncertain local distributions. Extensive experiments demonstrate that HiPro-CT significantly outperforms state-of-the-art deterministic baselines in zero-shot multi-abnormality detection and cross-modal retrieval, validating the efficacy of integrating fine-grained anatomical supervision with probabilistic representation learning.

APA

Lu, L., Liu, Z., Tang, C. & Zhang, H.. (2026). HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:4010-4025 Available from https://proceedings.mlr.press/v315/lu26b.html.

Related Material

Download PDF