Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction

Thomas Li; Lianrui Zuo; Yihao Liu; Aravind Krishnan; Kim L. Sandler; Thomas A Lasko; Fabien Maldonado; Bennett Allan Landman

Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction

Thomas Li, Lianrui Zuo, Yihao Liu, Aravind Krishnan, Kim L. Sandler, Thomas A Lasko, Fabien Maldonado, Bennett Allan Landman

Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, PMLR 301:1008-1020, 2026.

Abstract

Leveraging longitudinal and multimodal data is important for clinical predictive tasks. Contrastive language-image pretraining (CLIP) has been successful in learning multimodal representations by aligning paired images and captions, i.e. medical images and corresponding radiology report. However, in real clinical settings, the alignment of unpaired modalities, such as medical images and clinical notes collected at different times, is an open challenge, even though such data are ubiquitous in practice. This study conducts contrastive pretraining between longitudinal chest CTs and clinical variables on the patient level using a large public lung cancer screening dataset. Leveraging a time-distanced transformer to encode longitudinal imaging and an open-source text embedding to encode clinical variables, we optimize contrastive loss between the embedded modalities from same patient (positive pair) against those from different patients (negative pair). We find that finetuning the CLIP representation significantly improves prediction of lung cancer risk in two types of clinical populations (0.895 and 0.893 AUC) compared to conventional multimodal fusion (0.873 and 0.875 AUC) and single modality baselines. These results demonstrate how contrastive patient-level pretraining can enable longitudinal and multimodal fusion without additional training data. We released our code and pre-trained weights at https://github.com/MASILab/lung-cplp.

Cite this Paper

BibTeX

@InProceedings{pmlr-v301-li26c,
  title = 	 {Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction},
  author =       {Li, Thomas and Zuo, Lianrui and Liu, Yihao and Krishnan, Aravind and Sandler, Kim L. and Lasko, Thomas A and Maldonado, Fabien and Landman, Bennett Allan},
  booktitle = 	 {Proceedings of The 8th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {1008--1020},
  year = 	 {2026},
  editor = 	 {Tasdizen, Tolga and Elhabian, Shireen and Summers, Ronald and Chen, Chen and Koch, Lisa and Zhuang, Yan},
  volume = 	 {301},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--11 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v301/main/assets/li26c/li26c.pdf},
  url = 	 {https://proceedings.mlr.press/v301/li26c.html},
  abstract = 	 {Leveraging longitudinal and multimodal data is important for clinical predictive tasks. Contrastive language-image pretraining (CLIP) has been successful in learning multimodal representations by aligning paired images and captions, i.e. medical images and corresponding radiology report. However, in real clinical settings, the alignment of unpaired modalities, such as medical images and clinical notes collected at different times, is an open challenge, even though such data are ubiquitous in practice. This study conducts contrastive pretraining between longitudinal chest CTs and clinical variables on the patient level using a large public lung cancer screening dataset. Leveraging a time-distanced transformer to encode longitudinal imaging and an open-source text embedding to encode clinical variables, we optimize contrastive loss between the embedded modalities from same patient (positive pair) against those from different patients (negative pair). We find that finetuning the CLIP representation significantly improves prediction of lung cancer risk in two types of clinical populations (0.895 and 0.893 AUC) compared to conventional multimodal fusion (0.873 and 0.875 AUC) and single modality baselines. These results demonstrate how contrastive patient-level pretraining can enable longitudinal and multimodal fusion without additional training data. We released our code and pre-trained weights at https://github.com/MASILab/lung-cplp.}
}

Endnote

%0 Conference Paper
%T Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction
%A Thomas Li
%A Lianrui Zuo
%A Yihao Liu
%A Aravind Krishnan
%A Kim L. Sandler
%A Thomas A Lasko
%A Fabien Maldonado
%A Bennett Allan Landman
%B Proceedings of The 8th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Tolga Tasdizen
%E Shireen Elhabian
%E Ronald Summers
%E Chen Chen
%E Lisa Koch
%E Yan Zhuang	
%F pmlr-v301-li26c
%I PMLR
%P 1008--1020
%U https://proceedings.mlr.press/v301/li26c.html
%V 301
%X Leveraging longitudinal and multimodal data is important for clinical predictive tasks. Contrastive language-image pretraining (CLIP) has been successful in learning multimodal representations by aligning paired images and captions, i.e. medical images and corresponding radiology report. However, in real clinical settings, the alignment of unpaired modalities, such as medical images and clinical notes collected at different times, is an open challenge, even though such data are ubiquitous in practice. This study conducts contrastive pretraining between longitudinal chest CTs and clinical variables on the patient level using a large public lung cancer screening dataset. Leveraging a time-distanced transformer to encode longitudinal imaging and an open-source text embedding to encode clinical variables, we optimize contrastive loss between the embedded modalities from same patient (positive pair) against those from different patients (negative pair). We find that finetuning the CLIP representation significantly improves prediction of lung cancer risk in two types of clinical populations (0.895 and 0.893 AUC) compared to conventional multimodal fusion (0.873 and 0.875 AUC) and single modality baselines. These results demonstrate how contrastive patient-level pretraining can enable longitudinal and multimodal fusion without additional training data. We released our code and pre-trained weights at https://github.com/MASILab/lung-cplp.

APA

Li, T., Zuo, L., Liu, Y., Krishnan, A., Sandler, K.L., Lasko, T.A., Maldonado, F. & Landman, B.A.. (2026). Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction. Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 301:1008-1020 Available from https://proceedings.mlr.press/v301/li26c.html.

Related Material

Download PDF