[edit]
Visual Medical Entity Linking with VELCRO
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1126-1140, 2026.
Abstract
We study a visual entity linking ({VEL}) problem in which a user selects a region of interest ({RoI}) in an image (e.g., a brain tumour) and queries a textual knowledge base ({KB}) for information about the {RoI}. To solve this problem using cross-modal embeddings such as {CLIP}, we can encode the {KB} entries, then either encode the whole image or just the cropped {RoI}, and run a similarity search between the query and the {KB} embeddings. However, using the entire image as the query may retrieve {KB} entries related to other aspects of the image beyond the {RoI}, whereas using the {RoI} alone as the query ignores context, which is critical for recognizing and linking complex entities in medical images. To address these shortcomings, we propose {VELCRO} – visual entity linking with contrastive {RoI} alignment – which adapts an image segmentation model to {VEL} by aligning the contextual embeddings produced by its decoder with the {KB} using contrastive learning. This strategy preserves the information contained in the surrounding image while focusing {KB} alignment on the {RoI}. Experiments on medical {VEL} show that {VELCRO} achieves 95.3% linking accuracy compared to 83.9% or lower for baselines.