[edit]
RIS: Region-to-Image Search using ViT-like Embeddings
Proceedings of the Fourth Swiss AI Days, PMLR 309:56-66, 2026.
Abstract
We propose RIS (Region-to-Image Search), a two-stage framework for localized visual retrieval. RIS performs structural re-ranking directly within the latent embedding space of Vision Transformers, such as SigLIP2 and I-JEPA, bypassing traditional pixel-level verification. By matching a query Region of Interest (ROI) through a spatially-consistent region-growing algorithm, the framework ensures geometric coherence across latent representations. Preliminary qualitative results demonstrate that this embedding-based re-ranking improves Top-5 retrieval accuracy by at least 10% over standalone global methods, providing a robust and efficient mechanism for localized forensic search.