RIS: Region-to-Image Search using ViT-like Embeddings

Oussama Zayene, Lucas Genoud, Jean Hennebert, Houda Chabbi Drissi, Benoit de Raemy
Proceedings of the Fourth Swiss AI Days, PMLR 309:56-66, 2026.

Abstract

We propose RIS (Region-to-Image Search), a two-stage framework for localized visual retrieval. RIS performs structural re-ranking directly within the latent embedding space of Vision Transformers, such as SigLIP2 and I-JEPA, bypassing traditional pixel-level verification. By matching a query Region of Interest (ROI) through a spatially-consistent region-growing algorithm, the framework ensures geometric coherence across latent representations. Preliminary qualitative results demonstrate that this embedding-based re-ranking improves Top-5 retrieval accuracy by at least 10% over standalone global methods, providing a robust and efficient mechanism for localized forensic search.

Cite this Paper


BibTeX
@InProceedings{pmlr-v309-zayene26a, title = {RIS: Region-to-Image Search using ViT-like Embeddings}, author = {Zayene, Oussama and Genoud, Lucas and Hennebert, Jean and Drissi, Houda Chabbi and de Raemy, Benoit}, booktitle = {Proceedings of the Fourth Swiss AI Days}, pages = {56--66}, year = {2026}, editor = {Kucharavy, Andrei and Delgado, Pamela and Schürch Todeschini, Valérie and Rumley, Sébastien}, volume = {309}, series = {Proceedings of Machine Learning Research}, month = {23--25 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v309/main/assets/zayene26a/zayene26a.pdf}, url = {https://proceedings.mlr.press/v309/zayene26a.html}, abstract = {We propose RIS (Region-to-Image Search), a two-stage framework for localized visual retrieval. RIS performs structural re-ranking directly within the latent embedding space of Vision Transformers, such as SigLIP2 and I-JEPA, bypassing traditional pixel-level verification. By matching a query Region of Interest (ROI) through a spatially-consistent region-growing algorithm, the framework ensures geometric coherence across latent representations. Preliminary qualitative results demonstrate that this embedding-based re-ranking improves Top-5 retrieval accuracy by at least 10% over standalone global methods, providing a robust and efficient mechanism for localized forensic search.} }
Endnote
%0 Conference Paper %T RIS: Region-to-Image Search using ViT-like Embeddings %A Oussama Zayene %A Lucas Genoud %A Jean Hennebert %A Houda Chabbi Drissi %A Benoit de Raemy %B Proceedings of the Fourth Swiss AI Days %C Proceedings of Machine Learning Research %D 2026 %E Andrei Kucharavy %E Pamela Delgado %E Valérie Schürch Todeschini %E Sébastien Rumley %F pmlr-v309-zayene26a %I PMLR %P 56--66 %U https://proceedings.mlr.press/v309/zayene26a.html %V 309 %X We propose RIS (Region-to-Image Search), a two-stage framework for localized visual retrieval. RIS performs structural re-ranking directly within the latent embedding space of Vision Transformers, such as SigLIP2 and I-JEPA, bypassing traditional pixel-level verification. By matching a query Region of Interest (ROI) through a spatially-consistent region-growing algorithm, the framework ensures geometric coherence across latent representations. Preliminary qualitative results demonstrate that this embedding-based re-ranking improves Top-5 retrieval accuracy by at least 10% over standalone global methods, providing a robust and efficient mechanism for localized forensic search.
APA
Zayene, O., Genoud, L., Hennebert, J., Drissi, H.C. & de Raemy, B.. (2026). RIS: Region-to-Image Search using ViT-like Embeddings. Proceedings of the Fourth Swiss AI Days, in Proceedings of Machine Learning Research 309:56-66 Available from https://proceedings.mlr.press/v309/zayene26a.html.

Related Material