SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Nikolaos Chaidos; Angeliki Dimitriou; Maria Lymperaiou; Giorgos Stamou

SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:7196-7213, 2025.

Abstract

Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval. The source code is available at https://github.com/nickhaidos/scenir-icml2025.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-chaidos25a,
  title = 	 {{SCENIR}: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval},
  author =       {Chaidos, Nikolaos and Dimitriou, Angeliki and Lymperaiou, Maria and Stamou, Giorgos},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {7196--7213},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chaidos25a/chaidos25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/chaidos25a.html},
  abstract = 	 {Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval. The source code is available at https://github.com/nickhaidos/scenir-icml2025.}
}

Endnote

%0 Conference Paper
%T SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
%A Nikolaos Chaidos
%A Angeliki Dimitriou
%A Maria Lymperaiou
%A Giorgos Stamou
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-chaidos25a
%I PMLR
%P 7196--7213
%U https://proceedings.mlr.press/v267/chaidos25a.html
%V 267
%X Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval. The source code is available at https://github.com/nickhaidos/scenir-icml2025.

APA

Chaidos, N., Dimitriou, A., Lymperaiou, M. & Stamou, G.. (2025). SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:7196-7213 Available from https://proceedings.mlr.press/v267/chaidos25a.html.

SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Abstract

Cite this Paper

Related Material