Graph-Based Algorithms for Diverse Similarity Search

Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Sepideh Mahabadi, Vikas C. Raykar, Kirankumar Shiragur, Haike Xu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:1483-1504, 2025.

Abstract

Nearest neighbor search is a fundamental data structure problem with many applications. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number $r$ of points closest to the query, while in the second stage, the $r$ points are post-processed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output. In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to “piggy-back” on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of $k$ points approximately closest to the query, in time that only depends on $k$ and $\log \Delta$, where $\Delta$ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-anand25a, title = {Graph-Based Algorithms for Diverse Similarity Search}, author = {Anand, Piyush and Indyk, Piotr and Krishnaswamy, Ravishankar and Mahabadi, Sepideh and Raykar, Vikas C. and Shiragur, Kirankumar and Xu, Haike}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {1483--1504}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/anand25a/anand25a.pdf}, url = {https://proceedings.mlr.press/v267/anand25a.html}, abstract = {Nearest neighbor search is a fundamental data structure problem with many applications. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number $r$ of points closest to the query, while in the second stage, the $r$ points are post-processed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output. In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to “piggy-back” on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of $k$ points approximately closest to the query, in time that only depends on $k$ and $\log \Delta$, where $\Delta$ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.} }
Endnote
%0 Conference Paper %T Graph-Based Algorithms for Diverse Similarity Search %A Piyush Anand %A Piotr Indyk %A Ravishankar Krishnaswamy %A Sepideh Mahabadi %A Vikas C. Raykar %A Kirankumar Shiragur %A Haike Xu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-anand25a %I PMLR %P 1483--1504 %U https://proceedings.mlr.press/v267/anand25a.html %V 267 %X Nearest neighbor search is a fundamental data structure problem with many applications. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number $r$ of points closest to the query, while in the second stage, the $r$ points are post-processed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output. In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to “piggy-back” on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of $k$ points approximately closest to the query, in time that only depends on $k$ and $\log \Delta$, where $\Delta$ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.
APA
Anand, P., Indyk, P., Krishnaswamy, R., Mahabadi, S., Raykar, V.C., Shiragur, K. & Xu, H.. (2025). Graph-Based Algorithms for Diverse Similarity Search. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:1483-1504 Available from https://proceedings.mlr.press/v267/anand25a.html.

Related Material