Clustering Items From Adaptively Collected Inconsistent Feedback

Shubham Gupta, Peter W J Staar, Christian de Sainte Marie
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:604-612, 2024.

Abstract

We study clustering in a query-based model where the learner can repeatedly query an oracle to determine if two items belong to the same cluster. However, these queries are costly and the oracle’s responses are marred by inconsistency and noise. The learner’s goal is to adaptively make a small number of queries and return the correct clusters for \emph{all} $n$ items with high confidence. We develop efficient algorithms for this problem using the sequential hypothesis testing framework. We derive high probability upper bounds on their sample complexity (the number of queries they make) and complement this analysis with an information-theoretic lower bound. In particular, we show that our algorithm for two clusters is nearly optimal when the oracle’s error probability is a constant. Our experiments verify these findings and highlight a few shortcomings of our algorithms. Namely, we show that their sample complexity deviates from the lower bound when the error probability of the oracle depends on $n$. We suggest an improvement based on a more efficient sequential hypothesis test and demonstrate it empirically.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-gupta24a, title = {Clustering Items From Adaptively Collected Inconsistent Feedback}, author = {Gupta, Shubham and W J Staar, Peter and de Sainte Marie, Christian}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {604--612}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/gupta24a/gupta24a.pdf}, url = {https://proceedings.mlr.press/v238/gupta24a.html}, abstract = {We study clustering in a query-based model where the learner can repeatedly query an oracle to determine if two items belong to the same cluster. However, these queries are costly and the oracle’s responses are marred by inconsistency and noise. The learner’s goal is to adaptively make a small number of queries and return the correct clusters for \emph{all} $n$ items with high confidence. We develop efficient algorithms for this problem using the sequential hypothesis testing framework. We derive high probability upper bounds on their sample complexity (the number of queries they make) and complement this analysis with an information-theoretic lower bound. In particular, we show that our algorithm for two clusters is nearly optimal when the oracle’s error probability is a constant. Our experiments verify these findings and highlight a few shortcomings of our algorithms. Namely, we show that their sample complexity deviates from the lower bound when the error probability of the oracle depends on $n$. We suggest an improvement based on a more efficient sequential hypothesis test and demonstrate it empirically.} }
Endnote
%0 Conference Paper %T Clustering Items From Adaptively Collected Inconsistent Feedback %A Shubham Gupta %A Peter W J Staar %A Christian de Sainte Marie %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-gupta24a %I PMLR %P 604--612 %U https://proceedings.mlr.press/v238/gupta24a.html %V 238 %X We study clustering in a query-based model where the learner can repeatedly query an oracle to determine if two items belong to the same cluster. However, these queries are costly and the oracle’s responses are marred by inconsistency and noise. The learner’s goal is to adaptively make a small number of queries and return the correct clusters for \emph{all} $n$ items with high confidence. We develop efficient algorithms for this problem using the sequential hypothesis testing framework. We derive high probability upper bounds on their sample complexity (the number of queries they make) and complement this analysis with an information-theoretic lower bound. In particular, we show that our algorithm for two clusters is nearly optimal when the oracle’s error probability is a constant. Our experiments verify these findings and highlight a few shortcomings of our algorithms. Namely, we show that their sample complexity deviates from the lower bound when the error probability of the oracle depends on $n$. We suggest an improvement based on a more efficient sequential hypothesis test and demonstrate it empirically.
APA
Gupta, S., W J Staar, P. & de Sainte Marie, C.. (2024). Clustering Items From Adaptively Collected Inconsistent Feedback. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:604-612 Available from https://proceedings.mlr.press/v238/gupta24a.html.

Related Material