On Fine-Grained Distinct Element Estimation

Ilias Diakonikolas, Daniel Kane, Jasper C.H. Lee, Thanasis Pittas, David Woodruff, Samson Zhou
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:13643-13678, 2025.

Abstract

We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-diakonikolas25d, title = {On Fine-Grained Distinct Element Estimation}, author = {Diakonikolas, Ilias and Kane, Daniel and Lee, Jasper C.H. and Pittas, Thanasis and Woodruff, David and Zhou, Samson}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {13643--13678}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/diakonikolas25d/diakonikolas25d.pdf}, url = {https://proceedings.mlr.press/v267/diakonikolas25d.html}, abstract = {We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.} }
Endnote
%0 Conference Paper %T On Fine-Grained Distinct Element Estimation %A Ilias Diakonikolas %A Daniel Kane %A Jasper C.H. Lee %A Thanasis Pittas %A David Woodruff %A Samson Zhou %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-diakonikolas25d %I PMLR %P 13643--13678 %U https://proceedings.mlr.press/v267/diakonikolas25d.html %V 267 %X We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.
APA
Diakonikolas, I., Kane, D., Lee, J.C., Pittas, T., Woodruff, D. & Zhou, S.. (2025). On Fine-Grained Distinct Element Estimation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:13643-13678 Available from https://proceedings.mlr.press/v267/diakonikolas25d.html.

Related Material