Revisiting DP-Means: fast scalable algorithms via parallelism and delayed cluster creation

Or Dinari, Oren Freifeld
Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR 180:579-588, 2022.

Abstract

DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means

Cite this Paper


BibTeX
@InProceedings{pmlr-v180-dinari22b, title = {Revisiting DP-Means: fast scalable algorithms via parallelism and delayed cluster creation}, author = {Dinari, Or and Freifeld, Oren}, booktitle = {Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence}, pages = {579--588}, year = {2022}, editor = {Cussens, James and Zhang, Kun}, volume = {180}, series = {Proceedings of Machine Learning Research}, month = {01--05 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v180/dinari22b/dinari22b.pdf}, url = {https://proceedings.mlr.press/v180/dinari22b.html}, abstract = {DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means} }
Endnote
%0 Conference Paper %T Revisiting DP-Means: fast scalable algorithms via parallelism and delayed cluster creation %A Or Dinari %A Oren Freifeld %B Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2022 %E James Cussens %E Kun Zhang %F pmlr-v180-dinari22b %I PMLR %P 579--588 %U https://proceedings.mlr.press/v180/dinari22b.html %V 180 %X DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means
APA
Dinari, O. & Freifeld, O.. (2022). Revisiting DP-Means: fast scalable algorithms via parallelism and delayed cluster creation. Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 180:579-588 Available from https://proceedings.mlr.press/v180/dinari22b.html.

Related Material