Massively Parallel Algorithms and Hardness for Single-Linkage Clustering under $\ell_p$ Distances

Grigory Yaroslavtsev, Adithya Vadapalli
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5600-5609, 2018.

Abstract

We present first massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of n input d-dimensional vectors under Hamming, $\ell_1, \ell_2$ and $\ell_\infty$ distances. All our algorithms run in O(log n) rounds of MPC for any fixed d and achieve (1+\epsilon)-approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for o(\log n)-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on the largest available vector datasets from the UCI machine learning repository exhibiting speedups of several orders of magnitude.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-yaroslavtsev18a, title = {Massively Parallel Algorithms and Hardness for Single-Linkage Clustering under $\ell_p$ Distances}, author = {Yaroslavtsev, Grigory and Vadapalli, Adithya}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5600--5609}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/yaroslavtsev18a/yaroslavtsev18a.pdf}, url = {https://proceedings.mlr.press/v80/yaroslavtsev18a.html}, abstract = {We present first massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of n input d-dimensional vectors under Hamming, $\ell_1, \ell_2$ and $\ell_\infty$ distances. All our algorithms run in O(log n) rounds of MPC for any fixed d and achieve (1+\epsilon)-approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for o(\log n)-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on the largest available vector datasets from the UCI machine learning repository exhibiting speedups of several orders of magnitude.} }
Endnote
%0 Conference Paper %T Massively Parallel Algorithms and Hardness for Single-Linkage Clustering under $\ell_p$ Distances %A Grigory Yaroslavtsev %A Adithya Vadapalli %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-yaroslavtsev18a %I PMLR %P 5600--5609 %U https://proceedings.mlr.press/v80/yaroslavtsev18a.html %V 80 %X We present first massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of n input d-dimensional vectors under Hamming, $\ell_1, \ell_2$ and $\ell_\infty$ distances. All our algorithms run in O(log n) rounds of MPC for any fixed d and achieve (1+\epsilon)-approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for o(\log n)-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on the largest available vector datasets from the UCI machine learning repository exhibiting speedups of several orders of magnitude.
APA
Yaroslavtsev, G. & Vadapalli, A.. (2018). Massively Parallel Algorithms and Hardness for Single-Linkage Clustering under $\ell_p$ Distances. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5600-5609 Available from https://proceedings.mlr.press/v80/yaroslavtsev18a.html.

Related Material