Scaling Average-Linkage via Sparse Cluster Embeddings

Thomas Lavastida, Kefu Lu, Benjamin Moseley, Yuyan Wang
Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:1429-1444, 2021.

Abstract

Average-linkage is one of the most popular hierarchical clustering algorithms. It is well known that average-linkage does not scale to large data sets due to the slow asymptotic running time. The fastest known implementation has running time quadratic in the number of data points. This paper presents a technique that we call cluster embedding. The embedding maps each cluster into a point in slightly higher dimensions. The pairwise distances between the mapped points approximate the average distance between clusters. By utilizing this embedding we scale the task of finding close pairs of clusters, which is a key step in average-linkage clustering. We achieve an approximate, sub-quadratic time implementation of average-linkage. We show theoretically the algorithm proposed in this paper achieves a near-linear running time and scales to large data sets. Moreover, its scalability empirically dominates average-linkage and typically offers 3-10x speed-up on large data sets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v157-lavastida21a, title = {Scaling Average-Linkage via Sparse Cluster Embeddings}, author = {Lavastida, Thomas and Lu, Kefu and Moseley, Benjamin and Wang, Yuyan}, booktitle = {Proceedings of The 13th Asian Conference on Machine Learning}, pages = {1429--1444}, year = {2021}, editor = {Balasubramanian, Vineeth N. and Tsang, Ivor}, volume = {157}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v157/lavastida21a/lavastida21a.pdf}, url = {https://proceedings.mlr.press/v157/lavastida21a.html}, abstract = {Average-linkage is one of the most popular hierarchical clustering algorithms. It is well known that average-linkage does not scale to large data sets due to the slow asymptotic running time. The fastest known implementation has running time quadratic in the number of data points. This paper presents a technique that we call cluster embedding. The embedding maps each cluster into a point in slightly higher dimensions. The pairwise distances between the mapped points approximate the average distance between clusters. By utilizing this embedding we scale the task of finding close pairs of clusters, which is a key step in average-linkage clustering. We achieve an approximate, sub-quadratic time implementation of average-linkage. We show theoretically the algorithm proposed in this paper achieves a near-linear running time and scales to large data sets. Moreover, its scalability empirically dominates average-linkage and typically offers 3-10x speed-up on large data sets.} }
Endnote
%0 Conference Paper %T Scaling Average-Linkage via Sparse Cluster Embeddings %A Thomas Lavastida %A Kefu Lu %A Benjamin Moseley %A Yuyan Wang %B Proceedings of The 13th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Vineeth N. Balasubramanian %E Ivor Tsang %F pmlr-v157-lavastida21a %I PMLR %P 1429--1444 %U https://proceedings.mlr.press/v157/lavastida21a.html %V 157 %X Average-linkage is one of the most popular hierarchical clustering algorithms. It is well known that average-linkage does not scale to large data sets due to the slow asymptotic running time. The fastest known implementation has running time quadratic in the number of data points. This paper presents a technique that we call cluster embedding. The embedding maps each cluster into a point in slightly higher dimensions. The pairwise distances between the mapped points approximate the average distance between clusters. By utilizing this embedding we scale the task of finding close pairs of clusters, which is a key step in average-linkage clustering. We achieve an approximate, sub-quadratic time implementation of average-linkage. We show theoretically the algorithm proposed in this paper achieves a near-linear running time and scales to large data sets. Moreover, its scalability empirically dominates average-linkage and typically offers 3-10x speed-up on large data sets.
APA
Lavastida, T., Lu, K., Moseley, B. & Wang, Y.. (2021). Scaling Average-Linkage via Sparse Cluster Embeddings. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:1429-1444 Available from https://proceedings.mlr.press/v157/lavastida21a.html.

Related Material