Exact and approximate hierarchical clustering using A*

Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer, Andrew McCallum
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:2061-2071, 2021.

Abstract

Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with 1012 trees to 1015 trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than 101000 trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.

Cite this Paper


BibTeX
@InProceedings{pmlr-v161-greenberg21a, title = {Exact and approximate hierarchical clustering using A*}, author = {Greenberg, Craig S. and Macaluso, Sebastian and Monath, Nicholas and Dubey, Avinava and Flaherty, Patrick and Zaheer, Manzil and Ahmed, Amr and Cranmer, Kyle and McCallum, Andrew}, booktitle = {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence}, pages = {2061--2071}, year = {2021}, editor = {de Campos, Cassio and Maathuis, Marloes H.}, volume = {161}, series = {Proceedings of Machine Learning Research}, month = {27--30 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v161/greenberg21a/greenberg21a.pdf}, url = {https://proceedings.mlr.press/v161/greenberg21a.html}, abstract = {Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with $10^{12}$ trees to $10^{15}$ trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than $10^{1000}$ trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.} }
Endnote
%0 Conference Paper %T Exact and approximate hierarchical clustering using A* %A Craig S. Greenberg %A Sebastian Macaluso %A Nicholas Monath %A Avinava Dubey %A Patrick Flaherty %A Manzil Zaheer %A Amr Ahmed %A Kyle Cranmer %A Andrew McCallum %B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2021 %E Cassio de Campos %E Marloes H. Maathuis %F pmlr-v161-greenberg21a %I PMLR %P 2061--2071 %U https://proceedings.mlr.press/v161/greenberg21a.html %V 161 %X Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with $10^{12}$ trees to $10^{15}$ trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than $10^{1000}$ trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.
APA
Greenberg, C.S., Macaluso, S., Monath, N., Dubey, A., Flaherty, P., Zaheer, M., Ahmed, A., Cranmer, K. & McCallum, A.. (2021). Exact and approximate hierarchical clustering using A*. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:2061-2071 Available from https://proceedings.mlr.press/v161/greenberg21a.html.

Related Material