Exact and approximate hierarchical clustering using A*

Craig S. Greenberg; Sebastian Macaluso; Nicholas Monath; Avinava Dubey; Patrick Flaherty; Manzil Zaheer; Amr Ahmed; Kyle Cranmer; Andrew McCallum

Exact and approximate hierarchical clustering using A*

Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer, Andrew McCallum

Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:2061-2071, 2021.

Abstract

Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with

$10^{12}$ trees to

$10^{15}$ trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than

$10^{1000}$ trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.

Cite this Paper

BibTeX


@InProceedings{pmlr-v161-greenberg21a,
  title = 	 {Exact and approximate hierarchical clustering using A*},
  author =       {Greenberg, Craig S. and Macaluso, Sebastian and Monath, Nicholas and Dubey, Avinava and Flaherty, Patrick and Zaheer, Manzil and Ahmed, Amr and Cranmer, Kyle and McCallum, Andrew},
  booktitle = 	 {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {2061--2071},
  year = 	 {2021},
  editor = 	 {de Campos, Cassio and Maathuis, Marloes H.},
  volume = 	 {161},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v161/greenberg21a/greenberg21a.pdf},
  url = 	 {https://proceedings.mlr.press/v161/greenberg21a.html},
  abstract = 	 {Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with $10^{12}$ trees to $10^{15}$ trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than $10^{1000}$ trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.}
}

Endnote

%0 Conference Paper
%T Exact and approximate hierarchical clustering using A*
%A Craig S. Greenberg
%A Sebastian Macaluso
%A Nicholas Monath
%A Avinava Dubey
%A Patrick Flaherty
%A Manzil Zaheer
%A Amr Ahmed
%A Kyle Cranmer
%A Andrew McCallum
%B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2021
%E Cassio de Campos
%E Marloes H. Maathuis	
%F pmlr-v161-greenberg21a
%I PMLR
%P 2061--2071
%U https://proceedings.mlr.press/v161/greenberg21a.html
%V 161
%X Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with $10^{12}$ trees to $10^{15}$ trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than $10^{1000}$ trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.

APA


Greenberg, C.S., Macaluso, S., Monath, N., Dubey, A., Flaherty, P., Zaheer, M., Ahmed, A., Cranmer, K. & McCallum, A.. (2021). Exact and approximate hierarchical clustering using A*. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:2061-2071 Available from https://proceedings.mlr.press/v161/greenberg21a.html.

Exact and approximate hierarchical clustering using A*

Abstract

Cite this Paper

Related Material