Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds

Sepehr Assadi, Vaggos Chatziafratis, Jakub \Lącki, Vahab Mirrokni, Chen Wang
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:4643-4702, 2022.

Abstract

The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC’16]. Assuming that the input is an $n$-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: – With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm by Charikar and Chatziafratis [SODA’17]. – When the space is more limited, namely, $n^{1-o(1)}$, we prove that no algorithm can even estimate the value of the optimum hierarchical tree to within an $o(log(n)/loglog(n))$ factor, even when allowed polylog(n) passes over the input and exponential time. – In the most stringent setting of polylog{n} space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an $n^{1/2-o(1)}$ factor gap between their HC objective value. – Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires storing almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graphs can preserve the cost of “balanced” hierarchical trees to within some constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results involve establishing a new streaming lower bound for a novel problem “One-vs-Many-Expanders”, which can be of independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-assadi22a, title = {Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds}, author = {Assadi, Sepehr and Chatziafratis, Vaggos and \L\k{a}cki, Jakub and Mirrokni, Vahab and Wang, Chen}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {4643--4702}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/assadi22a/assadi22a.pdf}, url = {https://proceedings.mlr.press/v178/assadi22a.html}, abstract = {The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC’16]. Assuming that the input is an $n$-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: – With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm by Charikar and Chatziafratis [SODA’17]. – When the space is more limited, namely, $n^{1-o(1)}$, we prove that no algorithm can even estimate the value of the optimum hierarchical tree to within an $o(log(n)/loglog(n))$ factor, even when allowed polylog(n) passes over the input and exponential time. – In the most stringent setting of polylog{n} space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an $n^{1/2-o(1)}$ factor gap between their HC objective value. – Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires storing almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graphs can preserve the cost of “balanced” hierarchical trees to within some constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results involve establishing a new streaming lower bound for a novel problem “One-vs-Many-Expanders”, which can be of independent interest.} }
Endnote
%0 Conference Paper %T Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds %A Sepehr Assadi %A Vaggos Chatziafratis %A Jakub \Lącki %A Vahab Mirrokni %A Chen Wang %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-assadi22a %I PMLR %P 4643--4702 %U https://proceedings.mlr.press/v178/assadi22a.html %V 178 %X The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC’16]. Assuming that the input is an $n$-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: – With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm by Charikar and Chatziafratis [SODA’17]. – When the space is more limited, namely, $n^{1-o(1)}$, we prove that no algorithm can even estimate the value of the optimum hierarchical tree to within an $o(log(n)/loglog(n))$ factor, even when allowed polylog(n) passes over the input and exponential time. – In the most stringent setting of polylog{n} space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an $n^{1/2-o(1)}$ factor gap between their HC objective value. – Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires storing almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graphs can preserve the cost of “balanced” hierarchical trees to within some constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results involve establishing a new streaming lower bound for a novel problem “One-vs-Many-Expanders”, which can be of independent interest.
APA
Assadi, S., Chatziafratis, V., \Lącki, J., Mirrokni, V. & Wang, C.. (2022). Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:4643-4702 Available from https://proceedings.mlr.press/v178/assadi22a.html.

Related Material