Clusterability: A Theoretical Study

Margareta Ackerman, Shai Ben-David
; Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:1-8, 2009.

Abstract

We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong’ or ‘conclusive’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NP-hard, finding a close-to-optimal clustering for well-clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NP-hard problem.

Cite this Paper


BibTeX
@InProceedings{pmlr-v5-ackerman09a, title = {Clusterability: A Theoretical Study}, author = {Margareta Ackerman and Shai Ben-David}, booktitle = {Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics}, pages = {1--8}, year = {2009}, editor = {David van Dyk and Max Welling}, volume = {5}, series = {Proceedings of Machine Learning Research}, address = {Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v5/ackerman09a/ackerman09a.pdf}, url = {http://proceedings.mlr.press/v5/ackerman09a.html}, abstract = {We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong’ or ‘conclusive’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NP-hard, finding a close-to-optimal clustering for well-clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NP-hard problem.} }
Endnote
%0 Conference Paper %T Clusterability: A Theoretical Study %A Margareta Ackerman %A Shai Ben-David %B Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2009 %E David van Dyk %E Max Welling %F pmlr-v5-ackerman09a %I PMLR %J Proceedings of Machine Learning Research %P 1--8 %U http://proceedings.mlr.press %V 5 %W PMLR %X We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong’ or ‘conclusive’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NP-hard, finding a close-to-optimal clustering for well-clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NP-hard problem.
RIS
TY - CPAPER TI - Clusterability: A Theoretical Study AU - Margareta Ackerman AU - Shai Ben-David BT - Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics PY - 2009/04/15 DA - 2009/04/15 ED - David van Dyk ED - Max Welling ID - pmlr-v5-ackerman09a PB - PMLR SP - 1 DP - PMLR EP - 8 L1 - http://proceedings.mlr.press/v5/ackerman09a/ackerman09a.pdf UR - http://proceedings.mlr.press/v5/ackerman09a.html AB - We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong’ or ‘conclusive’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NP-hard, finding a close-to-optimal clustering for well-clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NP-hard problem. ER -
APA
Ackerman, M. & Ben-David, S.. (2009). Clusterability: A Theoretical Study. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, in PMLR 5:1-8

Related Material