New results on information theoretic clustering

Ferdinando Cicalese, Eduardo Laber, Lucas Murtinho
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:1242-1251, 2019.

Abstract

We study the problem of optimizing the clustering of a set of vectors when the quality of the clustering is measured by the Entropy or the Gini impurity measure. Our results contribute to the state of the art both in terms of best known approximation guarantees and inapproximability bounds: (i) we give the first polynomial time algorithm for Entropy impurity based clustering with approximation guarantee independent of the number of vectors and (ii) we show that the problem of clustering based on entropy impurity does not admit a PTAS. This also implies an inapproximability result in information theoretic clustering for probability distributions closing a problem left open in [Chaudhury and McGregor, COLT08] and [Ackermann et al., ECCC11]. We also report experiments with a new clustering method that was designed on top of the theoretical tools leading to the above results. These experiments suggest a practical applicability for our method, in particular, when the number of clusters is large.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-cicalese19a, title = {New results on information theoretic clustering}, author = {Cicalese, Ferdinando and Laber, Eduardo and Murtinho, Lucas}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {1242--1251}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/cicalese19a/cicalese19a.pdf}, url = {https://proceedings.mlr.press/v97/cicalese19a.html}, abstract = {We study the problem of optimizing the clustering of a set of vectors when the quality of the clustering is measured by the Entropy or the Gini impurity measure. Our results contribute to the state of the art both in terms of best known approximation guarantees and inapproximability bounds: (i) we give the first polynomial time algorithm for Entropy impurity based clustering with approximation guarantee independent of the number of vectors and (ii) we show that the problem of clustering based on entropy impurity does not admit a PTAS. This also implies an inapproximability result in information theoretic clustering for probability distributions closing a problem left open in [Chaudhury and McGregor, COLT08] and [Ackermann et al., ECCC11]. We also report experiments with a new clustering method that was designed on top of the theoretical tools leading to the above results. These experiments suggest a practical applicability for our method, in particular, when the number of clusters is large.} }
Endnote
%0 Conference Paper %T New results on information theoretic clustering %A Ferdinando Cicalese %A Eduardo Laber %A Lucas Murtinho %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-cicalese19a %I PMLR %P 1242--1251 %U https://proceedings.mlr.press/v97/cicalese19a.html %V 97 %X We study the problem of optimizing the clustering of a set of vectors when the quality of the clustering is measured by the Entropy or the Gini impurity measure. Our results contribute to the state of the art both in terms of best known approximation guarantees and inapproximability bounds: (i) we give the first polynomial time algorithm for Entropy impurity based clustering with approximation guarantee independent of the number of vectors and (ii) we show that the problem of clustering based on entropy impurity does not admit a PTAS. This also implies an inapproximability result in information theoretic clustering for probability distributions closing a problem left open in [Chaudhury and McGregor, COLT08] and [Ackermann et al., ECCC11]. We also report experiments with a new clustering method that was designed on top of the theoretical tools leading to the above results. These experiments suggest a practical applicability for our method, in particular, when the number of clusters is large.
APA
Cicalese, F., Laber, E. & Murtinho, L.. (2019). New results on information theoretic clustering. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1242-1251 Available from https://proceedings.mlr.press/v97/cicalese19a.html.

Related Material