Explainable k-Means and k-Medians Clustering

Michal Moshkovitz; Sanjoy Dasgupta; Cyrus Rashtchian; Nave Frost

Explainable k-Means and k-Medians Clustering

Michal Moshkovitz, Sanjoy Dasgupta, Cyrus Rashtchian, Nave Frost

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7055-7065, 2020.

Abstract

Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with k leaves must incur an Omega(log k) approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general k > 2, we design an efficient algorithm that leads to an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-moshkovitz20a,
  title = 	 {Explainable k-Means and k-Medians Clustering},
  author =       {Moshkovitz, Michal and Dasgupta, Sanjoy and Rashtchian, Cyrus and Frost, Nave},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {7055--7065},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/moshkovitz20a/moshkovitz20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/moshkovitz20a.html},
  abstract = 	 {Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with k leaves must incur an Omega(log k) approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general k > 2, we design an efficient algorithm that leads to an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.}
}

Endnote

%0 Conference Paper
%T Explainable k-Means and k-Medians Clustering
%A Michal Moshkovitz
%A Sanjoy Dasgupta
%A Cyrus Rashtchian
%A Nave Frost
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-moshkovitz20a
%I PMLR
%P 7055--7065
%U https://proceedings.mlr.press/v119/moshkovitz20a.html
%V 119
%X Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with k leaves must incur an Omega(log k) approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general k > 2, we design an efficient algorithm that leads to an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

APA

Moshkovitz, M., Dasgupta, S., Rashtchian, C. & Frost, N.. (2020). Explainable k-Means and k-Medians Clustering. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7055-7065 Available from https://proceedings.mlr.press/v119/moshkovitz20a.html.

Explainable k-Means and k-Medians Clustering

Abstract

Cite this Paper

Related Material