Robust $k$-means++

Amit Deshpande; Praneeth Kacham; Rameshwar Pratap

Robust $k$-means++

Amit Deshpande, Praneeth Kacham, Rameshwar Pratap

Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR 124:799-808, 2020.

Abstract

A good seeding or initialization of cluster centers for the $k$-means method is important from both theoretical and practical standpoints. The $k$-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the $k$-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data.For any $0 < \delta \leq 1$, we show that using a mixture of $D^{2}$ [3] and uniform sampling, we can pick $O(k/\delta)$ candidate centers with the following guarantee: they contain some $k$ centers that give $O(1)$-approximation to the optimal robust $k$-means solution while discarding at most $\delta n$ more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest $\beta n$ points as outliers, our solution discards its $(\beta + \delta) n$ points as outliers. The constant factor in our $O(1)$-approximation does not depend on $\delta$. This is an improvement over previous results for $k$-means with outliers based on LP relaxation and rounding [7] and local search [17]. The $O(k/\delta)$ sized subset can be found in time $O(ndk)$. Our \emph{robust} $k$-means++ is also easily amenable to scalable, faster, parallel implementations of $k$-means++ [5]. Our empirical results show a comparison of the above \emph{robust} variant of $k$-means++ with the usual $k$-means++, uniform random seeding, threshold $k$-means++ [6] and local search on real world and synthetic data.

Cite this Paper

BibTeX


@InProceedings{pmlr-v124-deshpande20a,
  title = 	 {Robust $k$-means++},
  author =       {Deshpande, Amit and Kacham, Praneeth and Pratap, Rameshwar},
  booktitle = 	 {Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)},
  pages = 	 {799--808},
  year = 	 {2020},
  editor = 	 {Peters, Jonas and Sontag, David},
  volume = 	 {124},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--06 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v124/deshpande20a/deshpande20a.pdf},
  url = 	 {https://proceedings.mlr.press/v124/deshpande20a.html},
  abstract = 	 {A good seeding or initialization of cluster centers for the $k$-means method is important from both theoretical and practical standpoints. The $k$-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the $k$-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data.For any $0 < \delta \leq 1$, we show that using a mixture of $D^{2}$ [3] and uniform sampling, we can pick $O(k/\delta)$ candidate centers with the following guarantee: they contain some $k$ centers that give $O(1)$-approximation to the optimal robust $k$-means solution while discarding at most $\delta n$ more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest $\beta n$ points as outliers, our solution discards  its $(\beta + \delta) n$ points as outliers. The constant factor in our $O(1)$-approximation does not  depend on $\delta$.  This is an improvement over previous results for $k$-means with outliers based on LP relaxation and rounding [7] and local search [17]. The $O(k/\delta)$ sized subset can be found in time  $O(ndk)$. Our \emph{robust} $k$-means++ is also easily amenable to scalable, faster, parallel implementations of $k$-means++ [5]. Our empirical results show a comparison of the above \emph{robust} variant of $k$-means++ with the usual $k$-means++, uniform random seeding, threshold $k$-means++ [6] and local search on real world and synthetic data.}
}

Endnote

%0 Conference Paper
%T Robust $k$-means++
%A Amit Deshpande
%A Praneeth Kacham
%A Rameshwar Pratap
%B Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)
%C Proceedings of Machine Learning Research
%D 2020
%E Jonas Peters
%E David Sontag	
%F pmlr-v124-deshpande20a
%I PMLR
%P 799--808
%U https://proceedings.mlr.press/v124/deshpande20a.html
%V 124
%X A good seeding or initialization of cluster centers for the $k$-means method is important from both theoretical and practical standpoints. The $k$-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the $k$-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data.For any $0 < \delta \leq 1$, we show that using a mixture of $D^{2}$ [3] and uniform sampling, we can pick $O(k/\delta)$ candidate centers with the following guarantee: they contain some $k$ centers that give $O(1)$-approximation to the optimal robust $k$-means solution while discarding at most $\delta n$ more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest $\beta n$ points as outliers, our solution discards  its $(\beta + \delta) n$ points as outliers. The constant factor in our $O(1)$-approximation does not  depend on $\delta$.  This is an improvement over previous results for $k$-means with outliers based on LP relaxation and rounding [7] and local search [17]. The $O(k/\delta)$ sized subset can be found in time  $O(ndk)$. Our \emph{robust} $k$-means++ is also easily amenable to scalable, faster, parallel implementations of $k$-means++ [5]. Our empirical results show a comparison of the above \emph{robust} variant of $k$-means++ with the usual $k$-means++, uniform random seeding, threshold $k$-means++ [6] and local search on real world and synthetic data.

APA


Deshpande, A., Kacham, P. & Pratap, R.. (2020). Robust $k$-means++. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), in Proceedings of Machine Learning Research 124:799-808 Available from https://proceedings.mlr.press/v124/deshpande20a.html.

Robust $k$-means++

Abstract

Cite this Paper

Related Material