Sets Clustering

Ibrahim Jubran; Murad Tukan; Alaa Maalouf; Dan Feldman

Sets Clustering

Ibrahim Jubran, Murad Tukan, Alaa Maalouf, Dan Feldman

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4994-5005, 2020.

Abstract

The input to the \emph{sets-

$k$ -means} problem is an integer

$k\geq 1$ and a set

$\mathcal{P}=\{P_1,\cdots,P_n\}$ of fixed sized sets in

$\mathbb{R}^d$ . The goal is to compute a set

$C$ of

$k$ centers (points) in

$\mathbb{R}^d$ that minimizes the sum

$\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\|{p}-c \right\|^2$ of squared distances to these sets. An \emph{

$\varepsilon$ -core-set} for this problem is a weighted subset of

$\mathcal{P}$ that approximates this sum up to

$1\pm\varepsilon$ factor, for \emph{every} set

$C$ of

$k$ centers in

$\mathbb{R}^d$ . We prove that such a core-set of

$O(\log^2{n})$ sets always exists, and can be computed in

$O(n\log{n})$ time, for every input

$\mathcal{P}$ and every fixed

$d,k\geq 1$ and

$\varepsilon \in (0,1)$ . The result easily generalized for any metric space, distances to the power of

$z>0$ , and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (

$1+\varepsilon$ approximation) for the sets-

$k$ -means problem that takes time near linear in

$n$ . This is the first result even for sets-mean on the plane (

$k=1$ ,

$d=2$ ). Open source code and experimental results for document classification and facility locations are also provided.

Cite this Paper

BibTeX


@InProceedings{pmlr-v119-jubran20a,
  title = 	 {Sets Clustering},
  author =       {Jubran, Ibrahim and Tukan, Murad and Maalouf, Alaa and Feldman, Dan},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {4994--5005},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/jubran20a/jubran20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/jubran20a.html},
  abstract = 	 {The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of fixed sized sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\|{p}-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.}
}

Endnote

%0 Conference Paper
%T Sets Clustering
%A Ibrahim Jubran
%A Murad Tukan
%A Alaa Maalouf
%A Dan Feldman
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-jubran20a
%I PMLR
%P 4994--5005
%U https://proceedings.mlr.press/v119/jubran20a.html
%V 119
%X The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of fixed sized sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\|{p}-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.

APA


Jubran, I., Tukan, M., Maalouf, A. & Feldman, D.. (2020). Sets Clustering. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4994-5005 Available from https://proceedings.mlr.press/v119/jubran20a.html.

Sets Clustering

Abstract

Cite this Paper

Related Material