[edit]
Sets Clustering
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4994-5005, 2020.
Abstract
The input to the \emph{sets-k-means} problem is an integer k≥1 and a set P={P1,⋯,Pn} of fixed sized sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes the sum ∑P∈Pmin of squared distances to these sets. An \emph{\varepsilon-core-set} for this problem is a weighted subset of \mathcal{P} that approximates this sum up to 1\pm\varepsilon factor, for \emph{every} set C of k centers in \mathbb{R}^d. We prove that such a core-set of O(\log^2{n}) sets always exists, and can be computed in O(n\log{n}) time, for every input \mathcal{P} and every fixed d,k\geq 1 and \varepsilon \in (0,1). The result easily generalized for any metric space, distances to the power of z>0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1+\varepsilon approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k=1, d=2). Open source code and experimental results for document classification and facility locations are also provided.