Approximate Cluster Recovery from Noisy Labels

Buddhima Gamlath; Silvio Lattanzi; Ashkan Norouzi-Fard; Ola Svensson

Approximate Cluster Recovery from Noisy Labels

Buddhima Gamlath, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson

Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:1463-1509, 2022.

Abstract

Designing algorithms for machine learning problems targeting beyond worst-case analysis and, in particular, analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels. In this problem, in addition to a set of points and parameter

$k$ , we receive cluster labels of each point generated by either an adversarial or a random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor

$1+o(1)$ of the optimum even when the label of each point is perturbed with a large probability (think 99%). In contrast, we show that the side-information with adversarial perturbations is as hard as the original problem even if only a small

$\epsilon$ fraction of the labels are perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb an

$\epsilon$ fraction of the labels per \emph{each cluster}.

Cite this Paper

BibTeX


@InProceedings{pmlr-v178-gamlath22a,
  title = 	 {Approximate Cluster Recovery from Noisy Labels},
  author =       {Gamlath, Buddhima and Lattanzi, Silvio and Norouzi-Fard, Ashkan and Svensson, Ola},
  booktitle = 	 {Proceedings of Thirty Fifth Conference on Learning Theory},
  pages = 	 {1463--1509},
  year = 	 {2022},
  editor = 	 {Loh, Po-Ling and Raginsky, Maxim},
  volume = 	 {178},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v178/gamlath22a/gamlath22a.pdf},
  url = 	 {https://proceedings.mlr.press/v178/gamlath22a.html},
  abstract = 	 {Designing algorithms for machine learning problems targeting beyond worst-case analysis and, in particular, analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels.  In this problem, in addition to a set of points and parameter \(k\), we receive cluster labels of each point generated by either an adversarial or a random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor $1+o(1)$ of the optimum even when the label of each point is perturbed with a large probability (think 99%). In contrast, we show that the side-information with adversarial perturbations is as hard as the original problem even if only a small $\epsilon$ fraction of the labels are  perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb an $\epsilon$ fraction of the labels per \emph{each cluster}.}
}

Endnote

%0 Conference Paper
%T Approximate Cluster Recovery from Noisy Labels
%A Buddhima Gamlath
%A Silvio Lattanzi
%A Ashkan Norouzi-Fard
%A Ola Svensson
%B Proceedings of Thirty Fifth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Po-Ling Loh
%E Maxim Raginsky	
%F pmlr-v178-gamlath22a
%I PMLR
%P 1463--1509
%U https://proceedings.mlr.press/v178/gamlath22a.html
%V 178
%X Designing algorithms for machine learning problems targeting beyond worst-case analysis and, in particular, analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels.  In this problem, in addition to a set of points and parameter \(k\), we receive cluster labels of each point generated by either an adversarial or a random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor $1+o(1)$ of the optimum even when the label of each point is perturbed with a large probability (think 99%). In contrast, we show that the side-information with adversarial perturbations is as hard as the original problem even if only a small $\epsilon$ fraction of the labels are  perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb an $\epsilon$ fraction of the labels per \emph{each cluster}.

APA


Gamlath, B., Lattanzi, S., Norouzi-Fard, A. & Svensson, O.. (2022). Approximate Cluster Recovery from Noisy Labels. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:1463-1509 Available from https://proceedings.mlr.press/v178/gamlath22a.html.

Related Material

Download PDF