Differentially Private Clustering in High-Dimensional Euclidean Spaces

Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, Hongyang Zhang
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:322-331, 2017.

Abstract

We study the problem of clustering sensitive data while preserving the privacy of individuals represented in the dataset, which has broad applications in practical machine learning and data analysis tasks. Although the problem has been widely studied in the context of low-dimensional, discrete spaces, much remains unknown concerning private clustering in high-dimensional Euclidean spaces $\mathbb{R}^d$. In this work, we give differentially private and efficient algorithms achieving strong guarantees for $k$-means and $k$-median clustering when $d=\Omega(\mathsf{polylog}(n))$. Our algorithm achieves clustering loss at most $\log^3(n)\mathsf{OPT}+\mathsf{poly}(\log n,d,k)$, advancing the state-of-the-art result of $\sqrt{d}\mathsf{OPT}+\mathsf{poly}(\log n,d^d,k^d)$. We also study the case where the data points are $s$-sparse and show that the clustering loss can scale logarithmically with $d$, i.e., $\log^3(n)\mathsf{OPT}+\mathsf{poly}(\log n,\log d,k,s)$. Experiments on both synthetic and real datasets verify the effectiveness of the proposed method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-balcan17a, title = {Differentially Private Clustering in High-Dimensional {E}uclidean Spaces}, author = {Maria-Florina Balcan and Travis Dick and Yingyu Liang and Wenlong Mou and Hongyang Zhang}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {322--331}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/balcan17a/balcan17a.pdf}, url = { http://proceedings.mlr.press/v70/balcan17a.html }, abstract = {We study the problem of clustering sensitive data while preserving the privacy of individuals represented in the dataset, which has broad applications in practical machine learning and data analysis tasks. Although the problem has been widely studied in the context of low-dimensional, discrete spaces, much remains unknown concerning private clustering in high-dimensional Euclidean spaces $\mathbb{R}^d$. In this work, we give differentially private and efficient algorithms achieving strong guarantees for $k$-means and $k$-median clustering when $d=\Omega(\mathsf{polylog}(n))$. Our algorithm achieves clustering loss at most $\log^3(n)\mathsf{OPT}+\mathsf{poly}(\log n,d,k)$, advancing the state-of-the-art result of $\sqrt{d}\mathsf{OPT}+\mathsf{poly}(\log n,d^d,k^d)$. We also study the case where the data points are $s$-sparse and show that the clustering loss can scale logarithmically with $d$, i.e., $\log^3(n)\mathsf{OPT}+\mathsf{poly}(\log n,\log d,k,s)$. Experiments on both synthetic and real datasets verify the effectiveness of the proposed method.} }
Endnote
%0 Conference Paper %T Differentially Private Clustering in High-Dimensional Euclidean Spaces %A Maria-Florina Balcan %A Travis Dick %A Yingyu Liang %A Wenlong Mou %A Hongyang Zhang %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-balcan17a %I PMLR %P 322--331 %U http://proceedings.mlr.press/v70/balcan17a.html %V 70 %X We study the problem of clustering sensitive data while preserving the privacy of individuals represented in the dataset, which has broad applications in practical machine learning and data analysis tasks. Although the problem has been widely studied in the context of low-dimensional, discrete spaces, much remains unknown concerning private clustering in high-dimensional Euclidean spaces $\mathbb{R}^d$. In this work, we give differentially private and efficient algorithms achieving strong guarantees for $k$-means and $k$-median clustering when $d=\Omega(\mathsf{polylog}(n))$. Our algorithm achieves clustering loss at most $\log^3(n)\mathsf{OPT}+\mathsf{poly}(\log n,d,k)$, advancing the state-of-the-art result of $\sqrt{d}\mathsf{OPT}+\mathsf{poly}(\log n,d^d,k^d)$. We also study the case where the data points are $s$-sparse and show that the clustering loss can scale logarithmically with $d$, i.e., $\log^3(n)\mathsf{OPT}+\mathsf{poly}(\log n,\log d,k,s)$. Experiments on both synthetic and real datasets verify the effectiveness of the proposed method.
APA
Balcan, M., Dick, T., Liang, Y., Mou, W. & Zhang, H.. (2017). Differentially Private Clustering in High-Dimensional Euclidean Spaces. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:322-331 Available from http://proceedings.mlr.press/v70/balcan17a.html .

Related Material