Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy

Kareem Amin, Alex Kulesza, Andres Munoz, Sergei Vassilvtiskii
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:263-271, 2019.

Abstract

Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance{—}indeed, it is often assumed that each user contributes only a single example{—}we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-amin19a, title = {Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy}, author = {Amin, Kareem and Kulesza, Alex and Munoz, Andres and Vassilvtiskii, Sergei}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {263--271}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/amin19a/amin19a.pdf}, url = {https://proceedings.mlr.press/v97/amin19a.html}, abstract = {Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance{—}indeed, it is often assumed that each user contributes only a single example{—}we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data.} }
Endnote
%0 Conference Paper %T Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy %A Kareem Amin %A Alex Kulesza %A Andres Munoz %A Sergei Vassilvtiskii %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-amin19a %I PMLR %P 263--271 %U https://proceedings.mlr.press/v97/amin19a.html %V 97 %X Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance{—}indeed, it is often assumed that each user contributes only a single example{—}we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data.
APA
Amin, K., Kulesza, A., Munoz, A. & Vassilvtiskii, S.. (2019). Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:263-271 Available from https://proceedings.mlr.press/v97/amin19a.html.

Related Material