Effectively Using Public Data in Privacy Preserving Machine Learning

Milad Nasr, Saeed Mahloujifar, Xinyu Tang, Prateek Mittal, Amir Houmansadr
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:25718-25732, 2023.

Abstract

Differentially private (DP) machine learning techniques are notorious for their degradation of model utility (e.g., they degrade classification accuracy). A recent line of work has demonstrated that leveraging public data can improve the trade-off between privacy and utility when training models with DP guaranteed. In this work, we further explore the potential of using public data in DP models, showing that utility gains can in fact be significantly higher than what shown in prior works. Specifically, we introduce DOPE-SGD, a modified DP-SGD algorithm that leverages public data during its training. DOPE-SGD uses public data in two complementary ways: (1) it uses advance augmentation techniques that leverages public data to generate synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a modified gradient clipping mechanism (which is a standard technique in DP training) to change the origin of gradient vectors using the information inferred from available public and synthetic data, therefore boosting utility. We also introduce a technique to ensemble intermediate DP models by leveraging the post processing property of differential privacy to further improve the accuracy of the predictions. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in DP machine learning across multiple datasets, network architectures, and application domains. For instance, assuming access to $2,000$ public images, and for a privacy budget of $\varepsilon=2,\delta=10^{-5}$, our technique achieves an accuracy of $75.1%$ on CIFAR10, significantly higher than $68.1%$ achieved by the state of the art.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-nasr23a, title = {Effectively Using Public Data in Privacy Preserving Machine Learning}, author = {Nasr, Milad and Mahloujifar, Saeed and Tang, Xinyu and Mittal, Prateek and Houmansadr, Amir}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {25718--25732}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/nasr23a/nasr23a.pdf}, url = {https://proceedings.mlr.press/v202/nasr23a.html}, abstract = {Differentially private (DP) machine learning techniques are notorious for their degradation of model utility (e.g., they degrade classification accuracy). A recent line of work has demonstrated that leveraging public data can improve the trade-off between privacy and utility when training models with DP guaranteed. In this work, we further explore the potential of using public data in DP models, showing that utility gains can in fact be significantly higher than what shown in prior works. Specifically, we introduce DOPE-SGD, a modified DP-SGD algorithm that leverages public data during its training. DOPE-SGD uses public data in two complementary ways: (1) it uses advance augmentation techniques that leverages public data to generate synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a modified gradient clipping mechanism (which is a standard technique in DP training) to change the origin of gradient vectors using the information inferred from available public and synthetic data, therefore boosting utility. We also introduce a technique to ensemble intermediate DP models by leveraging the post processing property of differential privacy to further improve the accuracy of the predictions. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in DP machine learning across multiple datasets, network architectures, and application domains. For instance, assuming access to $2,000$ public images, and for a privacy budget of $\varepsilon=2,\delta=10^{-5}$, our technique achieves an accuracy of $75.1%$ on CIFAR10, significantly higher than $68.1%$ achieved by the state of the art.} }
Endnote
%0 Conference Paper %T Effectively Using Public Data in Privacy Preserving Machine Learning %A Milad Nasr %A Saeed Mahloujifar %A Xinyu Tang %A Prateek Mittal %A Amir Houmansadr %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-nasr23a %I PMLR %P 25718--25732 %U https://proceedings.mlr.press/v202/nasr23a.html %V 202 %X Differentially private (DP) machine learning techniques are notorious for their degradation of model utility (e.g., they degrade classification accuracy). A recent line of work has demonstrated that leveraging public data can improve the trade-off between privacy and utility when training models with DP guaranteed. In this work, we further explore the potential of using public data in DP models, showing that utility gains can in fact be significantly higher than what shown in prior works. Specifically, we introduce DOPE-SGD, a modified DP-SGD algorithm that leverages public data during its training. DOPE-SGD uses public data in two complementary ways: (1) it uses advance augmentation techniques that leverages public data to generate synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a modified gradient clipping mechanism (which is a standard technique in DP training) to change the origin of gradient vectors using the information inferred from available public and synthetic data, therefore boosting utility. We also introduce a technique to ensemble intermediate DP models by leveraging the post processing property of differential privacy to further improve the accuracy of the predictions. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in DP machine learning across multiple datasets, network architectures, and application domains. For instance, assuming access to $2,000$ public images, and for a privacy budget of $\varepsilon=2,\delta=10^{-5}$, our technique achieves an accuracy of $75.1%$ on CIFAR10, significantly higher than $68.1%$ achieved by the state of the art.
APA
Nasr, M., Mahloujifar, S., Tang, X., Mittal, P. & Houmansadr, A.. (2023). Effectively Using Public Data in Privacy Preserving Machine Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:25718-25732 Available from https://proceedings.mlr.press/v202/nasr23a.html.

Related Material