Why Is Public Pretraining Necessary for Private Model Training?

Arun Ganesh, Mahdi Haghifam, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, Lun Wang
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:10611-10627, 2023.

Abstract

In the privacy-utility tradeoff of a model trained on benchmark language and vision tasks, remarkable improvements have been widely reported when the model is pretrained on public data. Some gain is expected as these models inherit the benefits of transfer learning, which is the standard motivation in non-private settings. However, the stark contrast in the gain of pretraining between non-private and private machine learning suggests that the gain in the latter is rooted in a fundamentally different cause. To explain this phenomenon, we hypothesize that the non-convex loss landscape of a model training necessitates the optimization algorithm to go through two phases. In the first, the algorithm needs to select a good “basin” in the loss landscape. In the second, the algorithm solves an easy optimization within that basin. The former is a harder problem to solve with private data, while the latter is harder to solve with public data due to a distribution shift or data scarcity. Guided by this intuition, we provide theoretical constructions that provably demonstrate the separation between private training with and without public pretraining. Further, systematic experiments on CIFAR10 and Librispeech provide supporting evidence for our hypothesis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-ganesh23a, title = {Why Is Public Pretraining Necessary for Private Model Training?}, author = {Ganesh, Arun and Haghifam, Mahdi and Nasr, Milad and Oh, Sewoong and Steinke, Thomas and Thakkar, Om and Guha Thakurta, Abhradeep and Wang, Lun}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {10611--10627}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/ganesh23a/ganesh23a.pdf}, url = {https://proceedings.mlr.press/v202/ganesh23a.html}, abstract = {In the privacy-utility tradeoff of a model trained on benchmark language and vision tasks, remarkable improvements have been widely reported when the model is pretrained on public data. Some gain is expected as these models inherit the benefits of transfer learning, which is the standard motivation in non-private settings. However, the stark contrast in the gain of pretraining between non-private and private machine learning suggests that the gain in the latter is rooted in a fundamentally different cause. To explain this phenomenon, we hypothesize that the non-convex loss landscape of a model training necessitates the optimization algorithm to go through two phases. In the first, the algorithm needs to select a good “basin” in the loss landscape. In the second, the algorithm solves an easy optimization within that basin. The former is a harder problem to solve with private data, while the latter is harder to solve with public data due to a distribution shift or data scarcity. Guided by this intuition, we provide theoretical constructions that provably demonstrate the separation between private training with and without public pretraining. Further, systematic experiments on CIFAR10 and Librispeech provide supporting evidence for our hypothesis.} }
Endnote
%0 Conference Paper %T Why Is Public Pretraining Necessary for Private Model Training? %A Arun Ganesh %A Mahdi Haghifam %A Milad Nasr %A Sewoong Oh %A Thomas Steinke %A Om Thakkar %A Abhradeep Guha Thakurta %A Lun Wang %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-ganesh23a %I PMLR %P 10611--10627 %U https://proceedings.mlr.press/v202/ganesh23a.html %V 202 %X In the privacy-utility tradeoff of a model trained on benchmark language and vision tasks, remarkable improvements have been widely reported when the model is pretrained on public data. Some gain is expected as these models inherit the benefits of transfer learning, which is the standard motivation in non-private settings. However, the stark contrast in the gain of pretraining between non-private and private machine learning suggests that the gain in the latter is rooted in a fundamentally different cause. To explain this phenomenon, we hypothesize that the non-convex loss landscape of a model training necessitates the optimization algorithm to go through two phases. In the first, the algorithm needs to select a good “basin” in the loss landscape. In the second, the algorithm solves an easy optimization within that basin. The former is a harder problem to solve with private data, while the latter is harder to solve with public data due to a distribution shift or data scarcity. Guided by this intuition, we provide theoretical constructions that provably demonstrate the separation between private training with and without public pretraining. Further, systematic experiments on CIFAR10 and Librispeech provide supporting evidence for our hypothesis.
APA
Ganesh, A., Haghifam, M., Nasr, M., Oh, S., Steinke, T., Thakkar, O., Guha Thakurta, A. & Wang, L.. (2023). Why Is Public Pretraining Necessary for Private Model Training?. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:10611-10627 Available from https://proceedings.mlr.press/v202/ganesh23a.html.

Related Material