Secure Distributed Training at Scale

Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:7679-7739, 2022.

Abstract

Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-gorbunov22a, title = {Secure Distributed Training at Scale}, author = {Gorbunov, Eduard and Borzunov, Alexander and Diskin, Michael and Ryabinin, Max}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {7679--7739}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/gorbunov22a/gorbunov22a.pdf}, url = {https://proceedings.mlr.press/v162/gorbunov22a.html}, abstract = {Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.} }
Endnote
%0 Conference Paper %T Secure Distributed Training at Scale %A Eduard Gorbunov %A Alexander Borzunov %A Michael Diskin %A Max Ryabinin %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-gorbunov22a %I PMLR %P 7679--7739 %U https://proceedings.mlr.press/v162/gorbunov22a.html %V 162 %X Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.
APA
Gorbunov, E., Borzunov, A., Diskin, M. & Ryabinin, M.. (2022). Secure Distributed Training at Scale. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:7679-7739 Available from https://proceedings.mlr.press/v162/gorbunov22a.html.

Related Material