DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:903-912, 2018.

Abstract

Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur a prohibitive computational overhead in large-scale settings, and their convergence guarantees often require strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are used by the parameter server to eliminate the effects of adversarial updates. DRACO comes with problem-independent robustness guarantees, and the model that it trains is identical to the one trained in the adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times, to orders of magnitude faster than median-based approaches.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-chen18l, title = {{DRACO}: {B}yzantine-resilient Distributed Training via Redundant Gradients}, author = {Chen, Lingjiao and Wang, Hongyi and Charles, Zachary and Papailiopoulos, Dimitris}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {903--912}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/chen18l/chen18l.pdf}, url = {https://proceedings.mlr.press/v80/chen18l.html}, abstract = {Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur a prohibitive computational overhead in large-scale settings, and their convergence guarantees often require strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are used by the parameter server to eliminate the effects of adversarial updates. DRACO comes with problem-independent robustness guarantees, and the model that it trains is identical to the one trained in the adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times, to orders of magnitude faster than median-based approaches.} }
Endnote
%0 Conference Paper %T DRACO: Byzantine-resilient Distributed Training via Redundant Gradients %A Lingjiao Chen %A Hongyi Wang %A Zachary Charles %A Dimitris Papailiopoulos %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-chen18l %I PMLR %P 903--912 %U https://proceedings.mlr.press/v80/chen18l.html %V 80 %X Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur a prohibitive computational overhead in large-scale settings, and their convergence guarantees often require strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are used by the parameter server to eliminate the effects of adversarial updates. DRACO comes with problem-independent robustness guarantees, and the model that it trains is identical to the one trained in the adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times, to orders of magnitude faster than median-based approaches.
APA
Chen, L., Wang, H., Charles, Z. & Papailiopoulos, D.. (2018). DRACO: Byzantine-resilient Distributed Training via Redundant Gradients. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:903-912 Available from https://proceedings.mlr.press/v80/chen18l.html.

Related Material