Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

Cong Xie, Sanmi Koyejo, Indranil Gupta
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:6893-6901, 2019.

Abstract

We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-xie19b, title = {Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance}, author = {Xie, Cong and Koyejo, Sanmi and Gupta, Indranil}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {6893--6901}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/xie19b/xie19b.pdf}, url = {http://proceedings.mlr.press/v97/xie19b.html}, abstract = {We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.} }
Endnote
%0 Conference Paper %T Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance %A Cong Xie %A Sanmi Koyejo %A Indranil Gupta %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-xie19b %I PMLR %P 6893--6901 %U http://proceedings.mlr.press/v97/xie19b.html %V 97 %X We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.
APA
Xie, C., Koyejo, S. & Gupta, I.. (2019). Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:6893-6901 Available from http://proceedings.mlr.press/v97/xie19b.html.

Related Material