Zeno: Distributed Stochastic Gradient Descent with Suspicionbased Faulttolerance
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:68936901, 2019.
Abstract
We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of nonfaulty nodes; we need assume only one nonfaulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a rankingbased preference mechanism. We prove the convergence of SGD for nonconvex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.
Related Material


