Fault Tolerance in Iterative-Convergent Machine Learning

Aurick Qiao, Bryon Aragam, Bingjing Zhang, Eric Xing
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5220-5230, 2019.

Abstract

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative- convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78%{–}95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-qiao19a, title = {Fault Tolerance in Iterative-Convergent Machine Learning}, author = {Qiao, Aurick and Aragam, Bryon and Zhang, Bingjing and Xing, Eric}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {5220--5230}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/qiao19a/qiao19a.pdf}, url = {https://proceedings.mlr.press/v97/qiao19a.html}, abstract = {Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative- convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78%{–}95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures.} }
Endnote
%0 Conference Paper %T Fault Tolerance in Iterative-Convergent Machine Learning %A Aurick Qiao %A Bryon Aragam %A Bingjing Zhang %A Eric Xing %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-qiao19a %I PMLR %P 5220--5230 %U https://proceedings.mlr.press/v97/qiao19a.html %V 97 %X Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative- convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78%{–}95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures.
APA
Qiao, A., Aragam, B., Zhang, B. & Xing, E.. (2019). Fault Tolerance in Iterative-Convergent Machine Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:5220-5230 Available from https://proceedings.mlr.press/v97/qiao19a.html.

Related Material