Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Tehila Dahan; Kfir Yehuda Levy

Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Tehila Dahan, Kfir Yehuda Levy

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:9806-9833, 2024.

Abstract

In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures—where workers may contribute incorrect updates due to malice or error—gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework and corroborated by empirical evidence.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-dahan24a,
  title = 	 {Fault Tolerant {ML}: Efficient Meta-Aggregation and Synchronous Training},
  author =       {Dahan, Tehila and Levy, Kfir Yehuda},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {9806--9833},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/dahan24a/dahan24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/dahan24a.html},
  abstract = 	 {In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures—where workers may contribute incorrect updates due to malice or error—gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework and corroborated by empirical evidence.}
}

Endnote

%0 Conference Paper
%T Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training
%A Tehila Dahan
%A Kfir Yehuda Levy
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-dahan24a
%I PMLR
%P 9806--9833
%U https://proceedings.mlr.press/v235/dahan24a.html
%V 235
%X In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures—where workers may contribute incorrect updates due to malice or error—gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework and corroborated by empirical evidence.

APA


Dahan, T. & Levy, K.Y.. (2024). Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:9806-9833 Available from https://proceedings.mlr.press/v235/dahan24a.html.

Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Abstract

Cite this Paper

Related Material