Natural Compression for Distributed Deep Learning

Samuel Horvóth; Chen-Yu Ho; Ludovit Horvath; Atal Narayan Sahu; Marco Canini; Peter Richtarik

Natural Compression for Distributed Deep Learning

Samuel Horvóth, Chen-Yu Ho, Ludovit Horvath, Atal Narayan Sahu, Marco Canini, Peter Richtarik

Proceedings of Mathematical and Scientific Machine Learning, PMLR 190:129-141, 2022.

Abstract

Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression (

$C_{\text{nat}}$ )}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression,

$C_{\text{nat}}$ increases the second moment of the compressed vector by not more than the tiny factor

$\frac{9}{8}$ , which means that the effect of

$C_{\text{nat}}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by

$C_{\text{nat}}$ are substantial, leading to {\em

$3$ -

$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize

$C_{\text{nat}}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.

Cite this Paper

BibTeX


@InProceedings{pmlr-v190-horvoth22a,
  title = 	 {Natural Compression for Distributed Deep Learning},
  author =       {Horv\'{o}th, Samuel and Ho, Chen-Yu and Horvath, Ludovit and Sahu, Atal Narayan and Canini, Marco and Richtarik, Peter},
  booktitle = 	 {Proceedings of Mathematical and Scientific Machine Learning},
  pages = 	 {129--141},
  year = 	 {2022},
  editor = 	 {Dong, Bin and Li, Qianxiao and Wang, Lei and Xu, Zhi-Qin John},
  volume = 	 {190},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--17 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v190/horvoth22a/horvoth22a.pdf},
  url = 	 {https://proceedings.mlr.press/v190/horvoth22a.html},
  abstract = 	 {Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($C_{\text{nat}}$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression, $C_{\text{nat}}$ increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of $C_{\text{nat}}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by $C_{\text{nat}}$ are substantial, leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $C_{\text{nat}}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.}
}

Endnote

%0 Conference Paper
%T Natural Compression for Distributed Deep Learning
%A Samuel Horvóth
%A Chen-Yu Ho
%A Ludovit Horvath
%A Atal Narayan Sahu
%A Marco Canini
%A Peter Richtarik
%B Proceedings of Mathematical and Scientific Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Bin Dong
%E Qianxiao Li
%E Lei Wang
%E Zhi-Qin John Xu	
%F pmlr-v190-horvoth22a
%I PMLR
%P 129--141
%U https://proceedings.mlr.press/v190/horvoth22a.html
%V 190
%X Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($C_{\text{nat}}$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression, $C_{\text{nat}}$ increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of $C_{\text{nat}}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by $C_{\text{nat}}$ are substantial, leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $C_{\text{nat}}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.

APA


Horvóth, S., Ho, C., Horvath, L., Sahu, A.N., Canini, M. & Richtarik, P.. (2022). Natural Compression for Distributed Deep Learning. Proceedings of Mathematical and Scientific Machine Learning, in Proceedings of Machine Learning Research 190:129-141 Available from https://proceedings.mlr.press/v190/horvoth22a.html.

Related Material

Download PDF