Natural Compression for Distributed Deep Learning

Samuel Horvóth, Chen-Yu Ho, Ludovit Horvath, Atal Narayan Sahu, Marco Canini, Peter Richtarik
Proceedings of Mathematical and Scientific Machine Learning, PMLR 190:129-141, 2022.

Abstract

Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($C_{\text{nat}}$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression, $C_{\text{nat}}$ increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of $C_{\text{nat}}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by $C_{\text{nat}}$ are substantial, leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $C_{\text{nat}}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v190-horvoth22a, title = {Natural Compression for Distributed Deep Learning}, author = {Horv\'{o}th, Samuel and Ho, Chen-Yu and Horvath, Ludovit and Sahu, Atal Narayan and Canini, Marco and Richtarik, Peter}, booktitle = {Proceedings of Mathematical and Scientific Machine Learning}, pages = {129--141}, year = {2022}, editor = {Dong, Bin and Li, Qianxiao and Wang, Lei and Xu, Zhi-Qin John}, volume = {190}, series = {Proceedings of Machine Learning Research}, month = {15--17 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v190/horvoth22a/horvoth22a.pdf}, url = {https://proceedings.mlr.press/v190/horvoth22a.html}, abstract = {Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($C_{\text{nat}}$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression, $C_{\text{nat}}$ increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of $C_{\text{nat}}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by $C_{\text{nat}}$ are substantial, leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $C_{\text{nat}}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.} }
Endnote
%0 Conference Paper %T Natural Compression for Distributed Deep Learning %A Samuel Horvóth %A Chen-Yu Ho %A Ludovit Horvath %A Atal Narayan Sahu %A Marco Canini %A Peter Richtarik %B Proceedings of Mathematical and Scientific Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Bin Dong %E Qianxiao Li %E Lei Wang %E Zhi-Qin John Xu %F pmlr-v190-horvoth22a %I PMLR %P 129--141 %U https://proceedings.mlr.press/v190/horvoth22a.html %V 190 %X Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($C_{\text{nat}}$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a “natural” way by ignoring the mantissa. We show that compared to no compression, $C_{\text{nat}}$ increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of $C_{\text{nat}}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by $C_{\text{nat}}$ are substantial, leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $C_{\text{nat}}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.
APA
Horvóth, S., Ho, C., Horvath, L., Sahu, A.N., Canini, M. & Richtarik, P.. (2022). Natural Compression for Distributed Deep Learning. Proceedings of Mathematical and Scientific Machine Learning, in Proceedings of Machine Learning Research 190:129-141 Available from https://proceedings.mlr.press/v190/horvoth22a.html.

Related Material