Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

Georgii Sergeevich Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Valerievich Dimitrov, Ivan Oseledets
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:26363-26381, 2023.

Abstract

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operations induce additional memory costs that, as we show, can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-novikov23a, title = {Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction}, author = {Novikov, Georgii Sergeevich and Bershatsky, Daniel and Gusak, Julia and Shonenkov, Alex and Dimitrov, Denis Valerievich and Oseledets, Ivan}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {26363--26381}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/novikov23a/novikov23a.pdf}, url = {https://proceedings.mlr.press/v202/novikov23a.html}, abstract = {Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operations induce additional memory costs that, as we show, can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.} }
Endnote
%0 Conference Paper %T Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction %A Georgii Sergeevich Novikov %A Daniel Bershatsky %A Julia Gusak %A Alex Shonenkov %A Denis Valerievich Dimitrov %A Ivan Oseledets %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-novikov23a %I PMLR %P 26363--26381 %U https://proceedings.mlr.press/v202/novikov23a.html %V 202 %X Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operations induce additional memory costs that, as we show, can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.
APA
Novikov, G.S., Bershatsky, D., Gusak, J., Shonenkov, A., Dimitrov, D.V. & Oseledets, I.. (2023). Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:26363-26381 Available from https://proceedings.mlr.press/v202/novikov23a.html.

Related Material