Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Yuchen Yang, Yingdong Shi, Cheems Wang, Xiantong Zhen, Yuxuan Shi, Jun Xu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:56357-56381, 2024.

Abstract

Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to $\sim$$30%$ of the peak memory usage. Our code is released at github.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-yang24u, title = {Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation}, author = {Yang, Yuchen and Shi, Yingdong and Wang, Cheems and Zhen, Xiantong and Shi, Yuxuan and Xu, Jun}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {56357--56381}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/yang24u/yang24u.pdf}, url = {https://proceedings.mlr.press/v235/yang24u.html}, abstract = {Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to $\sim$$30%$ of the peak memory usage. Our code is released at github.} }
Endnote
%0 Conference Paper %T Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation %A Yuchen Yang %A Yingdong Shi %A Cheems Wang %A Xiantong Zhen %A Yuxuan Shi %A Jun Xu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-yang24u %I PMLR %P 56357--56381 %U https://proceedings.mlr.press/v235/yang24u.html %V 235 %X Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to $\sim$$30%$ of the peak memory usage. Our code is released at github.
APA
Yang, Y., Shi, Y., Wang, C., Zhen, X., Shi, Y. & Xu, J.. (2024). Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:56357-56381 Available from https://proceedings.mlr.press/v235/yang24u.html.

Related Material