On the Duality between Gradient Transformations and Adapters

Lucas Torroba Hennigen; Hunter Lang; Han Guo; Yoon Kim

On the Duality between Gradient Transformations and Adapters

Lucas Torroba Hennigen, Hunter Lang, Han Guo, Yoon Kim

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23030-23046, 2025.

Abstract

We study memory-efficient optimization of neural networks (in particular language models) with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map’s transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter’s parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-hennigen25a,
  title = 	 {On the Duality between Gradient Transformations and Adapters},
  author =       {Hennigen, Lucas Torroba and Lang, Hunter and Guo, Han and Kim, Yoon},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {23030--23046},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hennigen25a/hennigen25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/hennigen25a.html},
  abstract = 	 {We study memory-efficient optimization of neural networks (in particular language models) with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map’s transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter’s parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.}
}

Endnote

%0 Conference Paper
%T On the Duality between Gradient Transformations and Adapters
%A Lucas Torroba Hennigen
%A Hunter Lang
%A Han Guo
%A Yoon Kim
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-hennigen25a
%I PMLR
%P 23030--23046
%U https://proceedings.mlr.press/v267/hennigen25a.html
%V 267
%X We study memory-efficient optimization of neural networks (in particular language models) with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map’s transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter’s parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.

APA

Hennigen, L.T., Lang, H., Guo, H. & Kim, Y.. (2025). On the Duality between Gradient Transformations and Adapters. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23030-23046 Available from https://proceedings.mlr.press/v267/hennigen25a.html.

On the Duality between Gradient Transformations and Adapters

Abstract

Cite this Paper

Related Material