Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

Yu Xin Li, Felix Dangel, Derek Tam, Colin Raffel
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34252-34270, 2025.

Abstract

The diagonal of a model’s Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher is estimated by computing the squared gradient of the model’s outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (Squared gradient accumulator as an approximation of the Fisher) consistently performs similarly to the Fisher while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25j, title = {Fishers for Free? {A}pproximating the {F}isher Information Matrix by Recycling the Squared Gradient Accumulator}, author = {Li, Yu Xin and Dangel, Felix and Tam, Derek and Raffel, Colin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {34252--34270}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25j/li25j.pdf}, url = {https://proceedings.mlr.press/v267/li25j.html}, abstract = {The diagonal of a model’s Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher is estimated by computing the squared gradient of the model’s outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (Squared gradient accumulator as an approximation of the Fisher) consistently performs similarly to the Fisher while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.} }
Endnote
%0 Conference Paper %T Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator %A Yu Xin Li %A Felix Dangel %A Derek Tam %A Colin Raffel %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25j %I PMLR %P 34252--34270 %U https://proceedings.mlr.press/v267/li25j.html %V 267 %X The diagonal of a model’s Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher is estimated by computing the squared gradient of the model’s outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (Squared gradient accumulator as an approximation of the Fisher) consistently performs similarly to the Fisher while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.
APA
Li, Y.X., Dangel, F., Tam, D. & Raffel, C.. (2025). Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:34252-34270 Available from https://proceedings.mlr.press/v267/li25j.html.

Related Material