Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik; Tim Lawson; Conor Houghton; Laurence Aitchison

Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16178-16236, 2025.

Abstract

Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-farnik25a,
  title = 	 {{J}acobian Sparse Autoencoders: Sparsify Computations, Not Just Activations},
  author =       {Farnik, Lucy and Lawson, Tim and Houghton, Conor and Aitchison, Laurence},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {16178--16236},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/farnik25a/farnik25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/farnik25a.html},
  abstract = 	 {Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.}
}

Endnote

%0 Conference Paper
%T Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
%A Lucy Farnik
%A Tim Lawson
%A Conor Houghton
%A Laurence Aitchison
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-farnik25a
%I PMLR
%P 16178--16236
%U https://proceedings.mlr.press/v267/farnik25a.html
%V 267
%X Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.

APA

Farnik, L., Lawson, T., Houghton, C. & Aitchison, L.. (2025). Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16178-16236 Available from https://proceedings.mlr.press/v267/farnik25a.html.

Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Abstract

Cite this Paper

Related Material