Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models

Raymond Li, Chuyuan Li, Gabriel Murray, Giuseppe Carenini
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:888-894, 2026.

Abstract

Sparse Autoencoders (SAEs) have emerged as a popular solution for extracting interpretable features from language model activations. However, existing SAE designs suffer from deterministic activations that starve gradients to “dead” components, and produce uncalibrated coefficients that provide no meaningful notion of uncertainty. To address these limitations, we introduce Probabilistic TopK SAEs, a novel approach that augments the TopK SAEs with probabilistic gating through Binary Concrete distribution. This stochastic sampling helps mitigate gradient starvation to dead neurons while producing coefficient magnitudes that are more correlated with the confidence of feature presence. Empirical experiments with GPT-2 and Qwen3 shows that our method achieves consistent Pareto improvements over the baselines in high sparsity settings (small number of activated features) while maintaining a larger set of alive dictionary features.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-li26a, title = {Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models}, author = {Li, Raymond and Li, Chuyuan and Murray, Gabriel and Carenini, Giuseppe}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {888--894}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/li26a/li26a.pdf}, url = {https://proceedings.mlr.press/v318/li26a.html}, abstract = {Sparse Autoencoders (SAEs) have emerged as a popular solution for extracting interpretable features from language model activations. However, existing SAE designs suffer from deterministic activations that starve gradients to “dead” components, and produce uncalibrated coefficients that provide no meaningful notion of uncertainty. To address these limitations, we introduce Probabilistic TopK SAEs, a novel approach that augments the TopK SAEs with probabilistic gating through Binary Concrete distribution. This stochastic sampling helps mitigate gradient starvation to dead neurons while producing coefficient magnitudes that are more correlated with the confidence of feature presence. Empirical experiments with GPT-2 and Qwen3 shows that our method achieves consistent Pareto improvements over the baselines in high sparsity settings (small number of activated features) while maintaining a larger set of alive dictionary features.} }
Endnote
%0 Conference Paper %T Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models %A Raymond Li %A Chuyuan Li %A Gabriel Murray %A Giuseppe Carenini %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-li26a %I PMLR %P 888--894 %U https://proceedings.mlr.press/v318/li26a.html %V 318 %X Sparse Autoencoders (SAEs) have emerged as a popular solution for extracting interpretable features from language model activations. However, existing SAE designs suffer from deterministic activations that starve gradients to “dead” components, and produce uncalibrated coefficients that provide no meaningful notion of uncertainty. To address these limitations, we introduce Probabilistic TopK SAEs, a novel approach that augments the TopK SAEs with probabilistic gating through Binary Concrete distribution. This stochastic sampling helps mitigate gradient starvation to dead neurons while producing coefficient magnitudes that are more correlated with the confidence of feature presence. Empirical experiments with GPT-2 and Qwen3 shows that our method achieves consistent Pareto improvements over the baselines in high sparsity settings (small number of activated features) while maintaining a larger set of alive dictionary features.
APA
Li, R., Li, C., Murray, G. & Carenini, G.. (2026). Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:888-894 Available from https://proceedings.mlr.press/v318/li26a.html.

Related Material