[edit]
Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:888-894, 2026.
Abstract
Sparse Autoencoders (SAEs) have emerged as a popular solution for extracting interpretable features from language model activations. However, existing SAE designs suffer from deterministic activations that starve gradients to “dead” components, and produce uncalibrated coefficients that provide no meaningful notion of uncertainty. To address these limitations, we introduce Probabilistic TopK SAEs, a novel approach that augments the TopK SAEs with probabilistic gating through Binary Concrete distribution. This stochastic sampling helps mitigate gradient starvation to dead neurons while producing coefficient magnitudes that are more correlated with the confidence of feature presence. Empirical experiments with GPT-2 and Qwen3 shows that our method achieves consistent Pareto improvements over the baselines in high sparsity settings (small number of activated features) while maintaining a larger set of alive dictionary features.