Dead Feature Counts in Sparse Autoencoders Predict Underlying Deep Q Networks’ Effectiveness

Coleman DuPlessie
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:385-397, 2026.

Abstract

Sparse autoencoders (SAEs) are machine learning models that can be used to express the inner workings of certain other models as human-interpretable features. While sparse autoencoders work well when applied to language models, there has been little research that investigates the extent to which they generalize to other applications of machine learning. This work investigates the application of SAEs to a deep Q network trained to complete a simple task. We find that, although SAEs tend to perform well and find a number of human-interpretable features, they contain a large number of "dead features" that never activate, which suggests that more research is necessary to adapt SAEs to the unique tasks reinforcement learning models solve. In particular, we note that the most effective deep Q networks trained to complete a task tend to result in sparse autoencoders with a consistent quantity of dead features. This suggests that these sparse autoencoders may in some sense be capturing the "optimal" or "true" number of features needed to solve the toy problem we study, and the high number of dead features may simply imply that additional live features past a certain quantity are unhelpful.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-duplessie26a, title = {Dead Feature Counts in Sparse Autoencoders Predict Underlying Deep Q Networks’ Effectiveness}, author = {DuPlessie, Coleman}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {385--397}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/duplessie26a/duplessie26a.pdf}, url = {https://proceedings.mlr.press/v322/duplessie26a.html}, abstract = {Sparse autoencoders (SAEs) are machine learning models that can be used to express the inner workings of certain other models as human-interpretable features. While sparse autoencoders work well when applied to language models, there has been little research that investigates the extent to which they generalize to other applications of machine learning. This work investigates the application of SAEs to a deep Q network trained to complete a simple task. We find that, although SAEs tend to perform well and find a number of human-interpretable features, they contain a large number of "dead features" that never activate, which suggests that more research is necessary to adapt SAEs to the unique tasks reinforcement learning models solve. In particular, we note that the most effective deep Q networks trained to complete a task tend to result in sparse autoencoders with a consistent quantity of dead features. This suggests that these sparse autoencoders may in some sense be capturing the "optimal" or "true" number of features needed to solve the toy problem we study, and the high number of dead features may simply imply that additional live features past a certain quantity are unhelpful.} }
Endnote
%0 Conference Paper %T Dead Feature Counts in Sparse Autoencoders Predict Underlying Deep Q Networks’ Effectiveness %A Coleman DuPlessie %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-duplessie26a %I PMLR %P 385--397 %U https://proceedings.mlr.press/v322/duplessie26a.html %V 322 %X Sparse autoencoders (SAEs) are machine learning models that can be used to express the inner workings of certain other models as human-interpretable features. While sparse autoencoders work well when applied to language models, there has been little research that investigates the extent to which they generalize to other applications of machine learning. This work investigates the application of SAEs to a deep Q network trained to complete a simple task. We find that, although SAEs tend to perform well and find a number of human-interpretable features, they contain a large number of "dead features" that never activate, which suggests that more research is necessary to adapt SAEs to the unique tasks reinforcement learning models solve. In particular, we note that the most effective deep Q networks trained to complete a task tend to result in sparse autoencoders with a consistent quantity of dead features. This suggests that these sparse autoencoders may in some sense be capturing the "optimal" or "true" number of features needed to solve the toy problem we study, and the high number of dead features may simply imply that additional live features past a certain quantity are unhelpful.
APA
DuPlessie, C.. (2026). Dead Feature Counts in Sparse Autoencoders Predict Underlying Deep Q Networks’ Effectiveness. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:385-397 Available from https://proceedings.mlr.press/v322/duplessie26a.html.

Related Material