Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8931-8954, 2025.

Abstract

Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared sub-dimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3$\times$ total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25bc, title = {Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer}, author = {Chen, Yilong and Shang, Junyuan and Zhang, Zhenyu and Sheng, Jiawei and Liu, Tingwen and Wang, Shuohuan and Sun, Yu and Wu, Hua and Wang, Haifeng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {8931--8954}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25bc/chen25bc.pdf}, url = {https://proceedings.mlr.press/v267/chen25bc.html}, abstract = {Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared sub-dimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3$\times$ total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.} }
Endnote
%0 Conference Paper %T Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer %A Yilong Chen %A Junyuan Shang %A Zhenyu Zhang %A Jiawei Sheng %A Tingwen Liu %A Shuohuan Wang %A Yu Sun %A Hua Wu %A Haifeng Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25bc %I PMLR %P 8931--8954 %U https://proceedings.mlr.press/v267/chen25bc.html %V 267 %X Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared sub-dimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3$\times$ total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.
APA
Chen, Y., Shang, J., Zhang, Z., Sheng, J., Liu, T., Wang, S., Sun, Y., Wu, H. & Wang, H.. (2025). Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8931-8954 Available from https://proceedings.mlr.press/v267/chen25bc.html.

Related Material