Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Zhou Xun
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:26261-26282, 2025.

Abstract

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-huang25bb, title = {Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling}, author = {Huang, Hongzhi and Zhu, Defa and Wu, Banggu and Zeng, Yutao and Wang, Ya and Min, Qiyang and Xun, Zhou}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {26261--26282}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/huang25bb/huang25bb.pdf}, url = {https://proceedings.mlr.press/v267/huang25bb.html}, abstract = {Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.} }
Endnote
%0 Conference Paper %T Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling %A Hongzhi Huang %A Defa Zhu %A Banggu Wu %A Yutao Zeng %A Ya Wang %A Qiyang Min %A Zhou Xun %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-huang25bb %I PMLR %P 26261--26282 %U https://proceedings.mlr.press/v267/huang25bb.html %V 267 %X Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.
APA
Huang, H., Zhu, D., Wu, B., Zeng, Y., Wang, Y., Min, Q. & Xun, Z.. (2025). Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:26261-26282 Available from https://proceedings.mlr.press/v267/huang25bb.html.

Related Material