From Language Models over Tokens to Language Models over Characters

Tim Vieira, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Brian Dusell, John Terilla, Timothy J. O’Donnell, Ryan Cotterell
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:61391-61412, 2025.

Abstract

Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model’s compression rate (bits/byte) is achieved.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-vieira25a, title = {From Language Models over Tokens to Language Models over Characters}, author = {Vieira, Tim and Lebrun, Benjamin and Giulianelli, Mario and Gastaldi, Juan Luis and Dusell, Brian and Terilla, John and O'Donnell, Timothy J. and Cotterell, Ryan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {61391--61412}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/vieira25a/vieira25a.pdf}, url = {https://proceedings.mlr.press/v267/vieira25a.html}, abstract = {Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model’s compression rate (bits/byte) is achieved.} }
Endnote
%0 Conference Paper %T From Language Models over Tokens to Language Models over Characters %A Tim Vieira %A Benjamin Lebrun %A Mario Giulianelli %A Juan Luis Gastaldi %A Brian Dusell %A John Terilla %A Timothy J. O’Donnell %A Ryan Cotterell %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-vieira25a %I PMLR %P 61391--61412 %U https://proceedings.mlr.press/v267/vieira25a.html %V 267 %X Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model’s compression rate (bits/byte) is achieved.
APA
Vieira, T., Lebrun, B., Giulianelli, M., Gastaldi, J.L., Dusell, B., Terilla, J., O’Donnell, T.J. & Cotterell, R.. (2025). From Language Models over Tokens to Language Models over Characters. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:61391-61412 Available from https://proceedings.mlr.press/v267/vieira25a.html.

Related Material