Language Models over Canonical Byte-Pair Encodings

Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian Dusell, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O’Donnell, Ryan Cotterell
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:61413-61443, 2025.

Abstract

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of noncanonical token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-vieira25b, title = {Language Models over Canonical Byte-Pair Encodings}, author = {Vieira, Tim and Liu, Tianyu and Pasti, Clemente and Emara, Yahya and Dusell, Brian and Lebrun, Benjamin and Giulianelli, Mario and Gastaldi, Juan Luis and O'Donnell, Timothy J. and Cotterell, Ryan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {61413--61443}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/vieira25b/vieira25b.pdf}, url = {https://proceedings.mlr.press/v267/vieira25b.html}, abstract = {Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of noncanonical token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.} }
Endnote
%0 Conference Paper %T Language Models over Canonical Byte-Pair Encodings %A Tim Vieira %A Tianyu Liu %A Clemente Pasti %A Yahya Emara %A Brian Dusell %A Benjamin Lebrun %A Mario Giulianelli %A Juan Luis Gastaldi %A Timothy J. O’Donnell %A Ryan Cotterell %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-vieira25b %I PMLR %P 61413--61443 %U https://proceedings.mlr.press/v267/vieira25b.html %V 267 %X Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of noncanonical token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
APA
Vieira, T., Liu, T., Pasti, C., Emara, Y., Dusell, B., Lebrun, B., Giulianelli, M., Gastaldi, J.L., O’Donnell, T.J. & Cotterell, R.. (2025). Language Models over Canonical Byte-Pair Encodings. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:61413-61443 Available from https://proceedings.mlr.press/v267/vieira25b.html.

Related Material