Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park, Timothy Zhou, Loris D’Antoni
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:48262-48275, 2025.

Abstract

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can “align” with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-park25l, title = {Flexible and Efficient Grammar-Constrained Decoding}, author = {Park, Kanghee and Zhou, Timothy and D'Antoni, Loris}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {48262--48275}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/park25l/park25l.pdf}, url = {https://proceedings.mlr.press/v267/park25l.html}, abstract = {Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can “align” with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.} }
Endnote
%0 Conference Paper %T Flexible and Efficient Grammar-Constrained Decoding %A Kanghee Park %A Timothy Zhou %A Loris D’Antoni %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-park25l %I PMLR %P 48262--48275 %U https://proceedings.mlr.press/v267/park25l.html %V 267 %X Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can “align” with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.
APA
Park, K., Zhou, T. & D’Antoni, L.. (2025). Flexible and Efficient Grammar-Constrained Decoding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:48262-48275 Available from https://proceedings.mlr.press/v267/park25l.html.

Related Material