Optimal Inference Schedules for Masked Diffusion Models

Sitan Chen, Kevin Cong, Jerry Li
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1279-1311, 2026.

Abstract

A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called \emph{diffusion language models}, of which the \emph{masked diffusion model} (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work in Li and Cai (2025) obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, \emph{exact} characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of \emph{univariate function approximation}. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong \emph{a priori} knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its \emph{total correlation} and \emph{dual total correlation}, which show that in some natural settings, one can sample in $O(\log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-chen26e, title = {Optimal Inference Schedules for Masked Diffusion Models}, author = {Chen, Sitan and Cong, Kevin and Li, Jerry}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {1279--1311}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/chen26e/chen26e.pdf}, url = {https://proceedings.mlr.press/v336/chen26e.html}, abstract = { A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called \emph{diffusion language models}, of which the \emph{masked diffusion model} (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work in Li and Cai (2025) obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, \emph{exact} characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of \emph{univariate function approximation}. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong \emph{a priori} knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its \emph{total correlation} and \emph{dual total correlation}, which show that in some natural settings, one can sample in $O(\log n)$ steps without any visible loss in performance, where $n$ is the total sequence length. } }
Endnote
%0 Conference Paper %T Optimal Inference Schedules for Masked Diffusion Models %A Sitan Chen %A Kevin Cong %A Jerry Li %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-chen26e %I PMLR %P 1279--1311 %U https://proceedings.mlr.press/v336/chen26e.html %V 336 %X A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called \emph{diffusion language models}, of which the \emph{masked diffusion model} (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work in Li and Cai (2025) obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, \emph{exact} characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of \emph{univariate function approximation}. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong \emph{a priori} knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its \emph{total correlation} and \emph{dual total correlation}, which show that in some natural settings, one can sample in $O(\log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.
APA
Chen, S., Cong, K. & Li, J.. (2026). Optimal Inference Schedules for Masked Diffusion Models. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:1279-1311 Available from https://proceedings.mlr.press/v336/chen26e.html.

Related Material