Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability

Dirk van der Hoeven, Julia Olkhovskaya, Tim van Erven
Proceedings of The 37th International Conference on Algorithmic Learning Theory, PMLR 313:1-38, 2026.

Abstract

We consider the fundamental problem of estimating a discrete distribution on a domain of size $K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/\delta)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/\delta)\big) /n$ at error probability $\delta$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/\delta)\big) /n$, i.e. without the $\ln K$ multiplying $\ln (1/\delta)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by the maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/\delta) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and $\delta$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v313-hoeven26a, title = {Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability}, author = {van der Hoeven, Dirk and Olkhovskaya, Julia and van Erven, Tim}, booktitle = {Proceedings of The 37th International Conference on Algorithmic Learning Theory}, pages = {1--38}, year = {2026}, editor = {Telgarsky, Matus and Ullman, Jonathan}, volume = {313}, series = {Proceedings of Machine Learning Research}, month = {23--26 Feb}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v313/main/assets/hoeven26a/hoeven26a.pdf}, url = {https://proceedings.mlr.press/v313/hoeven26a.html}, abstract = {We consider the fundamental problem of estimating a discrete distribution on a domain of size $K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/\delta)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/\delta)\big) /n$ at error probability $\delta$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/\delta)\big) /n$, i.e. without the $\ln K$ multiplying $\ln (1/\delta)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by the maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/\delta) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and $\delta$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.} }
Endnote
%0 Conference Paper %T Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability %A Dirk van der Hoeven %A Julia Olkhovskaya %A Tim van Erven %B Proceedings of The 37th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Matus Telgarsky %E Jonathan Ullman %F pmlr-v313-hoeven26a %I PMLR %P 1--38 %U https://proceedings.mlr.press/v313/hoeven26a.html %V 313 %X We consider the fundamental problem of estimating a discrete distribution on a domain of size $K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/\delta)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/\delta)\big) /n$ at error probability $\delta$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/\delta)\big) /n$, i.e. without the $\ln K$ multiplying $\ln (1/\delta)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by the maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/\delta) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and $\delta$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.
APA
van der Hoeven, D., Olkhovskaya, J. & van Erven, T.. (2026). Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability. Proceedings of The 37th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 313:1-38 Available from https://proceedings.mlr.press/v313/hoeven26a.html.

Related Material