Estimation of Large Zipfian Distributions with Sort and Snap

Peter Matthew Jacobs, Anirban Bhattacharya, Debdeep Pati, Lekha Patel, Jeff M. Phillips
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:190-198, 2025.

Abstract

We study the estimation of Zipfian distributions under $L_1$ loss, and provide near minimax optimal bounds in several regimes. Specifically, we assume observations arrive from a known alphabet, and with a known decay rate parametrizing the Zipfian, but we do not know a priori which alphabet elements have larger probability than others. We present a novel Sort and Snap estimator, which uses the empirical proportions to sort the alphabet, and then snaps them to the associated term from the Zipfian distribution. For arbitrary decay rates and smaller alphabet sizes, as well as for large decay rates and large alphabet sizes, we show an exact or minor variant of this estimator is near minimax optimal and has exponential improvement over the standard empirical proportion estimator. However, for small decay rates and larger alphabet sizes a simulation study indicates the standard empirical proportion estimator is competitive with Sort and Snap procedures. In addition to providing nearly tight bounds for important high-dimensional estimation problems, we believe the Sort and Snap estimator, and its analysis, will have independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-jacobs25a, title = {Estimation of Large Zipfian Distributions with Sort and Snap}, author = {Jacobs, Peter Matthew and Bhattacharya, Anirban and Pati, Debdeep and Patel, Lekha and Phillips, Jeff M.}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {190--198}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/jacobs25a/jacobs25a.pdf}, url = {https://proceedings.mlr.press/v258/jacobs25a.html}, abstract = {We study the estimation of Zipfian distributions under $L_1$ loss, and provide near minimax optimal bounds in several regimes. Specifically, we assume observations arrive from a known alphabet, and with a known decay rate parametrizing the Zipfian, but we do not know a priori which alphabet elements have larger probability than others. We present a novel Sort and Snap estimator, which uses the empirical proportions to sort the alphabet, and then snaps them to the associated term from the Zipfian distribution. For arbitrary decay rates and smaller alphabet sizes, as well as for large decay rates and large alphabet sizes, we show an exact or minor variant of this estimator is near minimax optimal and has exponential improvement over the standard empirical proportion estimator. However, for small decay rates and larger alphabet sizes a simulation study indicates the standard empirical proportion estimator is competitive with Sort and Snap procedures. In addition to providing nearly tight bounds for important high-dimensional estimation problems, we believe the Sort and Snap estimator, and its analysis, will have independent interest.} }
Endnote
%0 Conference Paper %T Estimation of Large Zipfian Distributions with Sort and Snap %A Peter Matthew Jacobs %A Anirban Bhattacharya %A Debdeep Pati %A Lekha Patel %A Jeff M. Phillips %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-jacobs25a %I PMLR %P 190--198 %U https://proceedings.mlr.press/v258/jacobs25a.html %V 258 %X We study the estimation of Zipfian distributions under $L_1$ loss, and provide near minimax optimal bounds in several regimes. Specifically, we assume observations arrive from a known alphabet, and with a known decay rate parametrizing the Zipfian, but we do not know a priori which alphabet elements have larger probability than others. We present a novel Sort and Snap estimator, which uses the empirical proportions to sort the alphabet, and then snaps them to the associated term from the Zipfian distribution. For arbitrary decay rates and smaller alphabet sizes, as well as for large decay rates and large alphabet sizes, we show an exact or minor variant of this estimator is near minimax optimal and has exponential improvement over the standard empirical proportion estimator. However, for small decay rates and larger alphabet sizes a simulation study indicates the standard empirical proportion estimator is competitive with Sort and Snap procedures. In addition to providing nearly tight bounds for important high-dimensional estimation problems, we believe the Sort and Snap estimator, and its analysis, will have independent interest.
APA
Jacobs, P.M., Bhattacharya, A., Pati, D., Patel, L. & Phillips, J.M.. (2025). Estimation of Large Zipfian Distributions with Sort and Snap. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:190-198 Available from https://proceedings.mlr.press/v258/jacobs25a.html.

Related Material