A Bayesian nonparametric approach to count-min sketch under power-law data streams

Emanuele Dolera, Stefano Favaro, Stefano Peluchetti
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:226-234, 2021.

Abstract

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens’ frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token’s frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-dolera21a, title = { A Bayesian nonparametric approach to count-min sketch under power-law data streams }, author = {Dolera, Emanuele and Favaro, Stefano and Peluchetti, Stefano}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {226--234}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/dolera21a/dolera21a.pdf}, url = {https://proceedings.mlr.press/v130/dolera21a.html}, abstract = { The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens’ frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token’s frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data. } }
Endnote
%0 Conference Paper %T A Bayesian nonparametric approach to count-min sketch under power-law data streams %A Emanuele Dolera %A Stefano Favaro %A Stefano Peluchetti %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-dolera21a %I PMLR %P 226--234 %U https://proceedings.mlr.press/v130/dolera21a.html %V 130 %X The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens’ frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token’s frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.
APA
Dolera, E., Favaro, S. & Peluchetti, S.. (2021). A Bayesian nonparametric approach to count-min sketch under power-law data streams . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:226-234 Available from https://proceedings.mlr.press/v130/dolera21a.html.

Related Material