A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data

Ping Li, Cun-Hui Zhang
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:477-496, 2011.

Abstract

Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v19-li11a, title = {A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data}, author = {Li, Ping and Zhang, Cun-Hui}, booktitle = {Proceedings of the 24th Annual Conference on Learning Theory}, pages = {477--496}, year = {2011}, editor = {Kakade, Sham M. and von Luxburg, Ulrike}, volume = {19}, series = {Proceedings of Machine Learning Research}, address = {Budapest, Hungary}, month = {09--11 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v19/li11a/li11a.pdf}, url = {https://proceedings.mlr.press/v19/li11a.html}, abstract = {Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.} }
Endnote
%0 Conference Paper %T A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data %A Ping Li %A Cun-Hui Zhang %B Proceedings of the 24th Annual Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2011 %E Sham M. Kakade %E Ulrike von Luxburg %F pmlr-v19-li11a %I PMLR %P 477--496 %U https://proceedings.mlr.press/v19/li11a.html %V 19 %X Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.
RIS
TY - CPAPER TI - A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data AU - Ping Li AU - Cun-Hui Zhang BT - Proceedings of the 24th Annual Conference on Learning Theory DA - 2011/12/21 ED - Sham M. Kakade ED - Ulrike von Luxburg ID - pmlr-v19-li11a PB - PMLR DP - Proceedings of Machine Learning Research VL - 19 SP - 477 EP - 496 L1 - http://proceedings.mlr.press/v19/li11a/li11a.pdf UR - https://proceedings.mlr.press/v19/li11a.html AB - Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm. ER -
APA
Li, P. & Zhang, C.. (2011). A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data. Proceedings of the 24th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 19:477-496 Available from https://proceedings.mlr.press/v19/li11a.html.

Related Material