A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data

Ping Li; Cun-Hui Zhang

A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data

Ping Li, Cun-Hui Zhang

Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:477-496, 2011.

Abstract

Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.

Cite this Paper

BibTeX


@InProceedings{pmlr-v19-li11a,
  title = 	 {A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data},
  author = 	 {Li, Ping and Zhang, Cun-Hui},
  booktitle = 	 {Proceedings of the 24th Annual Conference on Learning Theory},
  pages = 	 {477--496},
  year = 	 {2011},
  editor = 	 {Kakade, Sham M. and von Luxburg, Ulrike},
  volume = 	 {19},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Budapest, Hungary},
  month = 	 {09--11 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v19/li11a/li11a.pdf},
  url = 	 {https://proceedings.mlr.press/v19/li11a.html},
  abstract = 	 {Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining.  To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th  moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still  expensive. The followup work (Li, 2009b) provides a  practical algorithm, which is however  difficult to analyze theoretically. In this paper, we propose a new   accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$.  In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is  close to $100\%$  statistically optimal.  An empirical study is  conducted to verify the accuracy of our algorithm.}
}

Endnote

%0 Conference Paper
%T A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data
%A Ping Li
%A Cun-Hui Zhang
%B Proceedings of the 24th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2011
%E Sham M. Kakade
%E Ulrike von Luxburg	
%F pmlr-v19-li11a
%I PMLR
%P 477--496
%U https://proceedings.mlr.press/v19/li11a.html
%V 19
%X Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining.  To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th  moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still  expensive. The followup work (Li, 2009b) provides a  practical algorithm, which is however  difficult to analyze theoretically. In this paper, we propose a new   accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$.  In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is  close to $100\%$  statistically optimal.  An empirical study is  conducted to verify the accuracy of our algorithm.

RIS


TY  - CPAPER
TI  - A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data
AU  - Ping Li
AU  - Cun-Hui Zhang
BT  - Proceedings of the 24th Annual Conference on Learning Theory
DA  - 2011/12/21
ED  - Sham M. Kakade
ED  - Ulrike von Luxburg	
ID  - pmlr-v19-li11a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 19
SP  - 477
EP  - 496
L1  - http://proceedings.mlr.press/v19/li11a/li11a.pdf
UR  - https://proceedings.mlr.press/v19/li11a.html
AB  - Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining.  To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th  moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still  expensive. The followup work (Li, 2009b) provides a  practical algorithm, which is however  difficult to analyze theoretically. In this paper, we propose a new   accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$.  In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is  close to $100\%$  statistically optimal.  An empirical study is  conducted to verify the accuracy of our algorithm.
ER  -

APA


Li, P. & Zhang, C.. (2011). A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data. Proceedings of the 24th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 19:477-496 Available from https://proceedings.mlr.press/v19/li11a.html.

Related Material

Download PDF