[edit]
A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:477-496, 2011.
Abstract
Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.