A simple sketching algorithm for entropy estimation over streaming data
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, PMLR 31:196-206, 2013.
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to αare lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as αapproaches 1: the maximally skewed, strictly stable distribution with α= 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ> 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ= 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.