A simple sketching algorithm for entropy estimation over streaming data

Peter Clifford, Ioana Cosma
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, PMLR 31:196-206, 2013.

Abstract

We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to αare lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as αapproaches 1: the maximally skewed, strictly stable distribution with α= 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ> 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ= 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.

Cite this Paper


BibTeX
@InProceedings{pmlr-v31-clifford13a, title = {A simple sketching algorithm for entropy estimation over streaming data}, author = {Clifford, Peter and Cosma, Ioana}, booktitle = {Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics}, pages = {196--206}, year = {2013}, editor = {Carvalho, Carlos M. and Ravikumar, Pradeep}, volume = {31}, series = {Proceedings of Machine Learning Research}, address = {Scottsdale, Arizona, USA}, month = {29 Apr--01 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v31/clifford13a.pdf}, url = {https://proceedings.mlr.press/v31/clifford13a.html}, abstract = {We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to αare lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as αapproaches 1: the maximally skewed, strictly stable distribution with α= 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ> 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ= 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity. } }
Endnote
%0 Conference Paper %T A simple sketching algorithm for entropy estimation over streaming data %A Peter Clifford %A Ioana Cosma %B Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2013 %E Carlos M. Carvalho %E Pradeep Ravikumar %F pmlr-v31-clifford13a %I PMLR %P 196--206 %U https://proceedings.mlr.press/v31/clifford13a.html %V 31 %X We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to αare lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as αapproaches 1: the maximally skewed, strictly stable distribution with α= 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ> 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ= 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.
RIS
TY - CPAPER TI - A simple sketching algorithm for entropy estimation over streaming data AU - Peter Clifford AU - Ioana Cosma BT - Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics DA - 2013/04/29 ED - Carlos M. Carvalho ED - Pradeep Ravikumar ID - pmlr-v31-clifford13a PB - PMLR DP - Proceedings of Machine Learning Research VL - 31 SP - 196 EP - 206 L1 - http://proceedings.mlr.press/v31/clifford13a.pdf UR - https://proceedings.mlr.press/v31/clifford13a.html AB - We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to αare lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as αapproaches 1: the maximally skewed, strictly stable distribution with α= 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ> 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ= 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity. ER -
APA
Clifford, P. & Cosma, I.. (2013). A simple sketching algorithm for entropy estimation over streaming data. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 31:196-206 Available from https://proceedings.mlr.press/v31/clifford13a.html.

Related Material