Composable Sketches for Functions of Frequencies: Beyond the Worst Case

Edith Cohen, Ofir Geri, Rasmus Pagh
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:2057-2067, 2020.

Abstract

Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and $p$th frequency moments with $p>2$, are known to require polynomial size sketches in the worst case. We explore performance beyond the worst case under two different types of assumptions. The first is having access to noisy \emph{advice} on item frequencies. This continues the line of work of Hsu et al. (ICLR 2019), who assume predictions are provided by a machine learning model. The second is providing guaranteed performance on a restricted class of input frequency distributions that are better aligned with what is observed in practice. This extends the work on heavy hitters under Zipfian distributions in a seminal paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and empirically that "in practice" small polylogarithmic-size sketches provide accuracy for "hard" functions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-cohen20a, title = {Composable Sketches for Functions of Frequencies: Beyond the Worst Case}, author = {Cohen, Edith and Geri, Ofir and Pagh, Rasmus}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {2057--2067}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/cohen20a/cohen20a.pdf}, url = {https://proceedings.mlr.press/v119/cohen20a.html}, abstract = {Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and $p$th frequency moments with $p>2$, are known to require polynomial size sketches in the worst case. We explore performance beyond the worst case under two different types of assumptions. The first is having access to noisy \emph{advice} on item frequencies. This continues the line of work of Hsu et al. (ICLR 2019), who assume predictions are provided by a machine learning model. The second is providing guaranteed performance on a restricted class of input frequency distributions that are better aligned with what is observed in practice. This extends the work on heavy hitters under Zipfian distributions in a seminal paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and empirically that "in practice" small polylogarithmic-size sketches provide accuracy for "hard" functions.} }
Endnote
%0 Conference Paper %T Composable Sketches for Functions of Frequencies: Beyond the Worst Case %A Edith Cohen %A Ofir Geri %A Rasmus Pagh %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-cohen20a %I PMLR %P 2057--2067 %U https://proceedings.mlr.press/v119/cohen20a.html %V 119 %X Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and $p$th frequency moments with $p>2$, are known to require polynomial size sketches in the worst case. We explore performance beyond the worst case under two different types of assumptions. The first is having access to noisy \emph{advice} on item frequencies. This continues the line of work of Hsu et al. (ICLR 2019), who assume predictions are provided by a machine learning model. The second is providing guaranteed performance on a restricted class of input frequency distributions that are better aligned with what is observed in practice. This extends the work on heavy hitters under Zipfian distributions in a seminal paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and empirically that "in practice" small polylogarithmic-size sketches provide accuracy for "hard" functions.
APA
Cohen, E., Geri, O. & Pagh, R.. (2020). Composable Sketches for Functions of Frequencies: Beyond the Worst Case. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:2057-2067 Available from https://proceedings.mlr.press/v119/cohen20a.html.

Related Material