CountSketches, Feature Hashing and the Median of Three

Kasper Green Larsen; Rasmus Pagh; Jakub Tětek

CountSketches, Feature Hashing and the Median of Three

Kasper Green Larsen, Rasmus Pagh, Jakub Tětek

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6011-6020, 2021.

Abstract

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-larsen21a,
  title = 	 {CountSketches, Feature Hashing and the Median of Three},
  author =       {Larsen, Kasper Green and Pagh, Rasmus and T{\v{e}}tek, Jakub},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {6011--6020},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/larsen21a/larsen21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/larsen21a.html},
  abstract = 	 {In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$.}
}

Endnote

%0 Conference Paper
%T CountSketches, Feature Hashing and the Median of Three
%A Kasper Green Larsen
%A Rasmus Pagh
%A Jakub Tětek
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-larsen21a
%I PMLR
%P 6011--6020
%U https://proceedings.mlr.press/v139/larsen21a.html
%V 139
%X In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$.

APA


Larsen, K.G., Pagh, R. & Tětek, J.. (2021). CountSketches, Feature Hashing and the Median of Three. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6011-6020 Available from https://proceedings.mlr.press/v139/larsen21a.html.

Related Material

Download PDF