On Densification for Minwise Hashing

Tung Mai, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, Ritwik Sinha
Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:831-840, 2020.

Abstract

One Permutation Hashing (OPH) is a significantly more efficient alternative to the popular minwise hashing. To produce a sketch of size $k$, OPH requires just one hash function whereas the classical minwise hashing requires $k$ hash functions. However, OPH does not have the desirable locality sensitive hashing (LSH) property that is important for indexing. [Srivastava and Li 2014, ICML] proposed the novel idea of densification to produce LSH sketches from OPH sketches, and gave the first densification routine. In this paper, we give a necessary and sufficient condition for a densification routine to result in LSH sketches when applied to OPH sketches. Furthermore, we give a novel densification routine that for every input, takes O($k \log k$) time in expectation and achieves better variance than the previous best bound obtained by [Srivastava 2017]. The running time of the densification routine given in [Srivastava 2017] for worst case inputs is O($k^2$) in expectation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v115-mai20a, title = {On Densification for Minwise Hashing}, author = {Mai, Tung and Rao, Anup and Kapilevich, Matt and Rossi, Ryan and Abbasi-Yadkori, Yasin and Sinha, Ritwik}, booktitle = {Proceedings of The 35th Uncertainty in Artificial Intelligence Conference}, pages = {831--840}, year = {2020}, editor = {Adams, Ryan P. and Gogate, Vibhav}, volume = {115}, series = {Proceedings of Machine Learning Research}, month = {22--25 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v115/mai20a/mai20a.pdf}, url = {https://proceedings.mlr.press/v115/mai20a.html}, abstract = {One Permutation Hashing (OPH) is a significantly more efficient alternative to the popular minwise hashing. To produce a sketch of size $k$, OPH requires just one hash function whereas the classical minwise hashing requires $k$ hash functions. However, OPH does not have the desirable locality sensitive hashing (LSH) property that is important for indexing. [Srivastava and Li 2014, ICML] proposed the novel idea of densification to produce LSH sketches from OPH sketches, and gave the first densification routine. In this paper, we give a necessary and sufficient condition for a densification routine to result in LSH sketches when applied to OPH sketches. Furthermore, we give a novel densification routine that for every input, takes O($k \log k$) time in expectation and achieves better variance than the previous best bound obtained by [Srivastava 2017]. The running time of the densification routine given in [Srivastava 2017] for worst case inputs is O($k^2$) in expectation.} }
Endnote
%0 Conference Paper %T On Densification for Minwise Hashing %A Tung Mai %A Anup Rao %A Matt Kapilevich %A Ryan Rossi %A Yasin Abbasi-Yadkori %A Ritwik Sinha %B Proceedings of The 35th Uncertainty in Artificial Intelligence Conference %C Proceedings of Machine Learning Research %D 2020 %E Ryan P. Adams %E Vibhav Gogate %F pmlr-v115-mai20a %I PMLR %P 831--840 %U https://proceedings.mlr.press/v115/mai20a.html %V 115 %X One Permutation Hashing (OPH) is a significantly more efficient alternative to the popular minwise hashing. To produce a sketch of size $k$, OPH requires just one hash function whereas the classical minwise hashing requires $k$ hash functions. However, OPH does not have the desirable locality sensitive hashing (LSH) property that is important for indexing. [Srivastava and Li 2014, ICML] proposed the novel idea of densification to produce LSH sketches from OPH sketches, and gave the first densification routine. In this paper, we give a necessary and sufficient condition for a densification routine to result in LSH sketches when applied to OPH sketches. Furthermore, we give a novel densification routine that for every input, takes O($k \log k$) time in expectation and achieves better variance than the previous best bound obtained by [Srivastava 2017]. The running time of the densification routine given in [Srivastava 2017] for worst case inputs is O($k^2$) in expectation.
APA
Mai, T., Rao, A., Kapilevich, M., Rossi, R., Abbasi-Yadkori, Y. & Sinha, R.. (2020). On Densification for Minwise Hashing. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, in Proceedings of Machine Learning Research 115:831-840 Available from https://proceedings.mlr.press/v115/mai20a.html.

Related Material