Scaling up Simhash

Rameshwar Pratap, Anup Deshmukh, Pratheeksha Nair, Anirudh Ravi
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:705-720, 2020.

Abstract

The seminal work of (Charikar, 2002) gives a space efficient sketching algorithm (Simhash) which compresses real-valued vectors to binary vectors while maintaining an estimate of the Cosine similarity between any pairs of original real-valued vectors. In this work, we propose a sketching algorithm – Simsketch – that can be applied on top of the results obtained from Simhash. This further reduces the data dimension while maintaining an estimate of the Cosine similarity between original real-valued vectors. As a consequence, it helps in scaling up the performance of Simhash. We present theoretical bounds of our result and complement it with experimentation on public datasets. Our proposed algorithm is simple, efficient, and therefore can be adopted in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v129-pratap20b, title = {Scaling up Simhash}, author = {Pratap, Rameshwar and Deshmukh, Anup and Nair, Pratheeksha and Ravi, Anirudh}, booktitle = {Proceedings of The 12th Asian Conference on Machine Learning}, pages = {705--720}, year = {2020}, editor = {Pan, Sinno Jialin and Sugiyama, Masashi}, volume = {129}, series = {Proceedings of Machine Learning Research}, month = {18--20 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v129/pratap20b/pratap20b.pdf}, url = {https://proceedings.mlr.press/v129/pratap20b.html}, abstract = {The seminal work of (Charikar, 2002) gives a space efficient sketching algorithm (Simhash) which compresses real-valued vectors to binary vectors while maintaining an estimate of the Cosine similarity between any pairs of original real-valued vectors. In this work, we propose a sketching algorithm – Simsketch – that can be applied on top of the results obtained from Simhash. This further reduces the data dimension while maintaining an estimate of the Cosine similarity between original real-valued vectors. As a consequence, it helps in scaling up the performance of Simhash. We present theoretical bounds of our result and complement it with experimentation on public datasets. Our proposed algorithm is simple, efficient, and therefore can be adopted in practice. } }
Endnote
%0 Conference Paper %T Scaling up Simhash %A Rameshwar Pratap %A Anup Deshmukh %A Pratheeksha Nair %A Anirudh Ravi %B Proceedings of The 12th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Sinno Jialin Pan %E Masashi Sugiyama %F pmlr-v129-pratap20b %I PMLR %P 705--720 %U https://proceedings.mlr.press/v129/pratap20b.html %V 129 %X The seminal work of (Charikar, 2002) gives a space efficient sketching algorithm (Simhash) which compresses real-valued vectors to binary vectors while maintaining an estimate of the Cosine similarity between any pairs of original real-valued vectors. In this work, we propose a sketching algorithm – Simsketch – that can be applied on top of the results obtained from Simhash. This further reduces the data dimension while maintaining an estimate of the Cosine similarity between original real-valued vectors. As a consequence, it helps in scaling up the performance of Simhash. We present theoretical bounds of our result and complement it with experimentation on public datasets. Our proposed algorithm is simple, efficient, and therefore can be adopted in practice.
APA
Pratap, R., Deshmukh, A., Nair, P. & Ravi, A.. (2020). Scaling up Simhash. Proceedings of The 12th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 129:705-720 Available from https://proceedings.mlr.press/v129/pratap20b.html.

Related Material