Randomness Efficient Feature Hashing for Sparse Binary Data

Rameshwar Pratap, Karthik Revanuru, Anirudh Ravi, Raghav Kulkarni
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:689-704, 2020.

Abstract

We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v129-pratap20a, title = {Randomness Efficient Feature Hashing for Sparse Binary Data}, author = {Pratap, Rameshwar and Revanuru, Karthik and Ravi, Anirudh and Kulkarni, Raghav}, booktitle = {Proceedings of The 12th Asian Conference on Machine Learning}, pages = {689--704}, year = {2020}, editor = {Pan, Sinno Jialin and Sugiyama, Masashi}, volume = {129}, series = {Proceedings of Machine Learning Research}, month = {18--20 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v129/pratap20a/pratap20a.pdf}, url = {https://proceedings.mlr.press/v129/pratap20a.html}, abstract = {We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.} }
Endnote
%0 Conference Paper %T Randomness Efficient Feature Hashing for Sparse Binary Data %A Rameshwar Pratap %A Karthik Revanuru %A Anirudh Ravi %A Raghav Kulkarni %B Proceedings of The 12th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Sinno Jialin Pan %E Masashi Sugiyama %F pmlr-v129-pratap20a %I PMLR %P 689--704 %U https://proceedings.mlr.press/v129/pratap20a.html %V 129 %X We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.
APA
Pratap, R., Revanuru, K., Ravi, A. & Kulkarni, R.. (2020). Randomness Efficient Feature Hashing for Sparse Binary Data. Proceedings of The 12th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 129:689-704 Available from https://proceedings.mlr.press/v129/pratap20a.html.

Related Material