Randomness Efficient Feature Hashing for Sparse Binary Data

Rameshwar Pratap; Karthik Revanuru; Anirudh Ravi; Raghav Kulkarni

Randomness Efficient Feature Hashing for Sparse Binary Data

Rameshwar Pratap, Karthik Revanuru, Anirudh Ravi, Raghav Kulkarni

Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:689-704, 2020.

Abstract

We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.

Cite this Paper

BibTeX


@InProceedings{pmlr-v129-pratap20a,
  title = 	 {Randomness Efficient Feature Hashing for Sparse Binary Data},
  author =       {Pratap, Rameshwar and Revanuru, Karthik and Ravi, Anirudh and Kulkarni, Raghav},
  booktitle = 	 {Proceedings of The 12th Asian Conference on Machine Learning},
  pages = 	 {689--704},
  year = 	 {2020},
  editor = 	 {Pan, Sinno Jialin and Sugiyama, Masashi},
  volume = 	 {129},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--20 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v129/pratap20a/pratap20a.pdf},
  url = 	 {https://proceedings.mlr.press/v129/pratap20a.html},
  abstract = 	 {We present sketching algorithms for sparse binary datasets, which maintain    binary version of the dataset after sketching,   while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that
 they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in   dimension, while other competitive algorithms require linear in   dimension. Our proposed algorithms are   efficient,  offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset.   We empirically verify  the assumption and notice that it  holds on several  real-world datasets.}
}

Endnote

%0 Conference Paper
%T Randomness Efficient Feature Hashing for Sparse Binary Data
%A Rameshwar Pratap
%A Karthik Revanuru
%A Anirudh Ravi
%A Raghav Kulkarni
%B Proceedings of The 12th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Sinno Jialin Pan
%E Masashi Sugiyama	
%F pmlr-v129-pratap20a
%I PMLR
%P 689--704
%U https://proceedings.mlr.press/v129/pratap20a.html
%V 129
%X We present sketching algorithms for sparse binary datasets, which maintain    binary version of the dataset after sketching,   while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that
 they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in   dimension, while other competitive algorithms require linear in   dimension. Our proposed algorithms are   efficient,  offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset.   We empirically verify  the assumption and notice that it  holds on several  real-world datasets.

APA


Pratap, R., Revanuru, K., Ravi, A. & Kulkarni, R.. (2020). Randomness Efficient Feature Hashing for Sparse Binary Data. Proceedings of The 12th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 129:689-704 Available from https://proceedings.mlr.press/v129/pratap20a.html.

Related Material

Download PDF