Coding for Random Projections

Ping Li; Michael Mitzenmacher; Anshumali Shrivastava

Coding for Random Projections

Ping Li, Michael Mitzenmacher, Anshumali Shrivastava

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):676-684, 2014.

Abstract

The method of random projections has become popular for large-scale applications in statistical learning, information retrieval, bio-informatics and other applications. Using a well-designed \textbfcoding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed. In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that \textbfuniform quantization outperforms the standard and influential method \citeProc:Datar_SCG04, which used a \em window-and-random offset scheme. Indeed, we argue that in many cases coding with just a small number of bits suffices. Furthermore, we also develop a \textbfnon-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM). Proofs and additional experiments are available at \em arXiv:1308.2218. In the context of using coded random projections for \textbfapproximate near neighbor search by building hash tables (\em arXiv:1403.8144) \citeReport:RPCodeLSH2014, we show that the step of random offset in \citeProc:Datar_SCG04 is again not needed and may hurt the performance. Furthermore, we show that, unless the target similarity level is high, it usually suffices to use only 1 or 2 bits to code each hashed value for this task. Section \refsec_LSH presents some experimental results for LSH.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-lie14,
  title = 	 {Coding for Random Projections},
  author = 	 {Li, Ping and Mitzenmacher, Michael and Shrivastava, Anshumali},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {676--684},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/lie14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/lie14.html},
  abstract = 	 {The method of random projections has become  popular for large-scale applications in statistical learning, information retrieval, bio-informatics  and other applications.  Using a well-designed \textbfcoding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed.   In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that \textbfuniform quantization outperforms the standard and influential method \citeProc:Datar_SCG04, which used a \em window-and-random offset scheme. Indeed, we argue that in many cases coding with just a small number of bits suffices.  Furthermore, we also  develop a \textbfnon-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM). Proofs and additional experiments  are available at \em arXiv:1308.2218.      In the context of using coded random projections for \textbfapproximate near neighbor search by building hash tables (\em arXiv:1403.8144) \citeReport:RPCodeLSH2014, we show that the step of random offset in \citeProc:Datar_SCG04 is  again not needed  and may hurt the performance. Furthermore, we show that, unless the target similarity level is high, it usually suffices to use only 1 or 2 bits to code each hashed value for this task. Section \refsec_LSH presents some experimental results for LSH.}
}

Endnote

%0 Conference Paper
%T Coding for Random Projections
%A Ping Li
%A Michael Mitzenmacher
%A Anshumali Shrivastava
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-lie14
%I PMLR
%P 676--684
%U https://proceedings.mlr.press/v32/lie14.html
%V 32
%N 2
%X The method of random projections has become  popular for large-scale applications in statistical learning, information retrieval, bio-informatics  and other applications.  Using a well-designed \textbfcoding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed.   In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that \textbfuniform quantization outperforms the standard and influential method \citeProc:Datar_SCG04, which used a \em window-and-random offset scheme. Indeed, we argue that in many cases coding with just a small number of bits suffices.  Furthermore, we also  develop a \textbfnon-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM). Proofs and additional experiments  are available at \em arXiv:1308.2218.      In the context of using coded random projections for \textbfapproximate near neighbor search by building hash tables (\em arXiv:1403.8144) \citeReport:RPCodeLSH2014, we show that the step of random offset in \citeProc:Datar_SCG04 is  again not needed  and may hurt the performance. Furthermore, we show that, unless the target similarity level is high, it usually suffices to use only 1 or 2 bits to code each hashed value for this task. Section \refsec_LSH presents some experimental results for LSH.

RIS


TY  - CPAPER
TI  - Coding for Random Projections
AU  - Ping Li
AU  - Michael Mitzenmacher
AU  - Anshumali Shrivastava
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/06/18
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-lie14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 2
SP  - 676
EP  - 684
L1  - http://proceedings.mlr.press/v32/lie14.pdf
UR  - https://proceedings.mlr.press/v32/lie14.html
AB  - The method of random projections has become  popular for large-scale applications in statistical learning, information retrieval, bio-informatics  and other applications.  Using a well-designed \textbfcoding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed.   In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that \textbfuniform quantization outperforms the standard and influential method \citeProc:Datar_SCG04, which used a \em window-and-random offset scheme. Indeed, we argue that in many cases coding with just a small number of bits suffices.  Furthermore, we also  develop a \textbfnon-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM). Proofs and additional experiments  are available at \em arXiv:1308.2218.      In the context of using coded random projections for \textbfapproximate near neighbor search by building hash tables (\em arXiv:1403.8144) \citeReport:RPCodeLSH2014, we show that the step of random offset in \citeProc:Datar_SCG04 is  again not needed  and may hurt the performance. Furthermore, we show that, unless the target similarity level is high, it usually suffices to use only 1 or 2 bits to code each hashed value for this task. Section \refsec_LSH presents some experimental results for LSH.
ER  -

APA


Li, P., Mitzenmacher, M. & Shrivastava, A.. (2014). Coding for Random Projections. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):676-684 Available from https://proceedings.mlr.press/v32/lie14.html.

Related Material

Download PDF