Random Decision Hashing for Massive Data Learning

Xiatian Zhang; Wei Fan; Nan Du

Random Decision Hashing for Massive Data Learning

Xiatian Zhang, Wei Fan, Nan Du

Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 41:65-80, 2015.

Abstract

In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v41-zhang15,
  title = 	 {Random Decision Hashing for Massive Data Learning},
  author = 	 {Zhang, Xiatian and Fan, Wei and Du, Nan},
  booktitle = 	 {Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications},
  pages = 	 {65--80},
  year = 	 {2015},
  editor = 	 {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.},
  volume = 	 {41},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v41/zhang15.pdf},
  url = 	 {https://proceedings.mlr.press/v41/zhang15.html},
  abstract = 	 {In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly  inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm,  but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised  locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines,  and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.}
}

Endnote

%0 Conference Paper
%T Random Decision Hashing for Massive Data Learning
%A Xiatian Zhang
%A Wei Fan
%A Nan Du
%B Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
%C Proceedings of Machine Learning Research
%D 2015
%E Wei Fan
%E Albert Bifet
%E Qiang Yang
%E Philip S. Yu	
%F pmlr-v41-zhang15
%I PMLR
%P 65--80
%U https://proceedings.mlr.press/v41/zhang15.html
%V 41
%X In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly  inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm,  but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised  locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines,  and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.

RIS


TY  - CPAPER
TI  - Random Decision Hashing for Massive Data Learning
AU  - Xiatian Zhang
AU  - Wei Fan
AU  - Nan Du
BT  - Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
DA  - 2015/08/31
ED  - Wei Fan
ED  - Albert Bifet
ED  - Qiang Yang
ED  - Philip S. Yu	
ID  - pmlr-v41-zhang15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 41
SP  - 65
EP  - 80
L1  - http://proceedings.mlr.press/v41/zhang15.pdf
UR  - https://proceedings.mlr.press/v41/zhang15.html
AB  - In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly  inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm,  but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised  locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines,  and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.
ER  -

APA


Zhang, X., Fan, W. & Du, N.. (2015). Random Decision Hashing for Massive Data Learning. Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 41:65-80 Available from https://proceedings.mlr.press/v41/zhang15.html.

Related Material

Download PDF