Random Decision Hashing for Massive Data Learning

Xiatian Zhang, Wei Fan, Nan Du
Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 41:65-80, 2015.

Abstract

In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v41-zhang15, title = {Random Decision Hashing for Massive Data Learning}, author = {Zhang, Xiatian and Fan, Wei and Du, Nan}, booktitle = {Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications}, pages = {65--80}, year = {2015}, editor = {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.}, volume = {41}, series = {Proceedings of Machine Learning Research}, month = {10 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v41/zhang15.pdf}, url = {https://proceedings.mlr.press/v41/zhang15.html}, abstract = {In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.} }
Endnote
%0 Conference Paper %T Random Decision Hashing for Massive Data Learning %A Xiatian Zhang %A Wei Fan %A Nan Du %B Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications %C Proceedings of Machine Learning Research %D 2015 %E Wei Fan %E Albert Bifet %E Qiang Yang %E Philip S. Yu %F pmlr-v41-zhang15 %I PMLR %P 65--80 %U https://proceedings.mlr.press/v41/zhang15.html %V 41 %X In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.
RIS
TY - CPAPER TI - Random Decision Hashing for Massive Data Learning AU - Xiatian Zhang AU - Wei Fan AU - Nan Du BT - Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications DA - 2015/08/31 ED - Wei Fan ED - Albert Bifet ED - Qiang Yang ED - Philip S. Yu ID - pmlr-v41-zhang15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 41 SP - 65 EP - 80 L1 - http://proceedings.mlr.press/v41/zhang15.pdf UR - https://proceedings.mlr.press/v41/zhang15.html AB - In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms. ER -
APA
Zhang, X., Fan, W. & Du, N.. (2015). Random Decision Hashing for Massive Data Learning. Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 41:65-80 Available from https://proceedings.mlr.press/v41/zhang15.html.

Related Material