Random Decision Hashing for Massive Data Learning

[edit]

Xiatian Zhang, Wei Fan, Nan Du ;
Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 41:65-80, 2015.

Abstract

In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms.

Related Material