Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark

Zhuolin Qiu; Bin Wu; Bai Wang; Le Yu

Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark

Zhuolin Qiu, Bin Wu, Bai Wang, Le Yu

Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:17-28, 2014.

Abstract

In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v36-qiu14,
  title = 	 {Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark},
  author = 	 {Qiu, Zhuolin and Wu, Bin and Wang, Bai and Yu, Le},
  booktitle = 	 {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications},
  pages = 	 {17--28},
  year = 	 {2014},
  editor = 	 {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.},
  volume = 	 {36},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {24 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v36/qiu14.pdf},
  url = 	 {https://proceedings.mlr.press/v36/qiu14.html},
  abstract = 	 {In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.}
}

Endnote

%0 Conference Paper
%T Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark
%A Zhuolin Qiu
%A Bin Wu
%A Bai Wang
%A Le Yu
%B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
%C Proceedings of Machine Learning Research
%D 2014
%E Wei Fan
%E Albert Bifet
%E Qiang Yang
%E Philip S. Yu	
%F pmlr-v36-qiu14
%I PMLR
%P 17--28
%U https://proceedings.mlr.press/v36/qiu14.html
%V 36
%X In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.

RIS


TY  - CPAPER
TI  - Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark
AU  - Zhuolin Qiu
AU  - Bin Wu
AU  - Bai Wang
AU  - Le Yu
BT  - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
DA  - 2014/08/13
ED  - Wei Fan
ED  - Albert Bifet
ED  - Qiang Yang
ED  - Philip S. Yu	
ID  - pmlr-v36-qiu14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 36
SP  - 17
EP  - 28
L1  - http://proceedings.mlr.press/v36/qiu14.pdf
UR  - https://proceedings.mlr.press/v36/qiu14.html
AB  - In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.
ER  -

APA


Qiu, Z., Wu, B., Wang, B. & Yu, L.. (2014). Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 36:17-28 Available from https://proceedings.mlr.press/v36/qiu14.html.

Related Material

Download PDF