Fast Randomized PCA for Sparse Data

Xu Feng; Yuyang Xie; Mingye Song; Wenjian Yu; Jie Tang

Fast Randomized PCA for Sparse Data

Xu Feng, Yuyang Xie, Mingye Song, Wenjian Yu, Jie Tang

Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:710-725, 2018.

Abstract

Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis, information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm (Halko et al., 2011), but is largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X faster than the \texttt{svds} in Matlab with little error. The algorithm computes the first 100 principal components of a large information retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional methods fail due to the out-of-memory issue.

Cite this Paper

BibTeX

@InProceedings{pmlr-v95-feng18a,
  title = 	 {Fast Randomized PCA for Sparse Data},
  author =       {Feng, Xu and Xie, Yuyang and Song, Mingye and Yu, Wenjian and Tang, Jie},
  booktitle = 	 {Proceedings of The 10th Asian Conference on Machine Learning},
  pages = 	 {710--725},
  year = 	 {2018},
  editor = 	 {Zhu, Jun and Takeuchi, Ichiro},
  volume = 	 {95},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--16 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v95/feng18a/feng18a.pdf},
  url = 	 {https://proceedings.mlr.press/v95/feng18a.html},
  abstract = 	 {Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis, information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm  (Halko et al., 2011), but is largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X faster than the \texttt{svds} in Matlab with little error. The algorithm computes the first 100 principal components of a large information retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional methods fail due to the out-of-memory issue.}
}

Endnote

%0 Conference Paper
%T Fast Randomized PCA for Sparse Data
%A Xu Feng
%A Yuyang Xie
%A Mingye Song
%A Wenjian Yu
%A Jie Tang
%B Proceedings of The 10th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jun Zhu
%E Ichiro Takeuchi	
%F pmlr-v95-feng18a
%I PMLR
%P 710--725
%U https://proceedings.mlr.press/v95/feng18a.html
%V 95
%X Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis, information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm  (Halko et al., 2011), but is largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X faster than the \texttt{svds} in Matlab with little error. The algorithm computes the first 100 principal components of a large information retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional methods fail due to the out-of-memory issue.

APA

Feng, X., Xie, Y., Song, M., Yu, W. & Tang, J.. (2018). Fast Randomized PCA for Sparse Data. Proceedings of The 10th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 95:710-725 Available from https://proceedings.mlr.press/v95/feng18a.html.

Related Material

Download PDF