The Gamma Operator for Big Data Summarization on an Array DBMS

Carlos Ordonez; Yiqun Zhang; Wellington Cabrera

The Gamma Operator for Big Data Summarization on an Array DBMS

Carlos Ordonez, Yiqun Zhang, Wellington Cabrera

Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:88-103, 2014.

Abstract

SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.

Cite this Paper

BibTeX


@InProceedings{pmlr-v36-ordonez14,
  title = 	 {The Gamma Operator for Big Data Summarization on an Array DBMS},
  author = 	 {Ordonez, Carlos and Zhang, Yiqun and Cabrera, Wellington},
  booktitle = 	 {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications},
  pages = 	 {88--103},
  year = 	 {2014},
  editor = 	 {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.},
  volume = 	 {36},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {24 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v36/ordonez14.pdf},
  url = 	 {https://proceedings.mlr.press/v36/ordonez14.html},
  abstract = 	 {SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID  properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.}
}

Endnote

%0 Conference Paper
%T The Gamma Operator for Big Data Summarization on an Array DBMS
%A Carlos Ordonez
%A Yiqun Zhang
%A Wellington Cabrera
%B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
%C Proceedings of Machine Learning Research
%D 2014
%E Wei Fan
%E Albert Bifet
%E Qiang Yang
%E Philip S. Yu	
%F pmlr-v36-ordonez14
%I PMLR
%P 88--103
%U https://proceedings.mlr.press/v36/ordonez14.html
%V 36
%X SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID  properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.

RIS


TY  - CPAPER
TI  - The Gamma Operator for Big Data Summarization on an Array DBMS
AU  - Carlos Ordonez
AU  - Yiqun Zhang
AU  - Wellington Cabrera
BT  - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
DA  - 2014/08/13
ED  - Wei Fan
ED  - Albert Bifet
ED  - Qiang Yang
ED  - Philip S. Yu	
ID  - pmlr-v36-ordonez14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 36
SP  - 88
EP  - 103
L1  - http://proceedings.mlr.press/v36/ordonez14.pdf
UR  - https://proceedings.mlr.press/v36/ordonez14.html
AB  - SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID  properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.
ER  -

APA


Ordonez, C., Zhang, Y. & Cabrera, W.. (2014). The Gamma Operator for Big Data Summarization on an Array DBMS. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 36:88-103 Available from https://proceedings.mlr.press/v36/ordonez14.html.

Related Material

Download PDF