The Gamma Operator for Big Data Summarization on an Array DBMS

Carlos Ordonez, Yiqun Zhang, Wellington Cabrera
; Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:88-103, 2014.

Abstract

SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.

Cite this Paper


BibTeX
@InProceedings{pmlr-v36-ordonez14, title = {The Gamma Operator for Big Data Summarization on an Array DBMS}, author = {Carlos Ordonez and Yiqun Zhang and Wellington Cabrera}, booktitle = {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications}, pages = {88--103}, year = {2014}, editor = {Wei Fan and Albert Bifet and Qiang Yang and Philip S. Yu}, volume = {36}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {24 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v36/ordonez14.pdf}, url = {http://proceedings.mlr.press/v36/ordonez14.html}, abstract = {SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.} }
Endnote
%0 Conference Paper %T The Gamma Operator for Big Data Summarization on an Array DBMS %A Carlos Ordonez %A Yiqun Zhang %A Wellington Cabrera %B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications %C Proceedings of Machine Learning Research %D 2014 %E Wei Fan %E Albert Bifet %E Qiang Yang %E Philip S. Yu %F pmlr-v36-ordonez14 %I PMLR %J Proceedings of Machine Learning Research %P 88--103 %U http://proceedings.mlr.press %V 36 %W PMLR %X SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.
RIS
TY - CPAPER TI - The Gamma Operator for Big Data Summarization on an Array DBMS AU - Carlos Ordonez AU - Yiqun Zhang AU - Wellington Cabrera BT - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications PY - 2014/08/13 DA - 2014/08/13 ED - Wei Fan ED - Albert Bifet ED - Qiang Yang ED - Philip S. Yu ID - pmlr-v36-ordonez14 PB - PMLR SP - 88 DP - PMLR EP - 103 L1 - http://proceedings.mlr.press/v36/ordonez14.pdf UR - http://proceedings.mlr.press/v36/ordonez14.html AB - SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix. ER -
APA
Ordonez, C., Zhang, Y. & Cabrera, W.. (2014). The Gamma Operator for Big Data Summarization on an Array DBMS. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in PMLR 36:88-103

Related Material