A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries

David Matusevich, Carlos Ordonez
; Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:61-76, 2014.

Abstract

CClustering is an important problem in Statistics and Machine Learning that is usually solved using Likelihood Maximization Methods, of which the Expectation-Maximization Algorithm (EM) is the most common. In this work we present an SQL implementation of an algorithm merging Markov Chain Monte Carlo methods with the EM algorithm to find qualitatively better solutions for the clustering problem. Even though SQL is not optimized for complex calculations, as it is constrained to work on tables and columns, it is unparalleled in handling all aspects of storage management, security of the information, fault management, etc. Our algorithm makes use of these characteristics to produce portable solutions that are comparable to the results obtained by other algorithms and are more efficient since the calculations are all performed inside the DBMS. To simplify the calculation we use very simple scalar UDFs, of a type that is available in most DBMS. The solution has linear time complexity on the size of the data set and it has a linear speedup with the number of servers in the cluster. This was achieved using sufficient statistics and a simplified model that assigns the data-points to different clusters during the E-step in an incremental manner and the introduction of a Sampling step in order to explore the solution space in a more efficient manner. Preliminary experiments show very good agreement with standard solutions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v36-matusevich14, title = {A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries}, author = {David Matusevich and Carlos Ordonez}, booktitle = {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications}, pages = {61--76}, year = {2014}, editor = {Wei Fan and Albert Bifet and Qiang Yang and Philip S. Yu}, volume = {36}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {24 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v36/matusevich14.pdf}, url = {http://proceedings.mlr.press/v36/matusevich14.html}, abstract = {CClustering is an important problem in Statistics and Machine Learning that is usually solved using Likelihood Maximization Methods, of which the Expectation-Maximization Algorithm (EM) is the most common. In this work we present an SQL implementation of an algorithm merging Markov Chain Monte Carlo methods with the EM algorithm to find qualitatively better solutions for the clustering problem. Even though SQL is not optimized for complex calculations, as it is constrained to work on tables and columns, it is unparalleled in handling all aspects of storage management, security of the information, fault management, etc. Our algorithm makes use of these characteristics to produce portable solutions that are comparable to the results obtained by other algorithms and are more efficient since the calculations are all performed inside the DBMS. To simplify the calculation we use very simple scalar UDFs, of a type that is available in most DBMS. The solution has linear time complexity on the size of the data set and it has a linear speedup with the number of servers in the cluster. This was achieved using sufficient statistics and a simplified model that assigns the data-points to different clusters during the E-step in an incremental manner and the introduction of a Sampling step in order to explore the solution space in a more efficient manner. Preliminary experiments show very good agreement with standard solutions.} }
Endnote
%0 Conference Paper %T A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries %A David Matusevich %A Carlos Ordonez %B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications %C Proceedings of Machine Learning Research %D 2014 %E Wei Fan %E Albert Bifet %E Qiang Yang %E Philip S. Yu %F pmlr-v36-matusevich14 %I PMLR %J Proceedings of Machine Learning Research %P 61--76 %U http://proceedings.mlr.press %V 36 %W PMLR %X CClustering is an important problem in Statistics and Machine Learning that is usually solved using Likelihood Maximization Methods, of which the Expectation-Maximization Algorithm (EM) is the most common. In this work we present an SQL implementation of an algorithm merging Markov Chain Monte Carlo methods with the EM algorithm to find qualitatively better solutions for the clustering problem. Even though SQL is not optimized for complex calculations, as it is constrained to work on tables and columns, it is unparalleled in handling all aspects of storage management, security of the information, fault management, etc. Our algorithm makes use of these characteristics to produce portable solutions that are comparable to the results obtained by other algorithms and are more efficient since the calculations are all performed inside the DBMS. To simplify the calculation we use very simple scalar UDFs, of a type that is available in most DBMS. The solution has linear time complexity on the size of the data set and it has a linear speedup with the number of servers in the cluster. This was achieved using sufficient statistics and a simplified model that assigns the data-points to different clusters during the E-step in an incremental manner and the introduction of a Sampling step in order to explore the solution space in a more efficient manner. Preliminary experiments show very good agreement with standard solutions.
RIS
TY - CPAPER TI - A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries AU - David Matusevich AU - Carlos Ordonez BT - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications PY - 2014/08/13 DA - 2014/08/13 ED - Wei Fan ED - Albert Bifet ED - Qiang Yang ED - Philip S. Yu ID - pmlr-v36-matusevich14 PB - PMLR SP - 61 DP - PMLR EP - 76 L1 - http://proceedings.mlr.press/v36/matusevich14.pdf UR - http://proceedings.mlr.press/v36/matusevich14.html AB - CClustering is an important problem in Statistics and Machine Learning that is usually solved using Likelihood Maximization Methods, of which the Expectation-Maximization Algorithm (EM) is the most common. In this work we present an SQL implementation of an algorithm merging Markov Chain Monte Carlo methods with the EM algorithm to find qualitatively better solutions for the clustering problem. Even though SQL is not optimized for complex calculations, as it is constrained to work on tables and columns, it is unparalleled in handling all aspects of storage management, security of the information, fault management, etc. Our algorithm makes use of these characteristics to produce portable solutions that are comparable to the results obtained by other algorithms and are more efficient since the calculations are all performed inside the DBMS. To simplify the calculation we use very simple scalar UDFs, of a type that is available in most DBMS. The solution has linear time complexity on the size of the data set and it has a linear speedup with the number of servers in the cluster. This was achieved using sufficient statistics and a simplified model that assigns the data-points to different clusters during the E-step in an incremental manner and the introduction of a Sampling step in order to explore the solution space in a more efficient manner. Preliminary experiments show very good agreement with standard solutions. ER -
APA
Matusevich, D. & Ordonez, C.. (2014). A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in PMLR 36:61-76

Related Material