Shared Execution of Clustering Tasks

Padmashree Ravindra; Rajeev Gupta; Kemafor Anyanwu

Shared Execution of Clustering Tasks

Padmashree Ravindra, Rajeev Gupta, Kemafor Anyanwu

Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 41:81-96, 2015.

Abstract

Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v41-ravindra15,
  title = 	 {Shared Execution of Clustering Tasks},
  author = 	 {Ravindra, Padmashree and Gupta, Rajeev and Anyanwu, Kemafor},
  booktitle = 	 {Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications},
  pages = 	 {81--96},
  year = 	 {2015},
  editor = 	 {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.},
  volume = 	 {41},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v41/ravindra15.pdf},
  url = 	 {https://proceedings.mlr.press/v41/ravindra15.html},
  abstract = 	 {Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering  technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with  different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution.  In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark,  etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show  that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques  proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values  of k, and number of clustering tasks.}
}

Endnote

%0 Conference Paper
%T Shared Execution of Clustering Tasks
%A Padmashree Ravindra
%A Rajeev Gupta
%A Kemafor Anyanwu
%B Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
%C Proceedings of Machine Learning Research
%D 2015
%E Wei Fan
%E Albert Bifet
%E Qiang Yang
%E Philip S. Yu	
%F pmlr-v41-ravindra15
%I PMLR
%P 81--96
%U https://proceedings.mlr.press/v41/ravindra15.html
%V 41
%X Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering  technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with  different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution.  In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark,  etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show  that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques  proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values  of k, and number of clustering tasks.

RIS


TY  - CPAPER
TI  - Shared Execution of Clustering Tasks
AU  - Padmashree Ravindra
AU  - Rajeev Gupta
AU  - Kemafor Anyanwu
BT  - Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
DA  - 2015/08/31
ED  - Wei Fan
ED  - Albert Bifet
ED  - Qiang Yang
ED  - Philip S. Yu	
ID  - pmlr-v41-ravindra15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 41
SP  - 81
EP  - 96
L1  - http://proceedings.mlr.press/v41/ravindra15.pdf
UR  - https://proceedings.mlr.press/v41/ravindra15.html
AB  - Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering  technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with  different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution.  In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark,  etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show  that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques  proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values  of k, and number of clustering tasks.
ER  -

APA


Ravindra, P., Gupta, R. & Anyanwu, K.. (2015). Shared Execution of Clustering Tasks. Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 41:81-96 Available from https://proceedings.mlr.press/v41/ravindra15.html.

Related Material

Download PDF