Shared Execution of Clustering Tasks

Padmashree Ravindra, Rajeev Gupta, Kemafor Anyanwu
Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 41:81-96, 2015.

Abstract

Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v41-ravindra15, title = {Shared Execution of Clustering Tasks}, author = {Ravindra, Padmashree and Gupta, Rajeev and Anyanwu, Kemafor}, booktitle = {Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications}, pages = {81--96}, year = {2015}, editor = {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.}, volume = {41}, series = {Proceedings of Machine Learning Research}, month = {10 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v41/ravindra15.pdf}, url = {https://proceedings.mlr.press/v41/ravindra15.html}, abstract = {Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks.} }
Endnote
%0 Conference Paper %T Shared Execution of Clustering Tasks %A Padmashree Ravindra %A Rajeev Gupta %A Kemafor Anyanwu %B Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications %C Proceedings of Machine Learning Research %D 2015 %E Wei Fan %E Albert Bifet %E Qiang Yang %E Philip S. Yu %F pmlr-v41-ravindra15 %I PMLR %P 81--96 %U https://proceedings.mlr.press/v41/ravindra15.html %V 41 %X Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks.
RIS
TY - CPAPER TI - Shared Execution of Clustering Tasks AU - Padmashree Ravindra AU - Rajeev Gupta AU - Kemafor Anyanwu BT - Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications DA - 2015/08/31 ED - Wei Fan ED - Albert Bifet ED - Qiang Yang ED - Philip S. Yu ID - pmlr-v41-ravindra15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 41 SP - 81 EP - 96 L1 - http://proceedings.mlr.press/v41/ravindra15.pdf UR - https://proceedings.mlr.press/v41/ravindra15.html AB - Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks. ER -
APA
Ravindra, P., Gupta, R. & Anyanwu, K.. (2015). Shared Execution of Clustering Tasks. Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 41:81-96 Available from https://proceedings.mlr.press/v41/ravindra15.html.

Related Material