Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree
Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 41:19-32, 2015.
With the advancement of data generation technologies such as sensor networks, multiple data streams are continuously generated. Clustering multiple data streams is challenging as the requirement of clustering at anytime becomes more critical. We aim to cluster multiple data streams concurrently and in this paper we report our work in progress. ClusTree is an anytime clustering algorithm for a single stream. It uses a hierarchical tree structure to index micro-clusters, which are summary statistics for streaming data objects. We design a dynamic, concurrent indexing tree structure that extends the ClusTree structure to achieve more granular micro clusters (summaries) of multiple streams at any time. We devised algorithms to search, expand and update the hierarchical tree structure of storing micro clusters concurrently, along with an algorithm for anytime concurrent clustering of multiple streams. As this is work in progress, we plan to test our proposed algorithms, on sensor data sets, and evaluate the space and time complexity of creating and accessing micro-clusters. We will also evaluate the quality of clustering in terms of number of created clusters and compare our technique with other approaches.