Proceedings of Machine Learning Research

Proceedings of Machine Learning Research Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications on 10 August 2015 Published as Volume 41 by the Proceedings of Machine Learning Research on 31 August 2015. Volume Edited by: Wei Fan Albert Bifet Qiang Yang Philip S. Yu Series Editors: Neil D. Lawrence Mark Reid https://proceedings.mlr.press/v41/ Wed, 08 Feb 2023 10:40:31 +0000 Wed, 08 Feb 2023 10:40:31 +0000 Jekyll v3.9.3 Random Decision Hashing for Massive Data Learning In the era of Big Data, the iterative nature of most traditional learning algorithms renders them increasingly inefficient to address large learning problems. Random decision trees algorithm is an efficient and decent learning algorithm, but the complexity of tree structure makes the algorithm inefficient for the big data problems. Inspired by the theoretical analyses of random decision trees, we propose a randomized algorithm for big data classification tasks based on unsupervised locality sensitive hashing. Our algorithm is essentially non-iterative, very flexible to deploy over clusters of machines, and thus able to handle large datasets efficiently. Experiments on real datasets demonstrate that the proposed algorithm can easily scale up to millions of data samples and features while still achieves at most 17% and 800% improvement in accuracy and efficiency respectively with moderate memory consumption over existing algorithms. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/zhang15.html https://proceedings.mlr.press/v41/zhang15.html Taking machine learning research online with OpenML OpenML is an online platform where scientists can automatically log and share machine learning data sets, code, and experiments, organize them online, and build directly on the work of others. It helps to automate many tedious aspects of research, is readily integrated into several machine learning tools, and offers easy-to-use APIs. It also enables large-scale and real-time collaboration, allowing researchers to share their very latest results, while keeping track of their impact and reuse. The combined and linked results provide a wealth of information to speed up research, assist people while analyzing data, or automate the process altogether. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/vanschoren15.html https://proceedings.mlr.press/v41/vanschoren15.html Big Data with ADAMS ADAMS is a modular open-source Java framework for developing workflows available for academic research as well as commercial applications. It integrates data mining applications, like MOA, WEKA, MEKA and R, image and video processing and feature generation capabilities, spreadsheet and database access, visualizations, GIS, webservices and fast protoyping of new functionality using scripting languages (Groovy/Jython). Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/reutemann15.html https://proceedings.mlr.press/v41/reutemann15.html Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree With the advancement of data generation technologies such as sensor networks, multiple data streams are continuously generated. Clustering multiple data streams is challenging as the requirement of clustering at anytime becomes more critical. We aim to cluster multiple data streams concurrently and in this paper we report our work in progress. ClusTree is an anytime clustering algorithm for a single stream. It uses a hierarchical tree structure to index micro-clusters, which are summary statistics for streaming data objects. We design a dynamic, concurrent indexing tree structure that extends the ClusTree structure to achieve more granular micro clusters (summaries) of multiple streams at any time. We devised algorithms to search, expand and update the hierarchical tree structure of storing micro clusters concurrently, along with an algorithm for anytime concurrent clustering of multiple streams. As this is work in progress, we plan to test our proposed algorithms, on sensor data sets, and evaluate the space and time complexity of creating and accessing micro-clusters. We will also evaluate the quality of clustering in terms of number of created clusters and compare our technique with other approaches. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/razavi15.html https://proceedings.mlr.press/v41/razavi15.html Shared Execution of Clustering Tasks Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in \textitmap and \textitreduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/ravindra15.html https://proceedings.mlr.press/v41/ravindra15.html Scalable Multidimensional Hierarchical Bayesian Modeling on Spark We consider the problem of estimating occurrence rates of rare events for extremely sparse data using pre-existing hierarchies and selected features to perform inference along multiple dimensions. In particular, we focus on the problem of estimating click rates for {Advertiser, Publisher, User} tuples where both the Advertisers and the Publishers are organized as hierarchies that capture broad contextual information at different levels of granularities. Typically, the click rates are low and the coverage of the hierarchies and dimensions is sparse. To overcome these difficulties, we decompose the joint prior of the three-dimensional Click-Through-Rate (CTR) using tensor decomposition and propose a Multidimensional Hierarchical Bayesian framework (abbreviated as MadHab). We set up a specific framework of each dimension to model dimension-specific characteristics. More specifically, we consider the hierarchical beta process prior for the Advertiser dimension and for the Publisher dimension respectively and a feature-dependent mixture model for the User dimension. Besides the centralized implementation, we propose a distributed algorithm through Spark for inference which make the model highly scalable and suited for large scale data mining applications. We demonstrate that on a real world ads campaign platform our framework can effectively discriminate extremely rare events in terms of their click propensity. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/ormandi15.html https://proceedings.mlr.press/v41/ormandi15.html Adaptive Stream Clustering Using Incremental Graph Maintenance Challenges for clustering streaming data are getting continuously more sophisticated. This trend is driven by the the emerging requirements of the application where those algorithms are used and the properties of the stream itself. Some of these properties are the continuous data arrival, the time-critical processing of objects, the evolution of the data streams, the presence of outliers and the varying densities of the data. Due to the fact that the stream evolves continuously in the process of its existence, it is crucial that the algorithm autonomously detects clusters of arbitrary shape, with different densities, and varying number of clusters. Recently, the first hierarchical density-based stream clustering algorithm based on cluster stability, called HASTREAM, was proposed. Although the algorithm was able to meet the above mentioned requirements, it inherited the main drawback of density-based hierarchical clustering algorithms, namely the efficiency issues. In this paper we propose \textitI-HASTREAM, a first density-based hierarchical clustering algorithm that has considerably less computational time than HASTREAM. Our proposed method utilizes and introduces techniques from the graph theory domain to devise an incremental update of the underlying model instead of repeatedly performing the expensive calculations of the huge graph. Specifically the Prim’s algorithm for constructing the minimal spanning tree is adopted by introducing novel, incremental maintenance of the tree by vertex and edge insertion and deletion. The extensive experimental evaluation study on real world datasets shows that I-HASTREAM is considerably faster than a state-of-the-art hierarchical density-based stream clustering approach while delivering almost the same clustering quality. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/hassani15.html https://proceedings.mlr.press/v41/hassani15.html Preface Recent years have witnessed a dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applications. This data flood has outpaced our capability to process, analyze, store and understand these datasets. Consider the Internet data. The web pages indexed by Google were around one million in 1998, but quickly reached 1 billion in 2000 and have already exceeded 1 trillion in 2008. This rapid expansion is accelerated by the dramatic increase in acceptance of social networking applications, such as Facebook, Twitter, Weibo, etc., that allow users to create contents freely and amplify the already huge Web volume. Furthermore, with mobile phones becoming the sensory gateway to get real-time data on people from different aspects, the vast amount of data that mobile carrier can potentially process to improve our daily life has significantly outpaced our past CDR (call data record)-based processing for billing purposes only. It can be foreseen that Internet of things (IoT) applications will raise the scale of data to an unprecedented level. People and devices (from home coffee machines to cars, to buses, railway stations and airports) are all loosely connected. Trillions of such connected components will generate a huge data ocean, and valuable information must be discovered from the data to help improve quality of life and make our world a better place. For example, after we get up every morning, in order to optimize our commute time to work and complete the optimization before we arrive at office, the system needs to process information from traffic, weather, construction, police activities to our calendar schedules, and perform deep optimization under the tight time constraints. In all these applications, we are facing significant challenges in leveraging the vast amount of data, including challenges in (1) system capabilities (2) algorithmic design (3) business models. The aim of this workshop is to bring together people from both academia and industry to present their most recent work related to big-data issues, and exchange ideas and thoughts in order to advance this big-data challenge, which has been considered as one of the most exciting opportunities in the past 10 years. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/fan15.html https://proceedings.mlr.press/v41/fan15.html Interactive Visual Big Data Analytics for Large Area Farm Biosecurity Monitoring: i-EKbase System In this industrial application paper a novel application of salad leaf disease detection has been developed using a combination of big data analytics and on field multi-dimensional sensing. We propose a cloud computing based intelligent big data analysis and interactive visual analytics platform to predict farm hot spots with high probability of potential biosecurity threats and early monitoring system aiming to save the farm from significant economic damage. Mon, 31 Aug 2015 00:00:00 +0000 https://proceedings.mlr.press/v41/dutta15.html https://proceedings.mlr.press/v41/dutta15.html