Proceedings of Machine Learning ResearchProceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Held in New York, New York, USA on 24 August 2014
Published as Volume 36 by the Proceedings of Machine Learning Research on 13 August 2014.
Volume Edited by:
Wei Fan
Albert Bifet
Qiang Yang
Philip S. Yu
Series Editors:
Neil D. Lawrence
Mark Reid
http://proceedings.mlr.press/v36/
Mon, 29 May 2017 07:26:54 +0000Mon, 29 May 2017 07:26:54 +0000Jekyll v3.4.3Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale LearningIn large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/takamatsu14.html
http://proceedings.mlr.press/v36/takamatsu14.htmlFrequent Subgraph Discovery in Large Attributed Streaming GraphsThe problem of finding frequent subgraphs in large dynamic graphs has so far only considered a dynamic graph as being represented by a series of static snapshots taken at various points in time. This representation of a dynamic graph does not lend itself well to real time processing of real world graphs like social networks or internet traffic which consist of a stream of nodes and edges. In this paper we propose an algorithm that discovers the frequent subgraphs present in a graph represented by a stream of labeled nodes and edges. Our algorithm is efficient and is easily tuned by the user to produce interesting patterns from various kinds of graph data. In our model, updates to the graph arrive in the form of batches which contain new nodes and edges. Our algorithm continuously reports the frequent subgraphs that are estimated to be found in the entire graph as each batch arrives. We evaluate our system using five large dynamic graph datasets: the Hetrec 2011 challenge data, Twitter, DBLP and two synthetic. We evaluate our approach against two popular large graph miners, i.e., SUBDUE and GERM. Our experimental results show that we can find the same frequent subgraphs as a non-incremental approach applied to snapshot graphs, and in less time.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/ray14.html
http://proceedings.mlr.press/v36/ray14.htmlGibbs Collapsed Sampling for Latent Dirichlet Allocation on SparkIn this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/qiu14.html
http://proceedings.mlr.press/v36/qiu14.htmliPARAS: Incremental Construction of Parameter Space for Online Association MiningAssociation rule mining is known to be computationally intensive, yet real-time decision-making applications are increasingly intolerant to delays. The state-of-the-art PARAS solution, a parameter space framework for online association mining, enables efficient rule mining by compactly indexing the final ruleset and providing efficient query-time redundancy resolution. Unfortunately, as many association mining models, PARAS was designed for static data. Modern transaction databases undergo regular data updates that quickly invalidating existing rules or introducing new rules for the PARAS index. While reloading the PARAS index from scratch is impractical, as even upon minor data changes, a complete rule inference and redundancy resolution steps would have to be performed. We now propose to tackle this open problem by designing an incremental parameter space construction approach, called iPARAS, that utilizes the previous mining result to minimally adjust the ruleset and associated redundancy relationships. iPARAS features two innovative techniques. First, iPARAS provides an end-to-end solution, composed of three algorithms, to efficiently update the final ruleset in the parameter space. Second, iPARAS designs a compact data structure to maintain the complex redundancy relationships. Overall, iPARAS achieves several times speed-up on parameter space construction for transaction databases comparing to the state-of-the-art online association rule mining system PARAS.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/qin14.html
http://proceedings.mlr.press/v36/qin14.htmlThe Gamma Operator for Big Data Summarization on an Array DBMSSciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper,we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/ordonez14.html
http://proceedings.mlr.press/v36/ordonez14.htmlA Clustering Algorithm Merging MCMC and EM Methods Using SQL QueriesCClustering is an important problem in Statistics and Machine Learning that is usually solved using Likelihood Maximization Methods, of which the Expectation-Maximization Algorithm (EM) is the most common. In this work we present an SQL implementation of an algorithm merging Markov Chain Monte Carlo methods with the EM algorithm to find qualitatively better solutions for the clustering problem. Even though SQL is not optimized for complex calculations, as it is constrained to work on tables and columns, it is unparalleled in handling all aspects of storage management, security of the information, fault management, etc. Our algorithm makes use of these characteristics to produce portable solutions that are comparable to the results obtained by other algorithms and are more efficient since the calculations are all performed inside the DBMS. To simplify the calculation we use very simple scalar UDFs, of a type that is available in most DBMS. The solution has linear time complexity on the size of the data set and it has a linear speedup with the number of servers in the cluster. This was achieved using sufficient statistics and a simplified model that assigns the data-points to different clusters during the E-step in an incremental manner and the introduction of a Sampling step in order to explore the solution space in a more efficient manner. Preliminary experiments show very good agreement with standard solutions.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/matusevich14.html
http://proceedings.mlr.press/v36/matusevich14.htmlA Fast Distributed Stochastic Gradient Descent Algorithm for Matrix FactorizationThe accuracy and effectiveness of matrix factorization technique were well demonstrated in the Netflix movie recommendation contest. Among the numerous solutions for matrix factorization, Stochastic Gradient Descent (SGD) is one of the most widely used algorithms. However, as a sequential approach, SGD algorithm cannot directly be used in the Distributed Cluster Environment (DCE). In this paper, we propose a fast distributed SGD algorithm named FDSGD for matrix factorization, which can run efficiently in DCE. This algorithm solves data sharing problem based on independent storage system to avoid data synchronization which may cause a big influence to algorithm performance, and synchronous operation problem in DCE using a distributed synchronization tool so that distributed cooperation threads can perform in a harmonious environment.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/li14.html
http://proceedings.mlr.press/v36/li14.htmlFAQ: A Framework for Fast Approximate Query Processing on Temporal DataTemporal queries on time evolving data are at the heart of a broad range of business and network intelligence applications ranging from consumer behavior analysis, trend analysis, temporal pattern mining, sentiment analysis on social media, cyber security, and network monitoring. In this work, we present an innovative data structure called Fast Approximate Query-able(FAQ) which provides a unified framework for temporal query processing on Big Data. FAQ uses a novel composition of data sketching, wavelet-style differencing for temporal compression, and quantization, and handles diverse kinds of queries including distinct counts, set membership, frequency estimation, top-K, p-norms, empirical entropy, and distance queries such as Histogram \ell_p-norm distance (including Euclidean and Manhattan distance), cosine similarity, Jaccard coefficient, and rank correlation. Experiments on a real-life multi dimensional network monitoring data sets demonstrate speedups of 92x achieved by FAQ over a flat representation of data for a mixed temporal query workload.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/khurana14.html
http://proceedings.mlr.press/v36/khurana14.htmlParallel Graph Mining with GPUsFrequent graph mining is an important though computationally hard problem because it requires enumerating possibly an exponential number of candidate subgraph patterns, and checking their presence in a database of graphs. In this paper, we propose a novel approach for parallel graph mining on GPUs, which have emerged as a relatively cheap but powerful architecture for general purpose computing. However, the thread-model for GPUs is different from that of CPUs, which makes the parallelization of graph mining algorithms on GPUs a challenging task. We investigate the major challenges for GPU-based graph mining. We perform extensive experiments on several real-world and synthetic datasets, achieving speedups up to 9 over the sequential algorithm.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/kessl14.html
http://proceedings.mlr.press/v36/kessl14.htmlHigh density-focused uncertainty sampling for active learning over evolving stream dataData labeling is an expensive and time-consuming task, hence carefully choosing which labels to use for training a model is becoming increasingly important. In the active learning setting, a classifier is trained by querying labels from a small representative fraction of data. While many approaches exist for non-streaming scenarios, few works consider the challenges of the data stream setting. We propose a new active learning method for evolving data streams based on a combination of density and prediction uncertainty (DBALStream). Our approach decides to label an instance or not, considering whether it lies in an high density partition of the data space. This allows focusing labelling efforts in the instance space where more data is concentrated; hence, the benefits of learning a more accurate classifier are expected to be higher. Instance density is approximated in an online manner by a sliding window mechanism, a standard technique for data streams. We compare our method with state-of-the-art active learning strategies over benchmark datasets. The experimental analysis demonstrates good predictive performance of the new approach.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/ienco14.html
http://proceedings.mlr.press/v36/ienco14.htmlFrom Tweets to Stories: Using Stream-Dashboard to weave the twitter data stream into dynamic cluster modelsSocial media has recently emerged as an invaluable source of information for decision making. Social media information reflects the interests of virtual communities in a spontaneous and timely manner. The need to understand the massive streams of data generated by social media platforms, such as Twitter and Facebook, has motivated researchers to use machine learning techniques to try to discover knowledge in real time. In this paper, we adapt our recently developed stream cluster mining, tracking and validation framework, Stream-Dashboard, to support detecting and tracking evolving discussion clusters in Twitter. The effectiveness of Stream-Dashboard in telling stories is illustrated by analyzing a couple of stories related to the Louisville Cardinals’ basketball championship. We further validate the detected story lines, that are automatically mined from user-generated tweets using as an alternative source, Google Trends, which are based on search queries.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/hawwash14.html
http://proceedings.mlr.press/v36/hawwash14.htmlTowards Optimal Execution of Density-based Clustering on Heterogeneous HardwareData Clustering is an important and highly utilized data mining technique in various application domains. With ever increasing data volumes in the era of big data, the efficient execution of clustering algorithms is a fundamental prerequisite to gain understanding and acquire novel, previously unknown knowledge from data. To establish an efficient execution, the clustering algorithms have to be re-engineered to fully exploit the provided hardware capabilities. Shared-memory multiprocessor systems like graphics processing units (GPUs) provide extremely high parallelism combined with a high bandwidth transfer at low cost. The availability of such computing units increases with upcoming processors, where a common CPU and various computing units, like GPU, are tightly coupled using a unified shared memory hierarchy. In this paper, we consider density-based clustering for such heterogeneous systems. In particular, we optimize the configuration of CUDA-DClust – a density-based clustering algorithm for GPUs – and show that our configuration approach enables an efficient and deterministic execution. Our configuration approach is based on data as well as hardware properties, so that we are able to adjust the algorithm execution in both directions. In our evaluation, we show the applicability of our approach and present open challenges which have to be solved next.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/habich14.html
http://proceedings.mlr.press/v36/habich14.htmlPrefaceThe aim of this workshop is to bring together people from both academia and industry to present their most recent work related to big-data issues, and exchange ideas and thoughts in order to advance this big-data challenge, which has been considered as one of the most exciting opportunities in the past 10 years. Recent years have witnessed a dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applications. This data flood has outpaced our capability to process, analyze, store and understand these datasets. Consider the Internet data. The web pages indexed by Google were around one million in 1998, but quickly reached 1 billion in 2000 and have already exceeded 1 trillion in 2008. This rapid expansion is accelerated by the dramatic increase in acceptance of social networking applications, such as Facebook, Twitter, Weibo, etc., that allow users to create contents freely and amplify the already huge Web volume. Furthermore, with mobile phones becoming the sensory gateway to get real-time data on people from different aspects, the vast amount of data that mobile carrier can potentially process to improve our daily life has significantly outpaced our past CDR (call data record)-based processing for billing purposes only. It can be foreseen that Internet of things (IoT) applications will raise the scale of data to an unprecedented level. People and devices (from home coffee machines to cars, to buses, railway stations and airports) are all loosely connected. Trillions of such connected components will generate a huge data ocean, and valuable information must be discovered from the data to help improve quality of life and make our world a better place. For example, after we get up every morning, in order to optimize our commute time to work and complete the optimization before we arrive at office, the system needs to process information from traffic, weather, construction, police activities to our calendar schedules, and perform deep optimization under the tight time constraints. In all these applications, we are facing significant challenges in leveraging the vast amount of data, including challenges in (1) system capabilities (2) algorithmic design (3) business models.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/fan14.html
http://proceedings.mlr.press/v36/fan14.htmlEnsembles of Adaptive Model Rules from High-Speed Data StreamsThe volume and velocity of data is increasing at astonishing rates. In order to extract knowledge from this huge amount of information there is a need for efficient on-line learning algorithms. Rule-based algorithms produce models that are easy to understand and can be used almost offhand. Ensemble methods combine several predicting models to improve the quality of prediction. In this paper, a new on-line ensemble method that combines a set of rule-based models is proposed to solve regression problems from data streams. Experimental results using synthetic and real time-evolving data streams show the proposed method significantly improves the performance of the single rule-based learner, and outperforms two state-of-the-art regression algorithms for data streams.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/duarte14.html
http://proceedings.mlr.press/v36/duarte14.htmlScalable Graph Building from Text DataIn this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an exhaustive search inside the buckets to build the graph. It also uses multiple stages to join the different unconnected subgraphs. We experimentally test the algorithm on different datasets consisting of the subject of spam emails. Although the algorithm is still at an early development stage, it already proves to be four times faster than a MapReduce implementation of NN-Descent, for the same quality of produced graph.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/debatty14.html
http://proceedings.mlr.press/v36/debatty14.htmlScalable Heterogeneous Transfer RankingIn this paper, we propose to study the problem of heterogeneous transfer ranking, a transfer learning problem with heterogeneous features in order to utilize the rich large-scale labeled data in popular languages to help the ranking task in less popular languages. We develop a large-margin algorithm, namely LM-HTR, to solve the problem by mapping the input features in both the source domain and target domain into a shared latent space and simultaneously minimizing the feature reconstruction loss and prediction loss. We analyze the theoretical bound of the prediction loss and develop fast algorithms via stochastic gradient descent so that our model can be scalable to large-scale applications. Experiment results on two application datasets demonstrate the advantages of our algorithms over other state-of-the-art methods.Wed, 13 Aug 2014 00:00:00 +0000
http://proceedings.mlr.press/v36/bahadori14.html
http://proceedings.mlr.press/v36/bahadori14.html