Proceedings of Machine Learning ResearchProceedings of The 8th Asian Conference on Machine Learning
Held in The University of Waikato, Hamilton, New Zealand on 16-18 November 2016
Published as Volume 63 by the Proceedings of Machine Learning Research on 20 November 2016.
Volume Edited by:
Robert J. Durrant
Kee-Eung Kim
Series Editors:
Neil D. Lawrence
Mark Reid
http://proceedings.mlr.press/v63/
Mon, 29 May 2017 07:40:01 +0000Mon, 29 May 2017 07:40:01 +0000Jekyll v3.4.3Learning Feature Aware MetricDistance Metric Learning (DML) aims to find a distance metric, revealing feature relationship and satisfying restrictions between instances, for distance based classifiers, e.g., kNN. Most DML methods take all features into consideration while leaving the feature importance identification untouched. Feature selection methods, on the other hand, only focus on feature weights and are seldom directly designed for distance based classifiers. In this paper, we propose a Feature AwaRe Metric learning (FARM) method which not only learns the appropriate metric for distance constraints but also discovers significant features and their relationships. In FARM approach, we treat a distance metric as a combination of feature weighting and feature relationship discovering factors. Therefore, by decoupling the metric into two parts, it facilitates flexible regularizations for feature importance selection as well as feature relationship constructing. Simulations on artificial datasets clearly reveal the comprehensiveness of feature weighting for FARM. Experiments on real datasets validate the improvement of classification performance and the efficiency of our FARM approach.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/ye4.html
http://proceedings.mlr.press/v63/ye4.htmlEnhancing Topic Modeling on Short Texts with CrowdsourcingTopic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics. Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/yang54.html
http://proceedings.mlr.press/v63/yang54.htmlMultitask Principal Component AnalysisPrincipal Component Analysis (PCA) is a canonical and well-studied tool for dimensionality reduction. However, when few data are available, the poor quality of the covariance estimator at its core may compromise its performance. We leverage this issue by casting the PCA into a multitask framework, and doing so, we show how to solve simultaneously several related PCA problems. Hence, we propose a novel formulation of the PCA problem relying on a novel regularization. This regularization is based on a distance between subspaces, and the whole problem is solved as an optimization problem over a Riemannian manifold. We experimentally demonstrate the usefulness of our approach as pre-processing for EEG signals.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/yamane65.html
http://proceedings.mlr.press/v63/yamane65.htmlModelling Symbolic Music: Beyond the Piano RollIn this paper, we consider the problem of probabilistically modelling symbolic music data. We introduce a representation which reduces polyphonic music to a univariate categorical sequence. In this way, we are able to apply state of the art natural language processing techniques, namely the long short-term memory sequence model. The representation we employ permits arbitrary rhythmic structure, which we assume to be given. We show that our model is effective on all four piano roll based benchmark datasets. We further improve our model by augmenting our training data set with transpositions of the original pieces through all musical keys, thereby convincingly advancing the state of the art on these benchmark problems. We also fit models to music which is unconstrained in its rhythmic structure, discuss the properties of this model, and provide musical samples which are more sophisticated than previously possible with this class of recurrent neural network sequence models. We also provide our newly preprocessed data set of non piano-roll music data. To facilitate future work we describe and provide a new carefully preprocessed dataset of 19700 classical midi music files — significantly more than previously available.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/walder88.html
http://proceedings.mlr.press/v63/walder88.htmlSecure Approximation Guarantee for Cryptographically Private Empirical Risk MinimizationPrivacy concern has been increasingly important in many machine learning (ML) problems. We study empirical risk minimization (ERM) problems under secure multi-party computation (MPC) frameworks. Main technical tools for MPC have been developed based on cryptography. One of limitations in current cryptographically private ML is that it is computationally intractable to evaluate non-linear functions such as logarithmic functions or exponential functions. Therefore, for a class of ERM problems such as logistic regression in which non-linear function evaluations are required, one can only obtain approximate solutions. In this paper, we introduce a novel cryptographically private tool called secure approximation guarantee (SAG) method. The key property of SAG method is that, given an arbitrary approximate solution, it can provide a non-probabilistic assumption-free bound on the approximation quality under cryptographically secure computation framework. We demonstrate the beneﬁt of the SAG method by applying it to several problems including a practical privacy-preserving data analysis task on genomic and clinical information.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/takada48.html
http://proceedings.mlr.press/v63/takada48.htmlCost Sensitive Online Multiple Kernel ClassificationMining data streams has been an important open research problem in the era of big data analytics. This paper investigates supervised machine learning techniques for mining data streams with application to online anomaly detection. Unlike conventional data mining tasks, mining data streams for online anomaly detection has several challenges: (i) data arriving sequentially and increasing rapidly, (ii) highly class-imbalanced distributions; and (iii) complex anomaly patterns that could evolve dynamically. To tackle these challenges, we propose Cost-Sensitive Online Multiple Kernel Classification (CSOMKC) for comprehensively mining data streams and demonstrate its application to online anomaly detection. Specifically, CSOMKC learns a kernel-based cost-sensitive prediction model for imbalanced data streams in a sequential or online learning fashion, in which a pool of multiple diverse kernels is dynamically explored. The optimal kernel predictor and the multiple kernel combination are learnt together, and simultaneously class imbalance issues are addressed. We perform theoretical and extensive empirical analysis of the proposed algorithms.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/sahoo56.html
http://proceedings.mlr.press/v63/sahoo56.htmlLinearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized OptimizationIn this paper, we consider a class of constrained nonconvex regularized minimization problems, where the constraints is linearly constrained. It was reported in the literature that nonconvex regularization usually yields a solution with more desirable sparse structural properties beyond convex ones. However, it is not easy to obtain the proximal mapping associated with nonconvex regularization, due to the imposed linearly constraints. In this paper, the optimization problem with linear constraints is solved by the Linearized Alternating Direction Method of Multipliers (LADMM). Moreover, we present a detailed convergence analysis of the LADMM algorithm for solving nonconvex compositely regularized optimization with a large class of nonconvex penalties. Experimental results on several real-world datasets validate the efficacy of the proposed algorithm.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/qiao37.html
http://proceedings.mlr.press/v63/qiao37.htmlPrefacePreface for ACML 2016Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/preface.html
http://proceedings.mlr.press/v63/preface.htmlA Bayesian Nonparametric Approach for Multi-label ClassificationMany real-world applications require multi-label classification where multiple target labels are assigned to each instance. In multi-label classification, there exist the intrinsic correlations between the labels and features. These correlations are beneficial for multi-label classification task since they reflect the coexistence of the input and output spaces that can be exploited for prediction. Traditional classification methods have attempted to reveal these correlations in different ways. However, existing methods demand expensive computation complexity for finding such correlation structures. Furthermore, these approaches can not identify the suitable number of label-feature correlation patterns. In this paper, we propose a Bayesian nonparametric (BNP) framework for multi-label classification that can automatically learn and exploit the unknown number of multi-label correlation. We utilize the recent techniques in stochastic inference to derive the cheap (but efficient) posterior inference algorithm for the model. In addition, our model can naturally exploit the useful information from missing label samples. Furthermore, we extend the model to update parameters in an online fashion that highlights the flexibility of our model against the existing approaches. We compare our method with the state-of-the-art multi-label classification algorithms on real-world datasets using both complete and missing label settings. Our model achieves better classification accuracy while our running time is consistently much faster than the baselines in an order of magnitude.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/nguyen93.html
http://proceedings.mlr.press/v63/nguyen93.htmlMultiple Kernel Learning with Data AugmentationThe motivations of multiple kernel learning (MKL) approach are to increase kernel expressiveness capacity and to avoid the expensive grid search over a wide spectrum of kernels. A large amount of work has been proposed to improve the MKL in terms of the computational cost and the sparsity of the solution. However, these studies still either require an expensive grid search on the model parameters or scale unsatisfactorily with the numbers of kernels and training samples. In this paper, we address these issues by conjoining MKL, Stochastic Gradient Descent (SGD) framework, and data augmentation technique. The pathway of our proposed method is developed as follows. We first develop a maximum-a-posteriori (MAP) view for MKL under a probabilistic setting and described in a graphical model. This view allows us to develop data augmentation technique to make the inference for finding the optimal parameters feasible, as opposed to traditional approach of training MKL via convex optimization techniques. As a result, we can use the standard SGD framework to learn weight matrix and extend the model to support online learning. We validate our method on several benchmark datasets in both batch and online settings. The experimental results show that our proposed method can learn the parameters in a principled way to eliminate the expensive grid search while gaining a significant computational speedup comparing with the state-of-the-art baselines.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/nguyen19.html
http://proceedings.mlr.press/v63/nguyen19.htmlSimulation and Calibration of a Fully Bayesian Marked Multidimensional Hawkes Process with Dissimilar DecaysWe propose a simulation method for multidimensional Hawkes processes based on superposition theory of point processes. This formulation allows us to design efficient simulations for Hawkes processes with differing exponentially decaying intensities. We demonstrate that inter-arrival times can be decomposed into simpler auxiliary variables that can be sampled directly, giving exact simulation with no approximation. We establish that the auxiliary variables provides information on the parent process for each event time. The algorithm correctness is shown by verifying the simulated intensities with their theoretical moments. A modular inference procedure consisting of a combination between Gibbs through the adaptive rejection sampling and Metropolis Hastings steps is presented. Finally, we compare our proposed simulation method against existing methods, and find significant improvement in terms of algorithm speed. Our inference algorithm is used to discover the strengths of mutually excitations in real dark networks. Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/lim83.html
http://proceedings.mlr.press/v63/lim83.htmlLocalized Multiple Kernel Learning—A Convex ApproachWe propose a localized approach to multiple kernel learning that can be formulated as a convex optimization problem over a given cluster structure. For which we obtain generalization error guarantees and derive an optimization algorithm based on the Fenchel dual representation. Experiments on real-world datasets from the application domains of computational biology and computer vision show that convex localized multiple kernel learning can achieve higher prediction accuracies than its global and non-convex local counterparts.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/lei63.html
http://proceedings.mlr.press/v63/lei63.htmlBank of Weight Filters for Deep CNNsConvolutional neural networks (CNNs) are seen to be extremely effective in many large object recognition tasks. One of the reasons for this is that they learn appropriate features also from the training data. The convolutional layers of a CNN have these feature generating filters whose weights are learnt. However, this entails learning millions of weights (across different layers) and hence learning times are very large even on the best available hardware. In some studies in transfer learning it has been observed that the network learnt on one task can be reused on another task (by some finetuning). In this context, this paper presents a systematic study of the exchangeability of weight filters of CNNs across different object recognition tasks. The paper proposes the concept of bank of weight-filters (BWF) which consists of all the weight vectors of filters learnt by different CNNs on different tasks. The BWF can be viewed at multiple levels of granularity such as network-level, layer-level and filter-level. Through extensive empirical investigations we show that one can efficiently learn CNNs for new tasks by randomly selecting from the bank of filters for initializing the convolutional layers of the new CNN. Our study is done at all the multiple levels of granularity mentioned above. Our results show that the concept of BWF proposed here would offer a very good strategy for initializing the filters while learning CNNs. We also show that the dependency among the filters and the layers of the CNN is not strict. One can choose any pre-trained filter instead of a fixed pre-trained net, as a whole, for initialization. This paper is a first step in the direction of creating and characterizing a Universal BWF for efficient learning of CNNs.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/kumaraswamy29.html
http://proceedings.mlr.press/v63/kumaraswamy29.htmlCollaborative Recurrent Neural Networks for Dynamic Recommender SystemsModern technologies enable us to record sequences of online user activity at an unprece- dented scale. Although such activity logs are abundantly available, most approaches to recommender systems are based on the rating-prediction paradigm, ignoring temporal and contextual aspects of user behavior revealed by temporal, recurrent patterns. In contrast to explicit ratings, such activity logs can be collected in a non-intrusive way and can offer richer insights into the dynamics of user preferences, which could potentially lead more accurate user models. In this work we advocate studying this ubiquitous form of data and, by combining ideas from latent factor models for collaborative filtering and language modeling, propose a novel, flexible and expressive collaborative sequence model based on recurrent neural networks. The model is designed to capture a user’s contextual state as a personalized hidden vector by summarizing cues from a data-driven, thus variable, number of past time steps, and represents items by a real-valued embedding. We found that, by exploiting the inherent structure in the data, our formulation leads to an efficient and practical method. Furthermore, we demonstrate the versatility of our model by applying it to two different tasks: music recommendation and mobility prediction, and we show empirically that our model consistently outperforms static and non-collaborative methods.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/ko101.html
http://proceedings.mlr.press/v63/ko101.htmlDeep Gate Recurrent Neural NetworkThis paper explores the possibility of using multiplicative gates to build two recurrent neural network structures. These two structures are called Deep Simple Gated Unit (DSGU) and Simple Gated Unit (SGU), which are structures for learning long-term dependencies. Compared to traditional Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), both structures require fewer parameters and less computation time in sequence classification tasks. Unlike GRU and LSTM, which require more than one gate to control information flow in the network, SGU and DSGU only use one multiplicative gate to control the flow of information. We show that this difference can accelerate the learning speed in tasks that require long dependency information. We also show that DSGU is more numerically stable than SGU. In addition, we also propose a standard way of representing the inner structure of RNN called RNN Conventional Graph (RCG), which helps to analyze the relationship between input units and hidden units of RNN.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/gao30.html
http://proceedings.mlr.press/v63/gao30.htmlLearning from Survey Training Samples: Rate Bounds for Horvitz-Thompson Risk MinimizersThe generalization ability of minimizers of the empirical risk in the context of binary classification has been investigated under a wide variety of complexity assumptions for the collection of classifiers over which optimization is performed. In contrast, the vast majority of the works dedicated to this issue stipulate that the training dataset used to compute the empirical risk functional is composed of i.i.d. observations and involve sharp control of uniform deviation of i.i.d. averages from their expectation. Beyond the cases where training data are drawn uniformly without replacement among a large i.i.d. sample or modelled as a realization of a weakly dependent sequence of r.v.’s, statistical guarantees when the data used to train a classifier are drawn by means of a more general sampling/survey scheme and exhibit a complex dependence structure have not been documented in the literature yet. It is the main purpose of this paper to show that the theory of empirical risk minimization can be extended to situations where statistical learning is based on survey samples and knowledge of the related (first order) inclusion probabilities. Precisely, we prove that minimizing a (possibly biased) weighted version of the empirical risk, refered to as the (approximate) Horvitz-Thompson risk (HT risk), over a class of controlled complexity lead to a rate for the excess risk of the order O_\mathbbP((\kappa_N (\log N)/n)^1/2) with \kappa_N=(n/N)/\min_i≤N\pi_i, when data are sampled by means of a rejective scheme of (deterministic) size n within a statistical population of cardinality N≥n, a generalization of basic \it sampling without replacement with unequal probability weights \pi_i > 0. Extension to other sampling schemes are then established by a coupling argument. Beyond theoretical results, numerical experiments are displayed in order to show the relevance of HT risk minimization and that ignoring the sampling scheme used to generate the training dataset may completely jeopardize the learning procedure.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/clemencon64.html
http://proceedings.mlr.press/v63/clemencon64.htmlUnifying Topic, Sentiment & Preference in an HDP-Based Rating Regression Model for Online ReviewsThis paper proposes a new HDP based online review rating regression model named Topic-Sentiment-Preference Regression Analysis (TSPRA). TSPRA combines topics (i.e. product as-pects), word sentiment and user preference as regression factors, and is able to perform topic clus-tering, review rating prediction, sentiment analysis and what we invent as “critical aspect” analysis altogether in one framework. TSPRA extends sentiment approaches by integrating the key concept “user preference” in collaborative filtering (CF) models into consideration, while it is distinct from current CF models by decoupling “user preference” and “sentiment” as independent factors. Our experiments conducted on 22 Amazon datasets show overwhelming better performance in rating predication against a state-of-art model FLAME (2015) in terms of error, Pearson’s Correlation and number of inverted pairs. For sentiment analysis, we compare the derived word sentiments against a public sentiment resource SenticNet3 and our sentiment estimations clearly make more sense in the context of online reviews. Last, as a result of the de-correlation of “user preference” from “sentiment”, TSPRA is able to evaluate a new concept “critical aspects”, defined as the prod-uct aspects seriously concerned by users but negatively commented in reviews. Improvement to such “critical aspects” could be most effective to enhance user experience.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/chen33.html
http://proceedings.mlr.press/v63/chen33.htmlNon-Linear Smoothed Transductive Network Embedding with Text InformationNetwork embedding is a classical task which aims to map the nodes of a network to low-dimensional vectors. Most of the previous network embedding methods are trained in an unsupervised scheme. Then the learned node embeddings can be used as inputs of many machine learning tasks such as node classification, attribute inference. However, the discrimination validity of the node embeddings maybe improved by considering the node label information and the node attribute information. Inspired by traditional semi-supervised learning techniques, we explore to train the node embeddings and the node classifiers simultaneously with the text attributes information in a flexible framework. We present NLSTNE (Non-Linear Smoothed Transductive Network Embedding), a transductive network embedding method, whose embeddings are enhanced by modeling the non-linear pairwise similarity between the nodes and the non-linear relationship between the nodes and the text attributes. We use the node classification task to evaluate the quality of node embeddings learned by different models on four real-world network datasets . The experimental results demonstrate that our model outperforms several state-of-the-art network embedding methods.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/chen115.html
http://proceedings.mlr.press/v63/chen115.htmlHierarchical Probabilistic Matrix Factorization with Network Topology for Multi-relational Social NetworkLink prediction in multi-relational social networks has attracted much attention. For instance, we may care the chance of two users being friends based on their contacts of other patterns, e.g., SMS and phone calls. In previous work, matrix factorization models are typically applied in single-relational networks; however, two challenges arise to extend it into multi-relational networks. First, the interaction of different relation types is hard to be captured. The second is the cold start problem, as the prediction of new entities in multi-relational networks becomes even more challenging. In this article we propose a novel method called Hierarchical Probabilistic Matrix Factorization with Network Topology (HPMFNT). Our model exploits the network topology by extending the Katz index into multi-relational settings, which could efficiently model the multidimensional interplay via the auxiliary information from other relationships. We also utilize the extended Katz index along with entitiy attributes to solve the cold-start problem. Experiments on two real world datasets have shown that our model outperforms the state-of-the-art with a significant margin.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/bai103.html
http://proceedings.mlr.press/v63/bai103.htmlEcoICA: Skewness-based ICA via Eigenvectors of Cumulant OperatorIndependent component analysis (ICA) is an important unsupervised learning method. Most popular ICA methods use kurtosis as a metric of non-Gaussianity to maximize, such as FastICA and JADE.However, their assumption of kurtosic sources may not always be satisfied in practice. For weak-kurtosic but skewed sources, kurtosis-based methods could fail while skewness-based methods seem more promising, where skewness is another non-Gaussianity metric measuring the non-symmetry of signals. Partly due to the common assumption of signal symmetry, skewness-based ICA has not been systematically studied in spite of some existing works. In this paper, we take a systematic approach to develop EcoICA, a new skewness-based ICA method for weak-kurtosic but skewed sources. Specifically, we design a new cumulant operator, define its eigenvalues and eigenvectors, reveal their connections with the ICA model to formulate the EcoICA problem, and use Jacobi method to solve it. Experiments on both synthetic and real data show the superior performance of EcoICA over existing kurtosis-based and skewness-based methods for skewed sources. In particular, EcoICA is less sensitive to sample size, noise, and outlier than other methods. Studies on face recognition further confirm the usefulness of EcoICA in classification.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Song94.html
http://proceedings.mlr.press/v63/Song94.htmlProper Inner Product with Mean Displacement for Gaussian Noise Invariant ICAIndependent Component Analysis (ICA) is a classical method for Blind Source Separation (BSS). In this paper, we are interested in ICA in the presence of noise, i.e., the noisy ICA problem. Pseudo-Euclidean Gradient Iteration (PEGI) is a recent cumulant-based method that defines a pseudo Euclidean inner product to replace a quasi-whitening step in Gaussian noise invariant ICA. However, PEGI has two major limitations: 1) the pseudo Euclidean inner product is improper because it violates the positive definiteness of inner product; 2) the inner product matrix is orthogonal by design but it has gross errors or imperfections due to sample-based estimation. This paper proposes a new cumulant-based ICA method named as PIMD to address these two problems. We first define a Proper Inner product (PI) with proved positive definiteness and then relax the centering preprocessing step to a mean displacement (MD) step. Both PI and MD aim to improve the orthogonality of inner product matrix and the recovery of independent components (ICs) in sample-based estimation. We adopt a gradient iteration step to find the ICs for PIMD. Experiments on both synthetic and real data show the respective effectiveness of PI and MD as well as the superiority of PIMD over competing ICA methods. Moreover, MD can improve the performance of other ICA methods as well.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Song106.html
http://proceedings.mlr.press/v63/Song106.htmlAn Efficient Approach for Multi-Sentence CompressionMulti Sentence Compression (MSC) is of great value to many real world applications, such as guided microblog summarization, opinion summarization and newswire summarization. Recently, word graph-based approaches have been proposed and become popular in MSC. Their key assumption is that redundancy among a set of related sentences provides a reliable way to generate informative and grammatical sentences. In this paper, we propose an effective approach to enhance the word graph-based MSC and tackle the issue that most of the state-of-the-art MSC approaches are confronted with: i.e., improving both informativity and grammaticality at the same time. Our approach consists of three main components: (1) a merging method based on Multiword Expressions (MWE); (2) a mapping strategy based on synonymy between words; (3) a re-ranking step to identify the best compression candidates generated using a POS-based language model (POS-LM). We demonstrate the effectiveness of this novel approach using a dataset made of clusters of English newswire sentences. The observed improvements on informativity and grammaticality of the generated compressions show an up to 44% error reduction over state-of-the-art MSC systems.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/ShafieiBavani24.html
http://proceedings.mlr.press/v63/ShafieiBavani24.htmlEcho State Hoeffding Tree LearningNowadays, real-time classification of Big Data streams is becoming essential in a variety of application domains. While decision trees are powerful and easy-to-deploy approaches for accurate and fast learning from data streams, they are unable to capture the strong temporal dependences typically present in the input data. Recurrent Neural Networks are an alternative solution that include an internal memory to capture these temporal dependences; however their training is computationally very expensive and with slow convergence, requiring a large number of hyper-parameters to tune. Reservoir Computing was proposed to reduce the computation requirements of the training phase but still include a feed-forward layer which requires a large number of parameters to tune. In this work we propose a novel architecture for real-time classification based on the combination of a Reservoir and a decision tree. This combination reduces the number of hyper-parameters while still maintaining the good temporal properties of recurrent neural networks. The capabilities of the proposed architecture to learn some typical string-based functions with strong temporal dependences are evaluated in the paper. We show how the new architecture is able to incrementally learn these functions in real-time with fast adaptation to unknown sequences. And we study the influence of the reduced number of hyper-parameters in the behaviour of the proposed solution.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Marron87.html
http://proceedings.mlr.press/v63/Marron87.htmlGeometry-aware stationary subspace analysisIn many real-world applications data exhibits non-stationarity, i.e., its distribution changes over time. One approach to handling non-stationarity is to remove or minimize it before attempting to analyze the data. In the context of brain computer interface (BCI) data analysis this is sometimes achieved using stationary subspace analysis (SSA). The classic SSA method finds a matrix that projects the data onto a stationary subspace by optimizing a cost function based on a matrix divergence. In this work we present an alternative method for SSA based on a symmetrized version of this matrix divergence. We show that this frames the problem in terms of distances between symmetric positive definite (SPD) matrices, suggesting a geometric interpretation of the problem. Stemming from this geometric viewpoint, we introduce and analyze a method which utilizes the geometry of the SPD matrix manifold and the invariance properties of its metrics. Most notably we show that these invariances alleviate the need to whiten the input matrices, a common step in many SSA methods which often introduces error. We demonstrate the usefulness of our technique in experiments on both synthetic and real-world data.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Horev84.html
http://proceedings.mlr.press/v63/Horev84.htmlLearning Distance Metrics for Multi-Label ClassificationDistance metric learning is a well studied problem in the field of machine learning, where it is typically used to improve the accuracy of instance based learning techniques. In this paper we propose a distance metric learning algorithm that is specialised for multi-label classification tasks, rather than the multiclass setting considered by most work in this area. The method trains an embedder that can transform instances into a feature space where Euclidean distance provides an estimate of the Jaccard distance between the corresponding label vectors. In addition to a linear Mahalanobis style metric, we also present a nonlinear extension that provides a substantial boost in performance. We show that this technique significantly improves upon current approaches for instance based multi-label classification, and also enables interesting data visualisations.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Gouk8.html
http://proceedings.mlr.press/v63/Gouk8.htmlLearnability of Non-I.I.D.Learnability has always been one of the most central problems in learning theory. Most previous studies on this issue were based on the assumption that the samples are drawn independently and identically according to an underlying (unknown) distribution. The i.i.d. assumption, however, does not hold in many real applications. In this paper, we study the learnability of problems where the samples are drawn from empirical process of stationary β-mixing sequence, which has been a widely-used assumption implying a dependence weaken over time in training samples. By utilizing the independent blocks technique, we provide a sufficient and necessary condition for learnability, that is, average stability is equivalent to learnability with AERM (Asymptotic Empirical Risk Minimization) in the non-i.i.d. learning setting. In addition, we also discuss the generalization error when the test variable is dependent on the training sample.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Gao09.html
http://proceedings.mlr.press/v63/Gao09.htmlLong Short-term Memory Network over Rhetorical Structure Theory for Sentence-level Sentiment AnalysisUsing deep learning models to solve sentiment analysis of sentences is still a challenging task. Long short-term memory (LSTM) network solves the gradient disappeared problem existed in recurrent neural network (RNN), but LSTM structure is linear chain-structure that can’t capture text structure information. Afterwards, Tree-LSTM is proposed, which uses LSTM forget gate to skip sub-trees that have little effect on the results to get good performance. It illustrates that the chain-structured LSTM more strongly depends on text structure. However, Tree-LSTM can’t clearly figure out which sub-trees are important and which sub-trees have little effect. We propose a simple model which uses Rhetorical Structure Theory (RST) for text parsing. By building LSTM network on RST parse structure, we make full use of LSTM structural characteristics to automatically enhance the nucleus information and filter the satellite information of text. Furthermore, this approach can make the representations concerning the relations between segments of text, which can improve text semantic representations. Experiment results show that this method not only has higher classification accuracy, but also trains quickly.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Fu62.html
http://proceedings.mlr.press/v63/Fu62.htmlImproving Distributed Word Representation and Topic Model by Word-Topic Mixture ModelWe propose a Word-Topic Mixture(WTM) model to improve word representation and topic model simultaneously. Firstly, it introduces the initial external word embeddings into the Topical Word Embeddings(TWE) model based on Latent Dirichlet Allocation(LDA) model to learn word embeddings and topic vectors. Then the results learned from TWE are integrated in the LDA by defining the probability distribution of topic vectors-word embeddings according to the idea of latent feature model with LDA (LFLDA), meanwhile minimizing the KL divergence of the new topic-word distribution function and the original one. The experimental results prove that the WTM model performs better on word representation and topic detection compared with some state-of-the-art models.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Fu60.html
http://proceedings.mlr.press/v63/Fu60.htmlFast Collaborative Filtering from Implicit Feedback with Provable GuaranteesBuilding recommendation algorithm is one of the most challenging tasks in Machine Learning. Although most of the recommendation systems are built on explicit feedback available from the users in terms of rating or text, a majority of the applications do not receive such feedback. Here we consider the recommendation task where the only available data is the records of user-item interaction over web applications over time, in terms of subscription or purchase of items; this is known as implicit feedback recommendation. There is usually a massive amount of such user-item interaction available for any web applications. Algorithms like PLSI or Matrix Factorization runs several iterations through the dataset and may prove very expensive for large datasets. Here we propose a recommendation algorithm based on Method of Moment, which involves factorization of second and third order moments of the dataset. Our algorithm can be proven to be globally convergent using PAC learning theory. Further, we show how to extract the parameters using only three passes through the entire dataset. This results in a highly scalable algorithm that scales up to million of users even on a machine with a single-core processor and 8 GB RAM and produces competitive performance in comparison with existing algorithms.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Dasgupta79.html
http://proceedings.mlr.press/v63/Dasgupta79.htmlRandom Fourier Features For Operator-Valued KernelsDevoted to multi-task learning and structured output learning, operator-valued kernels provide a flexible tool to build vector-valued functions in the context of Reproducing Kernel Hilbert Spaces. To scale up these methods, we extend the celebrated Random Fourier Feature methodology to get an approximation of operator-valued kernels. We propose a general principle for Operator-valued Random Fourier Feature construction relying on a generalization of Bochner’s theorem for translation-invariant operator-valued Mercer kernels. We prove the uniform convergence of the kernel approximation for bounded and unbounded operator random Fourier features using appropriate Bernstein matrix concentration inequality. An experimental proof-of-concept shows the quality of the approximation and the efficiency of the corresponding linear models on example datasets.Sun, 20 Nov 2016 00:00:00 +0000
http://proceedings.mlr.press/v63/Brault39.html
http://proceedings.mlr.press/v63/Brault39.html