Proceedings of Machine Learning Research

Proceedings of Machine Learning Research Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics Held in La Palma, Canary Islands on 21-23 April 2012 Published as Volume 22 by the Proceedings of Machine Learning Research on 21 March 2012. Volume Edited by: Neil D. Lawrence Mark Girolami Series Editors: Neil D. Lawrence https://proceedings.mlr.press/v22/ Wed, 08 Feb 2023 10:39:01 +0000 Wed, 08 Feb 2023 10:39:01 +0000 Jekyll v3.9.3 Approximate Inference in Additive Factorial HMMs with Application to Energy Disaggregation This paper considers additive factorial hidden Markov models, an extension to HMMs where the state factors into multiple independent chains, and the output is an additive function of all the hidden states. Although such models are very powerful, accurate inference is unfortunately difficult: exact inference is not computationally tractable, and existing approximate inference techniques are highly susceptible to local optima. In this paper we propose an alternative inference method for such models, which exploits their additive structure by 1) looking at the observed difference signal of the observation, 2) incorporating a “robust” mixture component that can account for unmodeled observations, and 3) constraining the posterior to allow at most one hidden state to change at a time. Combining these elements we develop a convex formulation of approximate inference that is computationally efficient, has no issues of local optima, and which performs much better than existing approaches in practice. The method is motivated by the problem of energy disaggregation, the task of taking a whole home electricity signal and decomposing it into its component appliances; applied to this task, our algorithm achieves state-of-the-art performance, and is able to separate many appliances almost perfectly using just the total aggregate signal. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zico12.html https://proceedings.mlr.press/v22/zico12.html Beta-Negative Binomial Process and Poisson Factor Analysis A beta-negative binomial (BNB) process is proposed, leading to a beta-gamma-Poisson process, which may be viewed as a “multi-scoop” generalization of the beta-Bernoulli process. The BNB process is augmented into a beta-gamma-gamma-Poisson hierarchical structure, and applied as a nonparametric Bayesian prior for an infinite Poisson factor analysis model. A finite approximation for the beta process Levy random measure is constructed for convenient implementation. Efficient MCMC computations are performed with data augmentation and marginalization techniques. Encouraging results are shown on document count matrix factorization. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhou12c.html https://proceedings.mlr.press/v22/zhou12c.html Online Incremental Feature Learning with Denoising Autoencoders While determining model complexity is an important problem in machine learning, many feature learning algorithms rely on cross-validation to choose an optimal number of features, which is usually infeasible for online learning from a massive stream of data. In this paper, we propose an incremental feature learning algorithm to determine the optimal model complexity for large-scale, online datasets based on the denoising autoencoder. This algorithm is composed of two processes: adding features and merging features. Specifically, it adds new features to minimize the objective function’s residual and merges similar features to obtain a compact feature representation and prevent over-fitting. Our experiments show that the model quickly converges to the optimal number of features in a large-scale online setting, and outperforms the (non-incremental) denoising autoencoder, as well as deep belief networks and stacked denoising autoencoders for classification tasks. Further, the algorithm is particularly effective in recognizing new patterns when the data distribution changes over time in the massive online data stream. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhou12b.html https://proceedings.mlr.press/v22/zhou12b.html Multi-label Subspace Ensemble A challenging problem of multi-label learning is that both the label space and the model complexity will grow rapidly with the increase in the number of labels, and thus makes the available training samples insufficient for training a proper model. In this paper, we eliminate this problem by learning a mapping of each label in the feature space as a robust subspace, and formulating the prediction as finding the group sparse representation of a given instance on the subspace ensemble. We term this approach as “multi-label subspace ensemble (MSE)”. In the training stage, the data matrix is decomposed as the sum of several low-rank matrices and a sparse residual via a randomized optimization, where each low-rank part defines a subspace mapped by a label. In the prediction stage, the group sparse representation on the subspace ensemble is estimated by group lasso. Experiments on several benchmark datasets demonstrate the appealing performance of MSE. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhou12a.html https://proceedings.mlr.press/v22/zhou12a.html Sparse Additive Machine We develop a high dimensional nonparametric classification method named sparse additive machine (SAM), which can be viewed as a functional version of support vector machines (SVM) combined with sparse additive modeling. SAM is related to multiple kernel learning (MKL), but is computationally more efficient and amenable to theoretical analysis. In terms of computation, we develop an efficient accelerated proximal gradient descent algorithm which is also scalable to large data sets with a provable O(1/k^2) convergence rate and k is the number of iterations. In terms of theory, we provide the oracle properties of SAM under asymptotic frameworks. Empirical results on3 both synthetic and real data are reported to back up our theory. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhao12.html https://proceedings.mlr.press/v22/zhao12.html Scaling up Kernel SVM on Limited Resources: A Low-rank Linearization Approach Kernel Support Vector Machine delivers state-of-the-art results in non-linear classification, but the need to maintain a large number of support vectors poses a challenge in large scale training and testing. In contrast, linear SVM is much more scalable even on limited computing recourses (e.g. daily life PCs), but the learned model cannot capture non-linear concepts. To scale up kernel SVM on limited resources, we propose a low-rank linearization approach that transforms a non-linear SVM to a linear one via a novel, approximate empirical kernel map computed from efficient low-rank approximation of kernel matrices. We call it LLSVM (Low-rank Linearized SVM). We theoretically study the gap between the solutions of the optimal and approximate kernel map, which is used in turn to provide important guidance on the sampling based kernel approximations. Our algorithm inherits high efficiency of linear SVMs and rich repesentability of kernel classifiers. Evaluation against large-scale linear and kernel SVMs on several truly large data sets shows that the proposed method achieves a better tradeoff between scalability and model representability. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhang12d.html https://proceedings.mlr.press/v22/zhang12d.html An Autoregressive Approach to Nonparametric Hierarchical Dependent Modeling We propose a conditional autoregression framework for a collection of random probability measures. Under this framework, we devise a conditional autoregressive Dirichlet process (DP) that we call one-parameter dependent DP (wDDP). The appealing properties of this specification are that it has two equivalent representations and its inference can be implemented in a conditional Polya urn scheme. Moreover, these two representations bear a resemblance to the Polya urn scheme and the stick-breaking representation in the conventional DP. We apply this wDDP to Bayesian multivariate-response regression problems. An efficient Markov chain Monte Carlo algorithm is developed for Bayesian computation and prediction. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhang12c.html https://proceedings.mlr.press/v22/zhang12c.html A Composite Likelihood View for Multi-Label Classification Given limited training samples, learning to classify multiple labels is challenging. Problem decomposition is widely used in this case, where the original problem is decomposed into a set of easier-to-learn subproblems, and predictions from subproblems are combined to make the final decision. In this paper we show the connection between composite likelihoods and many multi-label decomposition methods, e.g., one-vs-all, one-vs-one, calibrated label ranking, probabilistic classifier chain. This connection holds promise for improving problem decomposition in both the choice of subproblems and the combination of subproblem decisions. As an attempt to exploit this connection, we design a composite marginal method that improves pairwise decomposition. Pairwise label comparisons, which seem to be a natural choice for subproblems, are replaced by bivariate label densities, which are more informative and natural components in a composite likelihood. For combining subproblem decisions, we propose a new mean-field approximation that minimizes the notion of composite divergence and is potentially more robust to inaccurate estimations in subproblems. Empirical studies on five data sets show that, given limited training samples, the proposed method outperforms many alternatives. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhang12b.html https://proceedings.mlr.press/v22/zhang12b.html Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers Linear classification has demonstrated success in many areas of applications. Modern algorithms for linear classification can train reasonably good models while going through the data in only tens of rounds. However, large data often does not fit in the memory of a single machine, which makes the bottleneck in large-scale learning the disk I/O, not the CPU. Following this observation, Yu et al. (2010) made significant progress in reducing disk usage, and their algorithms now outperform LIBLINEAR. In this paper, rather than optimizing algorithms on a single machine, we propose and implement distributed algorithms that achieve parallel disk loading and access the disk only once. Our large-scale learning algorithms are based on the framework of alternating direction methods of multipliers. The framework derives a subproblem that remains to be solved efficiently for which we propose using dual coordinate descent. Our experimental evaluations on large datasets demonstrate that the proposed algorithms achieve significant speedup over the classifier proposed by Yu et al. running on a single machine. Our algorithms are faster than existing distributed solvers, such as Zinkevich et al. (2010)’s parallel stochastic gradient descent and Vowpal Wabbit. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/zhang12a.html https://proceedings.mlr.press/v22/zhang12a.html Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs We describe the first sub-quadratic sampling algorithm for Multiplicative Attribute Graph Model (MAGM) of Kim and Leskovec (2010). We exploit the close connection between MAGM and the Kronecker Product Graph Model (KPGM) of Leskovec et al. (2010), and show that to sample a graph from a MAGM it suffices to sample small number of KPGM graphs and quilt them together. Under a restricted set of technical conditions, our algorithm runs in $O((\log_2(n))^3 |E|)$ time, where n is the number of nodes and |E| is the number of edges in the sampled graph. We demonstrate the scalability of our algorithm via extensive empirical evaluation; we can sample a MAGM graph with 8 million nodes and 20 billion edges in under 6 hours. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/yun12.html https://proceedings.mlr.press/v22/yun12.html Forward Basis Selection for Sparse Approximation over Dictionary Recently, forward greedy selection method has been successfully applied to approximately solve sparse learning problems, characterized by a trade-off between sparsity and accuracy. In this paper, we generalize this method to the setup of sparse approximation over a pre-fixed dictionary. A fully corrective forward selection algorithm is proposed along with convergence analysis. The per-iteration computational overhead of the proposed algorithm is dominated by a subproblem of linear optimization over the dictionary and a subproblem to optimally adjust the aggregation weights. The former is cheaper in several applications than the Euclidean projection while the latter is typically an unconstrained optimization problem which is relatively easy to solve. Furthermore, we extend the proposed algorithm to the setting of non-negative/convex sparse approximation over a dictionary.Applications of our algorithms to several concrete learning problems are explored with efficiency validated on benchmark data sets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/yuan12.html https://proceedings.mlr.press/v22/yuan12.html Transductive Learning of Structural SVMs via Prior Knowledge Constraints Reducing the number of labeled examples required to learn accurate prediction models is an important problem in structured output prediction. In this paper we propose a new transductive structural SVM algorithm that learns by incorporating prior knowledge constraints on unlabeled data. Our formulation supports different types of prior knowledge constraints, and can be trained efficiently. Experiments on two citation and advertisement segmentation tasks show that our transductive structural SVM can learn effectively from unlabeled data, achieving similar prediction accuracies when compared against other state-of-art algorithms. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/yu12.html https://proceedings.mlr.press/v22/yu12.html Perturbation based Large Margin Approach for Ranking The use of the standard hinge loss for structured outputs, for the learning to rank problem, faces two main caveats: (a) the label space, the set of all possible permutations of items to be ranked, is too large, and also less amenable to the usual dynamic-programming based techniques used for structured outputs, and (b) the supervision or training data consists of instances with multiple labels per input, instead of just a single label. The most natural way to deal with such multiple labels leads, unfortunately, to a non-convex surrogate. In this paper, we propose a general class of perturbation-based surrogates that leverage the large margin approach, and are convex. We show that the standard hinge surrogate for classification actually falls within this class. We also find a surrogate within this class, for the ranking problem, that does not suffer from the caveats mentioned above. Indeed, our experiments demonstrate that it performs better than other candidate large margin proposals on both synthetic and real world ranking datasets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/yang12.html https://proceedings.mlr.press/v22/yang12.html Active Learning from Multiple Knowledge Sources Some supervised learning tasks do not fit the usual single annotator scenario. In these problems, ground-truth may not exist and multiple annotators are generally available. A few approaches have been proposed to address this learning problem. In this setting active learning (AL), the problem of optimally selecting unlabeled samples for labeling, offers new challenges and has received little attention. In multiple annotator AL, it is not sufficient to select a sample for labeling since, in addition, an optimal annotator must also be selected. This setting is of great interest as annotators’ expertise generally varies and could depend on the given sample itself; additionally, some annotators may be adversarial. Thus, clearly the information provided by some annotators should be more valuable than that provided by others and it could vary across data points. We propose an AL approach for this new scenario motivated by information theoretic principles. Specifically, we focus on maximizing the information that an annotator label provides about the true (but unknown) label of the data point. We develop this concept, propose an algorithm for active learning, and experimentally validate the proposed approach. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/yan12.html https://proceedings.mlr.press/v22/yan12.html Robust Multi-task Regression with Grossly Corrupted Observations We consider the multiple-response regression problem, where the response is subject to *sparse gross errors*, in the high-dimensional setup. We propose a tractable regularized M-estimator that is robust to such error, where the sum of two individual regularization terms are used: the first one encourages row-sparse regression parameters, and the second one encourages a sparse error term. We obtain non-asymptotical estimation error bounds of the proposed method. To the best of our knowledge, this is the first analysis of the robust multi-task regression problem with gross errors. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/xu12b.html https://proceedings.mlr.press/v22/xu12b.html Statistical Optimization in High Dimensions We consider optimization problems whose parameters are known only approximately, based on a noisy sample. Of particular interest is the high-dimensional regime, where the number of samples is roughly equal to the dimensionality of the problem, and the noise magnitude may greatly exceed the magnitude of the signal itself. This setup falls far outside the traditional scope of Robust and Stochastic optimization. We propose three algorithms to address this setting, combining ideas from statistics, machine learning, and robust optimization. In the important case where noise artificially increases the dimensionality of the parameters, we show that combining robust optimization and dimensionality reduction can result in high-quality solutions at greatly reduced computational cost. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/xu12a.html https://proceedings.mlr.press/v22/xu12a.html Primal-Dual methods for sparse constrained matrix completion We develop scalable algorithms for regular and non-negative matrix completion. In particular, we base the methods on trace-norm regularization that induces a low rank predicted matrix. The regularization problem is solved via a constraint generation method that explicitly maintains a sparse dual and the corresponding low rank primal solution. We provide a new dual block coordinate descent algorithm for solving the dual problem with a few spectral constraints. Empirical results illustrate the effectiveness of our method in comparison to recently proposed alternatives. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/xin12.html https://proceedings.mlr.press/v22/xin12.html Causality with Gates An intervention on a variable removes the influences that usually have a causal effect on that variable. Gates are a general-purpose graphical modelling notation for representing such context-specific independencies in the structure of a graphical model. We extend d-separation to cover gated graphical models and show that it subsumes do calculus when gates are used to represent interventions. We also show how standard message passing inference algorithms, such as belief propagation, can be applied to the gated graph. This demonstrates that causal reasoning can be performed by probabilistic inference alone. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/winn12.html https://proceedings.mlr.press/v22/winn12.html Generalized Optimal Reverse Prediction Recently it has been shown that classical supervised and unsupervised training methods can be unified as special cases of so-called “optimal reverse prediction": predicting inputs from target labels while optimizing over both model parameters and missing labels. Although this perspective establishes links between classical training principles, the existing formulation only applies to linear predictors under squared loss, hence is extremely limited. We generalize the formulation of optimal reverse prediction to arbitrary Bregman divergences, and more importantly to nonlinear predictors. This extension is achieved by establishing a new, generalized form of forward-reverse minimization equivalence that holds for arbitrary matching losses. Several benefits follow. First, a new variant of Bregman divergence clustering can be recovered that incorporates a non-linear data reconstruction model. Second, normalized-cut and kernel-based extensions can be formulated coherently. Finally, a new semi-supervised training principle can be recovered for classification problems that demonstrates advantages over the state of the art. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/white12.html https://proceedings.mlr.press/v22/white12.html Nonlinear low-dimensional regression using auxiliary coordinates When doing regression with inputs and outputs that are high-dimensional, it often makes sense to reduce the dimensionality of the inputs before mapping to the outputs. Much work in statistics and machine learning, such as reduced-rank regression, slice inverse regression and their variants, has focused on linear dimensionality reduction, or on estimating the dimensionality reduction first and then the mapping. We propose a method where both the dimensionality reduction and the mapping can be nonlinear and are estimated jointly. Our key idea is to define an objective function where the low-dimensional coordinates are free parameters, in addition to the dimensionality reduction and the mapping. This has the effect of decoupling many groups of parameters from each other, affording a far more effective optimization than if using a deep network with nested mappings, and to use a good initialization from slice inverse regression or spectral methods. Our experiments with image and robot applications show our approach to improve over direct regression and various existing approaches. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/wang12.html https://proceedings.mlr.press/v22/wang12.html A Hybrid Neural Network-Latent Topic Model This paper introduces a hybrid model that combines a neural network with a latent topic model. The neural network provides a low dimensional embedding for the input data, whose subsequent distribution is captured by the topic model. The neural network thus acts as a trainable feature extractor while the topic model captures the group structure of the data. Following an initial pretraining phase to separately initialize each part of the model, a unified training scheme is introduced that allows for discriminative training of the entire model. The approach is evaluated on visual data in scene classification task, where the hybrid model is shown to outperform models based solely on neural networks or topic models, as well as other baseline methods. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/wan12.html https://proceedings.mlr.press/v22/wan12.html Minimax Rates of Estimation for Sparse PCA in High Dimensions We study sparse principal components analysis in the high-dimensional setting, where p (the number of variables) can be much larger than n (the number of observations). We prove optimal minimax lower and upper bounds on the estimation error for the first leading eigenvector when it belongs to an \ell_q ball for q ∈[0,1]. Our bounds are tight in p and n for all q ∈[0, 1] over a wide class of distributions. The upper bound is obtained by analyzing the performance of \ell_q-constrained PCA. In particular, our results provide convergence rates for \ell_1-constrained PCA. Our methods and arguments are also extendable to multi-dimensional subspace estimation. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/vu12.html https://proceedings.mlr.press/v22/vu12.html Bayesian Group Factor Analysis We introduce a factor analysis model that summarizes the dependencies between observed variable groups, instead of dependencies between individual variables as standard factor analysis does. A group may correspond to one view of the same set of objects, one of many data sets tied by co-occurrence, or a set of alternative variables collected from statistics tables to measure one property of interest. We show that by assuming group-wise sparse factors, active in a subset of the sets, the variation can be decomposed into factors explaining relationships between the sets and factors explaining away set-specific variation. We formulate the assumptions in a Bayesian model providing the factors, and apply the model to two data analysis tasks, in neuroimaging and chemical systems biology. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/virtanen12.html https://proceedings.mlr.press/v22/virtanen12.html Krylov Subspace Descent for Deep Learning In this paper, we propose a second order optimization method to learn models where both the dimensionality of the parameter space and the number of training samples is high. In our method, we construct on each iteration a Krylov subspace formed by the gradient and an approximation to the Hessian matrix, and then use a subset of the training data samples to optimize over this subspace. As with the Hessian Free (HF) method of Martens (2010), the Hessian matrix is never explicitly constructed, and is computed using a subset of data. In practice, as in HF, we typically use a positive definite substitute for the Hessian matrix such as the Gauss-Newton matrix. We investigate the effectiveness of our proposed method on deep neural networks, and compare its performance to widely used methods such as stochastic gradient descent, conjugate gradient descent and L-BFGS, and also to HF. Our method leads to faster convergence than either L-BFGS or HF, and generally performs better than either of them in cross-validation accuracy. It is also simpler and more general than HF, as it does not require a positive semidefinite approximation of the Hessian matrix to work well nor the setting of a damping parameter. The chief drawback versus HF is the need for memory to store a basis for the Krylov subspace. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/vinyals12.html https://proceedings.mlr.press/v22/vinyals12.html Learning from Weak Teachers This paper addresses the problem of learning when high-quality labeled examples are an expensive resource, while samples with error-prone labeling (for example generated by crowdsourcing) are readily available. We introduce a formal framework for such learning scenarios with label sources of varying quality, and we propose a parametric model for such label sources (“weak teachers”), reflecting the intuition that their labeling is likely to be correct in label-homogeneous regions but may deteriorate near classification boundaries. We consider learning when the learner has access to weakly labeled random samples and, on top of that, can actively query the correct labels of sample points of its choice. We propose a learning algorithm for this scenario, analyze its sample complexity and prove that, under certain conditions on the underlying data distribution, our learner can utilize the weak labels to reduce the number of expert labels it requires. We view this paper as a first step towards the development of a theory of learning from labels generated by teachers of varying accuracy, a scenario that is relevant in various practical applications. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/urner12.html https://proceedings.mlr.press/v22/urner12.html A Bayesian Analysis of the Radioactive Releases of Fukushima The Fukushima Daiichi disaster 11 March, 2011 is considered the largest nuclear accident since the 1986 Chernobyl disaster and has been rated at level 7 on the International Nuclear Event Scale. As different radioactive materials have different effects to human body, it is important to know the types of nuclides and their levels of concentration from the recorded mixture of radiations to well take necessary measures. We presently formulate a Bayesian generative model for the data available on radioactive releases from the Fukushima Daiichi disaster across Japan. The model can infer from the sparsely sampled measurements what nuclides are present as well as their concentration levels. An important property of the proposed model is that it admits unique recovery of the parameters. On synthetic data we demonstrate that our model is able to infer the underlying components and on data from the Fukushima Daiichi plant we establish that the model is able to well account for the data. We further demonstrate how the model extends to include all the available measurements recorded throughout Japan. The model can be considered a first attempt to apply Bayesian learning unsupervised in order to give a more detailed account also of the latent structure present in the data of the Fukushima Daiichi disaster. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/tomioka12.html https://proceedings.mlr.press/v22/tomioka12.html Fast Variational Mode-Seeking Mode-seeking algorithms (e.g., mean-shift) constitute a class of powerful non-parametric clustering methods, but they are slow. We present VMS, a dual-tree based variational EM framework for mode-seeking that greatly accelerates performance. VMS has a number of pleasing properties: it generalizes across different mode-seeking algorithms, it does not have typical homoscedasticity constraints on kernel bandwidths, and it is the first truly sub-quadratic acceleration method that maintains provable convergence for a well-defined objective function. Experimental results demonstrate acceleration benefits over competing methods and show that VMS is particularly desirable for data sets of massive size, where a coarser approximation is needed to improve the computational efficiency. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/thiesson12.html https://proceedings.mlr.press/v22/thiesson12.html Randomized Optimum Models for Structured Prediction One approach to modeling structured discrete data is to describe the probability of states via an energy function and Gibbs distribution. A recurring difficulty in these models is the computation of the partition function, which may require an intractable sum. However, in many such models, the mode can be found efficiently even when the partition function is unavailable. Recent work on Perturb-and-MAP (PM) models (Papandreou and Yuille, 2011) has exploited this discrepancy to approximate the Gibbs distribution for Markov random fields (MRFs). Here, we explore a broader class of models, called Randomized Optimum Models (RandOMs), which include PM as a special case. This new class of models encompasses not only MRFs, but also other models that have intractable partition functions yet permit efficient mode-finding, such as those based on bipartite matchings, shortest paths, or connected components in a graph. We develop likelihood-based learning algorithms for RandOMs, which, empirical results indicate, can produce better models than PM. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/tarlow12b.html https://proceedings.mlr.press/v22/tarlow12b.html Structured Output Learning with High Order Loss Functions Often when modeling structured domains, it is desirable to leverage information that is not naturally expressed as simply a label. Examples include knowledge about the evaluation measure that will be used at test time, and partial (weak) label information. When the additional information has structure that factorizes according to small subsets of variables (i.e., is \emphlow order, or \emphdecomposable), several approaches can be used to incorporate it into a learning procedure. Our focus in this work is the more challenging case, where the additional information does not factorize according to low order graphical model structure; we call this the \emphhigh order case. We propose to formalize various forms of this additional information as high order loss functions, which may have complex interactions over large subsets of variables. We then address the computational challenges inherent in learning according to such loss functions, particularly focusing on the loss-augmented inference problem that arises in large margin learning; we show that learning with high order loss functions is often practical, giving strong empirical results, with one popular and several novel high-order loss functions, in several settings. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/tarlow12a.html https://proceedings.mlr.press/v22/tarlow12a.html Multiresolution Deep Belief Networks Motivated by the observation that coarse and fine resolutions of an image reveal different structures in the underlying visual phenomenon, we present a model based on the Deep Belief Network (DBN) which learns features from the multiscale representation of images. A Laplacian Pyramid is first constructed for each image. DBNs are then trained separately at each level of the pyramid. Finally, a top level RBM combines these DBNs into a single network we call the Multiresolution Deep Belief Network (MrDBN). Experiments show that MrDBNs generalize better than standard DBNs on NORB classification and TIMIT phone recognition. In the domain of generative learning, we demonstrate the superiority of MrDBNs at modeling face images. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/tang12.html https://proceedings.mlr.press/v22/tang12.html Lifted Variable Elimination with Arbitrary Constraints Lifted probabilistic inference algorithms exploit regularities in the structure of graphical models to perform inference more efficiently. More specifically, they identify groups of interchangeable variables and perform inference once for each group, as opposed to once for each variable. The groups are defined by means of constraints, so the flexibility of the grouping is determined by the expressivity of the constraint language. Existing approaches for exact lifted inference rely on (in)equality constraints. We show how inference methods can be generalized to work with arbitrary constraints. This allows them to capture a broader range of symmetries, leading to more opportunities for lifting. We empirically demonstrate that this improves inference efficiency with orders of magnitude, allowing exact inference in cases where until now only approximate inference was feasible. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/taghipour12.html https://proceedings.mlr.press/v22/taghipour12.html On Estimation and Selection for Topic Models This article describes posterior maximization for topic models, identifying computational and conceptual gains from inference under a non-standard parametrization. We then show that fitted parameters can be used as the basis for a novel approach to marginal likelihood estimation, via block-diagonal approximation to the information matrix, that facilitates choosing the number of latent topics. This likelihood-based model selection is complemented with a goodness-of-fit analysis built around estimated residual dispersion. Examples are provided to illustrate model selection as well as to compare our estimation against standard alternative techniques. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/taddy12.html https://proceedings.mlr.press/v22/taddy12.html Fast Learning Rate of Multiple Kernel Learning: Trade-Off between Sparsity and Smoothness We investigate the learning rate of multiple kernel leaning (MKL) with L1 and elastic-net regularizations. The elastic-net regularization is a composition of an L1-regularizer for inducing the sparsity and an L2-regularizer for controlling the smoothness. We focus on a sparse setting where the total number of kernels is large but the number of non-zero components of the ground truth is relatively small, and show sharper convergence rates than the learning rates ever shown for both L1 and elastic-net regularizations. Our analysis shows there appears a trade-off between the sparsity and the smoothness when it comes to selecting which of L1 and elastic-net regularizations to use; if the ground truth is smooth, the elastic-net regularization is preferred, otherwise the L1 regularization is preferred. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/suzuki12.html https://proceedings.mlr.press/v22/suzuki12.html Constrained 1-Spectral Clustering An important form of prior information in clustering comes in the form of cannot-link and must-link constraints of instances. We present a generalization of the popular spectral clustering technique which integrates such constraints. Motivated by the recently proposed 1-spectral clustering for the unconstrained normalized cut problem, our method is based on a tight relaxation of the constrained normalized cut into a continuous optimization problem. Opposite to all other methods which have been suggested for constrained spectral clustering, we can always guarantee to satisfy all constraints. Moreover, our soft formulation allows to optimize a trade-off between normalized cut and the number of violated constraints. An efficient implementation is provided which scales to large datasets. We outperform consistently all other proposed methods in the experiments. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/sundar12.html https://proceedings.mlr.press/v22/sundar12.html Efficient and Exact MAP-MRF Inference using Branch and Bound We propose two novel Branch-and-Bound (BB) methods to efficiently solve exact MAP-MRF inference on problems with a large number of states (per variable) H. By organizing the data in a suitable structure, the time complexity of our best method for evaluating the bound at each branch is reduced from O(H^2) to O(H). This permits searching for the MAP solution in O(BH+Q) instead of O(BH^2) (without using the data structure), where B is the number of branches and Q is the one-time cost to build the data structure which is proportional to H^2. Our analysis on synthetic data shows that, given a limited time budget, our method solves problems that are characterized by a much larger number of states when compared to state-of-the-art exact inference algorithms (e.g., Marinescu and Dechter (2007); Sontag et al. (2008)) and other baseline BB methods. We further show that our method is well suited for computer vision and computational biology problems where the state space (per variable) is large. In particular, our approach is an order of magnitude faster on average than Sontag et al.’s Cluster Pursuit (CP) on human pose estimation problems. Moreover, given a time budget of up to 20 minutes, our method consistently solves more protein design problems than CP does. Finally, we successfully explore different branching strategies and ways to utilize domain knowledge of the problem to significantly reduce the empirical inference time. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/sun12.html https://proceedings.mlr.press/v22/sun12.html Learning Fourier Sparse Set Functions Can we learn a sparse graph from observing the value of a few random cuts? This and more general problems can be reduced to the challenge of learning set functions known to have sparse Fourier support contained in some collection. We prove that if we choose O(k log^4 n) sets uniformly at random, then with high probability, observing any k-sparse function on those sets is sufficient to recover that function exactly. We further show that other properties, such as symmetry or submodularity imply structure in the Fourier spectrum, which can be exploited to further reduce sample complexity. One interesting special case is that it suffices to observe O(|E| log^4 |V|) values of a cut function to recover a graph. We demonstrate the effectiveness of our results on two real-world reconstruction problems: graph sketching and obtaining fast approximate surrogates to expensive submodular objective functions. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/stobbe12.html https://proceedings.mlr.press/v22/stobbe12.html Bayesian Inference for Change Points in Dynamical Systems with Reusable States - a Chinese Restaurant Process Approach We study a model of a stochastic process with unobserved parameters which suddenly change at random times. The possible parameter values are assumed to be from a finite but unknown set. Using a Chinese restaurant process prior over parameters we develop an efficient MCMC procedure for Bayesian inference. We demonstrate the significance of our approach with an application to systems biology data. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/stimberg12.html https://proceedings.mlr.press/v22/stimberg12.html Flexible Martingale Priors for Deep Hierarchies When building priors over trees for Bayesian hierarchical models, there is a tension between maintaining desirable theoretical properties such as infinite exchangeability and important practical properties such as the ability to increase the depth of the tree to accommodate new data. We resolve this tension by presenting a family of infinitely exchangeable priors over discrete tree structures that allows the depth of the tree to grow with the data, and then showing that our family contains all hierarchical models with certain mild symmetry properties. We also show that deep hierarchical models are in general intimately tied to a process called a martingale, and use Doob’s martingale convergence theorem to demonstrate some unexpected properties of deep hierarchies. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/steinhardt12.html https://proceedings.mlr.press/v22/steinhardt12.html Testing for Membership to the IFRA and the NBU Classes of Distributions This paper provides test procedures to determine whether the probability distribution underlying a set of non-negative valued samples belongs to the Increasing Failure Rate Average (IFRA) class or the New Better than Used (NBU) class. Membership of a distribution to one of these classes is known to have implications which are important in reliability, queuing theory, game theory and other disciplines. Our proposed test is based on the Kolmogorov-Smirnov distance between an empirical cumulative hazard function and its best approximation from the class of distributions constituting the null hypothesis. It turns out that the least favorable distribution, which produces the largest probability of Type I error of each of the tests, is the exponential distribution. This fact is used to produce an appropriate cut-off or p-value. Monte Carlo simulations are conducted to check small sample size (i.e., significance) and power of the test. Usefulness of the test is illustrated through the analysis of a set of monthly family expenditure data collected by the National Sample Survey Organization of the Government of India. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/srivastava12.html https://proceedings.mlr.press/v22/srivastava12.html Consistency and Rates for Clustering with DBSCAN We propose a simple and efficient modification of the popular DBSCAN clustering algorithm. This modification is able to detect the most interesting vertical threshold level in an automated, data-driven way. We establish both consistency and optimal learning rates for this modification. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/sriperumbudur12.html https://proceedings.mlr.press/v22/sriperumbudur12.html Joint Estimation of Structured Sparsity and Output Structure in Multiple-Output Regression via Inverse-Covariance Regularization We consider the problem of learning a sparse regression model for predicting multiple related outputs given high-dimensional inputs, where related outputs are likely to share common relevant inputs. Most of the previous methods for learning structured sparsity assumed that the structure over the outputs is known a priori, and focused on designing regularization functions that encourage structured sparsity reflecting the given output structure. In this paper, we propose a new approach for sparse multiple-output regression that can jointly learn both the output structure and regression coefficients with structured sparsity. Our approach reformulates the standard regression model into an alternative parameterization that leads to a conditional Gaussian graphical model, and employes an inverse-covariance regularization. We show that the orthant-wise quasi-Newton algorithm developed for L1-regularized log-linear model can be adopted for a fast optimization for our method. We demonstrate our method on simulated datasets and real datasets from genetics and finances applications. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/sohn12.html https://proceedings.mlr.press/v22/sohn12.html On Nonparametric Guidance for Learning Autoencoder Representations Unsupervised discovery of latent representations, in addition to being useful for density modeling, visualisation and exploratory data analysis, is also increasingly important for learning features relevant to discriminative tasks. Autoencoders, in particular, have proven to be an effective way to learn latent codes that reflect meaningful variations in data. A continuing challenge, however, is guiding an autoencoder toward representations that are useful for particular tasks. A complementary challenge is to find codes that are invariant to irrelevant transformations of the data. The most common way of introducing such problem-specific guidance in autoencoders has been through the incorporation of a parametric component that ties the latent representation to the label information. In this work, we argue that a preferable approach relies instead on a nonparametric guidance mechanism. Conceptually, it ensures that there exists a function that can predict the label information, without explicitly instantiating that function. The superiority of this guidance mechanism is confirmed on two datasets. In particular, this approach is able to incorporate invariance information (lighting, elevation, etc.) from the small NORB object recognition dataset and yields state-of-the-art performance for a single layer, non-convolutional network. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/snoek12.html https://proceedings.mlr.press/v22/snoek12.html Low rank continuous-space graphical models Constructing tractable dependent probability distributions over structured continuous random vectors is a central problem in statistics and machine learning. It has proven difficult to find general constructions for models in which efficient exact inference is possible, outside of the classical cases of models with restricted graph structure (chain, tree, etc.) and linear-Gaussian or discrete potentials. In this work we identify a tree graphical model class in which exact inference can be performed efficiently, owing to a certain “low-rank” structure in the potentials. We explore this new class of models by applying the resulting inference methods to neural spike rate estimation and motion-capture joint-angle smoothing tasks. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/smith12.html https://proceedings.mlr.press/v22/smith12.html On Bisubmodular Maximization Bisubmodularity extends the concept of submodularity to set functions with two arguments. We show how bisubmodular maximization leads to richer value-of-information problems, using examples in sensor placement and feature selection. We present the first constant-factor approximation algorithm for a wide class of bisubmodular maximizations. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/singh12.html https://proceedings.mlr.press/v22/singh12.html Multi-armed Bandit Problems with History In this paper we consider the stochastic multi-armed bandit problem. However, unlike in the conventional version of this problem, we do not assume that the algorithm starts from scratch. Many applications offer observations of (some of) the arms even before the algorithm starts. We propose three novel multi-armed bandit algorithms that can exploit this data. An upper bound on the regret is derived in each case. The results show that a logarithmic amount of historic data can reduce regret from logarithmic to constant. The effectiveness of the proposed algorithms are demonstrated on a large-scale malicious URL detection problem. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/shivaswamy12.html https://proceedings.mlr.press/v22/shivaswamy12.html Complexity of Bethe Approximation This paper resolves a common complexity issue in the Bethe approximation of statistical physics and the sum-product Belief Propagation (BP) algorithm of artificial intelligence. The Bethe approximation reduces the problem of computing the partition function in a graphical model to that of solving a set of non-linear equations, so-called the Bethe equation. On the other hand, the BP algorithm is a popular heuristic method for estimating marginal distribution in a graphical model. Although they are inspired and developed from different directions, Yedidia, Freeman and Weiss (2004) established a somewhat surprising connection: the BP algorithm solves the Bethe equation if it converges (however, it often does not). This naturally motivates the following important question to understand their limitations and empirical successes: the Bethe equation is computationally easy to solve? We present a message passing algorithm solving the Bethe equation in polynomial number of bitwise operations for arbitrary binary graphical models of n nodes where the maximum degree in the underlying graph is O(log n). Our algorithm, an alternative to BP fixing its convergence issue, is the first fully polynomial-time approximation scheme for the BP fixed point computation in such a large class of graphical models. Moreover, we believe that our technique is of broader interest to understand the computational complexity of the cavity method in statistical physics. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/shin12.html https://proceedings.mlr.press/v22/shin12.html Sparsistency of the Edge Lasso over Graphs The fused lasso was proposed recently to enable recovery of high-dimensional patterns which are piece-wise constant on a graph, by penalizing the \ell_1-norm of differences of measurements at vertices that share an edge. While there have been some attempts at coming up with efficient algorithms for solving the fused lasso optimization, a theoretical analysis of its performance is mostly lacking except for the simple linear graph topology. In this paper, we investigate \em sparsistency of fused lasso for general graph structures, i.e. its ability to correctly recover the exact support of piece-wise constant graph-structured patterns asymptotically (for large-scale graphs). To emphasize this distinction over previous work, we will refer to it as Edge Lasso. We focus on the (structured) normal means setting, and our results provide necessary and sufficient conditions on the graph properties as well as the signal-to-noise ratio needed to ensure sparsistency. We examplify our results using simple graph-structured patterns, and demonstrate that in some cases fused lasso is sparsistent at very weak signal-to-noise ratios (scaling as \sqrt(\log n)/|A|, where n is the number of vertices in the graph and A is the smallest set of vertices with constant activation). In other cases, it performs no better than thresholding the difference of measurements at vertices which share an edge (which requires signal-to-noise ratio that scales as \sqrt\log n). Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/sharpnack12.html https://proceedings.mlr.press/v22/sharpnack12.html Using More Data to Speed-up Training Time In many recent applications, data is plentiful. By now, we have a rather clear understanding of how more data can be used to improve the accuracy of learning algorithms. Recently, there has been a growing interest in understanding how more data can be leveraged to reduce the required training runtime. In this paper, we study the runtime of learning as a function of the number of available training examples, and underscore the main high-level techniques. We provide the first formal positive result showing that even in the unrealizable case, the runtime can decrease exponentially while only requiring a polynomial growth of the number of examples. Our construction corresponds to a synthetic learning problem and an interesting open question is whether the tradeoff can be shown for more natural learning problems. We spell out several interesting candidates of natural learning problems for which we conjecture that there is a tradeoff between computational and sample complexity. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/shalev-shwartz12.html https://proceedings.mlr.press/v22/shalev-shwartz12.html Fast Variational Bayesian Inference for Non-Conjugate Matrix Factorization Models Probabilistic matrix factorization methods aim to extract meaningful correlation structure from an incomplete data matrix by postulating low rank constraints. Recently, variational Bayesian (VB) inference techniques have successfully been applied to such large scale bilinear models. However, current algorithms are of the alternate updating or stochastic gradient descent type, slow to converge and prone to getting stuck in shallow local minima. While for MAP or maximum margin estimation, singular value shrinkage algorithms have been proposed which can far outperform alternate updating, this methodological avenue remains unexplored for Bayesian techniques. In this paper, we show how to combine a recent singular value shrinkage characterization of fully observed spherical Gaussian VB matrix factorization with augmented Lagrangian techniques in order to obtain efficient VB inference for general MF models with arbitrary likelihood potentials. In particular, we show how to handle Poisson and Bernoulli potentials, far more suited for most MF applications than Gaussian likelihoods. Our algorithm can be run even for very large models and is easily implemented in \em Matlab. It outperforms MAP estimation on a range of real-world datasets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/seeger12.html https://proceedings.mlr.press/v22/seeger12.html Markov Logic Mixtures of Gaussian Processes: Towards Machines Reading Regression Data We propose a novel mixtures of Gaussian processes model in which the gating function is interconnected with a probabilistic logical model, in our case Markov logic networks. In this way, the resulting mixed graphical model, called Markov logic mixtures of Gaussian processes (MLxGP), solves joint Bayesian non-parametric regression and probabilistic relational inference tasks. In turn, MLxGP facilitates novel, interesting tasks such as regression based on logical constraints or drawing probabilistic logical conclusions about regression data, thus putting “machines reading regression data” in reach. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/schiegg12.html https://proceedings.mlr.press/v22/schiegg12.html Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression We show how spatio-temporal Gaussian process (GP) regression problems (or the equivalent Kriging problems) can be formulated as infinite-dimensional Kalman filtering and Rauch-Tung-Striebel (RTS) smoothing problems, and present a procedure for converting spatio-temporal covariance functions into infinite-dimensional stochastic differential equations (SDEs). The resulting infinite-dimensional SDEs belong to the class of stochastic pseudo-differential equations and can be numerically treated using the methods developed for deterministic counterparts of the equations. The scaling of the computational cost in the proposed approach is linear in the number of time steps as opposed to the cubic scaling of the direct GP regression solution. We also show how separable covariance functions lead to a finite-dimensional Kalman filtering and RTS smoothing problem, present analytical and numerical examples, and discuss numerical methods for computing the solutions. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/sarkka12.html https://proceedings.mlr.press/v22/sarkka12.html Age-Layered Expectation Maximization for Parameter Learning in Bayesian Networks The expectation maximization (EM) algorithm is a popular algorithm for parameter estimation in models with hidden variables. However, the algorithm has several non-trivial limitations, a significant one being variation in eventual solutions found, due to convergence to local optima. Several techniques have been proposed to allay this problem, for example initializing EM from multiple random starting points and selecting the highest likelihood out of all runs. In this work, we a) show that this method can be very expensive computationally for difficult Bayesian networks, and b) in response we propose an age-layered EM approach (ALEM) that efficiently discards less promising runs well before convergence. Our experiments show a significant reduction in the number of iterations, typically two- to four-fold, with minimal or no reduction in solution quality, indicating the potential for ALEM to streamline parameter estimation in Bayesian networks. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/saluja12.html https://proceedings.mlr.press/v22/saluja12.html Local Anomaly Detection Anomalies with spatial and temporal stamps arise in a number of applications including communication networks, traffic monitoring and video analysis. In these applications anomalies are temporally or spatially localized but otherwise unknown. We propose a novel graph-based statistical notion that unifies the idea of temporal and spatial locality. This notion lends itself to an elegant characterization of optimal decision rules and in turn suggests corresponding empirical rules based on local nearest neighbor distances. We compute a single composite score for the entire spatio-temporal data sample based on the local neighborhood distances. We declare data samples as containing local anomalies based on the composite score. We show that such rules not only asymptotically guarantee desired false alarm control but are also asymptotically optimal. We also show that our composite scoring scheme overcomes the inherent resolution issues of alternative multi-comparison approaches that are based on fusing the outcomes of location-by-location comparisons. We then verify our algorithms on synthetic and real data sets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/saligrama12.html https://proceedings.mlr.press/v22/saligrama12.html Domain Adaptation: A Small Sample Statistical Approach We study the prevalent problem when a test distribution differs from the training distribution. We consider a setting where our training set consists of a small number of sample domains, but where we have many samples in each domain. Our goal is to generalize to a new domain. For example, we may want to learn a similarity function using only certain classes of objects, but we desire that this similarity function be applicable to object classes not present in our training sample (e.g. we might seek to learn that “dogs are similar to dogs” even though images of dogs were absent from our training set). Our theoretical analysis shows that we can select many more features than domains while avoiding overfitting by utilizing data-dependent variance properties. We present a greedy feature selection algorithm based on using T-statistics. Our experiments validate this theory showing that our T-statistic based greedy feature selection is more robust at avoiding overfitting than the classical greedy procedure. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/salakhutdinov12.html https://proceedings.mlr.press/v22/salakhutdinov12.html Exploiting Unrelated Tasks in Multi-Task Learning We study the problem of learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about which tasks are unrelated can lead to sparser and more informative representations for each task, essentially screening out idiosyncrasies of the data distribution. We propose a novel method which builds on a prior multitask methodology by favoring a shared low dimensional representation within each group of tasks. In addition, we impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. We further discuss a condition which ensures convexity of the optimization problem and argue that it can be solved by alternating minimization. We present experiments on synthetic and real data, which indicate that incorporating unrelated tasks can improve significantly over standard multi-task learning methods. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/romera12.html https://proceedings.mlr.press/v22/romera12.html Universal Measurement Bounds for Structured Sparse Signal Recovery Standard compressive sensing results state that to exactly recover an s sparse signal in Rp, one requires O(s log p) measurements. While this bound is extremely useful in practice, often real world signals are not only sparse, but also exhibit structure in the sparsity pattern. We focus on group-structured patterns in this paper. Under this model, groups of signal coefficients are active (or inactive) together. The groups are prede- fined, but the particular set of groups that are active (i.e., in the signal support) must be learned from measurements. We show that exploiting knowledge of groups can further reduce the number of measurements required for exact signal recovery, and derive universal bounds for the number of measurements needed. The bound is universal in the sense that it only depends on the number of groups under consideration, and not the particulars of the groups (e.g., compositions, sizes, ex- tents, overlaps, etc.). Experiments show that our result holds for a variety of overlapping group configurations. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/rao12.html https://proceedings.mlr.press/v22/rao12.html A Differentially Private Stochastic Gradient Descent Algorithm for Multiparty Classification We consider the problem of developing privacy preserving machine learning algorithms in a distributed multiparty setting. Here different parties own different parts of a data set, and the goal is to learn a classifier from the entire data set without any party revealing any information about the individual data points it owns. Pathak et al (2010) recently proposed a solution to this problem in which each party learns a local classifier from its own data, and a third party then aggregates these classifiers in a privacy-preserving manner using a cryptographic scheme. The generalization performance of their algorithm is sensitive to the number of parties and the relative fractions of data owned by the different parties. In this paper, we describe a new differentially private algorithm for the multiparty setting that uses a stochastic gradient descent based procedure to directly optimize the overall multiparty objective rather than combining classifiers learned from optimizing local objectives. The algorithm achieves a slightly weaker form of differential privacy than that of Pathak et al (2010), but provides improved generalization guarantees that do not depend on the number of parties or the relative sizes of the individual data sets. Experimental results corroborate our theoretical findings. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/rajkumar12.html https://proceedings.mlr.press/v22/rajkumar12.html Deep Learning Made Easier by Linear Transformations in Perceptrons We transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero activation and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. This transformation aims at separating the problems of learning the linear and nonlinear parts of the whole input-output mapping, which has many benefits. We study the theoretical properties of the transformation by noting that they make the Fisher information matrix closer to a diagonal matrix, and thus standard gradient closer to the natural gradient. We experimentally confirm the usefulness of the transformations by noting that they make basic stochastic gradient learning competitive with state-of-the-art learning algorithms in speed, and that they seem also to help find solutions that generalize better. The experiments include both classification of small images and learning a low-dimensional representation for images by using a deep unsupervised auto-encoder network. The transformations were beneficial in all cases, with and without regularization and with networks from two to five hidden layers. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/raiko12.html https://proceedings.mlr.press/v22/raiko12.html Nonparametric Estimation of Conditional Information and Divergences In this paper we propose new nonparametric estimators for a family of conditional mutual information and divergences. Our estimators are easy to compute; they only use simple k nearest neighbor based statistics. We prove that the proposed conditional information and divergence estimators are consistent under certain conditions, and demonstrate their consistency and applicability by numerical experiments on simulated and on real data as well. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/poczos12.html https://proceedings.mlr.press/v22/poczos12.html Informative Priors for Markov Blanket Discovery We present a novel interpretation of information theoretic feature selection as optimization of a discriminative model. We show that this formulation coincides with a group of mutual information based filter heuristics in the literature, and show how our probabilistic framework gives a well-founded extension for informative priors. We then derive a particular sparsity prior that recovers the well-known IAMB algorithm (Tsamardinos & Aliferis, 2003) and extend it to create a novel algorithm, IAMB-IP, that includes domain knowledge priors. In empirical evaluations, we find the new algorithm to improve Markov Blanket recovery even when a misspecified prior was used, in which half the prior knowledge was incorrect. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/pocock12.html https://proceedings.mlr.press/v22/pocock12.html Fast interior-point inference in high-dimensional sparse, penalized state-space models We present an algorithm for fast posterior inference in penalized high-dimensional state-space models, suitable in the case where a few measurements are taken in each time step. We assume that the state prior and observation likelihoods are log-concave and have a special structure that allows fast matrix-vector operations. We derive a second-order algorithm for computing the maximum a posteriori state path estimate, where the cost per iteration scales linearly both in time and memory. This is done by computing an approximate Newton direction using an efficient forward-backward scheme based on a sequence of low rank updates. We formalize the conditions under which our algorithm is applicable and prove its stability and convergence. We show that the state vector can be drawn from a large class of prior distributions without affecting the linear complexity of our algorithm. This class includes both Gaussian and nonsmooth sparse and group sparse priors for which we employ an interior point modification of our algorithm. We discuss applications in text modeling and neuroscience. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/pnevmatikakis12.html https://proceedings.mlr.press/v22/pnevmatikakis12.html Learning Low-order Models for Enforcing High-order Statistics Models such as pairwise conditional random fields (CRFs) are extremely popular in computer vision and various other machine learning disciplines. However, they have limited expressive power and often cannot represent the posterior distribution correctly. While learning the parameters of such models which have insufficient expressivity, researchers use loss functions to penalize certain misrepresentations of the solution space. Till now, researchers have used only simplistic loss functions such as the hamming loss, to enable efficient inference. The paper shows how sophisticated and useful higher order loss functions can be incorporated in the learning process. These loss functions ensure that the MAP solution does not deviate much from the ground truth in terms of certain \emphhigher order statistics. We propose a learning algorithm which uses the recently proposed lower-envelope representation of higher order functions to transform them to pairwise functions, which allow efficient inference. We test the efficacy of our method on the problem of foreground-background image segmentation. Experimental results show that the incorporation of higher order loss functions in the learning formulation using our method leads to much better results compared to those obtained by using the traditional hamming loss. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/pletscher12b.html https://proceedings.mlr.press/v22/pletscher12b.html Part & Clamp: Efficient Structured Output Learning Discriminative training for general graphical models is a challenging task, due to the intractability of the partition function. We propose a computationally efficient approach to estimate the partition sum in a structured learning problem. The key idea is a lower bound of the partition sum that can be evaluated in a fixed number of message passing iterations. The bound makes use of a subset of the variables, a feedback vertex set, which allows us to decompose the graph into tractable parts. Furthermore, a tightening strategy for the bound is presented, which finds the states of the feedback vertex set that maximally increase the bound, and clamps them. Based on this lower bound we derive batch and online learning algorithms and demonstrate their effectiveness on a computer vision problem. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/pletscher12a.html https://proceedings.mlr.press/v22/pletscher12a.html Approximate Inference by Intersecting Semidefinite Bound and Local Polytope Inference in probabilistic graphical models can be represented as a constrained optimization problem of a free-energy functional. Substantial research has been focused on the approximation of the constraint set, also known as the marginal polytope. This paper presents a novel inference algorithm that tightens and solves the optimization problem by intersecting the popular local polytope and the semidefinite outer bound of the marginal polytope. Using dual decomposition, our method separates the optimization problem into two subproblems: a semidefinite program (SDP), which is solved by a low-rank SDP algorithm, and a free-energy based optimization problem, which is solved by convex belief propagation. Our method has a very reasonable computational complexity and its actual running time is typically within a small factor (=10) of convex belief propagation. Tested on both synthetic data and a real-world protein side-chain packing benchmark, our method significantly outperforms tree-reweighted belief propagation in both marginal probability inference and MAP inference. Our method is competitive with the state-of-the-art in MRF inference, solving all protein tasks solved by the recently presented MPLP method, and beating MPLP on lattices with strong edge potentials. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/peng12.html https://proceedings.mlr.press/v22/peng12.html On a Connection between Maximum Variance Unfolding, Shortest Path Problems and IsoMap We present an equivalent formulation of the Maximum Variance Unfolding (MVU) problem in terms of distance matrices. This yields a novel interpretation of the MVU problem as a regularized version of the shortest path problem on a graph. This interpretation enables us to establish an asymptotic convergence result for the case that the underlying data are drawn from a Riemannian manifold which is isometric to a convex subset of Euclidean space. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/paprotny12.html https://proceedings.mlr.press/v22/paprotny12.html Stick-Breaking Beta Processes and the Poisson Process We show that the stick-breaking construction of the beta process due to \citePaisley:2010 can be obtained from the characterization of the beta process as a Poisson process. Specifically, we show that the mean measure of the underlying Poisson process is equal to that of the beta process. We use this underlying representation to derive error bounds on truncated beta processes that are tighter than those in the literature. We also develop a new MCMC inference algorithm for beta processes, based in part on our new Poisson process construction. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/paisley12.html https://proceedings.mlr.press/v22/paisley12.html Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis Most speech analysis/synthesis systems are based on the basic physical model of speech production - the acoustic tube model. There are two main drawbacks with current speech analysis methods. First, a common design paradigm seems to build a special-purpose signal-processing front-end followed by (when appropriate) a back-end based on probabilistic models. A difficulty is that most features are nonlinear operators of the speech waveform, whose statistical behavior is hard to be modeled. Second, different tasks of speech analysis are carried out separately. These practices are admittedly useful but not optimal due to the incomplete use of available information. These examinations motivate us to directly model the spectrogram and to integrate together the three fundamental speech parameters - the pitch, energy and spectral envelope. We successfully devise such a model called probabilistic acoustic tube (PAT) model. The integration is performed in a principled manner with explicit physical meaning. We demonstrate the capability of PAT for a number of speech analysis/synthesis tasks, such as pitch tracking under both clean and additive noise conditions, speech synthesis, and phoneme clustering. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/ou12.html https://proceedings.mlr.press/v22/ou12.html Bayesian Quadrature for Ratios We describe a novel approach to quadrature for ratios of probabilistic integrals, such as are used to compute posterior probabilities. It offers performance superior to Monte Carlo methods by exploiting a Bayesian quadrature framework. We improve upon previous Bayesian quadrature techniques by explicitly modelling the non-negativity of our integrands, and the correlations that exist between them. It offers most where the integrand is multi-modal and expensive to evaluate, as is commonplace in exoplanets research; we demonstrate the efficacy of our method on data from the Kepler spacecraft. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/osborne12.html https://proceedings.mlr.press/v22/osborne12.html Beyond Logarithmic Bounds in Online Learning We prove logarithmic regret bounds that depend on the loss L_T^* of the competitor rather than on the number T of time steps. In the general online convex optimization setting, our bounds hold for any smooth and exp-concave loss (such as the square loss or the logistic loss). This bridges the gap between the O(ln T) regret exhibited by exp-concave losses and the O(sqrt(L_T^*)) regret exhibited by smooth losses. We also show that these bounds are tight for specific losses, thus they cannot be improved in general. For online regression with square loss, our analysis can be used to derive a sparse randomized variant of the online Newton step, whose expected number of updates scales with the algorithm’s loss. For online classification, we prove the first logarithmic mistake bounds that do not rely on prior knowledge of a bound on the competitor’s norm. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/orabona12.html https://proceedings.mlr.press/v22/orabona12.html A Nonparametric Bayesian Model for Multiple Clustering with Overlapping Feature Views Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Chinese Restaurant Process. We provide an inference approach to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/niu12.html https://proceedings.mlr.press/v22/niu12.html The adversarial stochastic shortest path problem with unknown transition probabilities We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/neu12.html https://proceedings.mlr.press/v22/neu12.html Deep Boltzmann Machines as Feed-Forward Hierarchies The deep Boltzmann machine is a powerful model that extracts the hierarchical structure of observed data. While inference is typically slow due to its undirected nature, we argue that the emerging feature hierarchy is still explicit enough to be traversed in a feed-forward fashion. The claim is corroborated by training a set of deep neural networks on real data and measuring the evolution of the representation layer after layer. The analysis reveals that the deep Boltzmann machine produces a feed-forward hierarchy of increasingly invariant representations that clearly surpasses the layer-wise approach. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/montavon12.html https://proceedings.mlr.press/v22/montavon12.html Lifted Linear Programming Lifted inference approaches have rendered large, previously intractable probabilistic inference problems quickly solvable by handling whole sets of indistinguishable objects together. Triggered by this success, we show that another important AI technique is liftable, too, namely linear programming. Intuitively, given a linear program (LP), we employ a lifted variant of Gaussian belief propagation (GaBP) to solve the systems of linear equations arising when running an interior-point method to solve the LP. However, this naive solution cannot make use of standard solvers for linear equations and is doomed to construct lifted networks in each iteration of the interior-point method again, an operation that can itself be quite costly. To address both issues, we show how to read off an equivalent LP from the lifted GaBP computations that can be solved using any off-the-shelf LP solver. We prove the correctness of this compilation approach, including a lifted duality theorem, and experimentally demonstrate that it can greatly reduce the cost of solving LPs. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/mladenov12.html https://proceedings.mlr.press/v22/mladenov12.html Max-Margin Min-Entropy Models We propose a new family of latent variable models called max-margin min-entropy (M3E) models, which define a distribution over the output and the hidden variables conditioned on the input. Given an input, an M3E model predicts the output with the smallest corresponding Renyi entropy of generalized distribution. This is equivalent to minimizing a score that consists of two terms: (i) the negative log-likelihood of the output, ensuring that the output has a high probability; and (ii) a measure of uncertainty over the distribution of the hidden variables conditioned on the input and the output, ensuring that there is little confusion in the values of the hidden variables. Given a training dataset, the parameters of an M3E model are learned by maximizing the margin between the Renyi entropies of the ground-truth output and all other incorrect outputs. Training an M3E can be viewed as minimizing an upper bound on a user-defined loss, and includes, as a special case, the latent support vector machine framework. We demonstrate the efficacy of M3E models on two standard machine learning applications, discriminative motif finding and image classification, using publicly available datasets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/miller12.html https://proceedings.mlr.press/v22/miller12.html Globally Optimizing Graph Partitioning Problems Using Message Passing Graph partitioning algorithms play a central role in data analysis and machine learning. Most useful graph partitioning criteria correspond to optimizing a ratio between the cut and the size of the partitions, this ratio leads to an NP-hard problem that is only solved approximately. This makes it difficult to know whether failures of the algorithm are due to failures of the optimization or to the criterion being optimized. In this paper we present a framework that seeks and finds the optimal solution of several NP-hard graph partitioning problems. We use a classical approach to ratio problems where we repeatedly ask whether the optimal solution is greater than or less than some constant - lambda. Our main insight is the equivalence between this “lambda question” and performing inference in a graphical model with many local potentials and one high-order potential. We show that this specific form of the high-order potential is amenable to message-passing algorithms and how to obtain a bound on the optimal solution from the messages. Our experiments show that in many cases our approach yields the global optimum and improves the popular spectral solution. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/mezuman12.html https://proceedings.mlr.press/v22/mezuman12.html Movement Segmentation and Recognition for Imitation Learning In human movement learning, it is most common to teach constituent elements of complex movements in isolation, before chaining them into complex movements. Segmentation and recognition of observed movement could thus proceed out of this existing knowledge, which is directly compatible with movement generation. In this paper, we address exactly this scenario. We assume that a library of movement primitives has already been taught, and we wish to identify elements of the library in a complex motor act, where the individual elements have been smoothed together, and, occasionally, there might be a movement segment that is not in our library yet. We employ a flexible machine learning representation of movement primitives based on learnable nonlinear attractor system. For the purpose of movement segmentation and recognition, it is possible to reformulate this representation as a controlled linear dynamical system. An Expectation-Maximization algorithm can be developed to estimate the open parameters of a movement primitive from the library, using as input an observed trajectory piece. If no matching primitive from the library can be found, a new primitive is created. This process allows a straightforward sequential segmentation of observed movement into known and new primitives, which are suitable for robot imitation learning. We illustrate our approach with synthetic examples and data collected from human movement. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/meier12.html https://proceedings.mlr.press/v22/meier12.html Adaptive MCMC with Bayesian Optimization This paper proposes a new randomized strategy for adaptive MCMC using Bayesian optimization. This approach applies to non-differentiable objective functions and trades off exploration and exploitation to reduce the number of potentially costly objective function evaluations. We demonstrate the strategy in the complex setting of sampling from constrained, discrete and densely connected probabilistic graphical models where, for each variation of the problem, one needs to adjust the parameters of the proposal mechanism automatically to ensure efficient mixing of the Markov chains. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/mahendran12.html https://proceedings.mlr.press/v22/mahendran12.html Efficient Gaussian Process Inference for Short-Scale Spatio-Temporal Modeling This paper presents an efficient Gaussian process inference scheme for modeling shortscale phenomena in spatio-temporal datasets. Our model uses a sum of separable, compactly supported covariance functions, which yields a full covariance matrix represented in terms of small sparse matrices operating either on the spatial or temporal domain. The proposed inference procedure is based on Gibbs sampling, in which samples from the conditional distribution of the latent function values are obtained by applying a simple linear transformation to samples drawn from the joint distribution of the function values and the observations. We make use of the proposed model structure and the conjugate gradient method to compute the required transformation. In the experimental part, the proposed algorithm is compared to the standard approach using the sparse Cholesky decomposition and it is shown to be much faster and computationally feasible for 100-1000 times larger datasets. We demonstrate the advantages of the proposed method in the problem of reconstructing sea surface temperature, which requires processing of a real-world dataset with 10^6 observations. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/luttinen12.html https://proceedings.mlr.press/v22/luttinen12.html Closed-Form Entropy Limits - A Tool to Monitor Likelihood Optimization of Probabilistic Generative Models The maximization of the data likelihood under a given probabilistic generative model is the essential goal of many algorithms for unsupervised learning. If expectation maximization is used for optimization, a lower bound on the data likelihood, the free-energy, is optimized. The parameter-dependent part of the free-energy (the difference between free-energy and posterior entropy) is the essential entity in the derivation of learning algorithms. Here we show that for many common generative models the optimal values of the parameter-dependent part can be derived in closed-form. These closed-form expressions are hereby given as sums of the negative (differential) entropies of the individual model distributions. We apply our theoretical results to derive such closed-form expressions for a number of common and recent models, including probabilistic PCA, factor analysis, different versions of sparse coding, and Linear Dynamical Systems. The main contribution of this work is theoretical but we show how the derived results can be used to efficiently compute free-energies, and how they can be used for consistency checks of learning algorithms. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/lucke12.html https://proceedings.mlr.press/v22/lucke12.html A Simple Geometric Interpretation of SVM using Stochastic Adversaries We present a minimax framework for classification that considers stochastic adversarial perturbations to the training data. We show that for binary classification it is equivalent to SVM, but with a very natural interpretation of regularization parameter. In the multiclass case, we obtain that our formulation is equivalent to regularizing the hinge loss with the maximum norm of the weight vector (i.e., the two-infinity norm). We test this new regularization scheme and show that it is competitive with the Frobenius regularization commonly used for multiclass SVM. We proceed to analyze various forms of stochastic perturbations and obtain compact optimization problems for the optimal classifiers. Taken together, our results illustrate the advantage of using stochastic perturbations rather than deterministic ones, as well as offer a simple geometric interpretation for SVM optimization in the non-separable case. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/livni12.html https://proceedings.mlr.press/v22/livni12.html High-Dimensional Structured Feature Screening Using Binary Markov Random Fields Feature screening is a useful feature selection approach for high-dimensional data when the goal is to identify all the features relevant to the response variable. However, common feature screening methods do not take into account the correlation structure of the covariate space. We propose the concept of a feature relevance network, a binary Markov random field to represent the relevance of each individual feature by potentials on the nodes, and represent the correlation structure by potentials on the edges. By performing inference on the feature relevance network, we can accordingly select relevant features. Our algorithm does not yield sparsity, which is different from the particular popular family of feature selection approaches based on penalized least squares or penalized pseudo-likelihood. We give one concrete algorithm under this framework and show its superior performance over common feature selection methods in terms of prediction error and recovery of the truly relevant features on real-world data and synthetic data. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/liu12b.html https://proceedings.mlr.press/v22/liu12b.html Exact Subspace Segmentation and Outlier Detection by Low-Rank Representation In this work, we address the following matrix recovery problem: suppose we are given a set of data points containing two parts, one part consists of samples drawn from a union of multiple subspaces and the other part consists of outliers. We do not know which data points are outliers, or how many outliers there are. The rank and number of the subspaces are unknown either. Can we detect the outliers and segment the samples into their right subspaces, efficiently and exactly? We utilize a so-called Low-Rank Representation (LRR) method to solve this problem, and prove that under mild technical conditions, any solution to LRR exactly recover the row space of the samples and detect the outliers as well. Since the subspace membership is provably determined by the row space, this further implies that LRR can perform exact subspace segmentation and outlier detection, in an efficient way. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/liu12a.html https://proceedings.mlr.press/v22/liu12a.html Efficient Sampling from Combinatorial Space via Bridging MCMC sampling has been extensively studied and used in probabilistic inference. Many algorithms rely on local updates to explore the space, often resulting in slow convergence or failure to mix when there is no path from one set of states to another via local changes. We propose an efficient method for sampling from combinatorial spaces that addresses these issues via “bridging states” that facilitate the communication between different parts of the space. Such states can be created dynamically, providing more flexibility than methods relying on specific space structures to design jump proposals. Theoretical analysis of the approach yields bounds on mixing times. Empirical analysis demonstrates the practical utility on two problems: constrained map labeling and inferring partial order of object layers in a video. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/lin12.html https://proceedings.mlr.press/v22/lin12.html Data dependent kernels in nearly-linear time We propose a method to efficiently construct data dependent kernels which can make use of large quantities of (unlabeled) data. Our construction makes an approximation in the standard construction of semi-supervised kernels in Sindhwani et al. (2005). In typical cases these kernels can be computed in nearly-linear time (in the amount of data), improving on the cubic time of the standard construction, enabling large scale semi-supervised learning in a variety of contexts. The methods are validated on semi-supervised and unsupervised problems on data sets containing upto 64,000 sample points. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/lever12.html https://proceedings.mlr.press/v22/lever12.html Efficient Hypergraph Clustering Data clustering is an essential problem in data mining, machine learning and computer vision. In this paper we present a novel method for the hypergraph clustering problem, in which second or higher order affinities between sets of data points are considered. Our algorithm has important theoretical properties, such as convergence and satisfaction of first order necessary optimality conditions. It is based on an efficient iterative procedure, which by updating the cluster membership of all points in parallel, is able to achieve state of the art results in very few steps. We outperform current hypergraph clustering methods especially in terms of computational speed, but also in terms of accuracy. Moreover, we show that our method could be successfully applied both to higher-order assignment problems and to image segmentation. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/leordeanu12.html https://proceedings.mlr.press/v22/leordeanu12.html Preface Preface to the Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics April 21-23, 2012 La Palma, Canary Islands. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/lawrence12.html https://proceedings.mlr.press/v22/lawrence12.html Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets We propose a new method for comparing learning algorithms on multiple tasks which is based on a novel non-parametric test that we call the Poisson binomial test. The key aspect of this work is that we provide a formal definition for what is meant to have an algorithm that is better than another. Also, we are able to take into account the dependencies induced when evaluating classifiers on the same test set. Finally we make optimal use (in the Bayesian sense) of all the testing data we have. We demonstrate empirically that our approach is more reliable than the sign test and the Wilcoxon signed rank test, the current state of the art for algorithm comparisons. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/lacoste12.html https://proceedings.mlr.press/v22/lacoste12.html Message-Passing Algorithms for MAP Estimation Using DC Programming We address the problem of finding the most likely assignment or MAP estimation in a Markov random field. We analyze the linear programming formulation of MAP through the lens of difference of convex functions (DC) programming, and use the concave-convex procedure (CCCP) to develop efficient message-passing solvers. The resulting algorithms are guaranteed to converge to a global optimum of the well-studied local polytope, an outer bound on the MAP marginal polytope. To tighten the outer bound, we show how to combine it with the mean-field based inner bound and, again, solve it using CCCP. We also identify a useful relationship between the DC formulations and some recently proposed algorithms based on Bregman divergence. Experimentally, this hybrid approach produces optimal solutions for a range of hard OR problems and near-optimal solutions for standard benchmarks. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kumar12.html https://proceedings.mlr.press/v22/kumar12.html Marginal Regression For Multitask Learning Variable selection is an important practical problem that arises in analysis of many high-dimensional datasets. Convex optimization procedures, that arise from relaxing the NP-hard subset selection procedure, e.g., the Lasso or Dantzig selector, have become the focus of intense theoretical investigations. Although many efficient algorithms exist that solve these problems, finding a solution when the number of variables is large, e.g., several hundreds of thousands in problems arising in genome-wide association analysis, is still computationally challenging. A practical solution for these high-dimensional problems is the marginal regression, where the output is regressed on each variable separately. We investigate theoretical properties of the marginal regression in a multitask framework. Our contribution include: i) sharp analysis for the marginal regression in a single task setting with random design, ii) sufficient conditions for the multitask screening to select the relevant variables, iii) a lower bound on the Hamming distance convergence for multitask variable selection problems. A simulation study further demonstrates the performance of the marginal regression. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kolar12.html https://proceedings.mlr.press/v22/kolar12.html Multiple Texture Boltzmann Machines We assess the generative power of the mPoT-model of [10] with tiled-convolutional weight sharing as a model for visual textures by specifically training on this task, evaluating model performance on texture synthesis and inpainting tasks using quantitative metrics. We also analyze the relative importance of the mean and covariance parts of the mPoT model by comparing its performance to those of its subcomponents, tiled-convolutional versions of the PoT/FoE and Gaussian-Bernoulli restricted Boltzmann machine (GB-RBM). Our results suggest that while state-of-the-art or better performance can be achieved using the mPoT, similar performance can be achieved with the mean-only model. We then develop a model for multiple textures based on the GB-RBM, using a shared set of weights but texture-specific hidden unit biases. We show comparable performance of the multiple texture model to individually trained texture models. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kivinen12.html https://proceedings.mlr.press/v22/kivinen12.html Regression for sets of polynomial equations We propose a method called ideal regression for approximating an arbitrary system of polynomial equations by a system of a particular type. Using techniques from approximate computational algebraic geometry, we show how we can solve ideal regression directly without resorting to numerical optimization. Ideal regression is useful whenever the solution to a learning problem can be described by a system of polynomial equations. As an example, we demonstrate how to formulate Stationary Subspace Analysis (SSA), a source separation problem, in terms of ideal regression, which also yields a consistent estimator for SSA. We then compare this estimator in simulations with previous optimization-based approaches for SSA. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kiraly12.html https://proceedings.mlr.press/v22/kiraly12.html Bayesian Classifier Combination Bayesian model averaging linearly mixes the probabilistic predictions of multiple models, each weighted by its posterior probability. This is the coherent Bayesian way of combining multiple models only under certain restrictive assumptions, which we outline. We explore a general framework for Bayesian model combination (which differs from model averaging) in the context of classification. This framework explicitly models the relationship between each model’s output and the unknown true label. The framework does not require that the models be probabilistic (they can even be human assessors), that they share prior information or receive the same training data, or that they be independent in their errors. Finally, the Bayesian combiner does not need to believe any of the models is in fact correct. We test several variants of this classifier combination procedure starting from a classic statistical model proposed by Dawid and Skene (1979) and using MCMC to add more complex but important features to the model. Comparisons on several data sets to simpler methods like majority voting show that the Bayesian methods not only perform well but result in interpretable diagnostics on the data points and the models. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kim12.html https://proceedings.mlr.press/v22/kim12.html A Stick-Breaking Likelihood for Categorical Data Analysis with Latent Gaussian Models The development of accurate models and efficient algorithms for the analysis of multivariate categorical data are important and long-standing problems in machine learning and computational statistics. In this paper, we focus on modeling categorical data using Latent Gaussian Models (LGMs). We propose a novel stick-breaking likelihood function for categorical LGMs that exploits accurate linear and quadratic bounds on the logistic log-partition function, leading to an effective variational inference and learning framework. We thoroughly compare our approach to existing algorithms for multinomial logit/probit likelihoods on several problems, including inference in multinomial Gaussian process classification and learning in latent factor models. Our extensive comparisons demonstrate that our stick-breaking model effectively captures correlation in discrete data and is well suited for the analysis of categorical data. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/khan12.html https://proceedings.mlr.press/v22/khan12.html Online Clustering of Processes The problem of online clustering is considered in the case where each data point is a sequence generated by a stationary ergodic process. Data arrive in an online fashion so that the sample received at every time-step is either a continuation of some previously received sequence or a new sequence. The dependence between the sequences can be arbitrary. No parametric or independence assumptions are made; the only assumption is that the marginal distribution of each sequence is stationary and ergodic. A novel, computationally efficient algorithm is proposed and is shown to be asymptotically consistent (under a natural notion of consistency). The performance of the proposed algorithm is evaluated on simulated data, as well as on real datasets (motion classification). Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/khaleghi12.html https://proceedings.mlr.press/v22/khaleghi12.html On Bayesian Upper Confidence Bounds for Bandit Problems Stochastic bandit problems have been analyzed from two different perspectives: a frequentist view, where the parameter is a deterministic unknown quantity, and a Bayesian approach, where the parameter is drawn from a prior distribution. We show in this paper that methods derived from this second perspective prove optimal when evaluated using the frequentist cumulated regret as a measure of performance. We give a general formulation for a class of Bayesian index policies that rely on quantiles of the posterior distribution. For binary bandits, we prove that the corresponding algorithm, termed Bayes-UCB, satisfies finite-time regret bounds that imply its asymptotic optimality. More generally, Bayes-UCB appears as an unifying framework for several variants of the UCB algorithm addressing different bandit problems (parametric multi-armed bandits, Gaussian bandits with unknown mean and variance, linear bandits). But the generality of the Bayesian approach makes it possible to address more challenging models. In particular, we show how to handle linear bandits with sparsity constraints by resorting to Gibbs sampling. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kaufmann12.html https://proceedings.mlr.press/v22/kaufmann12.html Random Feature Maps for Dot Product Kernels Approximating non-linear kernels using feature maps has gained a lot of interest in recent years due to applications in reducing training and testing times of SVM classifiers and other kernel based learning algorithms. We extend this line of work and present low distortion embeddings for dot product kernels into linear Euclidean spaces. We base our results on a classical result in harmonic analysis characterizing all dot product kernels and use it to define randomized feature maps into explicit low dimensional Euclidean spaces in which the native dot product provides an approximation to the dot product kernel with high confidence. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/kar12.html https://proceedings.mlr.press/v22/kar12.html High-dimensional Sparse Inverse Covariance Estimation using Greedy Methods In this paper we consider the task of estimating the non-zero pattern of the sparse inverse covariance matrix of a zero-mean Gaussian random vector from a set of iid samples. Note that this is also equivalent to recovering the underlying graph structure of a sparse Gaussian Markov Random Field (GMRF). We present two novel greedy approaches to solving this problem. The first estimates the non-zero covariates of the overall inverse covariance matrix using a series of global forward and backward greedy steps. The second estimates the neighborhood of each node in the graph separately, again using greedy forward and backward steps, and combines the intermediate neighborhoods to form an overall estimate. The principal contribution of this paper is a rigorous analysis of the sparsistency, or consistency in recovering the sparsity pattern of the inverse covariance matrix. Surprisingly, we show that both the local and global greedy methods learn the full structure of the model with high probability given just O(d log(p)) samples, which is a significant improvement over state of the art L1-regularized Gaussian MLE (Graphical Lasso) that requires O(d^2 log(p)) samples. Moreover, the restricted eigenvalue and smoothness conditions imposed by our greedy methods are much weaker than the strong irrepresentable conditions required by the L1-regularization based methods. We corroborate our results with extensive simulations and examples, comparing our local and global greedy methods to the L1-regularized Gaussian MLE as well as the nodewise L1-regularized linear regression (Neighborhood Lasso). Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/johnson12.html https://proceedings.mlr.press/v22/johnson12.html Detecting Network Cliques with Radon Basis Pursuit In this paper, we propose a novel formulation of the network clique detection problem by introducing a general network data representation framework. We show connections between our formulation with a new algebraic tool, namely Radon basis pursuit in homogeneous spaces. Such a connection allows us to identify rigorous recovery conditions for clique detection problems. Practical approximation algorithms are also developed for solving empirical problems and their usefulness is demonstrated on real-world datasets. Our work connects two seemingly different areas: network data analysis and compressed sensing, which helps to bridge the gap between the research of network data and the classical theory of statistical learning and signal processing. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/jiang12.html https://proceedings.mlr.press/v22/jiang12.html A Variance Minimization Criterion to Active Learning on Graphs We consider the problem of active learning over the vertices in a graph, without feature representation. Our study is based on the common graph smoothness assumption, which is formulated in a Gaussian random field model. We analyze the probability distribution over the unlabeled vertices conditioned on the label information, which is a multivariate normal with the mean being the harmonic solution over the field. Then we select the nodes to label such that the total variance of the distribution on the unlabeled data, as well as the expected prediction error, is minimized. In this way, the classifier we obtain is theoretically more robust. Compared with existing methods, our algorithm has the advantage of selecting data in a batch offline mode with solid theoretical support. We show improved performance over existing label selection criteria on several real world data sets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/ji12.html https://proceedings.mlr.press/v22/ji12.html Subset Infinite Relational Models We propose a new probabilistic generative model for analyzing sparse and noisy pairwise relational data, such as friend-links on SNSs and customer records in online shops. Real-world relational data often include a large portion of non-informative pairwise data entries. Many existing stochastic blockmodels suffer from these irrelevant data entries because of their rather simpler forms of priors. The proposed model newly incorporates a latent variable that explicitly indicates whether each data entry is relevant or not to diminish the bad effects associated with such irrelevant data. Through experimental results using synthetic and real data sets, we show that the proposed model can extract clusters with stronger relations among data within the cluster than clusters obtained by the conventional model. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/ishiguro12.html https://proceedings.mlr.press/v22/ishiguro12.html Variable Selection for Gaussian Graphical Models We present a variable-selection structure learning approach for Gaussian graphical models. Unlike standard sparseness promoting techniques, our method aims at selecting the most-important variables besides simply sparsifying the set of edges. Through simulations, we show that our method outperforms the state-of-the-art in recovering the ground truth model. Our method also exhibits better generalization performance in a wide range of complex real-world datasets: brain fMRI, gene expression, NASDAQ stock prices and world weather. We also show that our resulting networks are more interpretable in the context of brain fMRI analysis, while retaining discriminability. From an optimization perspective, we show that a block coordinate descent method generates a sequence of positive definite solutions. Thus, we reduce the original problem into a sequence of strictly convex (\ell_1,\ell_p) regularized quadratic minimization subproblems for p∈{2,∞}. Our algorithm is well founded since the optimal solution of the maximization problem is unique and bounded. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/honorio12.html https://proceedings.mlr.press/v22/honorio12.html Stochastic Bandit Based on Empirical Moments In the multiarmed bandit problem a gambler chooses an arm of a slot machine to pull considering a tradeoff between exploration and exploitation. We study the stochastic bandit problem where each arm has a reward distribution supported in [0,1]. For this model, there exists a policy which achieves the theoretical bound asymptotically. However the optimal policy requires a computation of a convex optimization which involves the empirical distribution of each arm. In this paper, we propose a policy which exploits the first d empirical moments for arbitrary d fixed in advance. We show that the performance of the policy approaches the theoretical bound as d increases. This policy can be implemented by solving polynomial equations, which we derive the explicit solution for d smaller than 5. By choosing appropriate d, the proposed policy realizes a tradeoff between the computational complexity and the expected regret. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/honda12.html https://proceedings.mlr.press/v22/honda12.html Maximum Margin Temporal Clustering Temporal Clustering (TC) refers to the factorization of multiple time series into a set of non-overlapping segments that belong to k temporal clusters. Existing methods based on extensions of generative models such as k-means or Switching Linear Dynamical Systems often lead to intractable inference and lack a mechanism for feature selection, critical when dealing with high dimensional data. To overcome these limitations, this paper proposes Maximum Margin Temporal Clustering (MMTC). MMTC simultaneously determines the start and the end of each segment, while learning a multi-class Support Vector Machine to discriminate among temporal clusters. MMTC extends Maximum Margin Clustering in two ways: first, it incorporates the notion of TC, and second, it introduces additional constraints to achieve better balance between clusters. Experiments on clustering human actions and bee dancing motions illustrate the benefits of our approach compared to state-of-the-art methods. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/hoai12.html https://proceedings.mlr.press/v22/hoai12.html Kernel Topic Models Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents’ mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/hennig12.html https://proceedings.mlr.press/v22/hennig12.html Exchangeability Characterizes Optimality of Sequential Normalized Maximum Likelihood and Bayesian Prediction with Jeffreys Prior We study online prediction of individual sequences under logarithmic loss with parametric constant experts. The optimal strategy, normalized maximum likelihood (NML), is computationally demanding and requires the length of the game to be known. We consider two simpler strategies: sequential normalized maximum likelihood (SNML), which computes the NML forecasts at each round as if it were the last round, and Bayesian prediction. Under appropriate conditions, both are known to achieve near-optimal regret. In this paper, we investigate when these strategies are optimal. We show that SNML is optimal iff the joint distribution on sequences defined by SNML is exchangeable. In the case of exponential families, this is equivalent to the optimality of any Bayesian prediction strategy, and the optimal prior is Jeffreys prior. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/hedayati12.html https://proceedings.mlr.press/v22/hedayati12.html Information Theoretic Model Validation for Spectral Clustering Model validation constitutes a fundamental step in data clustering. The central question is: Which cluster model and how many clusters are most appropriate for a certain application? In this study, we introduce a method for the validation of spectral clustering based upon approximation set coding. In particular, we compare correlation and pairwise clustering to analyze the correlations of temporal gene expression profiles. To evaluate and select clustering models, we calculate their reliable informativeness. Experimental results in the context of gene expression analysis show that pairwise clustering yields superior amounts of reliable information. The analysis results are consistent with the Bayesian Information Criterion (BIC), and exhibit higher generality than BIC. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/haghir12.html https://proceedings.mlr.press/v22/haghir12.html Evaluation of marginal likelihoods via the density of states Bayesian model comparison involves the evaluation of the marginal likelihood, the expectation of the likelihood under the prior distribution. Typically, this high-dimensional integral over all model parameters is approximated using Markov chain Monte Carlo methods. Thermodynamic integration is a popular method to estimate the marginal likelihood by using samples from annealed posteriors. Here we show that there exists a robust and flexible alternative. The new method estimates the density of states, which counts the number of states associated with a particular value of the likelihood. If the density of states is known, computation of the marginal likelihood reduces to a one- dimensional integral. We outline a maximum likelihood procedure to estimate the density of states from annealed posterior samples. We apply our method to various likelihoods and show that it is superior to thermodynamic integration in that it is more flexible with regard to the annealing schedule and the family of bridging distributions. Finally, we discuss the relation of our method with Skilling’s nested sampling. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/habeck12.html https://proceedings.mlr.press/v22/habeck12.html Locality Preserving Feature Learning Locality Preserving Indexing (LPI) has been quite successful in tackling document analysis problems, such as clustering or classification. The approach relies on the Locality Preserving Criterion, which preserves the locality of the data points. However, LPI takes every word in a data corpus into account, even though many words may not be useful for document clustering. To overcome this problem, we propose an approach called Locality Preserving Feature Learning (LPFL), which incorporates feature selection into LPI. Specifically, we aim to find a subset of features, and learn a linear transformation to optimize the Locality Preserving Criterion based on these features. The resulting optimization problem is a mixed integer programming problem, which we relax into a constrained Frobenius norm minimization problem, and solve using a variation of Alternating Direction Method (ADM). ADM, which iteratively updates the linear transformation matrix, the residue matrix and the Lagrangian multiplier, is theoretically guaranteed to converge at the rate O(1/t). Experiments on benchmark document datasets show that our proposed method outperforms LPI, as well as other state-of-the-art document analysis approaches. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/gu12.html https://proceedings.mlr.press/v22/gu12.html Bayesian regularization of non-homogeneous dynamic Bayesian networks by globally coupling interaction parameters To relax the homogeneity assumption of classical dynamic Bayesian networks (DBNs), various recent studies have combined DBNs with multiple changepoint processes. The underlying assumption is that the parameters associated with time series segments delimited by multiple changepoints are a priori independent. Under weak regularity conditions, the parameters can be integrated out in the likelihood, leading to a closed-form expression of the marginal likelihood. However, the assumption of prior independence is unrealistic in many real-world applications, where the segment-specific regulatory relationships among the interdependent quantities tend to undergo gradual evolutionary adaptations. We therefore propose a Bayesian coupling scheme to introduce systematic information sharing among the segment-specific interaction parameters. We investigate the effect this model improvement has on the network reconstruction accuracy in a reverse engineering context, where the objective is to learn the structure of a gene regulatory network from temporal gene expression profiles. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/grzegorzyk12.html https://proceedings.mlr.press/v22/grzegorzyk12.html SpeedBoost: Anytime Prediction with Uniform Near-Optimality We present SpeedBoost, a natural extension of functional gradient descent, for learning anytime predictors, which automatically trade computation time for predictive accuracy by selecting from a set of simpler candidate predictors. These anytime predictors not only generate approximate predictions rapidly, but are capable of using extra resources at prediction time, when available, to improve performance. We also demonstrate how our framework can be used to select weak predictors which target certain subsets of the data, allowing for efficient use of computational resources on difficult examples. We also show that variants of the SpeedBoost algorithm produce predictors which are provably competitive with any possible sequence of weak predictors with the same total complexity. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/grubb12.html https://proceedings.mlr.press/v22/grubb12.html On Average Reward Policy Evaluation in Infinite-State Partially Observable Systems We investigate the problem of estimating the average reward of given decision policies in discrete-time controllable dynamical systems with finite action and observation sets, but possibly infinite state space. Unlike in systems with finite state spaces, in infinite–state systems the expected reward for some policies might not exist, so policy evaluation, which is a key step in optimal control methods, might fail. Our main analysis tool is Ergodic theory, which allows learning potentially useful quantities from the system without building a model. Our main contribution is three-fold. First, we present several dynamical systems that demonstrate the difficulty of learning in the general case, without making additional assumptions. We state the necessary condition that the underlying system must satisfy to be amenable for learning. Second, we discuss the relationship between this condition and state-of-the-art predictive representations, and we show that there are systems that satisfy the above condition but cannot be modeled by such representations. Third, we establish sufficient conditions for average-reward policy evaluation in this setting. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/grinberg12.html https://proceedings.mlr.press/v22/grinberg12.html Scalable Inference on Kingman’s Coalescent using Pair Similarity We present a scalable sequential Monte Carlo algorithm and its greedy counterpart for models based on Kingman’s coalescent. We utilize fast nearest neighbor algorithms to limit expensive computations to only a subset of data point pairs. For a dataset size of n, the resulting algorithm has O(n log n) computational complexity. We empirically verify that we achieve a large speedup in computation. When the gain in speed is used for increasing the number of particles, we can often obtain significantly better samples than those of previous algorithms. We apply our algorithm for learning visual taxonomies of birds on 6033 examples, a dataset size for which previous algorithms fail to be feasible. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/gorur12.html https://proceedings.mlr.press/v22/gorur12.html Regularization Paths with Guarantees for Convex Semidefinite Optimization We devise a simple algorithm for computing an approximate solution path for parameterized semidefinite convex optimization problems that is guaranteed to be epsilon-close to the exact solution path. As a consequence, we can compute the entire regularization path for many regularized matrix completion and factorization approaches, as well as nuclear norm or weighted nuclear norm regularized convex optimization problems. This also includes robust PCA and variants of sparse PCA. On the theoretical side, we show that the approximate solution path has low complexity. This implies that the whole solution path can be computed efficiently. Our experiments demonstrate the practicality of the approach for large matrix completion problems. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/giesen12.html https://proceedings.mlr.press/v22/giesen12.html UPAL: Unbiased Pool Based Active Learning In this paper we address the problem of pool based active learning, and provide an algorithm, called UPAL, that works by minimizing the unbiased estimator of the risk of a hypothesis in a given hypothesis space. For the space of linear classifiers and the squared loss we show that UPAL is equivalent to an exponentially weighted average forecaster. Exploiting some recent results regarding the spectra of random matrices allows us to analyze UPAL with squared losses for the noiseless setting. Empirical comparison with an active learner implementation in Vowpal Wabbit, and a previously proposed pool based active learner implementation show good empirical performance and better scalability. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/ganti12.html https://proceedings.mlr.press/v22/ganti12.html Hierarchical Latent Dictionaries for Models of Brain Activation In this work, we propose a hierarchical latent dictionary approach to estimate the time-varying mean and covariance of a process for which we have only limited noisy samples. We fully leverage the limited sample size and redundancy in sensor measurements by transferring knowledge through a hierarchy of lower dimensional latent processes. As a case study, we utilize Magnetoencephalography (MEG) recordings of brain activity to identify the word being viewed by a human subject. Specifically, we identify the word category for a single noisy MEG recording, when given only limited noisy samples on which to train. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/fyshe12.html https://proceedings.mlr.press/v22/fyshe12.html Factorized Asymptotic Bayesian Inference for Mixture Modeling This paper proposes a novel Bayesian approximation inference method for mixture modeling. Our key idea is to factorize marginal log-likelihood using a variational distribution over latent variables. An asymptotic approximation, a factorized information criterion (FIC), is obtained by applying the Laplace method to each of the factorized components. In order to evaluate FIC, we propose factorized asymptotic Bayesian inference (FAB), which maximizes an asymptotically-consistent lower bound of FIC. FIC and FAB have several desirable properties: 1) asymptotic consistency with the marginal log-likelihood, 2) automatic component selection on the basis of an intrinsic shrinkage mechanism, and 3) parameter identifiability in mixture modeling. Experimental results show that FAB outperforms state-of-the-art VB methods. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/fujimaki12.html https://proceedings.mlr.press/v22/fujimaki12.html Semiparametric Pseudo-Likelihood Estimation in Markov Random Fields Probabilistic graphical models for continuous variables can be built out of either parametric or nonparametric conditional density estimators. While several research efforts have been focusing on parametric approaches (such as Gaussian models), kernel-based estimators are still the only viable and well-understood option for nonparametric density estimation. This paper develops a semiparametric estimator of probability density functions based on the nonparanormal transformation, which has been recently proposed for mapping arbitrarily distributed data samples onto normally distributed datasets. Pointwise and uniform consistency properties are established for the developed method. The resulting density model is then applied to pseudo-likelihood estimation in Markov random fields. An experimental evaluation on data distributed according to a variety of density functions indicates that such semiparametric Markov random field models significantly outperform both their Gaussian and kernel-based alternatives in terms of prediction accuracy. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/freno12.html https://proceedings.mlr.press/v22/freno12.html No Internal Regret via Neighborhood Watch We present an algorithm which attains O(\sqrtT) internal (and thus external) regret for finite games with partial monitoring under the local observability condition. Recently, this condition has been shown by Bartok, Pal, and Szepesvari (2011) to imply the O(\sqrtT) rate for partial monitoring games against an i.i.d. opponent, and the authors conjectured that the same holds for non-stochastic adversaries. Our result is in the affirmative, and it completes the characterization of possible rates for finite partial-monitoring games, an open question stated by Cesa-Bianchi, Lugosi, and Stoltz (2006). Our regret guarantees also hold for the more general model of partial monitoring with random signals. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/foster12.html https://proceedings.mlr.press/v22/foster12.html High-Rank Matrix Completion This paper considers the problem of completing a matrix with many missing entries under the assumption that the columns of the matrix belong to a union of multiple low-rank subspaces. This generalizes the standard low-rank matrix completion problem to situations in which the matrix rank can be quite high or even full rank. Since the columns belong to a union of subspaces, this problem may also be viewed as a missing-data version of the subspace clustering problem. Let X be an nxN matrix whose (complete) columns lie in a union of at most k subspaces, each of rank = r n, and assume Nkn. The main result of the paper shows that under mild assumptions each column of X can be perfectly recovered with high probability from an incomplete version so long as at least C r N \log^2(n) entries of X are observed uniformly at random, with C1 a constant depending on the usual incoherence conditions, the geometrical arrangement of subspaces, and the distribution of columns over the subspaces. The result is illustrated with numerical experiments and an application to Internet distance matrix completion and topology identification. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/eriksson12.html https://proceedings.mlr.press/v22/eriksson12.html Statistical test for consistent estimation of causal effects in linear non-Gaussian models In many fields of science researchers are faced with the problem of estimating causal effects from non-experimental data. A key issue is to avoid inconsistent estimators due to confounding by measured or unmeasured covariates, a problem commonly solved by ’adjusting for’ a subset of the observed variables. When the data generating process can be represented by a directed acyclic graph, and this graph structure is known, there exist simple graphical procedures for determining which subset of covariates should be adjusted for to obtain consistent estimators of the causal effects. However, when the graph is not known no general and complete procedures for this task are available. In this paper we introduce such a method for linear non-Gaussian models, requiring only partial knowledge about the temporal ordering of the variables: We provide a simple statistical test for inferring whether an estimator of a causal effect is consistent when controlling for a subset of measured covariates, and we present heuristics to search for such a set. We show empirically that this statistical test identifies consistent vs inconsistent estimates, and that the search heuristics outperform the naive approach of adjusting for all observed covariates. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/entner12.html https://proceedings.mlr.press/v22/entner12.html Lightning-speed Structure Learning of Nonlinear Continuous Networks Graphical models are widely used to reason about high-dimensional domains. Yet, learning the structure of the model from data remains a formidable challenge, particularly in complex continuous domains. We present a highly accelerated structure learning approach for continuous densities based on the recently introduced copula Bayesian network representation. For two common copula families, we prove that the expected likelihood of a building block edge in the model is monotonic in Spearman’s rank correlation measure. We also show numerically that the same relationship holds for many other copula families. This allows us to perform structure learning while bypassing costly parameter estimation as well as explicit computation of the log-likelihood function. We demonstrate the merit of our approach for structure learning in three varied real-life domains. Importantly, the computational benefits are such that they open the door for practical scaling-up of structure learning in complex nonlinear continuous domains. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/elidan12b.html https://proceedings.mlr.press/v22/elidan12b.html Copula Network Classifiers (CNCs) The task of classification is of paramount importance and extensive research has been aimed at developing general purpose classifiers that can be used effectively in a variety of domains. Network-based classifiers, such as the tree augmented naive Bayes model, are appealing since they are easily interpretable, can naturally handle missing data, and are often quite effective. Yet, for complex domains with continuous explanatory variables, practical performance is often sub-optimal. To overcome this limitation, we introduce Copula Network Classifiers (CNCs), a model that combines the flexibility of a graph based representation with the modeling power of copulas. As we demonstrate on ten varied continuous real-life datasets, CNCs offer better overall performance than linear and non-linear standard generative models, as well as discriminative RBF and polynomial kernel SVMs. In addition, since no parameter tuning is required, CNCs can be trained dramatically faster than SVMs. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/elidan12a.html https://proceedings.mlr.press/v22/elidan12a.html Error bounds for Kernel Fisher Linear Discriminant in Gaussian Hilbert space We give a non-trivial, non-asymptotic upper bound on the classification error of the popular Kernel Fisher Linear Discriminant classifier under the assumption that the kernel-induced space is a Gaussian Hilbert space. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/durrant12.html https://proceedings.mlr.press/v22/durrant12.html Lifted coordinate descent for learning with trace-norm regularization We consider the minimization of a smooth loss with trace-norm regularization, which is a natural objective in multi-class and multi-task learning. Even though the problem is convex, existing approaches rely on optimizing a non-convex variational bound, which is not guaranteed to converge, or repeatedly perform singular-value decomposition, which prevents scaling beyond moderate matrix sizes. We lift the non-smooth convex problem into an infinitely dimensional smooth problem and apply coordinate descent to solve it. We prove that our approach converges to the optimum, and is competitive or outperforms state of the art. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/dudik12.html https://proceedings.mlr.press/v22/dudik12.html Generic Methods for Optimization-Based Modeling "Energy” models for continuous domains can be applied to many problems, but often suffer from high computational expense in training, due to the need to repeatedly minimize the energy function to high accuracy. This paper considers a modified setting, where the model is trained in terms of results after optimization is truncated to a fixed number of iterations. We derive “backpropagating” versions of gradient descent, heavy-ball and LBFGS. These are simple to use, as they require as input only routines to compute the gradient of the energy with respect to the domain and parameters. Experimental results on denoising and image labeling problems show that learning with truncated optimization greatly reduces computational expense compared to “full” fitting. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/domke12.html https://proceedings.mlr.press/v22/domke12.html A metric learning perspective of SVM: on the relation of LMNN and SVM Support Vector Machines, SVMs, and the Large Margin Nearest Neighbor algorithm, LMNN, are two very popular learning algorithms with quite different learning biases. In this paper we bring them into a unified view and show that they have a much stronger relation than what is commonly thought. We analyze SVMs from a metric learning perspective and cast them as a metric learning problem, a view which helps us uncover the relations of the two algorithms. We show that LMNN can be seen as learning a set of local SVM-like models in a quadratic space. Along the way and inspired by the metric-based interpretation of SVMs we derive a novel variant of SVMs, ε-SVM, to which LMNN is even more similar. We give a unified view of LMNN and the different SVM variants. Finally we provide some preliminary experiments on a number of benchmark datasets in which show that ε-SVM compares favorably both with respect to LMNN and SVM. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/do12.html https://proceedings.mlr.press/v22/do12.html Deterministic Annealing for Semi-Supervised Structured Output Learning In this paper we propose a new approach for semi-supervised structured output learning. Our approach uses relaxed labeling on unlabeled data to deal with the combinatorial nature of the label space and further uses domain constraints to guide the learning. Since the overall objective is non-convex, we alternate between the optimization of the model parameters and the label distribution of unlabeled data. The alternating optimization coupled with deterministic annealing helps us achieve better local optima and as a result our approach leads to better constraint satisfaction during inference. Experimental results on sequence labeling benchmarks show superior performance of our approach compared to Constraint Driven Learning (CoDL) and Posterior Regularization (PR). Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/dhillon12.html https://proceedings.mlr.press/v22/dhillon12.html There’s a Hole in My Data Space: Piecewise Predictors for Heterogeneous Learning Problems We study statistical learning problems where the data space is multimodal and heterogeneous, and constructing a single global predictor is difficult. We address such problems by iteratively identifying high-error regions in the data space and learning specialized predictors for these regions. While the idea of composing localized predictors is not new, our approach is unique in that we actively seek out predictors that clump errors together, making it easier to isolate the problematic regions. When errors are clumped together they are also easier to interpret and resolve through appropriate feature engineering and data preprocessing. We present an error-clumping classification algorithm based on a convex optimization problem, and an efficient stochastic optimization algorithm for this problem. We theoretically motivate our approach with a novel sample complexity analysis for piecewise predictors, and empirically demonstrate its behavior on an illustrative classification problem. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/dekel12.html https://proceedings.mlr.press/v22/dekel12.html Protocols for Learning Classifiers on Distributed Data We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/daume12.html https://proceedings.mlr.press/v22/daume12.html Hierarchical Relative Entropy Policy Search Many real-world problems are inherently hi- erarchically structured. The use of this struc- ture in an agent’s policy may well be the key to improved scalability and higher per- formance. However, such hierarchical struc- tures cannot be exploited by current policy search algorithms. We will concentrate on a basic, but highly relevant hierarchy - the ’mixed option’ policy. Here, a gating network first decides which of the options to execute and, subsequently, the option-policy deter- mines the action. In this paper, we reformulate learning a hi- erarchical policy as a latent variable estima- tion problem and subsequently extend the Relative Entropy Policy Search (REPS) to the latent variable case. We show that our Hierarchical REPS can learn versatile solu- tions while also showing an increased perfor- mance in terms of learning speed and quality of the found policy in comparison to the non- hierarchical approach. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/daniel12.html https://proceedings.mlr.press/v22/daniel12.html Wilks’ phenomenon and penalized likelihood-ratio test for nonparametric curve registration The problem of curve registration appears in many different areas of applications ranging from neuroscience to road traffic modeling. In the present work, we propose a nonparametric testing framework in which we develop a generalized likelihood ratio test to perform curve registration. We first prove that, under the null hypothesis, the resulting test statistic is asymptotically distributed as a chi-squared random variable (Wilks’ phenomenon). We also prove that the proposed test is consistent, extiti.e., its power is asymptotically equal to 1. Finite sample properties of the proposed methodology are demonstrated by numerical simulations. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/dalalyan12.html https://proceedings.mlr.press/v22/dalalyan12.html Gaussian Processes for time-marked time-series data In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/cunningham12.html https://proceedings.mlr.press/v22/cunningham12.html Fast, Exact Model Selection and Permutation Testing for l2-Regularized Logistic Regression Regularized logistic regression is a standard classification method used in statistics and machine learning. Unlike regularized least squares problems such as ridge regression, the parameter estimates cannot be computed in closed-form and instead must be estimated using an iterative technique. This paper addresses the computational problem of regularized logistic regression that is commonly encountered in model selection and classifier statistical significance testing, in which a large number of related logistic regression problems must be solved for. Our proposed approach solves the problems simultaneously through an iterative technique, which also garners computational efficiencies by leveraging the redundancies across the related problems. We demonstrate analytically that our method provides a substantial complexity reduction, which is further validated by our results on real-world datasets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/conroy12.html https://proceedings.mlr.press/v22/conroy12.html Minimax hypothesis testing for curve registration This paper is concerned with the problem of goodness-of-fit for curve registration, and more precisely for the shifted curve model, whose application field reaches from computer vision and road trafic prediction to medicine. We give bounds for the asymptotic minimax separation rate, when the functions in the alternative lie in Sobolev balls and the separation from the null hypothesis is measured by the l2-norm. We use the generalized likelihood ratio to build a nonadaptive procedure depending on a tuning parameter, which we choose in an optimal way according to the smoothness of the ambient space. Then, a Bonferroni procedure is applied to give an adaptive test over a range of Sobolev balls. Both achieve the asymptotic minimax separation rates, up to possible logarithmic factors. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/collier12.html https://proceedings.mlr.press/v22/collier12.html Online Clustering with Experts Approximating the k-means clustering objective with an online learning algorithm is an open problem. We introduce a family of online clustering algorithms by extending algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors in order to re-weight the experts, the algorithms compute an approximation to the current value of the k-means objective obtained by each expert. When the experts are batch clustering algorithms with b-approximation guarantees with respect to the k-means objective (for example, the k-means++ or k-means# algorithms), applied to a sliding window of the data stream, our algorithms obtain approximation guarantees with respect to the k-means objective. The form of these online clustering approximation guarantees is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Notably, our approximation bounds are with respect to the optimal k-means cost on the entire data stream seen so far, even though the algorithm is online. Our algorithm’s empirical performance tracks that of the best clustering algorithm in its expert set. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/choromanska12.html https://proceedings.mlr.press/v22/choromanska12.html Classifier Cascade for Minimizing Feature Evaluation Cost Machine learning algorithms are increasingly used in large-scale industrial settings. Here, the operational cost during test-time has to be taken into account when an algorithm is designed. This operational cost is affected by the average running time and the computation time required for feature extraction. When a diverse set of features is used, the latter can vary drastically. In this paper we propose an algorithm that constructs a cascade of classifiers that explicitly trades-off operational cost and classifier accuracy while accounting for on-demand feature extraction costs. Different from previous work, our algorithm re-optimizes trained classifiers and allows expensive features to be scheduled at any stage within the cascade to minimize overall cost. Experiments on actual web-search ranking data sets demonstrate that our framework leads to drastic test-time improvements. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/chen12c.html https://proceedings.mlr.press/v22/chen12c.html A Two-Graph Guided Multi-task Lasso Approach for eQTL Mapping Learning a small number of genetic variants associated with multiple complex genetic traits is of practical importance and remains challenging due to the high dimensional nature of data. In this paper, we proposed a two-graph guided multi-task Lasso to address this issue with an emphasis on estimating subnetwork-to-subnetwork associations in expression quantitative trait loci (eQTL) mapping. The proposed model can learn such subnetwork-to-subnetwork associations and therefore can be seen as a generalization of several state-of-the-art multi-task feature selection methods. Additionally, this model has a nice property of allowing flexible structured sparsity on both feature and label domains. Simulation study shows the improved performance of our model and a human eQTL data set is analyzed to further demonstrate the applications of the model. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/chen12b.html https://proceedings.mlr.press/v22/chen12b.html Structured Sparse Canonical Correlation Analysis In this paper, we propose to apply sparse canonical correlation analysis (sparse CCA) to an important genome-wide association study problem, eQTL mapping. Existing sparse CCA models do not incorporate structural information among variables such as pathways of genes. This work extends the sparse CCA so that it could exploit either the pre-given or unknown group structure via the structured-sparsity-inducing penalty. Such structured penalty poses new challenge on optimization techniques. To address this challenge, by specializing the excessive gap framework, we develop a scalable primal-dual optimization algorithm with a fast rate of convergence. Empirical results show that the proposed optimization algorithm is more efficient than existing state-of-the-art methods. We also demonstrate the effectiveness of the structured sparse CCA on both simulated and genetic datasets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/chen12a.html https://proceedings.mlr.press/v22/chen12a.html Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit We consider a linear stochastic bandit problem where the dimension K of the unknown parameter heta is larger than the sampling budget n. Since usual linear bandit algorithms have a regret in O(K\sqrtn), it is in general impossible to obtain a sub-linear regret without further assumption. In this paper we make the assumption that heta is S-sparse, i.e. has at most S-non-zero components, and that the space of arms is the unit ball for the ||.||_2 norm. We combine ideas from Compressed Sensing and Bandit Theory to derive an algorithm with a regret bound in O(S\sqrtn). We detail an application to the problem of optimizing a function that depends on many variables but among which only a small number of them (initially unknown) are relevant. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/carpentier12.html https://proceedings.mlr.press/v22/carpentier12.html Optimistic planning for Markov decision processes The reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight near-optimality guarantees are not yet available for the general case of stochastic Markov decision processes and closed-loop, state-dependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closed-loop policies, and we analyze the near-optimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact near-optimal policies. We characterize this set by introducing a novel measure of problem complexity, called the near-optimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer near-optimal policies and less uniform transition probabilities. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/busoniu12.html https://proceedings.mlr.press/v22/busoniu12.html On Sparse, Spectral and Other Parameterizations of Binary Probabilistic Models This paper studies issues relating to the parameterization of probability distributions over binary data sets. Several such parameterizations of models for binary data are known, including the Ising, generalized Ising, canonical and full parameterizations. We also discuss a parameterization that we call the “spectral parameterization”, which has received significantly less coverage in existing literature. We provide this parameterization with a spectral interpretation by casting log-linear models in terms of orthogonal Walsh-Hadamard harmonic expansions. Using various standard and group sparse regularizers for structural learning, we provide a comprehensive theoretical and empirical comparison of these parameterizations. We show that the spectral parameterization, along with the canonical, has the best performance and sparsity levels, while the spectral does not depend on any particular reference state. The spectral interpretation also provides a new starting point for analyzing the statistics of binary data sets; we measure the magnitude of higher order interactions in the underlying distributions for several data sets. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/buchman12.html https://proceedings.mlr.press/v22/buchman12.html A Family of MCMC Methods on Implicitly Defined Manifolds Traditional MCMC methods are only applicable to distributions which can be defined on \mathbbR^n. However, there exist many application domains where the distributions cannot easily be defined on a Euclidean space. To address this limitation, we propose a general constrained version of Hamiltonian Monte Carlo, and give conditions under which the Markov chain is convergent. Based on this general framework we define a family of MCMC methods which can be applied to sample from distributions on non-linear manifolds. We demonstrate the effectiveness of our approach on a variety of problems including sampling from the Bingham-von Mises-Fisher distribution, collaborative filtering and human pose estimation. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/brubaker12.html https://proceedings.mlr.press/v22/brubaker12.html Sample Complexity of Composite Likelihood We present the first PAC bounds for learning parameters of Conditional Random Fields (CRFs) with general structures over discrete and real-valued variables. Our bounds apply to composite likelihood, which generalizes maximum likelihood and pseudolikelihood. Moreover, we show that the only existing algorithm with a PAC bound for learning high-treewidth discrete models can be viewed as a computationally inefficient method for computing pseudolikelihood. We present an extensive empirical study of the statistical efficiency of these estimators, as predicted by our bounds. Finally, we use our bounds to show how to construct computationally and statistically efficient composite likelihood estimators. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/bradley12.html https://proceedings.mlr.press/v22/bradley12.html Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing Open-text semantic parsers are designed to interpret any statement in natural language by inferring a corresponding meaning representation (MR - a formal representation of its sense). Unfortunately, large scale systems cannot be easily machine-learned due to lack of directly supervised data. We propose a method that learns to assign MRs to a wide range of text (using a dictionary of more than 70,000 words mapped to more than 40,000 entities) thanks to a training scheme that combines learning from knowledge bases (e.g. WordNet) with learning from raw text. The model jointly learns representations of words, entities and MRs via a multi-task training process operating on these diverse sources of data. Hence, the system ends up providing methods for knowledge acquisition and word-sense disambiguation within the context of semantic parsing in a single elegant framework. Experiments on these various tasks indicate the promise of the approach. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/bordes12.html https://proceedings.mlr.press/v22/bordes12.html History-alignment models for bias-aware prediction of virological response to HIV combination therapy The relevant HIV data sets used for predicting outcomes of HIV combination therapies suffer from several problems: uneven therapy representation, different treatment backgrounds of the samples and uneven representation with respect to the level of therapy experience. Also, they comprise only viral strain(s) that can be detected in the patients’ blood serum. The approach presented in this paper tackles these issues by considering not only the most recent therapies but also the different treatment backgrounds of the samples making up the clinical data sets when predicting the outcomes of HIV therapies. For this purpose, we introduce a similarity measure for sequences of therapies and use it for training separate linear models for predicting therapy outcome for each target sample. Compared to the most commonly used approach that encodes all available treatment information only by specific input features our approach has the advantage of delivering significantly more accurate predictions for therapy-experienced patients and for rare therapies. Additionally, the sample-specific models are more interpretable which is very important in medical applications. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/bogojeska12.html https://proceedings.mlr.press/v22/bogojeska12.html CorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels In this paper, we present a simple but effective method for multi-label classification (MLC), termed Correlated Logistic Models (Corrlog), which extends multiple Independent Logistic Regressions (ILRs) by modeling the pairwise correlation between labels. Algorithmically, we propose an efficient method for learning parameters of Corrlog, which is based on regularized maximum pseudo-likelihood estimation and has a linear computational complexity with respect to the number of labels. Theoretically, we show that Corrlog enjoys a satisfying generalization bound which is independent of the number of labels. The effectiveness of Corrlog on modeling label correlations is illustrated by a toy example, and further experiments on real data show that Corrlog achieves competitive performance compared with popular MLC algorithms. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/bian12.html https://proceedings.mlr.press/v22/bian12.html Controlling Selection Bias in Causal Inference Selection bias, caused by preferential exclusion of samples from the data, is a major obstacle to valid causal and statistical inferences; it cannot be removed by randomized experiments and can hardly be detected in either experimental or observational studies. This paper highlights several graphical and algebraic methods capable of mitigating and sometimes eliminating this bias. These nonparametric methods generalize previously reported results, and identify the type of knowledge that is needed for reasoning in the presence of selection bias. Specifically, we derive a general condition together with a procedure for deciding recoverability of the odds ratio (OR) from s-biased data. We show that recoverability is feasible if and only if our condition holds. We further offer a new method of controlling selection bias using instrumental variables that permits the recovery of other effect measures besides OR. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/bareinboim12.html https://proceedings.mlr.press/v22/bareinboim12.html Adaptive Metropolis with Online Relabeling We propose a novel adaptive MCMC algorithm named AMOR (Adaptive Metropolis with Online Relabeling) for efficiently simulating from permutation-invariant targets occurring in, for example, Bayesian analysis of mixture models. An important feature of the algorithm is to tie the adaptation of the proposal distribution to the choice of a particular restriction of the target to a domain where label switching cannot occur. The algorithm relies on a stochastic approximation procedure for which we exhibit a Lyapunov function that formally defines the criterion used for selecting the relabeling rule. This criterion reveals an interesting connection with the problem of optimal quantifier design in vector quantization which was only implicit in previous works on the label switching problem. In benchmark examples, the algorithm turns out to be fast converging and efficient at selecting meaningful non-trivial relabeling rules to allow accurate parameter inference. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/bardenet12.html https://proceedings.mlr.press/v22/bardenet12.html A General Framework for Structured Sparsity via Proximal Optimization We study a generalized framework for structured sparsity. It extends the well-known methods of Lasso and Group Lasso by incorporating additional constraints on the variables as part of a convex optimization problem. This framework provides a straightforward way of favoring prescribed sparsity patterns, such as orderings, contiguous regions and overlapping groups, among others. Available optimization methods are limited to specific constraint sets and tend to not scale well with sample size and dimensionality. We propose a first order proximal method, which builds upon results on fixed points and successive approximations. The algorithm can be applied to a general class of conic and norm constraints sets and relies on a proximity operator subproblem which can be computed numerically. Experiments on different regression problems demonstrate state-of-the art statistical performance, which improves over Lasso, Group Lasso and StructOMP. They also demonstrate the efficiency of the optimization algorithm and its scalability with the size of the problem. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/baldassarre12.html https://proceedings.mlr.press/v22/baldassarre12.html Scalable Personalization of Long-Term Physiological Monitoring: Active Learning Methodologies for Epileptic Seizure Onset Detection Patient-specific algorithms to detect adverse clinical events during long-term physiological monitoring substantially improve performance relative to patient-nonspecific ones. However, these algorithms often rely on the availability of expert hand-labeled data for training, which severely restricts the scalability of personalized monitoring within a real-world setting. While active learning offers a natural framework to address this issue, the relative merits of different active learning methodologies have not been extensively studied in the setting of developing clinically useful detectors for infrequent time-series events. In this paper, we identify a core set of principles that are relative to the specific goal of personalized long-term physiological monitoring. We describe and compare different approaches for initialization, batch selection and termination within the active learning process. We position this work in the context of epileptic seizure onset detection. When evaluated on a database of scalp EEG recordings from 23 epileptic patients, we show that a combined distance- and diversity-based measure to determine the data to be queried, max-min clustering for identification of the initialization set, and a comparison of consecutive support vector sets to guide termination results in an active learning-based detector that can achieve similar performance to a patient-specific detector while requiring two orders of magnitude fewer labeled examples for training. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/balakrishnan12b.html https://proceedings.mlr.press/v22/balakrishnan12b.html Minimax rates for homology inference Often, high dimensional data lie close to a low-dimensional submanifold and it is of interest to understand the geometry of these submanifolds. The homology groups of a (sub)manifold are important topological invariants that provide an algebraic summary of the manifold. These groups contain rich topological information, for instance, about the connected components, holes, tunnels and (sometimes) the dimension of the manifold. In this paper, we consider the statistical problem of estimating the homology of a manifold from noisy samples under several different noise models. We derive upper and lower bounds on the minimax risk for this problem. Our upper bounds are based on estimators which are constructed from a union of balls of appropriate radius around carefully selected sample points. In each case we establish complementary lower bounds using Le Cam’s lemma. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/balakrishnan12a.html https://proceedings.mlr.press/v22/balakrishnan12a.html Graphlet decomposition of a weighted network We consider the problem of modeling networks with nonnegative edge weights. We develop a \emphbit-string decomposition (BSD) for weighted networks, a new representation of social information based on social structure, with an underlying semi-parametric statistical model. We develop a scalable inference algorithm, which combines Expectation-Maximization with Bron-Kerbosch in a novel fashion, for estimating the model’s parameters from a network sample. We present theoretical descriptions to the computational complexity of the method. Finally, we demonstrate the performance of the proposed methodology for synthetic data, academic networks from Facebook and finding communities in a historical data from 19th century. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/azari12.html https://proceedings.mlr.press/v22/azari12.html Memory-efficient inference in dynamic graphical models using multiple cores We introduce the archipelagos algorithm for memory-efficient multi-core inference in dynamic graphical models. By making use of several processors running in parallel, the archipelagos algorithm uses exponentially less memory compared to basic forward-backward message passing algorithms (O(log T) compared to O(T) on sequences of length T) and, under often-satisfied assumptions on the relative speed of passing forward and backward messages, runs no slower. We also describe a simple variant of the algorithm that achieves a factor of two speedup over forward-backward on a single core. Experiments with our implementation of archipelagos for the computation of posterior marginal probabilities in an HMM validate the space/time complexity analysis: using four cores, the required memory on our test problem was reduced from 8 GB to 319 KB (a factor of 25000) relative to forward-backward, but completed in essentially the same time. The archipelagos algorithm applies to any dynamic graphical model, including dynamic Bayesian networks, conditional random fields, and hidden conditional random fields. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/andrew12.html https://proceedings.mlr.press/v22/andrew12.html Factorized Diffusion Map Approximation Diffusion maps are among the most powerful Machine Learning tools to analyze and work with complex high-dimensional datasets. Unfortunately, the estimation of these maps from a finite sample is known to suffer from the curse of dimensionality. Motivated by other machine learning models for which the existence of structure in the underlying distribution of data can reduce the complexity of estimation, we study and show how the factorization of the underlying distribution into independent subspaces can help us to estimate diffusion maps more accurately. Building upon this result, we propose and develop an algorithm that can automatically factorize a high dimensional data space in order to minimize the error of estimation of its diffusion map, even in the case when the underlying distribution is not decomposable. Experiments on both the synthetic and real-world datasets demonstrate improved estimation performance of our method over the regular diffusion-map framework. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/amizadeh12.html https://proceedings.mlr.press/v22/amizadeh12.html Sparse Higher-Order Principal Components Analysis Traditional tensor decompositions such as the CANDECOMP / PARAFAC (CP) and Tucker decompositions yield higher-order principal components that have been used to understand tensor data in areas such as neuroimaging, microscopy, chemometrics, and remote sensing. Sparsity in high-dimensional matrix factorizations and principal components has been well-studied exhibiting many benefits; less attention has been given to sparsity in tensor decompositions. We propose two novel tensor decompositions that incorporate sparsity: the Sparse Higher-Order SVD and the Sparse CP Decomposition. The latter solves a 1-norm penalized relaxation of the single-factor CP optimization problem, thereby automatically selecting relevant features for each tensor factor. Through experiments and a scientific data analysis example, we demonstrate the utility of our methods for dimension reduction, feature selection, signal recovery, and exploratory data analysis of high-dimensional tensors. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/allen12.html https://proceedings.mlr.press/v22/allen12.html Contextual Bandit Learning with Predictable Rewards Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a realizability assumption: there exists a function in a (known) function class, always capable of predicting the expected reward, given the action and context. Under this assumption, we show three things. We present a new algorithm–Regressor Elimination – with a regret similar to the agnostic setting (i.e. in the absence of realizability assumption). We prove a new lower bound showing no algorithm can achieve superior performance in the worst case even with the realizability assumption. However, we do show that for \emphany set of policies (mapping contexts to actions), there is a distribution over rewards (given context) such that our new algorithm has \em constant regret unlike the previous approaches. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/agarwal12.html https://proceedings.mlr.press/v22/agarwal12.html Discriminative Mixtures of Sparse Latent Fields for Risk Management We describe a simple and efficient approach to learning structures of sparse high-dimensional latent variable models. Standard algorithms either learn structures of specific predefined forms, or estimate sparse graphs in the data space ignoring the possibility of the latent variables. In contrast, our method learns rich dependencies and allows for latent variables that may confound the relations between the observations. We extend the model to conditional mixtures with side information and non-Gaussian marginal distributions of the observations. We then show that our model may be used for learning sparse latent variable structures corresponding to multiple unknown states, and for uncovering features useful for explaining and predicting structural changes. We apply the model to real-world financial data with heavy-tailed marginals covering the low- and high- market volatility periods of 2005-2011. We show that our method tends to give rise to significantly higher likelihoods of test data than standard network learning methods exploiting the sparsity assumption. We also demonstrate that our approach may be practical for financial stress testing and visualization of dependencies between financial instruments. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/agakov12.html https://proceedings.mlr.press/v22/agakov12.html Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits We introduce a novel technique, which we call online-to-confidence-set conversion. The technique allows us to construct high-probability confidence sets for linear prediction with correlated inputs given the predictions of any algorithm (e.g., online LASSO, exponentiated gradient algorithm, online least-squares, p-norm algorithm) targeting online learning with linear predictors and the quadratic loss. By construction, the size of the confidence set is directly governed by the regret of the online learning algorithm. Constructing tight confidence sets is interesting on its own, but the new technique is given extra weight by the fact having access tight confidence sets underlies a number of important problems. The advantage of our construction here is that progress in constructing better algorithms for online prediction problems directly translates into tighter confidence sets. In this paper, this is demonstrated in the case of linear stochastic bandits. In particular, we introduce the sparse variant of linear stochastic bandits and show that a recent online algorithm together with our online-to-confidence-set conversion allows one to derive algorithms that can exploit if the reward is a function of a sparse linear combination of the components of the chosen action. Wed, 21 Mar 2012 00:00:00 +0000 https://proceedings.mlr.press/v22/abbasi-yadkori12.html https://proceedings.mlr.press/v22/abbasi-yadkori12.html