Proceedings of Machine Learning ResearchProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
Held in Fort Lauderdale, FL, USA on 11-13 April 2011
Published as Volume 15 by the Proceedings of Machine Learning Research on 14 June 2011.
Volume Edited by:
Geoffrey Gordon
David Dunson
Miroslav Dudík
Series Editors:
Neil D. Lawrence
http://proceedings.mlr.press/v15/
Sat, 14 Jul 2018 22:22:39 +0000Sat, 14 Jul 2018 22:22:39 +0000Jekyll v3.7.3Two-Layer Multiple Kernel LearningMultiple Kernel Learning (MKL) aims to learn kernel machines for solving a real machine learning problem (e.g. classification) by exploring the combinations of multiple kernels. The traditional MKL approach is in general “shallow” in the sense that the target kernel is simply a linear (or convex) combination of some base kernels. In this paper, we investigate a framework of Multi-Layer Multiple Kernel Learning (MLMKL) that aims to learn “deep” kernel machines by exploring the combinations of multiple kernels in a multi-layer structure, which goes beyond the conventional MKL approach. Through a multiple layer mapping, the proposed MLMKL framework offers higher flexibility than the regular MKL for finding the optimal kernel for applications. As the first attempt to this new MKL framework, we present a two-Layer Multiple Kernel Learning (2LMKL) method together with two efficient algorithms for classification tasks. We analyze their generalization performances and have conducted an extensive set of experiments over 16 benchmark datasets, in which encouraging results showed that our method outperformed the conventional MKL methods. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhuang11a.html
http://proceedings.mlr.press/v15/zhuang11a.htmlError Analysis of Laplacian Eigenmaps for Semi-supervised LearningWe study the error and sample complexity of semi-supervised learning by Laplacian Eignmaps at the limit of infinite unlabeled data. We provide a bound on the error, and show that it is controlled by the graph Laplacian regularizer. Our analysis also gives guidance to the choice of the number of eigenvectors k to use: when the data lies on a d-dimensional domain, the optimal choice of k is of order (n/\log(n))^\fracdd+2, yielding an asymptotic error rate of (n/\log(n))^-\frac22+d. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhou11c.html
http://proceedings.mlr.press/v15/zhou11c.htmlSemi-supervised Learning by Higher Order RegularizationIn semi-supervised learning, at the limit of infinite unlabeled points while fixing labeled ones, the solutions of several graph Laplacian regularization based algorithms were shown by Nadler et al. (2009) to degenerate to constant functions with “spikes” at labeled points in \R^d for d\ge 2. These optimization problems all use the graph Laplacian regularizer as a common penalty term. In this paper, we address this problem by using regularization based on an iterated Laplacian, which is equivalent to a higher order Sobolev semi-norm. Alternatively, it can be viewed as a generalization of the thin plate spline to an unknown submanifold in high dimensions. We also discuss relationships between Reproducing Kernel Hilbert Spaces and Green’s functions. Experimental results support our analysis by showing consistently improved results using iterated Laplacians. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhou11b.html
http://proceedings.mlr.press/v15/zhou11b.htmlDependent Hierarchical Beta Process for Image Interpolation and DenoisingA dependent hierarchical beta process (dHBP) is developed as a prior for data that may be represented in terms of a sparse set of latent features, with covariate-dependent feature usage. The dHBP is applicable to general covariates and data models, imposing that signals with similar covariates are likely to be manifested in terms of similar features. Coupling the dHBP with the Bernoulli process, and upon marginalizing out the dHBP, the model may be interpreted as a covariate-dependent hierarchical Indian buffet process. As applications, we consider interpolation and denoising of an image, with covariates defined by the location of image patches within an image. Two types of noise models are considered: (i) typical white Gaussian noise; and (ii) spiky noise of arbitrary amplitude, distributed uniformly at random. In these examples, the features correspond to the atoms of a dictionary, learned based upon the data under test (without a priori training data). State-of-the-art performance is demonstrated, with efficient inference using hybrid Gibbs, Metropolis-Hastings and slice sampling. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhou11a.html
http://proceedings.mlr.press/v15/zhou11a.htmlMulti-Label Output Codes using Canonical Correlation AnalysisTraditional error-correcting output codes (ECOCs) decompose a multi-class classification problem into many binary problems. Although it seems natural to use ECOCs for multi-label problems as well, doing so naively creates issues related to: the validity of the encoding, the efficiency of the decoding, the predictability of the generated codeword, and the exploitation of the label dependency. Using canonical correlation analysis, we propose an error-correcting code for multi-label classification. Label dependency is characterized as the most predictable directions in the label space, which are extracted as canonical output variates and encoded into the codeword. Predictions for the codeword define a graphical model of labels with both Bernoulli potentials (from classifiers on the labels) and Gaussian potentials (from regression on the canonical output variates). Decoding is performed by efficient mean-field approximation. We establish connections between the proposed code and research areas such as compressed sensing and ensemble learning. Some of these connections contribute to better understanding of the new code, and others lead to practical improvements in code design. In our empirical study, the proposed code leads to substantial improvements compared to various competitors in music emotion classification and outdoor scene recognition. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhang11c.html
http://proceedings.mlr.press/v15/zhang11c.htmlGeneralization Bound for Infinitely Divisible Empirical ProcessIn this paper, we study the generalization bound for an empirical process of samples independently drawn from an infinitely divisible (ID) distribution, which is termed as the ID empirical process. In particular, based on a martingale method, we develop deviation inequalities for the sequence of random variables of an ID distribution. By applying the obtained deviation inequalities, we then show the generalization bound for ID empirical process based on the annealed Vapnik- Chervonenkis (VC) entropy. Afterward, according to Sauer's lemma, we get the generalization bound for ID empirical process based on the VC dimension. Finally, by using a resulted result bound, we analyze the asymptotic convergence of ID empirical process and show that the convergence rate of ID empirical process can reach $O\left(\left(\frac{\Lambda_\mathcal{F}(2N)}{N}\right)^\frac{1}{1.3}\right)$ and it is faster than the results of the generic i.i.d. empirical process (Vapnik, 1999).Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhang11b.html
http://proceedings.mlr.press/v15/zhang11b.htmlDiscussion of “Learning Equivalence Classes of Acyclic Models with Latent and Selection Variables from Multiple Datasets with Overlapping Variables”Discussion of <a href="http://proceedings.mlr.press/v15/tillman11a.html">Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zhang11a.html
http://proceedings.mlr.press/v15/zhang11a.htmlAn Instantiation-Based Theorem Prover for First-Order ProgrammingFirst-order programming (FOP) is a new representation language that combines the strengths of mixed-integer linear programming (MILP) and first-order logic (FOL). In this paper we describe a novel feasibility proving system for FOP formulas that combines MILP solving with instance-based methods from theorem proving. This prover allows us to perform lifted inference by repeatedly refining a propositional MILP. We prove that this procedure is sound and refutationally complete: if a formula is infeasible our solver will demonstrate this fact in finite time. We conclude by demonstrating an implementation of our decision procedure on a simple first-order planning problem. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/zawadzki11a.html
http://proceedings.mlr.press/v15/zawadzki11a.htmlA Finite Newton Algorithm for Non-degenerate Piecewise Linear SystemsWe investigate Newton-type optimization methods for solving piecewise linear systems (PLS) with non-degenerate coefficient matrix. Such systems arise, for example, from the numerical solution of linear complementarity problem which is useful to model several learning and optimization problems. In this paper, we propose an effective damped Newton method, namely PLS-DN, to find the exact solution of non-degenerate PLS. PLS-DN exhibits provable semi-iterative property, i.e., the algorithm converges globally to the exact solution in a finite number of iterations. The rate of convergence is shown to be at least linear before termination. We emphasize the applications of our method to modeling, from a novel perspective of PLS, several statistical learning problems such as elitist Lasso, non-negative least squares and support vector machines. Numerical results on synthetic and benchmark data sets are presented to demonstrate the effectiveness and efficiency of PLS-DN on these problems. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/yuan11a.html
http://proceedings.mlr.press/v15/yuan11a.htmlEfficient variable selection in support vector machines via the alternating direction method of multipliersThe support vector machine (SVM) is a widely used tool for classification. Although commonly understood as a method of finding the maximum-margin hyperplane, it can also be formulated as a regularized function estimation problem, corresponding to a hinge loss function plus an l2-norm regulation term. The doubly regularized support vector machine (DrSVM) is a variant of the standard SVM, which introduces an additional l1-norm regularization term on the fitted coefficients. The combined l1 and l2 regularization, termed elastic net penalty, has the interesting property of achieving simultaneous variable selection and margin-maximization within a single framework. However, because of the nonsmoothness of both the loss function and the regularization term, there is no efficient method to solve DrSVM for large scale problems. Here we develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) to solve the optimization problem in DrSVM. The utility of the method is further illustrated using both simulated and real-world datasets.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/ye11a.html
http://proceedings.mlr.press/v15/ye11a.htmlBridging the Language Gap: Topic Adaptation for Documents with Different TechnicalityThe language-gap, for example between low-literacy laypersons and highly-technical experts, is a fundamental barrier for cross-domain knowledge transfer. This paper seeks to close the gap at the thematic level via topic adaptation, i.e., adjusting topical structures for cross-domain documents according to a domain factor such as technicality. We present a probabilistic model for this purpose based on joint modeling of topic and technicality. The proposed tLDA model explicitly encodes the interplay between topic and technicality hierarchies, providing an effective topic-bridge between lay and expert documents. We demonstrate the usefulness of tLDA with an application to consumer medical informatics. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/yang11b.html
http://proceedings.mlr.press/v15/yang11b.htmlThe Sample Complexity of Self-Verifying Bayesian Active LearningWe prove that access to a prior distribution over target functions can dramatically improve the sample complexity of self-terminating active learning algorithms, so that it is always better than the known results for prior-dependent passive learning. In particular, this is in stark contrast to the analysis of prior-independent algorithms, where there are simple known learning problems for which no self-terminating algorithm can provide this guarantee for all priors. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/yang11a.html
http://proceedings.mlr.press/v15/yang11a.htmlCross-Domain Object Matching with Model SelectionThe goal of cross-domain object matching (CDOM) is to find correspondence between two sets of objects in different domains in an unsupervised way. Photo album summarization is a typical application of CDOM, where photos are automatically aligned into a designed frame expressed in the Cartesian coordinate system. CDOM is usually formulated as finding a mapping from objects in one domain (photos) to objects in the other domain (frame) so that the pairwise dependency is maximized. A state-of-the-art CDOM method employs a kernel-based dependency measure, but it has a drawback that the kernel parameter needs to be determined manually. In this paper, we propose alternative CDOM methods that can naturally address the model selection problem. Through experiments on image matching, unpaired voice conversion, and photo album summarization tasks, the effectiveness of the proposed methods is demonstrated. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/yamada11a.html
http://proceedings.mlr.press/v15/yamada11a.htmlMulticore Gibbs Sampling in Dense, Unstructured GraphsMulticore computing is on the rise, but algorithms such as Gibbs sampling are fundamentally sequential and may require close consideration to be made parallel. Existing techniques either exploit sparse problem structure or make approximations to the algorithm; in this work, we explore an alternative to these ideas. We develop a parallel Gibbs sampling algorithm for shared-memory systems that does not require any independence structure among the variables yet does not approximate the sampling distributions. Our method uses a look-ahead sampler, which uses bounds to attempt to sample variables before the results of other threads are made available. We demonstrate our algorithm on Gibbs sampling in Boltzmann machines and latent Dirichlet allocation (LDA). We show in experiments that our algorithm achieves near linear speed-up in the number of cores, is faster than existing exact samplers, and is nearly as fast as approximate samplers while maintaining the correct stationary distribution. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/xu11a.html
http://proceedings.mlr.press/v15/xu11a.htmlHierarchical Probabilistic Models for Group Anomaly DetectionStatistical anomaly detection typically focuses on finding individual data point anomalies. Often the most interesting or unusual things in a data set are not odd individual points, but rather larger scale phenomena that only become apparent when groups of data points are considered. In this paper, we propose two hierarchical probabilistic models for detecting such group anomalies. We evaluate our methods on synthetic data as well as astronomical data from the Sloan Digital Sky Survey. The experimental results show that the proposed models are effective in detecting group anomalies. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/xiong11a.html
http://proceedings.mlr.press/v15/xiong11a.htmlRelational Learning with One Network: An Asymptotic AnalysisTheoretical analysis of structured learning methods has focused primarily on domains where the data consist of independent (albeit structured) examples. Although the statistical relational learning (SRL) community has recently developed many classification methods for graph and network domains, much of this work has focused on modeling domains where there is a single network for learning. For example, we could learn a model to predict the political views of users in an online social network, based on the friendship relationships among users. In this example, the data would be drawn from a single large network (e.g., Facebook) and increasing the data size would correspond to acquiring a larger graph. Although SRL methods can successfully improve classification in these types of domains, there has been little theoretical analysis of addressing the issue of single network domains. In particular, the asymptotic properties of estimation are not clear if the size of the model grows with the size of the network. In this work, we focus on outlining the conditions under which learning from a single network will be asymptotically consistent and normal. Moreover, we compare the properties of maximum likelihood estimation (MLE) with that of generalized maximum pseudolikelihood estimation (MPLE) and use the resulting understanding to propose novel MPLE estimators for single network domains. We include empirical analysis on both synthetic and real network data to illustrate the findings. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/xiang11a.html
http://proceedings.mlr.press/v15/xiang11a.htmlDiscussion of “The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling”Discussion of <a href="http:paisley11a.html">The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/wood11a.html
http://proceedings.mlr.press/v15/wood11a.htmlLightweight Implementations of Probabilistic Programming Languages Via Transformational CompilationWe describe a general method of transforming arbitrary programming languages into probabilistic programming languages with straightforward MCMC inference engines. Random choices in the program are “"named"” with information about their position in an execution trace; these names are used in conjunction with a database of randomness to implement MCMC inference in the space of execution traces. We encode naming information using lightweight source-to-source compilers. Our method enables us to reuse existing infrastructure (compilers, interpreters, etc.) with minimal additional code, implying fast models with low development overhead. We illustrate the technique on two languages, one functional and one imperative: Bher, a compiled version of the Church language which eliminates interpretive overhead of the original MIT-Church implementation, and Stochastic Matlab, a new open-source language. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/wingate11a.html
http://proceedings.mlr.press/v15/wingate11a.htmlInformation Theoretical Clustering via Semidefinite ProgrammingWe propose techniques of convex optimization for information theoretical clustering. The clustering objective is to maximize the mutual information between data points and cluster assignments. We formulate this problem first as an instance of MAX K CUT on weighted graphs. We then apply the technique of semidefinite programming (SDP) relaxation to obtain a convex SDP problem. We show how the solution of the SDP problem can be further improved with a low-rank refinement heuristic. The low-rank solution reveals more clearly the cluster structure of the data. Empirical studies on several datasets demonstrate the effectiveness of our approach. In particular, the approach outperforms several other clustering algorithms when compared on standard evaluation metrics. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/wang11b.html
http://proceedings.mlr.press/v15/wang11b.htmlOnline Variational Inference for the Hierarchical Dirichlet ProcessThe hierarchical Dirichlet process (HDP) is a Bayesian nonparametric model that can be used to model mixed-membership data with a potentially infinite number of components. It has been applied widely in probabilistic topic modeling, where the data are documents and the components are distributions of terms that reflect recurring patterns (or “topics”) in the collection. Given a document collection, posterior inference is used to determine the number of topics needed and to characterize their distributions. One limitation of HDP analysis is that existing posterior inference algorithms require multiple passes through all the data—these algorithms are intractable for very large scale applications. We propose an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data. Our algorithm is significantly faster than traditional inference algorithms for the HDP, and lets us analyze much larger data sets. We illustrate the approach on two large collections of text, showing improved performance over online LDA, the finite counterpart to the HDP topic model. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/wang11a.html
http://proceedings.mlr.press/v15/wang11a.htmlActive Boosted Learning (ActBoost)Active learning deals with the problem of selecting a small subset of examples to label, from a pool of unlabeled data, for training a good classifier. We develop an active learning algorithm algorithm in the boosting framework. In contrast to much of the recent efforts, which has focused on selecting the most ambiguous unlabeled example to label based on the current learned classifier, our algorithm selects examples to maximally reduce the volume of the version space of feasible boosted classifiers. We show that under suitable sparsity assumptions, this strategy achieves the generalization error performance of a boosted classifier trained on the entire data set while only selecting logarithmically many unlabeled samples to label. We also establish a partial negative result, in that with out imposing structural assumptions it is difficult to guarantee generalization error performance. We explicitly characterize our convergence rate in terms of the sign pattern differences produced by the weak learners on the unlabeled data. We also present a convex relaxation to account for the non-convex sparse structure and show that the computational complexity of the resulting algorithm scales polynomially in the number of weak learners. We test ActBoost on several datasets to illustrate its performance and demonstrate its robustness to initialization. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/trapeznikov11a.html
http://proceedings.mlr.press/v15/trapeznikov11a.htmlLearning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variablesWhile there has been considerable research in learning probabilistic graphical models from data for predictive and causal inference, almost all existing algorithms assume a single dataset of i.i.d. observations for all variables. For many applications, it may be impossible or impractical to obtain such datasets, but multiple datasets of i.i.d. observations for different subsets of these variables may be available. Tillman et al. (2009) showed how directed graphical models learned from such datasets can be integrated to construct an equivalence class of structures over all variables. While their procedure is correct, it assumes that the structures integrated do not entail contradictory conditional independences and dependences for variables in their intersections. While this assumption is reasonable asymptotically, it rarely holds in practice with finite samples due to the frequency of statistical errors. We propose a new correct procedure for learning such equivalence classes directly from the multiple datasets which avoids this problem and is thus more practically useful. Empirical results indicate our method is not only more accurate, but also faster and requires less memory.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/tillman11a.html
http://proceedings.mlr.press/v15/tillman11a.htmlEstimating Probabilities in Recommendation SystemsModeling ranked data is an essential component in a number of important applications including recommendation systems and web-search. In many cases, judges omit preference among unobserved items and between unobserved and observed items. This case of analyzing incomplete rankings is very important from a practical perspective and yet has not been fully studied due to considerable computational difficulties. We show how to avoid such computational difficulties and efficiently construct a non-parametric model for rankings with missing items. We demonstrate our approach and show how it applies in the context of collaborative filtering. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/sun11a.html
http://proceedings.mlr.press/v15/sun11a.htmlEmpirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model StructureGraphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic meta-descent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/stoyanov11a.html
http://proceedings.mlr.press/v15/stoyanov11a.htmlMachine Learning MarketsPrediction markets show considerable promise for developing flexible mechanisms for machine learning. Here, machine learning markets for multivariate systems are defined, and a utility-based framework is established for their analysis. It is shown that such markets can implement model combination methods used in machine learning, such as product of expert and mixture of expert approaches as equilibrium pricing models, by varying agent utility functions. They can implement models composed of local potentials, and message passing methods. Prediction markets also allow for more flexible combinations, by combining multiple different utility functions. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/storkey11a.html
http://proceedings.mlr.press/v15/storkey11a.htmlKernel Belief PropagationWe propose a nonparametric generalization of belief propagation, Kernel Belief Propagation (KBP), for pairwise Markov random fields. Messages are represented as functions in a reproducing kernel Hilbert space (RKHS), and message updates are simple linear operations in the RKHS. KBP makes none of the assumptions commonly required in classical BP algorithms: the variables need not arise from a finite domain or a Gaussian distribution, nor must their relations take any particular parametric form. Rather, the relations between variables are represented implicitly, and are learned nonparametrically from training data. KBP has the advantage that it may be used on any domain where kernels are defined (R^d, strings, groups), even where explicit parametric models are not known, or closed form expressions for the BP updates do not exist. The computational cost of message updates in KBP is polynomial in the training data size. We also propose a constant time approximate message update procedure by representing messages using a small number of basis functions. In experiments, we apply KBP to image denoising, depth prediction from still images, and protein configuration prediction: KBP is faster than competing classical and nonparametric approaches (by orders of magnitude, in some cases), while providing significantly more accurate results. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/song11a.html
http://proceedings.mlr.press/v15/song11a.htmlSpectral Chinese Restaurant Processes: Nonparametric Clustering Based on SimilaritiesWe introduce a new nonparametric clustering model which combines the recently proposed distance-dependent Chinese restaurant process (dd-CRP) and non-linear, spectral methods for dimensionality reduction. Our model retains the ability of nonparametric methods to learn the number of clusters from data. At the same time it addresses two key limitations of nonparametric Bayesian methods: modeling data that are not exchangeable and have many correlated features. Spectral methods use the similarity between documents to map them into a low-dimensional spectral space where we then compare several clustering methods. Our experiments on handwritten digits and text documents show that nonparametric methods such as the CRP or dd-CRP can perform as well as or better than k-means and also recover the true number of clusters. We improve the performance of the dd-CRP in spectral space by incorporating the original similarity matrix in its prior. This simple modification results in better performance than all other methods we compared to. We offer a new formulation and first experimental evaluation of a general Gibbs sampler for mixture modeling with distance-dependent CRPs. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/socher11a.html
http://proceedings.mlr.press/v15/socher11a.htmlAssisting Main Task Learning by Heterogeneous Auxiliary Tasks with Applications to Skin Cancer ScreeningIn typical classification problems, high level concept features provided by a domain expert are usually available during classifier training but not during its deployment. We address this problem from a multitask learning (MTL) perspective by treating these features as auxiliary learning tasks. Previous efforts in MTL have mostly assumed that all tasks have the same input space. However, auxiliary tasks can have different input spaces, since their learning targets are different. Thus, to handle cases with heterogeneous input, in this paper we present a newly developed model using heterogeneous auxiliary tasks to help main task learning. First, we formulate a convex optimization problem for the proposed model, and then, we analyze its hypothesis class and derive true risk bounds. Finally, we compare the proposed model with other relevant methods when applied to the problem of skin cancer screening and public datasets. Our results show that the performance of the proposed method is highly competitive compared to other relevant methods. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/situ11a.html
http://proceedings.mlr.press/v15/situ11a.htmlAsymptotic Theory for Linear-Chain Conditional Random FieldsIn this theoretical paper we develop an asymptotic theory for Linear-Chain Conditional Random Fields (L-CRFs) and apply it to derive conditions under which the Maximum Likelihood Estimates (MLEs) of the model weights are strongly consistent. We first define L-CRFs for infinite sequences and analyze some of their basic properties. Then we establish conditions under which ergodicity of the observations implies ergodicity of the joint sequence of observations and labels. This result is the key ingredient to derive conditions for strong consistency of the MLEs. Interesting findings are that the consistency crucially depends on the limit behavior of the Hessian of the likelihood function and that, asymptotically, the state feature functions do not matter. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/sinn11a.html
http://proceedings.mlr.press/v15/sinn11a.htmlMixed Cumulative Distribution NetworksDirected acyclic graphs (DAGs) are a popular framework to express multivariate probability distributions. Acyclic directed mixed graphs (ADMGs) are generalizations of DAGs that can succinctly capture much richer sets of conditional independencies, and are especially useful in modeling the effects of latent variables implicitly. Unfortunately, there are currently no parameterizations of general ADMGs. In this paper, we apply recent work on cumulative distribution networks and copulas to propose one general construction for ADMG models. We consider a simple parameter estimation approach, and report some encouraging experimental results. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/silva11a.html
http://proceedings.mlr.press/v15/silva11a.htmlSpectral Clustering on a BudgetSpectral clustering is a modern and well known method for performing data clustering. However, it depends on the availability of a similarity matrix, which in many applications can be non-trivial to obtain. In this paper, we focus on the problem of performing spectral clustering under a budget constraint, where there is a limit on the number of entries which can be queried from the similarity matrix. We propose two algorithms for this problem, and study them theoretically and experimentally. These algorithms allow a tradeoff between computational efficiency and actual performance, and are also relevant for the problem of speeding up standard spectral clustering. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/shamir11a.html
http://proceedings.mlr.press/v15/shamir11a.htmlFast Convergent Algorithms for Expectation Propagation Approximate Bayesian InferenceWe propose a novel algorithm to solve the expectation propagation relaxation of Bayesian inference for continuous-variable graphical models. In contrast to most previous algorithms, our method is provably convergent. By marrying convergent EP ideas from (Opper&Winther, 2005) with covariance decoupling techniques (Wipf&Nagarajan, 2008; Nickisch&Seeger, 2009), it runs at least an order of magnitude faster than the most common EP solver. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/seeger11a.html
http://proceedings.mlr.press/v15/seeger11a.htmlOnline Learning of Multiple Tasks and Their RelationshipsWe propose an Online MultiTask Learning (OMTL) framework which simultaneously learns the task weight vectors as well as the task relatedness adaptively from the data. Our work is in contrast with prior work on online multitask learning which assumes fixed task relatedness, a priori. Furthermore, whereas prior work in such settings assume only positively correlated tasks, our framework can capture negative correlations as well. Our proposed framework learns the task relationship matrix by framing the objective function as a Bregman divergence minimization problem for positive definite matrices. Subsequently, we exploit this adaptively learned task-relationship matrix to select the most informative samples in an online multitask active learning setting. Experimental results on a number of real-world datasets and comparisons with numerous baselines establish the efficacy of our proposed approach. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/saha11b.html
http://proceedings.mlr.press/v15/saha11b.htmlImproved Regret Guarantees for Online Smooth Convex Optimization with Bandit FeedbackThe study of online convex optimization in the bandit setting was initiated by Kleinberg (2004) and Flaxman et al. (2005). Such a setting models a decision maker that has to make decisions in the face of adversarially chosen convex loss functions. Moreover, the only information the decision maker receives are the losses. The identity of the loss functions themselves is not revealed. In this setting, we reduce the gap between the best known lower and upper bounds for the class of smooth convex functions, i.e. convex functions with a Lipschitz continuous gradient. Building upon existing work on self-concordant regularizers and one-point gradient estimation, we give the first algorithm whose expected regret, ignoring constant and logarithmic factors, is O(T^2/3). [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/saha11a.html
http://proceedings.mlr.press/v15/saha11a.htmlA Reduction of Imitation Learning and Structured Prediction to No-Regret Online LearningSequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either non-stationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/ross11a.html
http://proceedings.mlr.press/v15/ross11a.htmlOn NDCG Consistency of Listwise Ranking MethodsWe examine the consistency of listwise ranking methods with respect to the popular Normalized Discounted Cumulative Gain (NDCG) criterion. The most successful listwise approaches replace NDCG with a surrogate that is easier to optimize. We characterize NDCG consistent surrogates to discover a surprising fact: several commonly used surrogates are NDCG inconsistent. We then show how to change them so that they become NDCG consistent in a strong but natural sense. An explicit characterization of strong NDCG consistency is provided. Going beyond qualitative consistency considerations, we also give quantitive statements that enable us to transform the excess error, as measured in the surrogate, to the excess error in comparison to the Bayes optimal ranking function for NDCG. Finally, we also derive improved results if a certain natural “low noise"” or “large margin"” condition holds. Our experiments demonstrate that ensuring NDCG consistency does improve the performance of listwise ranking methods on real-world datasets. Moreover, a novel surrogate function suggested by our theoretical results leads to further improvements over NDCG consistent versions of existing surrogates. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/ravikumar11a.html
http://proceedings.mlr.press/v15/ravikumar11a.htmlOn the Estimation of alpha-DivergencesWe propose new nonparametric, consistent Renyi-alpha and Tsallis-alpha divergence estimators for continuous distributions. Given two independent and identically distributed samples, a ‘brute force’ approach would be simply to estimate the underlying densities, and plug these densities into the corresponding formulas. However, it is not our goal to consistently estimate these possibly high dimensional densities, and our algorithm avoids estimating them. We will use simple k-nearest-neighbor distance (k-NN) based statistics, and interestingly enough, we will still be able to prove that the proposed divergence estimators are consistent under certain conditions. We will also show how to use them for mutual information estimation, and demonstrate their efficiency by some numerical experiments. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/poczos11a.html
http://proceedings.mlr.press/v15/poczos11a.htmlDirectional Statistics on PermutationsDistributions over permutations arise in applications ranging from multi-object tracking to ranking. The difficulty of dealing with these distributions is caused by the size of their domain, which is factorial in the number of entities (n!). The direct definition of a multinomial distribution over permutation space is impractical for all but a very small n. In this work we propose an embedding of all n! permutations for a given n in a surface of a hypersphere defined in R^((n-1)^2). As a result, we acquire the ability to define continuous distributions over a hypersphere with all the benefits of directional statistics. We provide polynomial time projections between the continuous hypersphere representation and the n!-element permutation space. The framework provides a way to use continuous directional probability densities and the methods developed thereof for establishing densities over permutations. As a demonstration of the benefits of the framework we derive an inference procedure for a state-space model over permutations. We demonstrate the approach with applications and comparisons to existing models. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/plis11a.html
http://proceedings.mlr.press/v15/plis11a.htmlFaithfulness in Chain Graphs: The Gaussian CaseThis paper deals with chain graphs under the classic Lauritzen-Wermuth-Frydenberg interpretation. We prove that almost all the regular Gaussian distributions that factorize with respect to a chain graph are faithful to it. This result has three important consequences. First, chain graphs are more powerful than undirected graphs and acyclic directed graphs for representing regular Gaussian distributions, as some of these distributions can be represented exactly by the former but not by the latter. Second, the moralization and c-separation criteria for reading independencies from a chain graph are complete, in the sense that they identify all the independencies that can be identified from the chain graph alone. Third, some definitions of equivalence in chain graphs coincide and, thus, they have the same graphical characterization.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/pena11a.html
http://proceedings.mlr.press/v15/pena11a.htmlGenerative Modeling for Maximizing Precision and Recall in Information VisualizationInformation visualization has recently been formulated as an information retrieval problem, where the goal is to find similar data points based on the visualized nonlinear projection, and the visualization is optimized to maximize a compromise between (smoothed) precision and recall. We turn the visualization into a generative modeling task where a simple user model parameterized by the data coordinates is optimized, neighborhood relations are the observed data, and straightforward maximum likelihood estimation corresponds to Stochastic Neighbor Embedding (SNE). While SNE maximizes pure recall, adding a mixture component that “explains away” misses allows our generative model to focus on maximizing precision as well. The resulting model is a generative solution to maximizing tradeoffs between precision and recall. The model outperforms earlier models in terms of precision and recall and in external validation by unsupervised classification. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/peltonen11a.html
http://proceedings.mlr.press/v15/peltonen11a.htmlThe Discrete Infinite Logistic Normal Distribution for Mixed-Membership ModelingWe present the discrete infinite logistic normal distribution (DILN, “"Dylan""), a Bayesian nonparametric prior for mixed membership models. DILN is a generalization of the hierarchical Dirichlet process (HDP) that models correlation structure between the weights of the atoms at the group level. We derive a representation of DILN as a normalized collection of gamma-distributed random variables, and study its statistical properties. We consider applications to topic modeling and derive a variational Bayes algorithm for approximate posterior inference. We study the empirical performance of the DILN topic model on four corpora, comparing performance with the HDP and the correlated topic model. Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/paisley11a.html
http://proceedings.mlr.press/v15/paisley11a.htmlAdaptive Bandits: Towards the best history-dependent strategyWe consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model theta* ∈Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Theta=theta, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by \tilde O(\sqrtTAC), where C is the number of classes of θ. Now, when many models are available, all known algorithms achieving a nice regret O(\sqrtT) are unfortunately not tractable and scale poorly with the number of models |Θ|. Our contribution here is to provide tractable algorithms with regret bounded by T^2/3C^1/3\log(|Θ|)^1/2. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/odalric11a.html
http://proceedings.mlr.press/v15/odalric11a.htmlMaximum Volume ClusteringThe large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a clustering model based on the large volume principle called maximum volume clustering (MVC), and propose two algorithms to solve it approximately: a soft-label and a hard-label MVC algorithms based on sequential quadratic programming and semi-definite programming, respectively. Our MVC model includes spectral clustering and maximum margin clustering as special cases, and is substantially more general. We also establish the finite sample stability and an error bound for the soft-label MVC method. Experiments show that the proposed MVC approach compares favorably with state-of-the-art clustering algorithms. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/niu11b.html
http://proceedings.mlr.press/v15/niu11b.htmlDimensionality Reduction for Spectral ClusteringSpectral clustering is a flexible clustering methodology that is applicable to a variety of data types and has the particular virtue that it makes few assumptions on cluster shapes. It has become popular in a variety of application areas, particularly in computational vision and bioinformatics. The approach appears, however, to be particularly sensitive to irrelevant and noisy dimensions in the data. We thus introduce an approach that automatically learns the relevant dimensions and spectral clustering simultaneously. We pursue an augmented form of spectral clustering in which an explicit projection operator is incorporated in the relaxed optimization functional. We optimize this functional over both the projection and the spectral embedding. Experiments on simulated and real data show that this approach yields significant improvements in the performance of spectral clustering.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/niu11a.html
http://proceedings.mlr.press/v15/niu11a.htmlTopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked DocumentsPopular algorithms for modeling the influence of entities in networked data, such as PageRank, work by analyzing the hyperlink structure, but ignore the contents of documents. However, often times, influence is topic dependent, e.g., a web page of high influence in politics may be an unknown entity in sports. We design a new model called TopicFlow, which combines ideas from network flow and topic modeling, to learn this notion of topic specific influences of hyperlinked documents in a completely unsupervised fashion. On the task of citation recommendation, which is an instance of capturing influence, the TopicFlow model, when combined with TF-IDF based cosine similarity, outperforms several competitive baselines by as much as 11.8%. Our empirical study of the model's output on ACL corpus demonstrates its ability to identify topically influential documents. The Topic- Flow model is also competitive with the state-of-theart Relational Topic Models in predicting the likelihood of unseen text on two different data sets. Due to its ability to learn topic-specific flows across each hyperlink, the TopicFlow model can be a powerful visualization tool to track the diffusion of topics across a citation network.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/nallapati11a.html
http://proceedings.mlr.press/v15/nallapati11a.htmlCan matrix coherence be efficiently and accurately estimated?Matrix coherence has recently been used to characterize the ability to extract global information from a subset of matrix entries in the context of low-rank approximations and other sampling-based algorithms. The significance of these results crucially hinges upon the possibility of efficiently and accurately testing this coherence assumption. This paper precisely addresses this issue. We introduce a novel sampling-based algorithm for estimating coherence, present associated estimation guarantees and report the results of extensive experiments for coherence estimation. The quality of the estimation guarantees we present depends on the coherence value to estimate itself, but this turns out to be an inherent property of sampling-based coherence estimation, as shown by our lower bound. In practice, however, we find that these theoretically unfavorable scenarios rarely appear, as our algorithm efficiently and accurately estimates coherence across a wide range of datasets, and these estimates are excellent predictors of the effectiveness of sampling-based matrix approximation on a case-by-case basis. These results are significant as they reveal the extent to which coherence assumptions made in a number of recent machine learning publications are testable. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/mohri11a.html
http://proceedings.mlr.press/v15/mohri11a.htmlFollow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 RegularizationWe prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles L1 regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative L1 penalty. While FOBOS handles the L1 term exactly on any given update, we show that it is effectively using subgradient approximations to the L1 penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/mcmahan11b.html
http://proceedings.mlr.press/v15/mcmahan11b.htmlDiscussion of “Contextual Bandit Algorithms with Supervised Learning Guarantees”Discussion of <a href="http:beygelzimer11a.html">Contextual Bandit Algorithms with Supervised Learning Guarantees</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/mcmahan11a.html
http://proceedings.mlr.press/v15/mcmahan11a.htmlEstimating beta-mixing coefficientsThe literature on statistical learning for time series assumes the asymptotic independence or “"mixing"” of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the beta-mixing rate based on a single stationary sample path and show it is L1-risk consistent. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/mcdonald11a.html
http://proceedings.mlr.press/v15/mcdonald11a.htmlOnline Learning of Structured Predictors with Multiple KernelsTraining structured predictors often requires a considerable time selecting features or tweaking the kernel. Multiple kernel learning (MKL) sidesteps this issue by embedding the kernel learning into the training procedure. Despite the recent progress towards efficiency of MKL algorithms, the structured output case remains an open research front. We propose a family of online algorithms able to tackle variants of MKL and group-LASSO, for which we show regret, convergence, and generalization bounds. Experiments on handwriting recognition and dependency parsing attest the success of the approach. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/martins11a.html
http://proceedings.mlr.press/v15/martins11a.htmlCAKE: Convex Adaptive Kernel Density EstimationIn this paper we present a generalization of kernel density estimation called Convex Adaptive Kernel Density Estimation (CAKE) that replaces single bandwidth se- lection by a convex aggregation of kernels at all scales, where the convex aggregation is allowed to vary from one training point to another, treating the fundamental problem of heterogeneous smoothness in a novel way. Learning the CAKE estimator given a training set reduces to solving a single con- vex quadratic programming problem. We derive rates of convergence of CAKE like estimator to the true underlying density under smoothness assumptions on the class and show that given a sufficiently large sample the mean squared error of such estimators is optimal in a minimax sense. We also give a risk bound of the CAKE estimator in terms of its empirical risk. We empirically compare CAKE to other density estimators proposed in the statistics literature for handling heterogeneous smoothness on different synthetic and natural distributions. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/mahapatruni11a.html
http://proceedings.mlr.press/v15/mahapatruni11a.htmlLearning mixtures of Gaussians with maximum-a-posteriori oracleWe consider the problem of estimating the parameters of a mixture of distributions, where each component distribution is from a given parametric family e.g. exponential, Gaussian etc. We define a learning model in which the learner has access to a “maximum-a-posteriori” oracle which given any sample from a mixture of distributions, tells the learner which component distribution was the most likely to have generated it. We describe a learning algorithm in this setting which accurately estimates the parameters of a mixture of k spherical Gaussians in R^d assuming the component Gaussians satisfy a mild separation condition. Our algorithm uses only polynomially many (in d, k) samples and oracle calls, and our separation condition is much weaker than those required by unsupervised learning algorithms like [Arora 01, Vempala 02]. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/mahalanabis11a.html
http://proceedings.mlr.press/v15/mahalanabis11a.htmlHidden-Unit Conditional Random FieldsThe paper explores a generalization of conditional random fields (CRFs) in which binary stochastic hidden units appear between the data and the labels. Hidden-unit CRFs are potentially more powerful than standard CRFs because they can represent nonlinear dependencies at each frame. The hidden units in these models also learn to discover latent distributed structure in the data that improves classification. We derive efficient algorithms for inference and learning in these models by observing that the hidden units are conditionally independent given the data and the labels. Finally, we show that hidden-unit CRFs perform well in experiments on a range of tasks, including optical character recognition, text classification, protein structure prediction, and part-of-speech tagging. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/maaten11b.html
http://proceedings.mlr.press/v15/maaten11b.htmlDiscussion of “Spectral Dimensionality Reduction via Maximum Entropy”Discussion of <a href="http:lawrence11a.html">Spectral Dimensionality Reduction via Maximum Entropy</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/maaten11a.html
http://proceedings.mlr.press/v15/maaten11a.htmlLearning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBMRestricted Boltzmann Machines are commonly used in unsupervised learning to extract features from training data. Since these features are learned for regenerating training data a classifier based on them has to be trained. If only a few of the learned features are discriminative other non-discriminative features will distract the classifier during the training process and thus waste computing resources for testing. In this paper, we present a hybrid third-order Restricted Boltzmann Machine in which class-relevant features (for recognizing) and class-irrelevant features (for generating only) are learned simultaneously. As the classification task uses only the class-relevant features, the test itself becomes very fast. We show that class-irrelevant features help class-relevant features to focus on the recognition task and introduce useful regularization effects to reduce the norms of class-relevant features. Thus there is no need to use weight-decay for the parameters of this model. Experiments on the MNIST, NORB and Caltech101 Silhouettes datasets show very promising results. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/luo11a.html
http://proceedings.mlr.press/v15/luo11a.htmlA Fast Algorithm for Recovery of Jointly Sparse Vectors based on the Alternating Direction MethodsThe standard compressive sensing (CS) aims to recover sparse signal from single measurement vector (SMV) which is known as SMV model. In this paper, we consider the recovery of jointly sparse signals in the multiple measurement vector (MMV) scenario where signal is represented as a matrix and the sparsity of signal occurs in a common location set. The sparse MMV model can be formulated as a matrix (2,1)-norm minimization problem. However, the (2,1)-norm minimization problem is much more difficult to solve than l1-norm minimization. In this paper, we propose a very fast algorithm, called MMV-ADM, for jointly sparse signal recovery in MMV settings based on the alternating direction method (ADM). The MMV-ADM alternately updates the signal matrix, the Lagrangian multiplier and the residue, and all update rules only involve matrix or vector multiplications and summations, so it is simple, easy to implement and much more fast than the state-of-the-art method MMV\small prox. Numerical simulations show that MMV-ADM is at least dozens of times faster than MMV\small prox with comparable recovery accuracy. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/lu11a.html
http://proceedings.mlr.press/v15/lu11a.htmlGroup Orthogonal Matching Pursuit for Logistic RegressionWe consider a matching pursuit approach for variable selection and estimation in logistic regression models. Specifically, we propose Logistic Group Orthogonal Matching Pursuit (Logit-GOMP), which extends the Group-OMP procedure originally proposed for linear regression models, to select groups of variables in logistic regression models, given a predefined grouping structure within the explanatory variables. We theoretically characterize the performance of Logit-GOMP in terms of predictive accuracy, and also provide conditions under which Logit-GOMP is able to identify the correct (groups of) variables. Our results are non-asymptotic in contrast to classical consistency results for logistic regression which only apply in the asymptotic limit where the dimensionality is fixed or is restricted to grow slowly with the sample size. We conduct empirical evaluation on simulated data sets and the real world problem of splice site detection in DNA sequences. The results indicate that Logit-GOMP compares favorably to Logistic Group Lasso both in terms of variable selection and prediction accuracy. We also provide a generic version of our algorithm that applies to the wider class of generalized linear models. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/lozano11a.html
http://proceedings.mlr.press/v15/lozano11a.htmlLearning Scale Free Networks by Reweighted L1 regularizationMethods for L1-type regularization have been widely used in Gaussian graphical model selection tasks to encourage sparse structures. However, often we would like to include more structural information than mere sparsity. In this work, we focus on learning so-called “scale-free” models, a common feature that appears in many real-work networks. We replace the L1 regularization with a power law regularization and optimize the objective function by a sequence of iteratively reweighted L1 regularization problems, where the regularization coefficients of nodes with high degree are reduced, encouraging the appearance of hubs with high degree. Our method can be easily adapted to improve any existing L1-based methods, such as graphical lasso, neighborhood selection, and JSRM when the underlying networks are believed to be scale free or have dominating hubs. We demonstrate in simulation that our method significantly outperforms the a baseline L1 method at learning scale-free networks and hub networks, and also illustrate its behavior on gene expression data.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/liu11a.html
http://proceedings.mlr.press/v15/liu11a.htmlBayesian Hierarchical Cross-ClusteringMost clustering algorithms assume that all dimensions of the data can be described by a single structure. Cross-clustering (or multi- view clustering) allows multiple structures, each applying to a subset of the dimen- sions. We present a novel approach to cross- clustering, based on approximating the so- lution to a Cross Dirichlet Process mixture (CDPM) model [Shafto et al., 2006, Mans- inghka et al., 2009]. Our bottom-up, de- terministic approach results in a hierarchi- cal clustering of dimensions, and at each node, a hierarchical clustering of data points. We also present a randomized approxima- tion, based on a truncated hierarchy, that scales linearly in the number of levels. Re- sults on synthetic and real-world data sets demonstrate that the cross-clustering based algorithms perform as well or better than the clustering based algorithms, our determinis- tic approaches models perform as well as the MCMC-based CDPM, and the randomized approximation provides a remarkable speed- up relative to the full deterministic approxi- mation with minimal cost in predictive error. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/li11c.html
http://proceedings.mlr.press/v15/li11c.htmlConfidence Weighted Mean Reversion Strategy for On-Line Portfolio SelectionThis paper proposes a novel on-line portfolio selection strategy named “Confidence Weighted Mean Reversion” (CWMR). Inspired by the mean reversion principle and the confidence weighted online learning technique, CWMR models a portfolio vector as Gaussian distribution, and sequentially updates the distribution by following the mean reversion trading principle. The CWMR strategy is able to effectively exploit the power of mean reversion for on-line portfolio selection. Extensive experiments on various real markets demonstrate the effectiveness of our strategy in comparison to the state of the art. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/li11b.html
http://proceedings.mlr.press/v15/li11b.htmlSpectral Dimensionality Reduction via Maximum EntropyWe introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting probabilistic models are based on GRFs. The resulting model is a nonlinear generalization of principal component analysis. We show that parameter fitting in the locally linear embedding is approximate maximum likelihood in these models. We develop new algorithms that directly maximize the likelihood and show that these new algorithms are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/lawrence11a.html
http://proceedings.mlr.press/v15/lawrence11a.htmlThe Neural Autoregressive Distribution EstimatorWe describe a new approach for modeling the distribution of high-dimensional vectors of discrete variables. This model is inspired by the restricted Boltzmann machine (RBM), which has been shown to be a powerful model of such distributions. However, an RBM typically does not provide a tractable distribution estimator, since evaluating the probability it assigns to some given observation requires the computation of the so-called partition function, which itself is intractable for RBMs of even moderate size. Our model circumvents this difficulty by decomposing the joint distribution of observations into tractable conditional distributions and modeling each conditional using a non-linear function similar to a conditional of an RBM. Our model can also be interpreted as an autoencoder wired such that its output can be used to assign valid probabilities to observations. We show that this new model outperforms other multivariate binary distribution estimators on several datasets and performs similarly to a large (but intractable) RBM. Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/larochelle11a.html
http://proceedings.mlr.press/v15/larochelle11a.htmlRobust Bayesian Matrix FactorisationWe analyse the noise arising in collaborative filtering when formalised as a probabilistic matrix factorisation problem. We show empirically that modelling row- and column-specific variances is important, the noise being in general non-Gaussian and heteroscedastic. We also advocate for the use of a Student-t prior for the latent features as the standard Gaussian is included as a special case. We derive several variational inference algorithms and estimate the hyperparameters by type-II maximum likelihood. Experiments on real data show that the predictive performance is significantly improved. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/lakshminarayanan11a.html
http://proceedings.mlr.press/v15/lakshminarayanan11a.htmlApproximate inference for the loss-calibrated BayesianWe consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task – and associated losses – for which the posterior could be used. We argue that this can be suboptimal and propose instead to loss-calibrate the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/lacoste_julien11a.html
http://proceedings.mlr.press/v15/lacoste_julien11a.htmlOn Time Varying Undirected GraphsThe time-varying multivariate Gaussian distribution and the undirected graph associated with it, as introduced in Zhou et al. (2008), provide a useful statistical framework for modeling complex dynamic networks. In many application domains, it is of high importance to estimate the graph structure of the model consistently for the purpose of scientific discovery. In this short note, we show that under suitable technical conditions the structure of the undirected graphical model can be consistently estimated in the high dimensional setting, when the dimensionality of the model is allowed to diverge with the sample size. The model selection consistency is shown for the procedure proposed in Zhou et al. (2008) and for the modified neighborhood selection procedure of Meinshausen and BÃ¼hlmann (2006). [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/kolar11a.html
http://proceedings.mlr.press/v15/kolar11a.htmlConvex envelopes of complexity controlling penalties: the case against premature envelopmentConvex envelopes of the cardinality and rank function, l_1 and nuclear norm, have gained immense popularity due to their sparsity inducing properties. This gave rise to a natural approach to building objectives with sparse optima whereby such convex penalties are added to another objective. Such a heuristic approach to objective building does not always work. For example, addition of an L_1 penalty to the KL-divergence fails to induce any sparsity, as the L_1 norm of any vector in a simplex is a constant. However, a convex envelope of KL and a cardinality penalty can be obtained that indeed trades off sparsity and KL-divergence. We consider cases of two composite penalties, elastic net and fused lasso, which combine multiple desiderata. In both of these cases, we show that a hard objective relaxed to obtain penalties can be more tightly approximated. Further, by construction, it is impossible to get a better convex approximation than the ones we derive. Thus, constructing a joint envelope across different parts of the objective provides means to trade off tightness and computational cost. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/jojic11a.html
http://proceedings.mlr.press/v15/jojic11a.htmlConvergent Decomposition Solvers for Tree-reweighted Free EnergiesWe investigate minimization of tree-reweighted free energies for the purpose of obtaining approximate marginal probabilities and upper bounds on the partition function of cyclic graphical models. The solvers we present for this problem work by directly tightening tree-reweighted upper bounds. As a result, they are particularly efficient for tree-reweighted energies arising from a small number of spanning trees. While this assumption may seem restrictive at first, we show how small sets of trees can be constructed in a principled manner. An appealing property of our algorithms, which results from the problem decomposition, is that they are embarassingly parallel. In contrast to the original message passing algorithm introduced for this problem, we obtain global convergence guarantees. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/jancsary11a.html
http://proceedings.mlr.press/v15/jancsary11a.htmlOn Learning Discrete Graphical Models using Group-Sparse RegularizationWe study the problem of learning the graph structure associated with a general discrete graphical models (each variable can take any of m > 1 values, the clique factors have maximum size c >= 2) from samples, under high-dimensional scaling where the number of variables p could be larger than the number of samples n. We provide a quantitative consistency analysis of a procedure based on node-wise multi-class logistic regression with group-sparse regularization. We first consider general m-ary pairwise models – where each factor depends on at most two variables. We show that when the number of samples scale as n > K(m-1)^2 d^2 log ((m-1)^2(p-1)) – where d is the maximum degree and K a fixed constant – the procedure succeeds in recovering the graph with high probability. For general models with c-way factors, the natural multi-way extension of the pairwise method quickly becomes very computationally complex. So we studied the effectiveness of using the pairwise method even while the true model has higher order factors. Surprisingly, we show that under slightly more stringent conditions, the pairwise procedure still recovers the graph structure, when the samples scale as n > K (m-1)^2 d^3/2c - 1 \log ( (m-1)^c (p-1)^c-1 ). [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/jalali11a.html
http://proceedings.mlr.press/v15/jalali11a.htmlImproved Loss Bounds For Multiple Kernel LearningWe propose two new generalization error bounds for multiple kernel learning (MKL). First, using the bound of Srebro and Ben-David (2006) as a starting point, we derive a new version which uses a simple counting argument for the choice of kernels in order to generate a tighter bound when 1-norm regularization (sparsity) is imposed in the kernel learning problem. The second bound is a Rademacher complexity bound which is additive in the (logarithmic) kernel complexity and margin term. This dependency is superior to all previously published Rademacher bounds for learning a convex combination of kernels, including the recent bound of Cortes et al. (2010), which exhibits a multiplicative interaction. We illustrate the tightness of our bounds with simulations. [pdf] [errata]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/hussain11a.html
http://proceedings.mlr.press/v15/hussain11a.htmlFast b-matching via Sufficient Selection Belief PropagationThis article describes scalability enhancements to a previously established belief propagation algorithm that solves bipartite maximum weight b-matching. The previous algorithm required O(|V|+|E|) space and O(|V||E|) time, whereas we apply improvements to reduce the space to O(|V|) and the time to O(|V|^2.5) in the expected case (though worst case time is still O(|V||E|)). The space improvement is most significant in cases where edge weights are determined by a function of node descriptors, such as a distance or kernel function. In practice, we demonstrate maximum weight b-matchings to be solvable on graphs with hundreds of millions of edges in only a few hours of compute time on a modern personal computer without parallelization, whereas neither the memory nor the time requirement of previously known algorithms would have allowed graphs of this scale. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/huang11a.html
http://proceedings.mlr.press/v15/huang11a.htmlOptimal Distributed Market-Based Planning for Multi-Agent Systems with Shared ResourcesMarket-based algorithms have become popular in collaborative multi-agent planning due to their simplicity, distributedness, low communication requirements, and proven success in domains such as task allocation and robotic exploration. Most existing market-based algorithms, however, suffer from two main drawbacks: resource prices must be carefully handcrafted for each problem domain, and there is no guarantee on final solution quality. We present an optimal market-based algorithm, derived from a mixed integer program formulation of planning problems. Our method is based on two well-known techniques for optimization: Dantzig-Wolfe decomposition and Gomory cuts. The former prices resources optimally for a relaxed version of the problem, while the latter introduces new derivative resources to correct pricing imbalances that arise from the relaxation. Our algorithm is applicable to a wide variety of multi-agent planning domains. We provide optimality guarantees and demonstrate the effectiveness of our algorithm in both centralized and distributed settings on synthetic planning problems. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/hong11a.html
http://proceedings.mlr.press/v15/hong11a.htmlEvolving Cluster Mixed-Membership Blockmodel for Time-Evolving NetworksTime-evolving networks are a natural presentation for dynamic social and biological interactions. While latent space models are gaining popularity in network modeling and analysis, previous works mostly ignore networks with temporal behavior and multi-modal actor roles. Furthermore, prior knowledge, such as division and grouping of social actors or biological specificity of molecular functions, has not been systematically exploited in network modeling. In this paper, we develop a network model featuring a state space mixture prior that tracks complex actor latent role changes through time. We provide a fast variational inference algorithm for learning our model, and validate it with simulations and held-out likelihood comparisons on real-world time-evolving networks. Finally, we demonstrate our model’s utility as a network analysis tool, by applying it to United States Congress voting data. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/ho11b.html
http://proceedings.mlr.press/v15/ho11b.htmlMultiscale Community Blockmodel for Network ExplorationReal world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric Multiscale Community Blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross- community interactions. By using the nested Chinese Restaurant Process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/ho11a.html
http://proceedings.mlr.press/v15/ho11a.htmlPrefacePreface to the Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics April 11-13, 2011, Fort Lauderdale, FL, USA.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/gordon11a.html
http://proceedings.mlr.press/v15/gordon11a.htmlParallel Gibbs Sampling: From Colored Fields to Thin Junction TreesWe explore the task of constructing a parallel Gibbs sampler, to both improve mixing and the exploration of high likelihood states. Recent work in parallel Gibbs sampling has focused on update schedules which do not guarantee convergence to the intended stationary distribution. In this work, we propose two methods to construct parallel Gibbs samplers guaranteed to draw from the targeted distribution. The first method, called the Chromatic sampler, uses graph coloring to construct a direct parallelization of the classic sequential scan Gibbs sampler. In the case of 2-colorable models we relate the Chromatic sampler to the Synchronous Gibbs sampler (which draws all variables simultaneously in parallel), and reveal new ergodic properties of Synchronous Gibbs chains. Our second method, the Splash sampler, is a complementary strategy which can be used when the variables are tightly coupled. This constructs and samples multiple blocks in parallel, using a novel locking protocol and an iterative junction tree generation algorithm. We further improve the Splash sampler through adaptive tree construction. We demonstrate the benefits of our two sampling algorithms on large synthetic and real-world models using a 32 processor multi-core system. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/gonzalez11a.html
http://proceedings.mlr.press/v15/gonzalez11a.htmlDeep Sparse Rectifier Neural NetworksWhile logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/glorot11a.html
http://proceedings.mlr.press/v15/glorot11a.htmlLearning from positive and unlabeled examples by enforcing statistical significanceGiven a finite but large set of objects described by a vector of features, only a small subset of which have been labeled as “"positive"” with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For linear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/geurts11a.html
http://proceedings.mlr.press/v15/geurts11a.htmlBlock-sparse Solutions using Kernel Block RIP and its Application to Group LassoWe propose Kernel Block Restricted Isometry Property (KB-RIP) as a generalization of the well-studied RIP and prove a variety of results. First, we present a “sum-of-norms”-minimization based formulation of the sparse recovery problem and prove that under certain conditions on KB-RIP, it recovers the optimal sparse solution exactly. The Group Lasso formulation, widely used as a good heuristic, arises naturally from the Lagrangian relaxation of our formulation. Second, we present an efficient combinatorial algorithm for provable sparse recovery under similar assumptions on KB-RIP. As a side product, this result improves the previous best assumptions on RIP under which a combinatorial algorithm was known. Finally, we provide numerical evidence to illustrate that not only are our sum-of-norms-minimization formulation and combinatorial algorithm significantly faster than Lasso, they also outperforms Lasso in terms of recovery. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/garg11a.html
http://proceedings.mlr.press/v15/garg11a.htmlA Dynamic Relational Infinite Feature Model for Longitudinal Social NetworksReal-world relational data sets, such as social networks, often involve measurements over time. We propose a Bayesian nonparametric latent feature model for such data, where the latent features for each actor in the network evolve according to a Markov process, extending recent work on similar models for static networks. We show how the number of features and their trajectories for each actor can be inferred simultaneously and demonstrate the utility of this model on prediction tasks using both synthetic and real-world data. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/foulds11b.html
http://proceedings.mlr.press/v15/foulds11b.htmlRevisiting MAP Estimation, Message Passing and Perfect GraphsGiven a graphical model, one of them ost useful queries is to find the most likely configuration of its variables. This task, known as the maximum a posteriori (MAP) problem, can be solved efficiently via message passing techniques when the graph is a tree, but is NP-hard for general graphs. Jebara (2009) shows that the MAP problem can be converted into the stable set problem, which can be solved in polynomial time for a broad class of graphs known as perfect graphs via a linear programming relaxation technique. This is a result of great theoretical interest. However, the article additionally claims that max-product linear programming (MPLP) message passing techniques of Globerson and Jaakkola (2007) are also guaranteed to solve these problems exactly and efficiently. We investigate this claim, show that it does not hold, and repair it with alternative message passing algorithms. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/foulds11a.html
http://proceedings.mlr.press/v15/foulds11a.htmlA novel greedy algorithm for Nyström approximationThe Nyström method is an efficient technique for obtaining a low-rank approximation of a large kernel matrix based on a subset of its columns. The quality of the Nyström approximation highly depends on the subset of columns used, which are usually selected using random sampling. This paper presents a novel recursive algorithm for calculating the Nyström approximation, and an effective greedy criterion for column selection. Further, a very efficient variant is proposed for greedy sampling, which works on random partitions of data instances. Experiments on benchmark data sets show that the proposed greedy algorithms achieve significant improvements in approximating kernel matrices, with minimum overhead in run time.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/farahat11a.html
http://proceedings.mlr.press/v15/farahat11a.htmlActive Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected SimilaritiesHierarchical clustering based on pairwise similarities is a common tool used in a broad range of scientific applications. However, in many problems it may be expensive to obtain or compute similarities between the items to be clustered. This paper investigates the possibility of hierarchical clustering of N items based on a small subset of pairwise similarities, significantly less than the complete set of N(N-1)/2 similarities. First, we show that, if the intracluster similarities exceed intercluster similarities, then it is possible to correctly determine the hierarchical clustering from as few as 3N log N similarities. We demonstrate this order of magnitude saving in the number of pairwise similarities necessitates sequentially selecting which similarities to obtain in an adaptive fashion, rather than picking them at random. Finally, we propose an active clustering method that is robust to a limited fraction of anomalous similarities, and show how even in the presence of these noisy similarity values we can resolve the hierarchical clustering using only O(N log^2 N) pairwise similarities. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/eriksson11a.html
http://proceedings.mlr.press/v15/eriksson11a.htmlBagged Structure Learning of Bayesian NetworkWe present a novel approach for density estimation using Bayesian networks when faced with scarce and partially observed data. Our approach relies on Efron’s bootstrap framework, and replaces the standard model selection score by a bootstrap aggregation objective aimed at sifting out bad decisions during the learning procedure. Unlike previous bootstrap or MCMC based approaches that are only aimed at recovering specific structural features, we learn a concrete density model that can be used for probabilistic generalization. To make use of our objective when some of the data is missing, we propose a bagged structural EM procedure that does not incur the heavy computational cost typically associated with a bootstrap-based approach. We compare our bagged objective to the Bayesian score and the Bayesian information criterion (BIC), as well as other bootstrap-based model selection objectives, and demonstrate its effectiveness in improving generalization performance for varied real-life datasets. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/elidan11a.html
http://proceedings.mlr.press/v15/elidan11a.htmlA conditional game for comparing approximationsWe present a “conditional game” to be played between two approximate inference algorithms. We prove that exact inference is an optimal strategy and demonstrate how the game can be used to estimate the relative accuracy of two different approximations in the absence of exact marginals. Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/eaton11a.html
http://proceedings.mlr.press/v15/eaton11a.htmlOptimal and Robust Price Experimentation: Learning by LotteryThis paper studies optimal price learning for one or more items. We introduce the Schr\""odinger price experiment (SPE) which superimposes classical price experiments using lotteries, and thereby extracts more information from each customer interaction. If buyers are perfectly rational we show that there exist SPEs that in the limit of infinite superposition learn optimally \emphand exploit optimally. We refer to the new resulting mechanism as the hopeful mechanism (HM) since although it is incentive compatible, buyers can deviate with extreme consequences for the seller at very little cost to themselves. For real-world settings we propose a robust version of the approach which takes the form of a Markov decision process where the actions are functions. We provide approximate policies motivated by the best of sampled set (BOSS) algorithm coupled with approximate Bayesian inference. Numerical studies show that the proposed method significantly increases seller revenue compared to classical price experimentation, even for the single-item case. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/dance11a.html
http://proceedings.mlr.press/v15/dance11a.htmlA Spike and Slab Restricted Boltzmann MachineWe introduce the spike and slab Restricted Boltzmann Machine, characterized by having both a real-valued vector, the slab, and a binary variable, the spike, associated with each unit in the hidden layer. The model possesses some practical properties such as being amenable to Block Gibbs sampling as well as being capable of generating similar latent representations of the data to the recently introduced mean and covariance Restricted Boltzmann Machine. We illustrate how the spike and slab Restricted Boltzmann Machine achieves competitive performance on the CIFAR-10 object recognition task. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/courville11a.html
http://proceedings.mlr.press/v15/courville11a.htmlDiscussion of “A conditional game for comparing approximations”Discussion of <a href="http:eaton11a.html">A conditional game for comparing approximations</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/conitzer11a.html
http://proceedings.mlr.press/v15/conitzer11a.htmlDeep Learning for Efficient Discriminative ParsingWe propose a new fast purely discriminative algorithm for natural language parsing, based on a “deep” recurrent convolutional graph transformer network (GTN). Assuming a decomposition of a parse tree into a stack of “levels”, the network predicts a level of the tree taking into account predictions of previous levels. Using only few basic text features, we show similar performance (in F1 score) to existing pure discriminative parsers and existing “benchmark” parsers (like Collins parser, probabilistic context-free grammars based), with a huge speed advantage. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/collobert11a.html
http://proceedings.mlr.press/v15/collobert11a.htmlAn Analysis of Single-Layer Networks in Unsupervised Feature LearningA great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR-10 by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several simple factors, such as the number of hidden nodes in the model, may be more important to achieving high performance than the learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR-10, NORB, and STL datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (“stride”) between extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are critical to achieving high performance - so critical, in fact, that when these parameters are pushed to their limits, we achieve state-of-the-art performance on both CIFAR-10 and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure itself, and is very easy to implement. Despite the simplicity of our system, we achieve accuracy beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.2% respectively). [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/coates11a.html
http://proceedings.mlr.press/v15/coates11a.htmlContextual Bandits with Linear Payoff FunctionsIn this paper we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d dimensional feature vectors, we prove an O(\sqrtTd\ln^3(KT\ln(T)/δ)) regret bound that holds with probability 1-δfor the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω(\sqrtTd) for this setting, matching the upper bound up to logarithmic factors. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/chu11a.html
http://proceedings.mlr.press/v15/chu11a.htmlConcave Gaussian Variational Approximations for Inference in Large-Scale Bayesian Linear ModelsTwo popular approaches to forming bounds in approximate Bayesian inference are local variational methods and minimal Kullback-Leibler divergence methods. For a large class of models we explicitly relate the two approaches, showing that the local variational method is equivalent to a weakened form of Kullback-Leibler Gaussian approximation. This gives a strong motivation to develop efficient methods for KL minimisation. An important and previously unproven property of the KL variational Gaussian bound is that it is a concave function in the parameters of the Gaussian for log concave sites. This observation, along with compact concave parametrisations of the covariance, enables us to develop fast scalable optimisation procedures to obtain lower bounds on the marginal likelihood in large scale Bayesian linear models. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/challis11a.html
http://proceedings.mlr.press/v15/challis11a.htmlSwitch-Reset Models : Exact and Approximate InferenceReset models are constrained switching latent Markov models in which the dynamics either continues according to a standard model, or the latent variable is resampled. We consider exact marginal inference in this class of models and their extension, the switch-reset models. A further convenient class of conjugate-exponential reset models is also discussed. For a length T time-series, exact filtering scales with T squared and smoothing T cubed. We discuss approximate filtering and smoothing routines that scale linearly with T. Applications are given to change-point models and reset linear dynamical systems. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/bracegirdle11a.html
http://proceedings.mlr.press/v15/bracegirdle11a.htmlRelative Entropy Inverse Reinforcement LearningWe consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)-optimal policy can be computed for different reward functions. However, this requirement can hardly be satisfied in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using learned MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/boularias11a.html
http://proceedings.mlr.press/v15/boularias11a.htmlDomain Adaptation with Coupled SubspacesDomain adaptation algorithms address a key issue in applied machine learning: How can we train a system under a source distribution but achieve high performance under a different target distribution? We tackle this question for divergent distributions where crucial predictive target features may not even have support under the source distribution. In this setting, the key intuition is that that if we can link target-specific features to source features, we can learn effectively using only source labeled data. We formalize this intuition, as well as the assumptions under which such coupled learning is possible. This allows us to give finite sample target error bounds (using only source training data) and an algorithm which performs at the state-of-the-art on two natural language processing adaptation tasks which are characterized by novel target features. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/blitzer11a.html
http://proceedings.mlr.press/v15/blitzer11a.htmlContextual Bandit Algorithms with Supervised Learning GuaranteesWe address the problem of competing with any large set of N policies in the non-stochastic bandit setting, where the learner must repeatedly select among K actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of [Auer et al. 2002] called Exp4.P, which with high probability incurs regret at most O(\sqrtKT\ln N). Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(\sqrtTd\ln T) with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning. Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/beygelzimer11a.html
http://proceedings.mlr.press/v15/beygelzimer11a.htmlDeep Learners Benefit More from Out-of-Distribution ExamplesRecent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did <em>beat previously published results and reached human-level performance</em>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/bengio11b.html
http://proceedings.mlr.press/v15/bengio11b.htmlDiscussion of “The Neural Autoregressive Distribution Estimator”Discussion of <a href="http:larochelle11a.html">The Neural Autoregressive Distribution Estimator</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/bengio11a.html
http://proceedings.mlr.press/v15/bengio11a.htmlActive Diagnosis under Persistent Noise with Unknown Noise Distribution: A Rank-Based ApproachWe consider a problem of active diagnosis, where the goal is to efficiently identify an unknown object by sequentially selecting, and observing, the responses to binary valued queries. We assume that query observations are noisy, and further that the noise is persistent, meaning that repeating a query does not change the response. Previous work in this area either assumed the knowledge of the query noise distribution, or that the noise level is sufficiently low so that the unknown object can be identified with high accuracy. We make no such assumptions, and introduce an algorithm that returns a ranked list of objects, such that the expected rank of the true object is optimized. Furthermore, our algorithm does not require knowledge of the query noise distribution. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/bellala11a.html
http://proceedings.mlr.press/v15/bellala11a.htmlTighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation AlgorithmWe propose an efficient and adaptive method for MAP-MRF inference that provides increasingly tighter upper and lower bounds on the optimal objective. Similar to Sontag et al. (2008), our method starts by solving the first-order LOCAL(G) linear programming relaxation. This is followed by an adaptive tightening of the relaxation where we incrementally add higher-order interactions to enforce proper marginalization over groups of variables. Computing the best interaction to add is an NP-hard problem. We show good solutions to this problem can be readily obtained from “local primal-dual gaps” given the current primal solution and a dual reparameterization vector. This is not only extremely efficient, but in contrast to previous approaches, also allows us to search over prohibitively large sets of candidate interactions to add. We demonstrate the superiority of our approach on MAP-MRF inference problems encountered in computer vision. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/batra11a.html
http://proceedings.mlr.press/v15/batra11a.htmlUnsupervised Supervised Learning II: Margin-Based Classification without LabelsMany popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing margin-based risk functions. Traditionally, these risk functions are computed based on a labeled dataset. We develop a novel technique for estimating such risks using only unlabeled data and knowledge of p(y). We prove that the proposed risk estimator is consistent on high-dimensional datasets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers using exclusively unlabeled data. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/balasubramanian11a.html
http://proceedings.mlr.press/v15/balasubramanian11a.htmlStatistical Optimization of Non-Negative Matrix FactorizationNon-Negative Matrix Factorization (NMF) is a dimensionality reduction method that has been shown to be very useful for a variety of tasks in machine learning and data mining. One of the fastest algorithms for NMF is the Block Principal Pivoting method (BPP), which follows a block coordinate descent approach. The optimization in each iteration involves solving a large number of expensive least squares problems. Taking the view that the design matrix was generated by a stochastic process, and using the asymptotic normality of the least squares estimator, we propose a method for improving the performance of BPP. Our method starts with a small subset of the columns and rows of the original matrix and uses frequentist hypothesis tests to adaptively increase the size of the problem. This achieves two objectives: 1) during the initial phase of the algorithm we solve far fewer, much smaller sized least squares problems and 2) all hypothesis tests failing while using all the data represents a principled, automatic stopping criterion. Our experiments on three real world datasets show that our algorithm significantly improves the performance of the original BPP algorithm. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/balan11a.html
http://proceedings.mlr.press/v15/balan11a.htmlDynamic Policy Programming with Function ApproximationIn this paper, we consider the problem of planning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative method, called dynamic policy programming (DPP), which updates the parametrized policy by a Bellman-like iteration. For discrete state-action case, we establish sup-norm loss bounds for the performance of the policy induced by DPP and prove that it asymptotically converges to the optimal policy. Then, we generalize our approach to large-scale (continuous) state-action problems using function approximation technique. We provide sup-norm performance-loss bounds for approximate DPP and compare these bounds with the standard results from approximate dynamic programming (ADP) showing that approximate DPP results in a tighter asymptotic bound than standard ADP methods. We also numerically compare the performance of DPP to other ADP and RL methods. We observe that approximate DPP asymptotically outperforms other methods on the mountain-car problem. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/azar11a.html
http://proceedings.mlr.press/v15/azar11a.htmlPolytope samplers for inference in ill-posed inverse problemsWe consider linear ill-posed inverse problems y=Ax, in which we want to infer many count parameters x from few count observations y, where the matrix A is binary and has some unimodularity property. Such problems are typical in applications such as contingency table analysis and network tomography (on which we present testing results). These properties of A have a geometrical implication for the solution space: It is a convex integer polytope. We develop a novel approach to characterize this polytope in terms of its vertices; by taking advantage of the geometrical intuitions behind the Hermite normal form decomposition of the matrix A, and of a newly defined pivoting operation to travel across vertices. Next, we use this characterization to develop three (exact) polytope samplers for x with emphasis on uniform distributions. We showcase one of these samplers on simulated and real data. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/airoldi11a.html
http://proceedings.mlr.press/v15/airoldi11a.htmlOnline Inference for the Infinite Topic-Cluster Model: Storylines from Streaming TextWe present the time-dependent topic-cluster model, a hierarchical approach for combining Latent Dirichlet Allocation and clustering via the Recurrent Chinese Restaurant Process. It inherits the advantages of both of its constituents, namely interpretability and concise representation. We show how it can be applied to streaming collections of objects such as real world feeds in a news portal. We provide details of a parallel Sequential Monte Carlo algorithm to perform inference in the resulting graphical model which scales to hundred of thousands of documents. [pdf][supplementary]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/ahmed11a.html
http://proceedings.mlr.press/v15/ahmed11a.htmlLinear-Time Estimators for Propensity ScoresWe present linear-time estimators for three popular covariate shift correction and propensity scoring algorithms: logistic regression(LR), kernel mean matching(KMM), and maximum entropy mean matching(MEMM). This allows applications in situations where \emphboth treatment and control groups are large. We also show that the last two algorithms differ only in their choice of regularizer (\ell_2 of the Radon Nikodym derivative vs. maximum entropy). Experiments show that all methods scale well. [pdf]Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/agarwal11c.html
http://proceedings.mlr.press/v15/agarwal11c.htmlGenerative Kernels for Exponential FamiliesIn this paper, we propose a family of kernels for the data distributions belonging to the exponential family. We call these kernels generative kernels because they take into account the generative process of the data. Our proposed method considers the geometry of the data distribution to build a set of efficient closed-form kernels best suited for that distribution. We compare our generative kernels on multinomial data and observe improved empirical performance across the board. Moreover, our generative kernels perform significantly better when training size is small, an important property of the generative models.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/agarwal11b.html
http://proceedings.mlr.press/v15/agarwal11b.htmlDiscussion of “Learning Scale Free Networks by Reweighted L1 regularization”Discussion of <a href="http:liu11a.html">Learning Scale Free Networks by Reweighted L1 regularization</a>.Tue, 14 Jun 2011 00:00:00 +0000
http://proceedings.mlr.press/v15/agarwal11a.html
http://proceedings.mlr.press/v15/agarwal11a.html