Proceedings of Machine Learning Research

Two-Layer Multiple Kernel Learning

Tue, 14 Jun 2011 00:00:00 +0000

Multiple Kernel Learning (MKL) aims to learn kernel machines for solving a real machine learning problem (e.g. classification) by exploring the combinations of multiple kernels. The traditional MKL approach is in general “shallow” in the sense that the target kernel is simply a linear (or convex) combination of some base kernels. In this paper, we investigate a framework of Multi-Layer Multiple Kernel Learning (MLMKL) that aims to learn “deep” kernel machines by exploring the combinations of multiple kernels in a multi-layer structure, which goes beyond the conventional MKL approach. Through a multiple layer mapping, the proposed MLMKL framework offers higher flexibility than the regular MKL for finding the optimal kernel for applications. As the first attempt to this new MKL framework, we present a two-Layer Multiple Kernel Learning (2LMKL) method together with two efficient algorithms for classification tasks. We analyze their generalization performances and have conducted an extensive set of experiments over 16 benchmark datasets, in which encouraging results showed that our method outperformed the conventional MKL methods.

Error Analysis of Laplacian Eigenmaps for Semi-supervised Learning

Tue, 14 Jun 2011 00:00:00 +0000

We study the error and sample complexity of semi-supervised learning by Laplacian Eignmaps at the limit of infinite unlabeled data. We provide a bound on the error, and show that it is controlled by the graph Laplacian regularizer. Our analysis also gives guidance to the choice of the number of eigenvectors $k$ to use: when the data lies on a $d$-dimensional domain, the optimal choice of $k$ is of order $(n/\log(n))^{\frac{d}{d+2}}$, yielding an asymptotic error rate of $(n/\log(n))^{-\frac{2}{2+d}}$.

Semi-supervised Learning by Higher Order Regularization

Tue, 14 Jun 2011 00:00:00 +0000

In semi-supervised learning, at the limit of infinite unlabeled points while fixing labeled ones, the solutions of several graph Laplacian regularization based algorithms were shown by Nadler et al. (2009) to degenerate to constant functions with “spikes” at labeled points in $\mathbb{R}^d$ for $d \ge 2$. These optimization problems all use the graph Laplacian regularizer as a common penalty term. In this paper, we address this problem by using regularization based on an iterated Laplacian, which is equivalent to a higher order Sobolev semi-norm. Alternatively, it can be viewed as a generalization of the thin plate spline to an unknown submanifold in high dimensions. We also discuss relationships between Reproducing Kernel Hilbert Spaces and Green’s functions. Experimental results support our analysis by showing consistently improved results using iterated Laplacians.

Dependent Hierarchical Beta Process for Image Interpolation and Denoising

Tue, 14 Jun 2011 00:00:00 +0000

A dependent hierarchical beta process (dHBP) is developed as a prior for data that may be represented in terms of a sparse set of latent features, with covariate-dependent feature usage. The dHBP is applicable to general covariates and data models, imposing that signals with similar covariates are likely to be manifested in terms of similar features. Coupling the dHBP with the Bernoulli process, and upon marginalizing out the dHBP, the model may be interpreted as a covariate-dependent hierarchical Indian buffet process. As applications, we consider interpolation and denoising of an image, with covariates defined by the location of image patches within an image. Two types of noise models are considered: (i) typical white Gaussian noise; and (ii) spiky noise of arbitrary amplitude, distributed uniformly at random. In these examples, the features correspond to the atoms of a dictionary, learned based upon the data under test (without a priori training data). State-of-the-art performance is demonstrated, with efficient inference using hybrid Gibbs, Metropolis-Hastings and slice sampling.

Multi-Label Output Codes using Canonical Correlation Analysis

Tue, 14 Jun 2011 00:00:00 +0000

Traditional error-correcting output codes (ECOCs) decompose a multi-class classification problem into many binary problems. Although it seems natural to use ECOCs for multi-label problems as well, doing so naively creates issues related to: the validity of the encoding, the efficiency of the decoding, the predictability of the generated codeword, and the exploitation of the label dependency. Using canonical correlation analysis, we propose an error-correcting code for multi-label classification. Label dependency is characterized as the most predictable directions in the label space, which are extracted as canonical output variates and encoded into the codeword. Predictions for the codeword define a graphical model of labels with both Bernoulli potentials (from classifiers on the labels) and Gaussian potentials (from regression on the canonical output variates). Decoding is performed by efficient mean-field approximation. We establish connections between the proposed code and research areas such as compressed sensing and ensemble learning. Some of these connections contribute to better understanding of the new code, and others lead to practical improvements in code design. In our empirical study, the proposed code leads to substantial improvements compared to various competitors in music emotion classification and outdoor scene recognition.

Generalization Bound for Infinitely Divisible Empirical Process

Tue, 14 Jun 2011 00:00:00 +0000

In this paper, we study the generalization bound for an empirical process of samples independently drawn from an infinitely divisible (ID) distribution, which is termed as the ID empirical process. In particular, based on a martingale method, we develop deviation inequalities for the sequence of random variables of an ID distribution. By applying the obtained deviation inequalities, we then show the generalization bound for ID empirical process based on the annealed Vapnik- Chervonenkis (VC) entropy. Afterward, according to Sauer's lemma, we get the generalization bound for ID empirical process based on the VC dimension. Finally, by using a resulted result bound, we analyze the asymptotic convergence of ID empirical process and show that the convergence rate of ID empirical process can reach $O\left(\left(\frac{\Lambda_\mathcal{F}(2N)}{N}\right)^\frac{1}{1.3}\right)$ and it is faster than the results of the generic i.i.d. empirical process (Vapnik, 1999).

Discussion of “Learning Equivalence Classes of Acyclic Models with Latent and Selection Variables from Multiple Datasets with Overlapping Variables”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables.

An Instantiation-Based Theorem Prover for First-Order Programming

Tue, 14 Jun 2011 00:00:00 +0000

First-order programming (FOP) is a new representation language that combines the strengths of mixed-integer linear programming (MILP) and first-order logic (FOL). In this paper we describe a novel feasibility proving system for FOP formulas that combines MILP solving with instance-based methods from theorem proving. This prover allows us to perform lifted inference by repeatedly refining a propositional MILP. We prove that this procedure is sound and refutationally complete: if a formula is infeasible our solver will demonstrate this fact in finite time. We conclude by demonstrating an implementation of our decision procedure on a simple first-order planning problem.

A Finite Newton Algorithm for Non-degenerate Piecewise Linear Systems

Tue, 14 Jun 2011 00:00:00 +0000

We investigate Newton-type optimization methods for solving piecewise linear systems (PLS) with non-degenerate coefficient matrix. Such systems arise, for example, from the numerical solution of linear complementarity problem which is useful to model several learning and optimization problems. In this paper, we propose an effective damped Newton method, namely PLS-DN, to find the exact solution of non-degenerate PLS. PLS-DN exhibits provable semi-iterative property, i.e., the algorithm converges globally to the exact solution in a finite number of iterations. The rate of convergence is shown to be at least linear before termination. We emphasize the applications of our method to modeling, from a novel perspective of PLS, several statistical learning problems such as elitist Lasso, non-negative least squares and support vector machines. Numerical results on synthetic and benchmark data sets are presented to demonstrate the effectiveness and efficiency of PLS-DN on these problems.

Efficient variable selection in support vector machines via the alternating direction method of multipliers

Tue, 14 Jun 2011 00:00:00 +0000

The support vector machine (SVM) is a widely used tool for classification. Although commonly understood as a method of finding the maximum-margin hyperplane, it can also be formulated as a regularized function estimation problem, corresponding to a hinge loss function plus an $\ell_2$-norm regulation term. The doubly regularized support vector machine (DrSVM) is a variant of the standard SVM, which introduces an additional $\ell_1$-norm regularization term on the fitted coefficients. The combined $\ell_1$ and $\ell_2$ regularization, termed elastic net penalty, has the interesting property of achieving simultaneous variable selection and margin-maximization within a single framework. However, because of the nonsmoothness of both the loss function and the regularization term, there is no efficient method to solve DrSVM for large scale problems. Here we develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) to solve the optimization problem in DrSVM. The utility of the method is further illustrated using both simulated and real-world datasets.

Bridging the Language Gap: Topic Adaptation for Documents with Different Technicality

Tue, 14 Jun 2011 00:00:00 +0000

The language-gap, for example between low-literacy laypersons and highly-technical experts, is a fundamental barrier for cross-domain knowledge transfer. This paper seeks to close the gap at the thematic level via topic adaptation, i.e., adjusting topical structures for cross-domain documents according to a domain factor such as technicality. We present a probabilistic model for this purpose based on joint modeling of topic and technicality. The proposed $\tau$LDA model explicitly encodes the interplay between topic and technicality hierarchies, providing an effective topic-bridge between lay and expert documents. We demonstrate the usefulness of $\tau$LDA with an application to consumer medical informatics.

The Sample Complexity of Self-Verifying Bayesian Active Learning

Tue, 14 Jun 2011 00:00:00 +0000

We prove that access to a prior distribution over target functions can dramatically improve the sample complexity of self-terminating active learning algorithms, so that it is always better than the known results for prior-dependent passive learning. In particular, this is in stark contrast to the analysis of prior-independent algorithms, where there are simple known learning problems for which no self-terminating algorithm can provide this guarantee for all priors.

Cross-Domain Object Matching with Model Selection

Tue, 14 Jun 2011 00:00:00 +0000

The goal of cross-domain object matching (CDOM) is to find correspondence between two sets of objects in different domains in an unsupervised way. Photo album summarization is a typical application of CDOM, where photos are automatically aligned into a designed frame expressed in the Cartesian coordinate system. CDOM is usually formulated as finding a mapping from objects in one domain (photos) to objects in the other domain (frame) so that the pairwise dependency is maximized. A state-of-the-art CDOM method employs a kernel-based dependency measure, but it has a drawback that the kernel parameter needs to be determined manually. In this paper, we propose alternative CDOM methods that can naturally address the model selection problem. Through experiments on image matching, unpaired voice conversion, and photo album summarization tasks, the effectiveness of the proposed methods is demonstrated.

Multicore Gibbs Sampling in Dense, Unstructured Graphs

Tue, 14 Jun 2011 00:00:00 +0000

Multicore computing is on the rise, but algorithms such as Gibbs sampling are fundamentally sequential and may require close consideration to be made parallel. Existing techniques either exploit sparse problem structure or make approximations to the algorithm; in this work, we explore an alternative to these ideas. We develop a parallel Gibbs sampling algorithm for shared-memory systems that does not require any independence structure among the variables yet does not approximate the sampling distributions. Our method uses a look-ahead sampler, which uses bounds to attempt to sample variables before the results of other threads are made available. We demonstrate our algorithm on Gibbs sampling in Boltzmann machines and latent Dirichlet allocation (LDA). We show in experiments that our algorithm achieves near linear speed-up in the number of cores, is faster than existing exact samplers, and is nearly as fast as approximate samplers while maintaining the correct stationary distribution.

Hierarchical Probabilistic Models for Group Anomaly Detection

Tue, 14 Jun 2011 00:00:00 +0000

Statistical anomaly detection typically focuses on finding individual data point anomalies. Often the most interesting or unusual things in a data set are not odd individual points, but rather larger scale phenomena that only become apparent when groups of data points are considered. In this paper, we propose two hierarchical probabilistic models for detecting such group anomalies. We evaluate our methods on synthetic data as well as astronomical data from the Sloan Digital Sky Survey. The experimental results show that the proposed models are effective in detecting group anomalies.

Relational Learning with One Network: An Asymptotic Analysis

Tue, 14 Jun 2011 00:00:00 +0000

Theoretical analysis of structured learning methods has focused primarily on domains where the data consist of independent (albeit structured) examples. Although the statistical relational learning (SRL) community has recently developed many classification methods for graph and network domains, much of this work has focused on modeling domains where there is a single network for learning. For example, we could learn a model to predict the political views of users in an online social network, based on the friendship relationships among users. In this example, the data would be drawn from a single large network (e.g., Facebook) and increasing the data size would correspond to acquiring a larger graph. Although SRL methods can successfully improve classification in these types of domains, there has been little theoretical analysis of addressing the issue of single network domains. In particular, the asymptotic properties of estimation are not clear if the size of the model grows with the size of the network. In this work, we focus on outlining the conditions under which learning from a single network will be asymptotically consistent and normal. Moreover, we compare the properties of maximum likelihood estimation (MLE) with that of generalized maximum pseudolikelihood estimation (MPLE) and use the resulting understanding to propose novel MPLE estimators for single network domains. We include empirical analysis on both synthetic and real network data to illustrate the findings.

Discussion of “The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling.

Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation

Tue, 14 Jun 2011 00:00:00 +0000

We describe a general method of transforming arbitrary programming languages into probabilistic programming languages with straightforward MCMC inference engines. Random choices in the program are “named” with information about their position in an execution trace; these names are used in conjunction with a database of randomness to implement MCMC inference in the space of execution traces. We encode naming information using lightweight source-to-source compilers. Our method enables us to reuse existing infrastructure (compilers, interpreters, etc.) with minimal additional code, implying fast models with low development overhead. We illustrate the technique on two languages, one functional and one imperative: Bher, a compiled version of the Church language which eliminates interpretive overhead of the original MIT-Church implementation, and Stochastic Matlab, a new open-source language.

Information Theoretical Clustering via Semidefinite Programming

Tue, 14 Jun 2011 00:00:00 +0000

We propose techniques of convex optimization for information theoretical clustering. The clustering objective is to maximize the mutual information between data points and cluster assignments. We formulate this problem first as an instance of MAX K CUT on weighted graphs. We then apply the technique of semidefinite programming (SDP) relaxation to obtain a convex SDP problem. We show how the solution of the SDP problem can be further improved with a low-rank refinement heuristic. The low-rank solution reveals more clearly the cluster structure of the data. Empirical studies on several datasets demonstrate the effectiveness of our approach. In particular, the approach outperforms several other clustering algorithms when compared on standard evaluation metrics.

Online Variational Inference for the Hierarchical Dirichlet Process

Tue, 14 Jun 2011 00:00:00 +0000

The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric model that can be used to model mixed-membership data with a potentially infinite number of components. It has been applied widely in probabilistic topic modeling, where the data are documents and the components are distributions of terms that reflect recurring patterns (or “topics”) in the collection. Given a document collection, posterior inference is used to determine the number of topics needed and to characterize their distributions. One limitation of HDP analysis is that existing posterior inference algorithms require multiple passes through all the data—these algorithms are intractable for very large scale applications. We propose an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data. Our algorithm is significantly faster than traditional inference algorithms for the HDP, and lets us analyze much larger data sets. We illustrate the approach on two large collections of text, showing improved performance over online LDA, the finite counterpart to the HDP topic model.

Active Boosted Learning (ActBoost)

Tue, 14 Jun 2011 00:00:00 +0000

Active learning deals with the problem of selecting a small subset of examples to label, from a pool of unlabeled data, for training a good classifier. We develop an active learning algorithm algorithm in the boosting framework. In contrast to much of the recent efforts, which has focused on selecting the most ambiguous unlabeled example to label based on the current learned classifier, our algorithm selects examples to maximally reduce the volume of the version space of feasible boosted classifiers. We show that under suitable sparsity assumptions, this strategy achieves the generalization error performance of a boosted classifier trained on the entire data set while only selecting logarithmically many unlabeled samples to label. We also establish a partial negative result, in that with out imposing structural assumptions it is difficult to guarantee generalization error performance. We explicitly characterize our convergence rate in terms of the sign pattern differences produced by the weak learners on the unlabeled data. We also present a convex relaxation to account for the non-convex sparse structure and show that the computational complexity of the resulting algorithm scales polynomially in the number of weak learners. We test ActBoost on several datasets to illustrate its performance and demonstrate its robustness to initialization.

Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables

Tue, 14 Jun 2011 00:00:00 +0000

While there has been considerable research in learning probabilistic graphical models from data for predictive and causal inference, almost all existing algorithms assume a single dataset of i.i.d. observations for all variables. For many applications, it may be impossible or impractical to obtain such datasets, but multiple datasets of i.i.d. observations for different subsets of these variables may be available. Tillman et al. (2009) showed how directed graphical models learned from such datasets can be integrated to construct an equivalence class of structures over all variables. While their procedure is correct, it assumes that the structures integrated do not entail contradictory conditional independences and dependences for variables in their intersections. While this assumption is reasonable asymptotically, it rarely holds in practice with finite samples due to the frequency of statistical errors. We propose a new correct procedure for learning such equivalence classes directly from the multiple datasets which avoids this problem and is thus more practically useful. Empirical results indicate our method is not only more accurate, but also faster and requires less memory.

Estimating Probabilities in Recommendation Systems

Tue, 14 Jun 2011 00:00:00 +0000

Modeling ranked data is an essential component in a number of important applications including recommendation systems and web-search. In many cases, judges omit preference among unobserved items and between unobserved and observed items. This case of analyzing incomplete rankings is very important from a practical perspective and yet has not been fully studied due to considerable computational difficulties. We show how to avoid such computational difficulties and efficiently construct a non-parametric model for rankings with missing items. We demonstrate our approach and show how it applies in the context of collaborative filtering.

Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure

Tue, 14 Jun 2011 00:00:00 +0000

Graphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic meta-descent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.

Machine Learning Markets

Tue, 14 Jun 2011 00:00:00 +0000

Prediction markets show considerable promise for developing flexible mechanisms for machine learning. Here, machine learning markets for multivariate systems are defined, and a utility-based framework is established for their analysis. It is shown that such markets can implement model combination methods used in machine learning, such as product of expert and mixture of expert approaches as equilibrium pricing models, by varying agent utility functions. They can implement models composed of local potentials, and message passing methods. Prediction markets also allow for more flexible combinations, by combining multiple different utility functions.

Kernel Belief Propagation

Tue, 14 Jun 2011 00:00:00 +0000

We propose a nonparametric generalization of belief propagation, Kernel Belief Propagation (KBP), for pairwise Markov random fields. Messages are represented as functions in a reproducing kernel Hilbert space (RKHS), and message updates are simple linear operations in the RKHS. KBP makes none of the assumptions commonly required in classical BP algorithms: the variables need not arise from a finite domain or a Gaussian distribution, nor must their relations take any particular parametric form. Rather, the relations between variables are represented implicitly, and are learned nonparametrically from training data. KBP has the advantage that it may be used on any domain where kernels are defined ($\mathbb{R}^d$, strings, groups), even where explicit parametric models are not known, or closed form expressions for the BP updates do not exist. The computational cost of message updates in KBP is polynomial in the training data size. We also propose a constant time approximate message update procedure by representing messages using a small number of basis functions. In experiments, we apply KBP to image denoising, depth prediction from still images, and protein configuration prediction: KBP is faster than competing classical and nonparametric approaches (by orders of magnitude, in some cases), while providing significantly more accurate results.

Spectral Chinese Restaurant Processes: Nonparametric Clustering Based on Similarities

Tue, 14 Jun 2011 00:00:00 +0000

We introduce a new nonparametric clustering model which combines the recently proposed distance-dependent Chinese restaurant process (dd-CRP) and non-linear, spectral methods for dimensionality reduction. Our model retains the ability of nonparametric methods to learn the number of clusters from data. At the same time it addresses two key limitations of nonparametric Bayesian methods: modeling data that are not exchangeable and have many correlated features. Spectral methods use the similarity between documents to map them into a low-dimensional spectral space where we then compare several clustering methods. Our experiments on handwritten digits and text documents show that nonparametric methods such as the CRP or dd-CRP can perform as well as or better than $k$-means and also recover the true number of clusters. We improve the performance of the dd-CRP in spectral space by incorporating the original similarity matrix in its prior. This simple modification results in better performance than all other methods we compared to. We offer a new formulation and first experimental evaluation of a general Gibbs sampler for mixture modeling with distance-dependent CRPs.

Assisting Main Task Learning by Heterogeneous Auxiliary Tasks with Applications to Skin Cancer Screening

Tue, 14 Jun 2011 00:00:00 +0000

In typical classification problems, high level concept features provided by a domain expert are usually available during classifier training but not during its deployment. We address this problem from a multitask learning (MTL) perspective by treating these features as auxiliary learning tasks. Previous efforts in MTL have mostly assumed that all tasks have the same input space. However, auxiliary tasks can have different input spaces, since their learning targets are different. Thus, to handle cases with heterogeneous input, in this paper we present a newly developed model using heterogeneous auxiliary tasks to help main task learning. First, we formulate a convex optimization problem for the proposed model, and then, we analyze its hypothesis class and derive true risk bounds. Finally, we compare the proposed model with other relevant methods when applied to the problem of skin cancer screening and public datasets. Our results show that the performance of the proposed method is highly competitive compared to other relevant methods.

Asymptotic Theory for Linear-Chain Conditional Random Fields

Tue, 14 Jun 2011 00:00:00 +0000

In this theoretical paper we develop an asymptotic theory for Linear-Chain Conditional Random Fields (L-CRFs) and apply it to derive conditions under which the Maximum Likelihood Estimates (MLEs) of the model weights are strongly consistent. We first define L-CRFs for infinite sequences and analyze some of their basic properties. Then we establish conditions under which ergodicity of the observations implies ergodicity of the joint sequence of observations and labels. This result is the key ingredient to derive conditions for strong consistency of the MLEs. Interesting findings are that the consistency crucially depends on the limit behavior of the Hessian of the likelihood function and that, asymptotically, the state feature functions do not matter.

Mixed Cumulative Distribution Networks

Tue, 14 Jun 2011 00:00:00 +0000

Directed acyclic graphs (DAGs) are a popular framework to express multivariate probability distributions. Acyclic directed mixed graphs (ADMGs) are generalizations of DAGs that can succinctly capture much richer sets of conditional independencies, and are especially useful in modeling the effects of latent variables implicitly. Unfortunately, there are currently no parameterizations of general ADMGs. In this paper, we apply recent work on cumulative distribution networks and copulas to propose one general construction for ADMG models. We consider a simple parameter estimation approach, and report some encouraging experimental results.

Spectral Clustering on a Budget

Tue, 14 Jun 2011 00:00:00 +0000

Spectral clustering is a modern and well known method for performing data clustering. However, it depends on the availability of a similarity matrix, which in many applications can be non-trivial to obtain. In this paper, we focus on the problem of performing spectral clustering under a budget constraint, where there is a limit on the number of entries which can be queried from the similarity matrix. We propose two algorithms for this problem, and study them theoretically and experimentally. These algorithms allow a tradeoff between computational efficiency and actual performance, and are also relevant for the problem of speeding up standard spectral clustering.

Fast Convergent Algorithms for Expectation Propagation Approximate Bayesian Inference

Tue, 14 Jun 2011 00:00:00 +0000

We propose a novel algorithm to solve the expectation propagation relaxation of Bayesian inference for continuous-variable graphical models. In contrast to most previous algorithms, our method is provably convergent. By marrying convergent EP ideas from (Opper & Winther, 2005) with covariance decoupling techniques (Wipf & Nagarajan, 2008; Nickisch & Seeger, 2009), it runs at least an order of magnitude faster than the most common EP solver.

Online Learning of Multiple Tasks and Their Relationships

Tue, 14 Jun 2011 00:00:00 +0000

We propose an Online MultiTask Learning (OMTL) framework which simultaneously learns the task weight vectors as well as the task relatedness adaptively from the data. Our work is in contrast with prior work on online multitask learning which assumes fixed task relatedness, a priori. Furthermore, whereas prior work in such settings assume only positively correlated tasks, our framework can capture negative correlations as well. Our proposed framework learns the task relationship matrix by framing the objective function as a Bregman divergence minimization problem for positive definite matrices. Subsequently, we exploit this adaptively learned task-relationship matrix to select the most informative samples in an online multitask active learning setting. Experimental results on a number of real-world datasets and comparisons with numerous baselines establish the efficacy of our proposed approach.

Improved Regret Guarantees for Online Smooth Convex Optimization with Bandit Feedback

Tue, 14 Jun 2011 00:00:00 +0000

The study of online convex optimization in the bandit setting was initiated by Kleinberg (2004) and Flaxman et al. (2005). Such a setting models a decision maker that has to make decisions in the face of adversarially chosen convex loss functions. Moreover, the only information the decision maker receives are the losses. The identity of the loss functions themselves is not revealed. In this setting, we reduce the gap between the best known lower and upper bounds for the class of smooth convex functions, i.e. convex functions with a Lipschitz continuous gradient. Building upon existing work on self-concordant regularizers and one-point gradient estimation, we give the first algorithm whose expected regret, ignoring constant and logarithmic factors, is $O(T^{2/3})$.

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Tue, 14 Jun 2011 00:00:00 +0000

Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either non-stationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.

On NDCG Consistency of Listwise Ranking Methods

Tue, 14 Jun 2011 00:00:00 +0000

We examine the consistency of listwise ranking methods with respect to the popular Normalized Discounted Cumulative Gain (NDCG) criterion. The most successful listwise approaches replace NDCG with a surrogate that is easier to optimize. We characterize NDCG consistent surrogates to discover a surprising fact: several commonly used surrogates are NDCG inconsistent. We then show how to change them so that they become NDCG consistent in a strong but natural sense. An explicit characterization of strong NDCG consistency is provided. Going beyond qualitative consistency considerations, we also give quantitive statements that enable us to transform the excess error, as measured in the surrogate, to the excess error in comparison to the Bayes optimal ranking function for NDCG. Finally, we also derive improved results if a certain natural “low noise” or “large margin” condition holds. Our experiments demonstrate that ensuring NDCG consistency does improve the performance of listwise ranking methods on real-world datasets. Moreover, a novel surrogate function suggested by our theoretical results leads to further improvements over NDCG consistent versions of existing surrogates.

On the Estimation of $\alpha$-Divergences

Tue, 14 Jun 2011 00:00:00 +0000

We propose new nonparametric, consistent Rényi-$\alpha$ and Tsallis-$\alpha$ divergence estimators for continuous distributions. Given two independent and identically distributed samples, a ‘brute force’ approach would be simply to estimate the underlying densities, and plug these densities into the corresponding formulas. However, it is not our goal to consistently estimate these possibly high dimensional densities, and our algorithm avoids estimating them. We will use simple $k$-nearest-neighbor statistics, and interestingly enough, we will still be able to prove that the proposed divergence estimators are consistent under certain conditions. We will also show how to use them for mutual information estimation, and demonstrate their efficiency by some numerical experiments.

Directional Statistics on Permutations

Tue, 14 Jun 2011 00:00:00 +0000

Distributions over permutations arise in applications ranging from multi-object tracking to ranking. The difficulty of dealing with these distributions is caused by the size of their domain, which is factorial in the number of entities $(n!)$. The direct definition of a multinomial distribution over permutation space is impractical for all but a very small $n$. In this work we propose an embedding of all $n!$ permutations for a given $n$ in a surface of a hypersphere defined in $\mathbb{R}^{(n-1)^2}$. As a result, we acquire the ability to define continuous distributions over a hypersphere with all the benefits of directional statistics. We provide polynomial time projections between the continuous hypersphere representation and the $n!$-element permutation space. The framework provides a way to use continuous directional probability densities and the methods developed thereof for establishing densities over permutations. As a demonstration of the benefits of the framework we derive an inference procedure for a state-space model over permutations. We demonstrate the approach with applications and comparisons to existing models.

Faithfulness in Chain Graphs: The Gaussian Case

Tue, 14 Jun 2011 00:00:00 +0000

This paper deals with chain graphs under the classic Lauritzen-Wermuth-Frydenberg interpretation. We prove that almost all the regular Gaussian distributions that factorize with respect to a chain graph are faithful to it. This result has three important consequences. First, chain graphs are more powerful than undirected graphs and acyclic directed graphs for representing regular Gaussian distributions, as some of these distributions can be represented exactly by the former but not by the latter. Second, the moralization and c-separation criteria for reading independencies from a chain graph are complete, in the sense that they identify all the independencies that can be identified from the chain graph alone. Third, some definitions of equivalence in chain graphs coincide and, thus, they have the same graphical characterization.

Generative Modeling for Maximizing Precision and Recall in Information Visualization

Tue, 14 Jun 2011 00:00:00 +0000

Information visualization has recently been formulated as an information retrieval problem, where the goal is to find similar data points based on the visualized nonlinear projection, and the visualization is optimized to maximize a compromise between (smoothed) precision and recall. We turn the visualization into a generative modeling task where a simple user model parameterized by the data coordinates is optimized, neighborhood relations are the observed data, and straightforward maximum likelihood estimation corresponds to Stochastic Neighbor Embedding (SNE). While SNE maximizes pure recall, adding a mixture component that “explains away” misses allows our generative model to focus on maximizing precision as well. The resulting model is a generative solution to maximizing tradeoffs between precision and recall. The model outperforms earlier models in terms of precision and recall and in external validation by unsupervised classification.

The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling

Tue, 14 Jun 2011 00:00:00 +0000

We present the discrete infinite logistic normal distribution (DILN, “Dylan”), a Bayesian nonparametric prior for mixed membership models. DILN is a generalization of the hierarchical Dirichlet process (HDP) that models correlation structure between the weights of the atoms at the group level. We derive a representation of DILN as a normalized collection of gamma-distributed random variables, and study its statistical properties. We consider applications to topic modeling and derive a variational Bayes algorithm for approximate posterior inference. We study the empirical performance of the DILN topic model on four corpora, comparing performance with the HDP and the correlated topic model.

Adaptive Bandits: Towards the best history-dependent strategy

Tue, 14 Jun 2011 00:00:00 +0000

We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$.

Maximum Volume Clustering

Tue, 14 Jun 2011 00:00:00 +0000

The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a clustering model based on the large volume principle called maximum volume clustering (MVC), and propose two algorithms to solve it approximately: a soft-label and a hard-label MVC algorithms based on sequential quadratic programming and semi-definite programming, respectively. Our MVC model includes spectral clustering and maximum margin clustering as special cases, and is substantially more general. We also establish the finite sample stability and an error bound for the soft-label MVC method. Experiments show that the proposed MVC approach compares favorably with state-of-the-art clustering algorithms.

Dimensionality Reduction for Spectral Clustering

Tue, 14 Jun 2011 00:00:00 +0000

Spectral clustering is a flexible clustering methodology that is applicable to a variety of data types and has the particular virtue that it makes few assumptions on cluster shapes. It has become popular in a variety of application areas, particularly in computational vision and bioinformatics. The approach appears, however, to be particularly sensitive to irrelevant and noisy dimensions in the data. We thus introduce an approach that automatically learns the relevant dimensions and spectral clustering simultaneously. We pursue an augmented form of spectral clustering in which an explicit projection operator is incorporated in the relaxed optimization functional. We optimize this functional over both the projection and the spectral embedding. Experiments on simulated and real data show that this approach yields significant improvements in the performance of spectral clustering.

TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents

Tue, 14 Jun 2011 00:00:00 +0000

Popular algorithms for modeling the influence of entities in networked data, such as PageRank, work by analyzing the hyperlink structure, but ignore the contents of documents. However, often times, influence is topic dependent, e.g., a web page of high influence in politics may be an unknown entity in sports. We design a new model called TopicFlow, which combines ideas from network flow and topic modeling, to learn this notion of topic specific influences of hyperlinked documents in a completely unsupervised fashion. On the task of citation recommendation, which is an instance of capturing influence, the TopicFlow model, when combined with TF-IDF based cosine similarity, outperforms several competitive baselines by as much as 11.8%. Our empirical study of the model's output on ACL corpus demonstrates its ability to identify topically influential documents. The Topic- Flow model is also competitive with the state-of-theart Relational Topic Models in predicting the likelihood of unseen text on two different data sets. Due to its ability to learn topic-specific flows across each hyperlink, the TopicFlow model can be a powerful visualization tool to track the diffusion of topics across a citation network.

Can matrix coherence be efficiently and accurately estimated?

Tue, 14 Jun 2011 00:00:00 +0000

Matrix coherence has recently been used to characterize the ability to extract global information from a subset of matrix entries in the context of low-rank approximations and other sampling-based algorithms. The significance of these results crucially hinges upon the possibility of efficiently and accurately testing this coherence assumption. This paper precisely addresses this issue. We introduce a novel sampling-based algorithm for estimating coherence, present associated estimation guarantees and report the results of extensive experiments for coherence estimation. The quality of the estimation guarantees we present depends on the coherence value to estimate itself, but this turns out to be an inherent property of sampling-based coherence estimation, as shown by our lower bound. In practice, however, we find that these theoretically unfavorable scenarios rarely appear, as our algorithm efficiently and accurately estimates coherence across a wide range of datasets, and these estimates are excellent predictors of the effectiveness of sampling-based matrix approximation on a case-by-case basis. These results are significant as they reveal the extent to which coherence assumptions made in a number of recent machine learning publications are testable.

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Tue, 14 Jun 2011 00:00:00 +0000

We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles $L_1$ regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative $L_1$ penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the $L_1$ penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset.

Discussion of “Contextual Bandit Algorithms with Supervised Learning Guarantees”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of Contextual Bandit Algorithms with Supervised Learning Guarantees.

Estimating beta-mixing coefficients

Tue, 14 Jun 2011 00:00:00 +0000

The literature on statistical learning for time series assumes the asymptotic independence or “mixing” of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the beta-mixing rate based on a single stationary sample path and show it is L1-risk consistent.

Online Learning of Structured Predictors with Multiple Kernels

Tue, 14 Jun 2011 00:00:00 +0000

Training structured predictors often requires a considerable time selecting features or tweaking the kernel. Multiple kernel learning (MKL) sidesteps this issue by embedding the kernel learning into the training procedure. Despite the recent progress towards efficiency of MKL algorithms, the structured output case remains an open research front. We propose a family of online algorithms able to tackle variants of MKL and group-LASSO, for which we show regret, convergence, and generalization bounds. Experiments on handwriting recognition and dependency parsing attest the success of the approach.

CAKE: Convex Adaptive Kernel Density Estimation

Tue, 14 Jun 2011 00:00:00 +0000

In this paper we present a generalization of kernel density estimation called Convex Adaptive Kernel Density Estimation (CAKE) that replaces single bandwidth selection by a convex aggregation of kernels at all scales, where the convex aggregation is allowed to vary from one training point to another, treating the fundamental problem of heterogeneous smoothness in a novel way. Learning the CAKE estimator given a training set reduces to solving a single convex quadratic programming problem. We derive rates of convergence of CAKE like estimator to the true underlying density under smoothness assumptions on the class and show that given a sufficiently large sample the mean squared error of such estimators is optimal in a minimax sense. We also give a risk bound of the CAKE estimator in terms of its empirical risk. We empirically compare CAKE to other density estimators proposed in the statistics literature for handling heterogeneous smoothness on different synthetic and natural distributions.

Learning mixtures of Gaussians with maximum-a-posteriori oracle

Tue, 14 Jun 2011 00:00:00 +0000

We consider the problem of estimating the parameters of a mixture of distributions, where each component distribution is from a given parametric family e.g. exponential, Gaussian etc. We define a learning model in which the learner has access to a “maximum-a-posteriori” oracle which given any sample from a mixture of distributions, tells the learner which component distribution was the most likely to have generated it. We describe a learning algorithm in this setting which accurately estimates the parameters of a mixture of $k$ spherical Gaussians in $\mathbb{R}^d$ assuming the component Gaussians satisfy a mild separation condition. Our algorithm uses only polynomially many (in $d, k$) samples and oracle calls, and our separation condition is much weaker than those required by unsupervised learning algorithms like [Arora 01, Vempala 02].

Hidden-Unit Conditional Random Fields

Tue, 14 Jun 2011 00:00:00 +0000

The paper explores a generalization of conditional random fields (CRFs) in which binary stochastic hidden units appear between the data and the labels. Hidden-unit CRFs are potentially more powerful than standard CRFs because they can represent nonlinear dependencies at each frame. The hidden units in these models also learn to discover latent distributed structure in the data that improves classification. We derive efficient algorithms for inference and learning in these models by observing that the hidden units are conditionally independent given the data and the labels. Finally, we show that hidden-unit CRFs perform well in experiments on a range of tasks, including optical character recognition, text classification, protein structure prediction, and part-of-speech tagging.

Discussion of “Spectral Dimensionality Reduction via Maximum Entropy”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of Spectral Dimensionality Reduction via Maximum Entropy.

Learning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBM

Tue, 14 Jun 2011 00:00:00 +0000

Restricted Boltzmann Machines are commonly used in unsupervised learning to extract features from training data. Since these features are learned for regenerating training data a classifier based on them has to be trained. If only a few of the learned features are discriminative other non-discriminative features will distract the classifier during the training process and thus waste computing resources for testing. In this paper, we present a hybrid third-order Restricted Boltzmann Machine in which class-relevant features (for recognizing) and class-irrelevant features (for generating only) are learned simultaneously. As the classification task uses only the class-relevant features, the test itself becomes very fast. We show that class-irrelevant features help class-relevant features to focus on the recognition task and introduce useful regularization effects to reduce the norms of class-relevant features. Thus there is no need to use weight-decay for the parameters of this model. Experiments on the MNIST, NORB and Caltech101 Silhouettes datasets show very promising results.

A Fast Algorithm for Recovery of Jointly Sparse Vectors based on the Alternating Direction Methods

Tue, 14 Jun 2011 00:00:00 +0000

The standard compressive sensing (CS) aims to recover sparse signal from single measurement vector (SMV) which is known as SMV model. In this paper, we consider the recovery of jointly sparse signals in the multiple measurement vector (MMV) scenario where signal is represented as a matrix and the sparsity of signal occurs in a common location set. The sparse MMV model can be formulated as a matrix $(2,1)$-norm minimization problem. However, the $(2,1)$-norm minimization problem is much more difficult to solve than l1-norm minimization. In this paper, we propose a very fast algorithm, called MMV-ADM, for jointly sparse signal recovery in MMV settings based on the alternating direction method (ADM). The MMV-ADM alternately updates the signal matrix, the Lagrangian multiplier and the residue, and all update rules only involve matrix or vector multiplications and summations, so it is simple, easy to implement and much more fast than the state-of-the-art method MMVprox. Numerical simulations show that MMV-ADM is at least dozens of times faster than MMVprox with comparable recovery accuracy.

Group Orthogonal Matching Pursuit for Logistic Regression

Tue, 14 Jun 2011 00:00:00 +0000

We consider a matching pursuit approach for variable selection and estimation in logistic regression models. Specifically, we propose Logistic Group Orthogonal Matching Pursuit (Logit-GOMP), which extends the Group-OMP procedure originally proposed for linear regression models, to select groups of variables in logistic regression models, given a predefined grouping structure within the explanatory variables. We theoretically characterize the performance of Logit-GOMP in terms of predictive accuracy, and also provide conditions under which Logit-GOMP is able to identify the correct (groups of) variables. Our results are non-asymptotic in contrast to classical consistency results for logistic regression which only apply in the asymptotic limit where the dimensionality is fixed or is restricted to grow slowly with the sample size. We conduct empirical evaluation on simulated data sets and the real world problem of splice site detection in DNA sequences. The results indicate that Logit-GOMP compares favorably to Logistic Group Lasso both in terms of variable selection and prediction accuracy. We also provide a generic version of our algorithm that applies to the wider class of generalized linear models.

Learning Scale Free Networks by Reweighted $\ell_1$ regularization

Tue, 14 Jun 2011 00:00:00 +0000

Methods for $\ell_1$-type regularization have been widely used in Gaussian graphical model selection tasks to encourage sparse structures. However, often we would like to include more structural information than mere sparsity. In this work, we focus on learning so-called “scale-free” models, a common feature that appears in many real-work networks. We replace the $\ell_1$ regularization with a power law regularization and optimize the objective function by a sequence of iteratively reweighted $\ell_1$ regularization problems, where the regularization coefficients of nodes with high degree are reduced, encouraging the appearance of hubs with high degree. Our method can be easily adapted to improve any existing $\ell_1$-based methods, such as graphical lasso, neighborhood selection, and JSRM when the underlying networks are believed to be scale free or have dominating hubs. We demonstrate in simulation that our method significantly outperforms the a baseline $\ell_1$ method at learning scale-free networks and hub networks, and also illustrate its behavior on gene expression data.

Bayesian Hierarchical Cross-Clustering

Tue, 14 Jun 2011 00:00:00 +0000

Most clustering algorithms assume that all dimensions of the data can be described by a single structure. Cross-clustering (or multi-view clustering) allows multiple structures, each applying to a subset of the dimensions. We present a novel approach to cross-clustering, based on approximating the solution to a Cross Dirichlet Process mixture (CDPM) model [Shafto et al., 2006, Mansinghka et al., 2009]. Our bottom-up, deterministic approach results in a hierarchical clustering of dimensions, and at each node, a hierarchical clustering of data points. We also present a randomized approximation, based on a truncated hierarchy, that scales linearly in the number of levels. Results on synthetic and real-world data sets demonstrate that the cross-clustering based algorithms perform as well or better than the clustering based algorithms, our deterministic approaches models perform as well as the MCMC-based CDPM, and the randomized approximation provides a remarkable speedup relative to the full deterministic approximation with minimal cost in predictive error.

Confidence Weighted Mean Reversion Strategy for On-Line Portfolio Selection

Tue, 14 Jun 2011 00:00:00 +0000

This paper proposes a novel on-line portfolio selection strategy named “Confidence Weighted Mean Reversion” (CWMR). Inspired by the mean reversion principle and the confidence weighted online learning technique, CWMR models a portfolio vector as Gaussian distribution, and sequentially updates the distribution by following the mean reversion trading principle. The CWMR strategy is able to effectively exploit the power of mean reversion for on-line portfolio selection. Extensive experiments on various real markets demonstrate the effectiveness of our strategy in comparison to the state of the art.

Spectral Dimensionality Reduction via Maximum Entropy

Tue, 14 Jun 2011 00:00:00 +0000

We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting probabilistic models are based on GRFs. The resulting model is a nonlinear generalization of principal component analysis. We show that parameter fitting in the locally linear embedding is approximate maximum likelihood in these models. We develop new algorithms that directly maximize the likelihood and show that these new algorithms are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set.

The Neural Autoregressive Distribution Estimator

Tue, 14 Jun 2011 00:00:00 +0000

We describe a new approach for modeling the distribution of high-dimensional vectors of discrete variables. This model is inspired by the restricted Boltzmann machine (RBM), which has been shown to be a powerful model of such distributions. However, an RBM typically does not provide a tractable distribution estimator, since evaluating the probability it assigns to some given observation requires the computation of the so-called partition function, which itself is intractable for RBMs of even moderate size. Our model circumvents this difficulty by decomposing the joint distribution of observations into tractable conditional distributions and modeling each conditional using a non-linear function similar to a conditional of an RBM. Our model can also be interpreted as an autoencoder wired such that its output can be used to assign valid probabilities to observations. We show that this new model outperforms other multivariate binary distribution estimators on several datasets and performs similarly to a large (but intractable) RBM.

Robust Bayesian Matrix Factorisation

Tue, 14 Jun 2011 00:00:00 +0000

We analyse the noise arising in collaborative filtering when formalised as a probabilistic matrix factorisation problem. We show empirically that modelling row- and column-specific variances is important, the noise being in general non-Gaussian and heteroscedastic. We also advocate for the use of a Student-t prior for the latent features as the standard Gaussian is included as a special case. We derive several variational inference algorithms and estimate the hyperparameters by type-II maximum likelihood. Experiments on real data show that the predictive performance is significantly improved.

Approximate inference for the loss-calibrated Bayesian

Tue, 14 Jun 2011 00:00:00 +0000

We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task – and associated losses – for which the posterior could be used. We argue that this can be suboptimal and propose instead to loss-calibrate the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.

On Time Varying Undirected Graphs

Tue, 14 Jun 2011 00:00:00 +0000

The time-varying multivariate Gaussian distribution and the undirected graph associated with it, as introduced in Zhou et al. (2008), provide a useful statistical framework for modeling complex dynamic networks. In many application domains, it is of high importance to estimate the graph structure of the model consistently for the purpose of scientific discovery. In this short note, we show that under suitable technical conditions the structure of the undirected graphical model can be consistently estimated in the high dimensional setting, when the dimensionality of the model is allowed to diverge with the sample size. The model selection consistency is shown for the procedure proposed in Zhou et al. (2008) and for the modified neighborhood selection procedure of Meinshausen and Bühlmann (2006).

Convex envelopes of complexity controlling penalties: the case against premature envelopment

Tue, 14 Jun 2011 00:00:00 +0000

Convex envelopes of the cardinality and rank function, $l_1$ and nuclear norm, have gained immense popularity due to their sparsity inducing properties. This gave rise to a natural approach to building objectives with sparse optima whereby such convex penalties are added to another objective. Such a heuristic approach to objective building does not always work. For example, addition of an $L_1$ penalty to the KL-divergence fails to induce any sparsity, as the $L_1$ norm of any vector in a simplex is a constant. However, a convex envelope of KL and a cardinality penalty can be obtained that indeed trades off sparsity and KL-divergence. We consider cases of two composite penalties, elastic net and fused lasso, which combine multiple desiderata. In both of these cases, we show that a hard objective relaxed to obtain penalties can be more tightly approximated. Further, by construction, it is impossible to get a better convex approximation than the ones we derive. Thus, constructing a joint envelope across different parts of the objective provides means to trade off tightness and computational cost.

Convergent Decomposition Solvers for Tree-reweighted Free Energies

Tue, 14 Jun 2011 00:00:00 +0000

We investigate minimization of tree-reweighted free energies for the purpose of obtaining approximate marginal probabilities and upper bounds on the partition function of cyclic graphical models. The solvers we present for this problem work by directly tightening tree-reweighted upper bounds. As a result, they are particularly efficient for tree-reweighted energies arising from a small number of spanning trees. While this assumption may seem restrictive at first, we show how small sets of trees can be constructed in a principled manner. An appealing property of our algorithms, which results from the problem decomposition, is that they are embarassingly parallel. In contrast to the original message passing algorithm introduced for this problem, we obtain global convergence guarantees.

On Learning Discrete Graphical Models using Group-Sparse Regularization

Tue, 14 Jun 2011 00:00:00 +0000

We study the problem of learning the graph structure associated with a general discrete graphical models (each variable can take any of $m > 1$ values, the clique factors have maximum size $c \geq 2$) from samples, under high-dimensional scaling where the number of variables $p$ could be larger than the number of samples $n$. We provide a quantitative consistency analysis of a procedure based on node-wise multi-class logistic regression with group-sparse regularization. We first consider general $m$-ary pairwise models – where each factor depends on at most two variables. We show that when the number of samples scale as $n > K(m-1)^2 d^2 \log ((m-1)^2(p-1))$ – where $d$ is the maximum degree and $K$ a fixed constant – the procedure succeeds in recovering the graph with high probability. For general models with $c$-way factors, the natural multi-way extension of the pairwise method quickly becomes very computationally complex. So we studied the effectiveness of using the pairwise method even while the true model has higher order factors. Surprisingly, we show that under slightly more stringent conditions, the pairwise procedure still recovers the graph structure, when the samples scale as $n > K (m-1)^2 d^{\frac{3}{2}c - 1} \log ( (m-1)^c (p-1)^{c-1} )$.

Improved Loss Bounds For Multiple Kernel Learning

Tue, 14 Jun 2011 00:00:00 +0000

We propose two new generalization error bounds for multiple kernel learning (MKL). First, using the bound of Srebro and Ben-David (2006) as a starting point, we derive a new version which uses a simple counting argument for the choice of kernels in order to generate a tighter bound when 1-norm regularization (sparsity) is imposed in the kernel learning problem. The second bound is a Rademacher complexity bound which is additive in the (logarithmic) kernel complexity and margin term. This dependency is superior to all previously published Rademacher bounds for learning a convex combination of kernels, including the recent bound of Cortes et al. (2010), which exhibits a multiplicative interaction. We illustrate the tightness of our bounds with simulations.

Fast $b$-matching via Sufficient Selection Belief Propagation

Tue, 14 Jun 2011 00:00:00 +0000

This article describes scalability enhancements to a previously established belief propagation algorithm that solves bipartite maximum weight $b$-matching. The previous algorithm required $O(|V|+|E|)$ space and $O(|V||E|)$ time, whereas we apply improvements to reduce the space to $O(|V|)$ and the time to $O(|V|^{2.5})$ in the expected case (though worst case time is still $O(|V||E|))$. The space improvement is most significant in cases where edge weights are determined by a function of node descriptors, such as a distance or kernel function. In practice, we demonstrate maximum weight $b$-matchings to be solvable on graphs with hundreds of millions of edges in only a few hours of compute time on a modern personal computer without parallelization, whereas neither the memory nor the time requirement of previously known algorithms would have allowed graphs of this scale.

Optimal Distributed Market-Based Planning for Multi-Agent Systems with Shared Resources

Tue, 14 Jun 2011 00:00:00 +0000

Market-based algorithms have become popular in collaborative multi-agent planning due to their simplicity, distributedness, low communication requirements, and proven success in domains such as task allocation and robotic exploration. Most existing market-based algorithms, however, suffer from two main drawbacks: resource prices must be carefully handcrafted for each problem domain, and there is no guarantee on final solution quality. We present an optimal market-based algorithm, derived from a mixed integer program formulation of planning problems. Our method is based on two well-known techniques for optimization: Dantzig-Wolfe decomposition and Gomory cuts. The former prices resources optimally for a relaxed version of the problem, while the latter introduces new derivative resources to correct pricing imbalances that arise from the relaxation. Our algorithm is applicable to a wide variety of multi-agent planning domains. We provide optimality guarantees and demonstrate the effectiveness of our algorithm in both centralized and distributed settings on synthetic planning problems.

Evolving Cluster Mixed-Membership Blockmodel for Time-Evolving Networks

Tue, 14 Jun 2011 00:00:00 +0000

Time-evolving networks are a natural presentation for dynamic social and biological interactions. While latent space models are gaining popularity in network modeling and analysis, previous works mostly ignore networks with temporal behavior and multi-modal actor roles. Furthermore, prior knowledge, such as division and grouping of social actors or biological specificity of molecular functions, has not been systematically exploited in network modeling. In this paper, we develop a network model featuring a state space mixture prior that tracks complex actor latent role changes through time. We provide a fast variational inference algorithm for learning our model, and validate it with simulations and held-out likelihood comparisons on real-world time-evolving networks. Finally, we demonstrate our model’s utility as a network analysis tool, by applying it to United States Congress voting data.

Multiscale Community Blockmodel for Network Exploration

Tue, 14 Jun 2011 00:00:00 +0000

Real world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric Multiscale Community Blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross- community interactions. By using the nested Chinese Restaurant Process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks.

Preface

Tue, 14 Jun 2011 00:00:00 +0000

Preface to the Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics April 11-13, 2011, Fort Lauderdale, FL, USA.

Parallel Gibbs Sampling: From Colored Fields to Thin Junction Trees

Tue, 14 Jun 2011 00:00:00 +0000

We explore the task of constructing a parallel Gibbs sampler, to both improve mixing and the exploration of high likelihood states. Recent work in parallel Gibbs sampling has focused on update schedules which do not guarantee convergence to the intended stationary distribution. In this work, we propose two methods to construct parallel Gibbs samplers guaranteed to draw from the targeted distribution. The first method, called the Chromatic sampler, uses graph coloring to construct a direct parallelization of the classic sequential scan Gibbs sampler. In the case of 2-colorable models we relate the Chromatic sampler to the Synchronous Gibbs sampler (which draws all variables simultaneously in parallel), and reveal new ergodic properties of Synchronous Gibbs chains. Our second method, the Splash sampler, is a complementary strategy which can be used when the variables are tightly coupled. This constructs and samples multiple blocks in parallel, using a novel locking protocol and an iterative junction tree generation algorithm. We further improve the Splash sampler through adaptive tree construction. We demonstrate the benefits of our two sampling algorithms on large synthetic and real-world models using a 32 processor multi-core system.

Deep Sparse Rectifier Neural Networks

Tue, 14 Jun 2011 00:00:00 +0000

While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training.

Learning from positive and unlabeled examples by enforcing statistical significance

Tue, 14 Jun 2011 00:00:00 +0000

Given a finite but large set of objects described by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For linear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem.

Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso

Tue, 14 Jun 2011 00:00:00 +0000

We propose Kernel Block Restricted Isometry Property (KB-RIP) as a generalization of the well-studied RIP and prove a variety of results. First, we present a “sum-of-norms”-minimization based formulation of the sparse recovery problem and prove that under certain conditions on KB-RIP, it recovers the optimal sparse solution exactly. The Group Lasso formulation, widely used as a good heuristic, arises naturally from the Lagrangian relaxation of our formulation. Second, we present an efficient combinatorial algorithm for provable sparse recovery under similar assumptions on KB-RIP. As a side product, this result improves the previous best assumptions on RIP under which a combinatorial algorithm was known. Finally, we provide numerical evidence to illustrate that not only are our sum-of-norms-minimization formulation and combinatorial algorithm significantly faster than Lasso, they also outperforms Lasso in terms of recovery.

A Dynamic Relational Infinite Feature Model for Longitudinal Social Networks

Tue, 14 Jun 2011 00:00:00 +0000

Real-world relational data sets, such as social networks, often involve measurements over time. We propose a Bayesian nonparametric latent feature model for such data, where the latent features for each actor in the network evolve according to a Markov process, extending recent work on similar models for static networks. We show how the number of features and their trajectories for each actor can be inferred simultaneously and demonstrate the utility of this model on prediction tasks using both synthetic and real-world data.

Revisiting MAP Estimation, Message Passing and Perfect Graphs

Tue, 14 Jun 2011 00:00:00 +0000

Given a graphical model, one of them ost useful queries is to find the most likely configuration of its variables. This task, known as the maximum a posteriori (MAP) problem, can be solved efficiently via message passing techniques when the graph is a tree, but is NP-hard for general graphs. Jebara (2009) shows that the MAP problem can be converted into the stable set problem, which can be solved in polynomial time for a broad class of graphs known as perfect graphs via a linear programming relaxation technique. This is a result of great theoretical interest. However, the article additionally claims that max-product linear programming (MPLP) message passing techniques of Globerson and Jaakkola (2007) are also guaranteed to solve these problems exactly and efficiently. We investigate this claim, show that it does not hold, and repair it with alternative message passing algorithms.

A novel greedy algorithm for Nyström approximation

Tue, 14 Jun 2011 00:00:00 +0000

The Nyström method is an efficient technique for obtaining a low-rank approximation of a large kernel matrix based on a subset of its columns. The quality of the Nyström approximation highly depends on the subset of columns used, which are usually selected using random sampling. This paper presents a novel recursive algorithm for calculating the Nyström approximation, and an effective greedy criterion for column selection. Further, a very efficient variant is proposed for greedy sampling, which works on random partitions of data instances. Experiments on benchmark data sets show that the proposed greedy algorithms achieve significant improvements in approximating kernel matrices, with minimum overhead in run time.

Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

Tue, 14 Jun 2011 00:00:00 +0000

Hierarchical clustering based on pairwise similarities is a common tool used in a broad range of scientific applications. However, in many problems it may be expensive to obtain or compute similarities between the items to be clustered. This paper investigates the possibility of hierarchical clustering of $N$ items based on a small subset of pairwise similarities, significantly less than the complete set of $N(N-1)/2$ similarities. First, we show that, if the intracluster similarities exceed intercluster similarities, then it is possible to correctly determine the hierarchical clustering from as few as $3N \log N$ similarities. We demonstrate this order of magnitude saving in the number of pairwise similarities necessitates sequentially selecting which similarities to obtain in an adaptive fashion, rather than picking them at random. Finally, we propose an active clustering method that is robust to a limited fraction of anomalous similarities, and show how even in the presence of these noisy similarity values we can resolve the hierarchical clustering using only $O(N \log^2 N)$ pairwise similarities.

Bagged Structure Learning of Bayesian Network

Tue, 14 Jun 2011 00:00:00 +0000

We present a novel approach for density estimation using Bayesian networks when faced with scarce and partially observed data. Our approach relies on Efron’s bootstrap framework, and replaces the standard model selection score by a bootstrap aggregation objective aimed at sifting out bad decisions during the learning procedure. Unlike previous bootstrap or MCMC based approaches that are only aimed at recovering specific structural features, we learn a concrete density model that can be used for probabilistic generalization. To make use of our objective when some of the data is missing, we propose a bagged structural EM procedure that does not incur the heavy computational cost typically associated with a bootstrap-based approach. We compare our bagged objective to the Bayesian score and the Bayesian information criterion (BIC), as well as other bootstrap-based model selection objectives, and demonstrate its effectiveness in improving generalization performance for varied real-life datasets.

A conditional game for comparing approximations

Tue, 14 Jun 2011 00:00:00 +0000

We present a “conditional game” to be played between two approximate inference algorithms. We prove that exact inference is an optimal strategy and demonstrate how the game can be used to estimate the relative accuracy of two different approximations in the absence of exact marginals.

Optimal and Robust Price Experimentation: Learning by Lottery

Tue, 14 Jun 2011 00:00:00 +0000

This paper studies optimal price learning for one or more items. We introduce the Schrödinger price experiment (SPE) which superimposes classical price experiments using lotteries, and thereby extracts more information from each customer interaction. If buyers are perfectly rational we show that there exist SPEs that in the limit of infinite superposition learn optimally and exploit optimally. We refer to the new resulting mechanism as the hopeful mechanism (HM) since although it is incentive compatible, buyers can deviate with extreme consequences for the seller at very little cost to themselves. For real-world settings we propose a robust version of the approach which takes the form of a Markov decision process where the actions are functions. We provide approximate policies motivated by the best of sampled set (BOSS) algorithm coupled with approximate Bayesian inference. Numerical studies show that the proposed method significantly increases seller revenue compared to classical price experimentation, even for the single-item case.

A Spike and Slab Restricted Boltzmann Machine

Tue, 14 Jun 2011 00:00:00 +0000

We introduce the spike and slab Restricted Boltzmann Machine, characterized by having both a real-valued vector, the slab, and a binary variable, the spike, associated with each unit in the hidden layer. The model possesses some practical properties such as being amenable to Block Gibbs sampling as well as being capable of generating similar latent representations of the data to the recently introduced mean and covariance Restricted Boltzmann Machine. We illustrate how the spike and slab Restricted Boltzmann Machine achieves competitive performance on the CIFAR-10 object recognition task.

Discussion of “A conditional game for comparing approximations”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of A conditional game for comparing approximations.

Deep Learning for Efficient Discriminative Parsing

Tue, 14 Jun 2011 00:00:00 +0000

We propose a new fast purely discriminative algorithm for natural language parsing, based on a “deep” recurrent convolutional graph transformer network (GTN). Assuming a decomposition of a parse tree into a stack of “levels”, the network predicts a level of the tree taking into account predictions of previous levels. Using only few basic text features which leverage word representations from Collobert and Weston (2008), we show similar performance (in $F_1$ score) to existing pure discriminative parsers and existing “benchmark” parsers (like Collins parser, probabilistic context-free grammars based), with a huge speed advantage.

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Tue, 14 Jun 2011 00:00:00 +0000

A great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR-10 by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several simple factors, such as the number of hidden nodes in the model, may be more important to achieving high performance than the learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR-10, NORB, and STL datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (“stride”) between extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are critical to achieving high performance - so critical, in fact, that when these parameters are pushed to their limits, we achieve state-of-the-art performance on both CIFAR-10 and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure itself, and is very easy to implement. Despite the simplicity of our system, we achieve accuracy beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.2% respectively).

Contextual Bandits with Linear Payoff Functions

Tue, 14 Jun 2011 00:00:00 +0000

In this paper we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For $T$ rounds, $K$ actions, and d dimensional feature vectors, we prove an $O\left(\sqrt{Td\ln^3(KT\ln(T)/\delta)}\right)$ regret bound that holds with probability $1-\delta$ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of $\Omega(\sqrt{Td})$ for this setting, matching the upper bound up to logarithmic factors.

Concave Gaussian Variational Approximations for Inference in Large-Scale Bayesian Linear Models

Tue, 14 Jun 2011 00:00:00 +0000

Two popular approaches to forming bounds in approximate Bayesian inference are local variational methods and minimal Kullback-Leibler divergence methods. For a large class of models we explicitly relate the two approaches, showing that the local variational method is equivalent to a weakened form of Kullback-Leibler Gaussian approximation. This gives a strong motivation to develop efficient methods for KL minimisation. An important and previously unproven property of the KL variational Gaussian bound is that it is a concave function in the parameters of the Gaussian for log concave sites. This observation, along with compact concave parametrisations of the covariance, enables us to develop fast scalable optimisation procedures to obtain lower bounds on the marginal likelihood in large scale Bayesian linear models.

Switch-Reset Models : Exact and Approximate Inference

Tue, 14 Jun 2011 00:00:00 +0000

Reset models are constrained switching latent Markov models in which the dynamics either continues according to a standard model, or the latent variable is resampled. We consider exact marginal inference in this class of models and their extension, the switch-reset models. A further convenient class of conjugate-exponential reset models is also discussed. For a length $T$ time-series, exact filtering scales with $T^2$ squared and smoothing $T^3$ cubed. We discuss approximate filtering and smoothing routines that scale linearly with $T$. Applications are given to change-point models and reset linear dynamical systems.

Relative Entropy Inverse Reinforcement Learning

Tue, 14 Jun 2011 00:00:00 +0000

We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)-optimal policy can be computed for different reward functions. However, this requirement can hardly be satisfied in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using learned MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations.

Domain Adaptation with Coupled Subspaces

Tue, 14 Jun 2011 00:00:00 +0000

Domain adaptation algorithms address a key issue in applied machine learning: How can we train a system under a source distribution but achieve high performance under a different target distribution? We tackle this question for divergent distributions where crucial predictive target features may not even have support under the source distribution. In this setting, the key intuition is that that if we can link target-specific features to source features, we can learn effectively using only source labeled data. We formalize this intuition, as well as the assumptions under which such coupled learning is possible. This allows us to give finite sample target error bounds (using only source training data) and an algorithm which performs at the state-of-the-art on two natural language processing adaptation tasks which are characterized by novel target features.

Contextual Bandit Algorithms with Supervised Learning Guarantees

Tue, 14 Jun 2011 00:00:00 +0000

We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al., called Exp4.P, which with high probability incurs regret at most $O(\sqrt{KT\ln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most $O(\sqrt{Td\ln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.

Deep Learners Benefit More from Out-of-Distribution Examples

Tue, 14 Jun 2011 00:00:00 +0000

Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did beat previously published results and reached human-level performance.

Discussion of “The Neural Autoregressive Distribution Estimator”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of The Neural Autoregressive Distribution Estimator.

Active Diagnosis under Persistent Noise with Unknown Noise Distribution: A Rank-Based Approach

Tue, 14 Jun 2011 00:00:00 +0000

We consider a problem of active diagnosis, where the goal is to efficiently identify an unknown object by sequentially selecting, and observing, the responses to binary valued queries. We assume that query observations are noisy, and further that the noise is persistent, meaning that repeating a query does not change the response. Previous work in this area either assumed the knowledge of the query noise distribution, or that the noise level is sufficiently low so that the unknown object can be identified with high accuracy. We make no such assumptions, and introduce an algorithm that returns a ranked list of objects, such that the expected rank of the true object is optimized. Furthermore, our algorithm does not require knowledge of the query noise distribution.

Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation Algorithm

Tue, 14 Jun 2011 00:00:00 +0000

We propose an efficient and adaptive method for MAP-MRF inference that provides increasingly tighter upper and lower bounds on the optimal objective. Similar to Sontag et al. (2008b), our method starts by solving the first-order LOCAL(G) linear programming relaxation. This is followed by an adaptive tightening of the relaxation where we incrementally add higher-order interactions to enforce proper marginalization over groups of variables. Computing the best interaction to add is an NP-hard problem. We show good solutions to this problem can be readily obtained from “local primal-dual gaps” given the current primal solution and a dual reparameterization vector. This is not only extremely efficient, but in contrast to previous approaches, also allows us to search over prohibitively large sets of candidate interactions to add. We demonstrate the superiority of our approach on MAP-MRF inference problems encountered in computer vision.

Unsupervised Supervised Learning II: Margin-Based Classification without Labels

Tue, 14 Jun 2011 00:00:00 +0000

Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing margin-based risk functions. Traditionally, these risk functions are computed based on a labeled dataset. We develop a novel technique for estimating such risks using only unlabeled data and knowledge of $p(y)$. We prove that the proposed risk estimator is consistent on high-dimensional datasets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers using exclusively unlabeled data.

Statistical Optimization of Non-Negative Matrix Factorization

Tue, 14 Jun 2011 00:00:00 +0000

Non-Negative Matrix Factorization (NMF) is a dimensionality reduction method that has been shown to be very useful for a variety of tasks in machine learning and data mining. One of the fastest algorithms for NMF is the Block Principal Pivoting method (BPP) of (Kim & Park, 2008b), which follows a block coordinate descent approach. The optimization in each iteration involves solving a large number of expensive least squares problems. Taking the view that the design matrix was generated by a stochastic process, and using the asymptotic normality of the least squares estimator, we propose a method for improving the performance of BPP. Our method starts with a small subset of the columns and rows of the original matrix and uses frequentist hypothesis tests to adaptively increase the size of the problem. This achieves two objectives: 1) during the initial phase of the algorithm we solve far fewer, much smaller sized least squares problems and 2) all hypothesis tests failing while using all the data represents a principled, automatic stopping criterion. Our experiments on three real world datasets show that our algorithm significantly improves the performance of the original BPP algorithm.

Dynamic Policy Programming with Function Approximation

Tue, 14 Jun 2011 00:00:00 +0000

In this paper, we consider the problem of planning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative method, called dynamic policy programming (DPP), which updates the parametrized policy by a Bellman-like iteration. For discrete state-action case, we establish sup-norm loss bounds for the performance of the policy induced by DPP and prove that it asymptotically converges to the optimal policy. Then, we generalize our approach to large-scale (continuous) state-action problems using function approximation technique. We provide sup-norm performance-loss bounds for approximate DPP and compare these bounds with the standard results from approximate dynamic programming (ADP) showing that approximate DPP results in a tighter asymptotic bound than standard ADP methods. We also numerically compare the performance of DPP to other ADP and RL methods. We observe that approximate DPP asymptotically outperforms other methods on the mountain-car problem.

Polytope samplers for inference in ill-posed inverse problems

Tue, 14 Jun 2011 00:00:00 +0000

We consider linear ill-posed inverse problems $y=Ax$, in which we want to infer many count parameters x from few count observations $y$, where the matrix $A$ is binary and has some unimodularity property. Such problems are typical in applications such as contingency table analysis and network tomography (on which we present testing results). These properties of $A$ have a geometrical implication for the solution space: It is a convex integer polytope. We develop a novel approach to characterize this polytope in terms of its vertices; by taking advantage of the geometrical intuitions behind the Hermite normal form decomposition of the matrix $A$, and of a newly defined pivoting operation to travel across vertices. Next, we use this characterization to develop three (exact) polytope samplers for $x$ with emphasis on uniform distributions. We showcase one of these samplers on simulated and real data.

Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text

Tue, 14 Jun 2011 00:00:00 +0000

We present the time-dependent topic-cluster model, a hierarchical approach for combining Latent Dirichlet Allocation and clustering via the Recurrent Chinese Restaurant Process. It inherits the advantages of both of its constituents, namely interpretability and concise representation. We show how it can be applied to streaming collections of objects such as real world feeds in a news portal. We provide details of a parallel Sequential Monte Carlo algorithm to perform inference in the resulting graphical model which scales to hundred of thousands of documents.

Linear-Time Estimators for Propensity Scores

Tue, 14 Jun 2011 00:00:00 +0000

We present linear-time estimators for three popular covariate shift correction and propensity scoring algorithms: logistic regression(LR), kernel mean matching(KMM), and maximum entropy mean matching(MEMM). This allows applications in situations where both treatment and control groups are large. We also show that the last two algorithms differ only in their choice of regularizer ($\ell_2$ of the Radon Nikodym derivative vs. maximum entropy). Experiments show that all methods scale well.

Generative Kernels for Exponential Families

Tue, 14 Jun 2011 00:00:00 +0000

In this paper, we propose a family of kernels for the data distributions belonging to the exponential family. We call these kernels generative kernels because they take into account the generative process of the data. Our proposed method considers the geometry of the data distribution to build a set of efficient closed-form kernels best suited for that distribution. We compare our generative kernels on multinomial data and observe improved empirical performance across the board. Moreover, our generative kernels perform significantly better when training size is small, an important property of the generative models.

Discussion of “Learning Scale Free Networks by Reweighted $\ell_1$ regularization”

Tue, 14 Jun 2011 00:00:00 +0000

Discussion of Learning Scale Free Networks by Reweighted $\ell_1$ regularization.