Proceedings of Machine Learning Research

Greedy Bilateral Sketch, Completion & Smoothing

Mon, 29 Apr 2013 00:00:00 +0000

Recovering a large low-rank matrix from highly corrupted, incomplete or sparse outlier overwhelmed observations is the crux of various intriguing statistical problems. We explore the power of "greedy bilateral (GreB)" paradigm in reducing both time and sample complexities for solving these problems. GreB models a low-rank variable as a bilateral factorization, and updates the left and right factors in a mutually adaptive and greedy incremental manner. We detail how to model and solve low-rank approximation, matrix completion and robust PCA in GreB’s paradigm. On their MATLAB implementations, approximating a noisy 10000x10000 matrix of rank 500 with SVD accuracy takes 6s; MovieLens10M matrix of size 69878x10677 can be completed in 10s from 30% of 10^7 ratings with RMSE 0.86 on the rest 70%; the low-rank background and sparse moving outliers in a 120x160 video of 500 frames are accurately separated in 1s. This brings 30 to 100 times acceleration in solving these popular statistical problems.

Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes

Mon, 29 Apr 2013 00:00:00 +0000

How will the behaviors of individuals in a social network be influenced by their neighbors, the authorities and the communities? Such knowledge is often hidden from us and we only observe its manifestation in the form of recurrent and time-stamped events occurring at the individuals involved. It is an important yet challenging problem to infer the network of social inference based on the temporal patterns of these historical events. We propose a convex optimization approach to discover the hidden network of social influence by modeling the recurrent events at different individuals as multi-dimensional Hawkes processes. Furthermore, our estimation procedure, using nuclear and \ell_1 norm regularization simultaneously on the parameters, is able to take into account the prior knowledge of the presence of neighbor interaction, authority influence, and community coordination. To efficiently solve the problem, we also design an algorithm ADM4 which combines techniques of alternating direction method of multipliers and majorization minimization. We experimented with both synthetic and real world data sets, and showed that the proposed method can discover the hidden network more accurately and produce a better predictive model.

Dual Decomposition for Joint Discrete-Continuous Optimization

Mon, 29 Apr 2013 00:00:00 +0000

We analyse convex formulations for combined discrete-continuous MAP inference using the dual decomposition method. As a consquence we can provide a more intuitive derivation for the resulting convex relaxation than presented in the literature. Further, we show how to strengthen the relaxation by reparametrizing the potentials, hence convex relaxations for discrete-continuous inference does not share an important feature of LP relaxations for discrete labeling problems: incorporating unary potentials into higher order ones affects the quality of the relaxation. We argue that the convex model for discrete-continuous inference is very general and can be used as alternative for alternation-based methods often employed for such joint inference tasks.

Bethe Bounds and Approximating the Global Optimum

Mon, 29 Apr 2013 00:00:00 +0000

Inference in general Markov random fields (MRFs) is NP-hard, though identifying the maximum a posteriori (MAP) configuration of pairwise MRFs with submodular cost functions is efficiently solvable using graph cuts. Marginal inference, however, even for this restricted class, is #P-hard. Restricting to binary pairwise models, we prove new formulations of derivatives of the Bethe free energy, provide bounds on the derivatives and bracket the locations of stationary points. Several results apply whether the model is associative or not. Applying these to discretized pseudo-marginals in the associative case, we present a polynomial time approximation scheme for global optimization of the Bethe free energy provided the maximum degree ∆=O(\log n), where n is the number of variables. Runtime is guaranteed O(ε^-3/2 n^6 Σ^3/4 Ω^3/2), where Σ=O(∆/n) is the fraction of possible edges present and Ωis a function of MRF parameters. We examine use of the algorithm in practice, demonstrating runtime that is typically much faster, and discuss several extensions.

Block Regularized Lasso for Multivariate Multi-Response Linear Regression

Mon, 29 Apr 2013 00:00:00 +0000

The multivariate multi-response (MVMR) linear regression problem is investigated, in which design matrices can be distributed differently across K linear regressions. The support union of K p-dimensional regression vectors are recovered via block regularized Lasso which uses the l_1/l_2 norm for regression vectors across K tasks. Sufficient and necessary conditions to guarantee successful recovery of the support union are characterized. More specifically, it is shown that under certain conditions on the distributions of design matrices, if n > c_p1 ψ(B^*,Σ^(1:K))\log(p-s) where c_p1 is a constant and s is the size of the support set, then the l_1/l_2 regularized Lasso correctly recovers the support union; and if n < c_p2 ψ(B^*,Σ^(1:K))\log(p-s) where c_p2 is a constant, then the l_1/l_2 regularized Lasso fails to recover the support union. In particular, ψ(B^*,Σ^(1:K)) captures the sparsity of K regression vectors and the statistical properties of the design matrices. Numerical results are provided to demonstrate the advantages of joint support union recovery using multi-task Lasso problem over studying each problem individually.

Collapsed Variational Bayesian Inference for Hidden Markov Models

Mon, 29 Apr 2013 00:00:00 +0000

Approximate inference for Bayesian models is dominated by two approaches, variational Bayesian inference and Markov Chain Monte Carlo. Both approaches have their own advantages and disadvantages, and they can complement each other. Recently researchers have proposed collapsed variational Bayesian inference to combine the advantages of both. Such inference methods have been successful in several models whose hidden variables are conditionally independent given the parameters. In this paper we propose two collapsed variational Bayesian inference algorithms for hidden Markov models, a popular framework for representing time series data. We validate our algorithms on the natural language processing task of unsupervised part-of-speech induction, showing that they are both more computationally efficient than sampling, and more accurate than standard variational Bayesian inference for HMMs.

Sparse Principal Component Analysis for High Dimensional Multivariate Time Series

Mon, 29 Apr 2013 00:00:00 +0000

We study sparse principal component analysis (sparse PCA) for high dimensional multivariate vector autoregressive (VAR) time series. By treating the transition matrix as a nuisance parameter, we show that sparse PCA can be directly applied on analyzing multivariate time series as if the data are i.i.d. generated. Under a double asymptotic framework in which both the length of the sample period T and dimensionality d of the time series can increase (with possibly d≫T), we provide explicit rates of convergence of the angle between the estimated and population leading eigenvectors of the time series covariance matrix. Our results suggest that the spectral norm of the transition matrix plays a pivotal role in determining the final rates of convergence. Implications of such a general result is further illustrated using concrete examples. The results of this paper have impacts on different applications, including financial time series, biomedical imaging, and social media, etc.

On the Asymptotic Optimality of Maximum Margin Bayesian Networks

Mon, 29 Apr 2013 00:00:00 +0000

Maximum margin Bayesian networks (MMBNs) are Bayesian networks with discriminatively optimized parameters. They have shown good classification performance in various applications. However, there has not been any theoretic analysis of their asymptotic performance, e.g. their Bayes consistency. For specific classes of MMBNs, i.e. MMBNs with fully connected graphs and discrete-valued nodes, we show Bayes consistency for binary-class problems and a sufficient condition for Bayes consistency in the multi-class case. We provide simple examples showing that MMBNs in their current formulation are not Bayes consistent in general. These examples are especially interesting, as the model used for the MMBNs can represent the assumed true distributions. This indicates that the current formulations of MMBNs may be deficient. Furthermore, experimental results on the generalization performance are presented.

Supervised Sequential Classification Under Budget Constraints

Mon, 29 Apr 2013 00:00:00 +0000

In this paper we develop a framework for a sequential decision making under budget constraints for multi-class classification. In many classification systems, such as medical diagnosis and homeland security, sequential decisions are often warranted. For each instance, a sensor is first chosen for acquiring measurements and then based on the available information one decides (rejects) to seek more measurements from a new sensor/modality or to terminate by classifying the example based on the available information. Different sensors have varying costs for acquisition, and these costs account for delay, throughput or monetary value. Consequently, we seek methods for maximizing performance of the system subject to budget constraints. We formulate a multi-stage multi-class empirical risk objective and learn sequential decision functions from training data. We show that reject decision at each stage can be posed as supervised binary classification. We derive bounds for the VC dimension of the multi-stage system to quantify the generalization error. We compare our approach to alternative strategies on several multi-class real world datasets.

Completeness Results for Lifted Variable Elimination

Mon, 29 Apr 2013 00:00:00 +0000

Lifting aims at improving the efficiency of probabilistic inference by exploiting symmetries in the model. Various methods for lifted probabilistic inference have been proposed, but our understanding of these methods and the relationships between them is still limited, compared to their propositional counterparts. The only existing theoretical characterization of lifting is a completeness result for weighted first-order model counting. This paper addresses the question whether the same completeness result holds for other lifted inference algorithms. We answer this question positively for lifted variable elimination (LVE). Our proof relies on introducing a novel inference operator for LVE.

Statistical Tests for Contagion in Observational Social Network Studies

Mon, 29 Apr 2013 00:00:00 +0000

Current tests for contagion in social network studies are vulnerable to the confounding effects of latent homophily (i.e., ties form preferentially between individuals with similar hidden traits). We demonstrate a general method to lower bound the strength of causal effects in observational social network studies, even in the presence of arbitrary, unobserved individual traits. Our tests require no parametric assumptions and each test is associated with an algebraic proof. We demonstrate the effectiveness of our approach by correctly deducing the causal effects for examples previously shown to expose defects in existing methodology. Finally, we discuss preliminary results on data taken from the Framingham Heart Study.

Central Limit Theorems for Conditional Markov Chains

Mon, 29 Apr 2013 00:00:00 +0000

This paper studies Central Limit Theorems for real-valued functionals of Conditional Markov Chains. Using a classical result by Dobrushin (1956) for non-stationary Markov chains, a conditional Central Limit Theorem for fixed sequences of observations is established. The asymptotic variance can be estimated by resampling the latent states conditional on the observations. If the conditional means themselves are asymptotically normally distributed, an unconditional Central Limit Theorem can be obtained. The methodology is used to construct a statistical hypothesis test which is applied to synthetically generated environmental data.

Changepoint Detection over Graphs with the Spectral Scan Statistic

Mon, 29 Apr 2013 00:00:00 +0000

We consider the change-point detection problem of deciding, based on noisy measurements, whether an unknown signal over a given graph is constant or is instead piecewise constant over two induced subgraphs of relatively low cut size. We analyze the corresponding generalized likelihood ratio (GLR) statistic and relate it to the problem of finding a sparsest cut in a graph. We develop a tractable relaxation of the GLR statistic based on the combinatorial Laplacian of the graph, which we call the spectral scan statistic, and analyze its properties. We show how its performance as a testing procedure depends directly on the spectrum of the graph, and use this result to explicitly derive its asymptotic properties on few graph topologies. Finally, we demonstrate both theoretically and by simulations that the spectral scan statistic can outperform naive testing procedures based on edge thresholding and χ^2 testing.

Detecting Activations over Graphs using Spanning Tree Wavelet Bases

Mon, 29 Apr 2013 00:00:00 +0000

We consider the detection of clusters of activation over graphs under Gaussian noise. This problem appears in many real world scenarios, such as the detecting contamination or seismic activity by sensor networks, viruses in human and computer networks, and groups with anomalous behavior in social and biological networks. Despite the wide applicability of such a detection algorithm, there has been little success in the development of computationally feasible methods with provable theoretical guarantees. To this end, we introduce the spanning tree wavelet basis over a graph, a localized basis that reflects the topology of the graph. We first provide a necessary condition for asymptotic distinguishability of the null and alternative hypotheses. Then we prove that for any spanning tree, we can hope to correctly detect signals in a low signal-to-noise regime using spanning tree wavelets. We propose a randomized test, in which we use a uniform spanning tree in the basis construction. Using electrical network theory, we show that the uniform spanning tree provides strong guarantees that in many cases match our necessary condition. We prove that for edge transitive graphs, k-nearest neighbor graphs, and ε-graphs we obtain nearly optimal performance with the uniform spanning tree wavelet detector.

A recursive estimate for the predictive likelihood in a topic model

Mon, 29 Apr 2013 00:00:00 +0000

We consider the problem of evaluating the predictive log likelihood of a previously un- seen document under a topic model. This task arises when cross-validating for a model hyperparameter, when testing a model on a hold-out set, and when comparing the performance of different fitting strategies. Yet it is known to be very challenging, as it is equivalent to estimating a marginal likelihood in Bayesian model selection. We propose a fast algorithm for approximating this likelihood, one whose computational cost is linear both in document length and in the number of topics. The method is a first-order approximation to the algorithm of Carvalho et al. (2010), and can also be interpreted as a one-particle, Rao-Blackwellized version of the "left-to-right" method of Wallach et al. (2009). On our test examples, the proposed method gives similar answers to these other methods, but at lower computational cost.

Localization and Adaptation in Online Learning

Mon, 29 Apr 2013 00:00:00 +0000

We introduce a formalism of localization for online learning problems, which, similarly to statistical learning theory, can be used to obtain fast rates. In particular, we introduce local sequential Rademacher complexities and other local measures. Based on the idea of relaxations for deriving algorithms, we provide a template method that takes advantage of localization. Furthermore, we build a general adaptive method that can take advantage of the suboptimality of the observed sequence. We illustrate the utility of the introduced concepts on several problems. Among them is a novel upper bound on regret in terms of classical Rademacher complexity when the data are i.i.d.

Distribution-Free Distribution Regression

Mon, 29 Apr 2013 00:00:00 +0000

Distribution regression refers to the situation where a response Y depends on a covariate P where P is a probability distribution. The model is Y=f(P) + e where f is an unknown regression function and e is a random error. Typically, we do not observe P directly, but rather, we observe a sample from P. In this paper we develop theory and methods for distribution-free versions of distribution regression. This means that we do not make strong distributional assumptions about the error term e and covariate P. We prove that when the effective dimension is small enough (as measured by the doubling dimension), then the excess prediction risk converges to zero with a polynomial rate.

Random Projections for Support Vector Machines

Mon, 29 Apr 2013 00:00:00 +0000

Let X be a data matrix of rank ρ, representing n points in d-dimensional space. The linear support vector machine constructs a hyperplane separator that maximizes the 1-norm soft margin. We develop a new oblivious dimension reduction technique which is precomputed and can be applied to any input matrix X. We prove that, with high probability, the margin and minimum enclosing ball in the feature space are preserved to within ε-relative error, ensuring comparable generalization as in the original space. We present extensive experiments with real and synthetic data to support our theory.

Bayesian Structure Learning for Functional Neuroimaging

Mon, 29 Apr 2013 00:00:00 +0000

Predictive modeling of functional neuroimaging data has become an important tool for analyzing cognitive structures in the brain. Brain images are high-dimensional and exhibit large correlations, and imaging experiments provide a limited number of samples. Therefore, capturing the inherent statistical properties of the imaging data is critical for robust inference. Previous methods tackle this problem by exploiting either spatial sparsity or smoothness, which does not fully exploit the structure in the data. Here we develop a flexible, hierarchical model designed to simultaneously capture spatial block sparsity and smoothness in neuroimaging data. We exploit a function domain representation for the high-dimensional small-sample data and develop efficient inference, parameter estimation, and prediction procedures. Empirical results with simulated and real neuroimaging data suggest that simultaneously capturing the block sparsity and smoothness properties can significantly improve structure recovery and predictive modeling performance.

High-dimensional Inference via Lipschitz Sparsity-Yielding Regularizers

Mon, 29 Apr 2013 00:00:00 +0000

Non-convex regularizers are more and more applied to high-dimensional inference with sparsity prior knowledge. In general, the non-convex regularizer is superior to the convex ones in inference but it suffers the difficulties brought by local optimums and massive computation. A "good" regularizer should perform well in both inference and optimization. In this paper, we prove that some non-convex regularizers can be such "good" regularizers. They are a family of sparsity-yielding penalties with proper Lipschitz subgradients. These regularizers keep the superiority of non-convex regularizers in inference. Their estimation conditions based on sparse eigenvalues are weaker than the convex regularizers. Meanwhile, if properly tuned, they behave like convex regularizers since standard proximal methods guarantee to give stationary solutions. These stationary solutions, if sparse enough, are identical to the global solutions. If the solution sequence provided by proximal methods is along a sparse path, the convergence rate to the global optimum is on the order of 1/k where k is the number of iterations.

Efficient Variational Inference for Gaussian Process Regression Networks

Mon, 29 Apr 2013 00:00:00 +0000

In multi-output regression applications the correlations between the response variables may vary with the input space and can be highly non-linear. Gaussian process regression networks (GPRNs) are flexible and effective models to represent such complex adaptive output dependencies. However, inference in GPRNs is intractable. In this paper we propose two efficient variational inference methods for GPRNs. The first method, GPRN-MF, adopts a mean-field approach with full Gaussians over the GPRN’s parameters as its factorizing distributions. The second method, GPRN-NPV, uses a nonparametric variational inference approach. We derive analytical forms for the evidence lower bound on both methods, which we use to learn the variational parameters and the hyper-parameters of the GPRN model. We obtain closed-form updates for the parameters of GPRN-MF and show that, while having relatively complex approximate posterior distributions, our approximate methods require the estimation of O(N) variational parameters rather than O(N2) for the parameters’ covariances. Our experiments on real data sets show that GPRN-NPV may give a better approximation to the posterior distribution compared to GPRN-MF, in terms of both predictive performance and stability.

Competing with an Infinite Set of Models in Reinforcement Learning

Mon, 29 Apr 2013 00:00:00 +0000

We consider a reinforcement learning setting where the learner also has to deal with the problem of finding a suitable state-representation function from a given set of models. This has to be done while interacting with the environment in an online fashion (no resets), and the goal is to have small regret with respect to any Markov model in the set. For this setting, recently the BLB algorithm has been proposed, which achieves regret of order T^2/3, provided that the given set of models is finite. Our first contribution is to extend this result to a countably infinite set of models. Moreover, the BLB regret bound suffers from an additive term that can be exponential in the diameter of the MDP involved, since the diameter has to be guessed. The algorithm we propose avoids guessing the diameter, thus improving the regret bound.

A Last-Step Regression Algorithm for Non-Stationary Online Learning

Mon, 29 Apr 2013 00:00:00 +0000

The goal of a learner in standard online learning is to maintain an average loss close to the loss of the best-performing single function in some class. In many real-world problems, such as rating or ranking items, there is no single best target function during the runtime of the algorithm, instead the best (local) target function is drifting over time. We develop a novel last step minmax optimal algorithm in context of a drift. We analyze the algorithm in the worst-case regret framework and show that it maintains an average loss close to that of the best slowly changing sequence of linear functions, as long as the total of drift is sublinear. In some situations, our bound improves over existing bounds, and additionally the algorithm suffers logarithmic regret when there is no drift. We also build on the H1 filter and its bound, and develop and analyze a second algorithm for drifting setting. Synthetic simulations demonstrate the advantages of our algorithms in a worst-case constant drift setting.

Distributed Learning of Gaussian Graphical Models via Marginal Likelihoods

Mon, 29 Apr 2013 00:00:00 +0000

We consider distributed estimation of the inverse covariance matrix, also called the concentration matrix, in Gaussian graphical models. Traditional centralized estimation often requires iterative and expensive global inference and is therefore difficult in large distributed networks. In this paper, we propose a general framework for distributed estimation based on a maximum marginal likelihood (MML) approach. Each node independently computes a local estimate by maximizing a marginal likelihood defined with respect to data collected from its local neighborhood. Due to the non-convexity of the MML problem, we derive and consider solving a convex relaxation. The local estimates are then combined into a global estimate without the need for iterative message-passing between neighborhoods. We prove that this relaxed MML estimator is asymptotically consistent. Through numerical experiments on several synthetic and real-world data sets, we demonstrate that the two-hop version of the proposed estimator is significantly better than the one-hop version, and nearly closes the gap to the centralized maximum likelihood estimator in many situations.

Thompson Sampling in Switching Environments with Bayesian Online Change Detection

Mon, 29 Apr 2013 00:00:00 +0000

Thompson Sampling has recently been shown to achieve the lower bound on regret in the Bernoulli Multi-Armed Bandit setting. This bandit problem assumes stationary distributions for the rewards. It is often unrealistic to model the real world as a stationary distribution. In this paper we derive and evaluate algorithms using Thompson Sampling for a Switching Multi-Armed Bandit Problem. We propose a Thompson Sampling strategy equipped with a Bayesian change point mechanism to tackle this problem. We develop algorithms for a variety of cases with constant switching rate: when switching occurs all arms change (Global Switching), switching occurs independently for each arm (Per-Arm Switching), when the switching rate is known and when it must be inferred from data. This leads to a family of algorithms we collectively term Change-Point Thompson Sampling (CTS). We show empirical results in 4 artificial environments, and 2 derived from real world data: news click-through and foreign exchange data, comparing them to some other bandit algorithms. In real world data CTS is the most effective.

Estimating the Partition Function of Graphical Models Using Langevin Importance Sampling

Mon, 29 Apr 2013 00:00:00 +0000

Graphical models are powerful in modeling a variety of applications. Computing the partition function of a graphical model is a typical inference problem and known as an NP-hard problem for general graphs. A few sampling algorithms like MCMC, Simulated Annealing Sampling (SAS), Annealed Importance Sampling (AIS) are developed to address this challenging problem. This paper describes a Langevin Importance Sampling (LIS) algorithm to compute the partition function of a graphical model. LIS first performs a random walk in the configuration-temperature space guided by the Langevin equation and then estimates the partition function using all the samples generated during the random walk, as opposed to the other configuration-temperature sampling methods, which uses only the samples at a specific temperature. Experimental results show that LIS can obtain much more accurate partition function than the others tested on several different types of graphical models. LIS performs especially well on relatively large graph models or those with a large number of local optima.

Fast Near-GRID Gaussian Process Regression

Mon, 29 Apr 2013 00:00:00 +0000

\emphGaussian process regression (GPR) is a powerful non-linear technique for Bayesian inference and prediction. One drawback is its O(N^3) computational complexity for both prediction and hyperparameter estimation for N input points which has led to much work in sparse GPR methods. In case that the covariance function is expressible as a \emphtensor product kernel (TPK) and the inputs form a multidimensional grid, it was shown that the costs for exact GPR can be reduced to a sub-quadratic function of N. We extend these exact fast algorithms to sparse GPR and remark on a connection to \emphGaussian process latent variable models (GPLVMs). In practice, the inputs may also violate the multidimensional grid constraints so we pose and efficiently solve missing and extra data problems for both exact and sparse grid GPR. We demonstrate our method on synthetic, text scan, and magnetic resonance imaging (MRI) data reconstructions.

Texture Modeling with Convolutional Spike-and-Slab RBMs and Deep Extensions

Mon, 29 Apr 2013 00:00:00 +0000

We apply the spike-and-slab Restricted Boltzmann Machine (ssRBM) to texture modeling. The ssRBM with tiled-convolution weight sharing (TssRBM) achieves or surpasses the state-of-the-art on texture synthesis and inpainting by parametric models. We also develop a novel RBM model with a spike-and-slab visible layer and binary variables in the hidden layer. This model is designed to be stacked on top of the ssRBM. We show the resulting deep belief network (DBN) is a powerful generative model that improves on single-layer models and is capable of modeling not only single high-resolution and challenging textures but also multiple textures with fixed-size filters in the bottom layer.

Learning Markov Networks With Arithmetic Circuits

Mon, 29 Apr 2013 00:00:00 +0000

Markov networks are an effective way to represent complex probability distributions. However, learning their structure and parameters or using them to answer queries is typically intractable. One approach to making learning and inference tractable is to use approximations, such as pseudo-likelihood or approximate inference. An alternate approach is to use a restricted class of models where exact inference is always efficient. Previous work has explored low treewidth models, models with tree-structured features, and latent variable models. In this paper, we introduce ACMN, the first ever method for learning efficient Markov networks with arbitrary conjunctive features. The secret to ACMN’s greater flexibility is its use of arithmetic circuits, a linear-time inference representation that can handle many high treewidth models by exploiting local structure. ACMN uses the size of the corresponding arithmetic circuit as a learning bias, allowing it to trade off accuracy and inference complexity. In experiments on 12 standard datasets, the tractable models learned by ACMN are more accurate than both tractable models learned by other algorithms and approximate inference in intractable models.

Dynamic Scaled Sampling for Deterministic Constraints

Mon, 29 Apr 2013 00:00:00 +0000

Deterministic and near-deterministic relationships among subsets of random variables in multivariate systems are known to cause serious problems for Monte Carlo algorithms. We examine the case in which the relationship Z = f(X_1,...,X_k) holds, where each X_i has a continuous prior pdf and we wish to obtain samples from the conditional distribution P(X_1,...,X_k | Z= s). When f is addition, the problem is NP-hard even when the X_i are independent. In more restricted cases — for example, i.i.d. Boolean or categorical X_i — efficient exact samplers have been obtained previously. For the general continuous case, we propose a dynamic scaling algorithm (DYSC), and prove that it has O(k) expected running time and finite variance. We discuss generalizations of DYSC to functions f described by binary operation trees. We evaluate the algorithm on several examples.

Structure Learning of Mixed Graphical Models

Mon, 29 Apr 2013 00:00:00 +0000

We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme is new and follows naturally from a particular parametrization of the model.

Structural Expectation Propagation (SEP): Bayesian structure learning for networks with latent variables

Mon, 29 Apr 2013 00:00:00 +0000

Learning the structure of discrete Bayesian networks has been the subject of extensive research in machine learning, with most Bayesian approaches focusing on fully observed networks. One of few the methods that can handle networks with latent variables is the "structural EM algorithm" which interleaves greedy structure search with the estimation of latent variables and parameters, maintaining a single best network at each step. We introduce Structural Expectation Propagation (SEP), an extension of EP which can infer the structure of Bayesian networks having latent variables and missing data. SEP performs variational inference in a joint model of structure, latent variables, and parameters, offering two advantages: (i) it accounts for uncertainty in structure and parameter values when making local distribution updates (ii) it returns a variational distribution over network structures rather than a single network. We demonstrate the performance of SEP both on synthetic problems and on real-world clinical data.

Exact Learning of Bounded Tree-width Bayesian Networks

Mon, 29 Apr 2013 00:00:00 +0000

Inference in Bayesian networks is known to be NP-hard, but if the network has bounded tree-width, then inference becomes tractable. Not surprisingly, learning networks that closely match the given data and have a bounded tree-width has recently attracted some attention. In this paper we aim to lay groundwork for future research on the topic by studying the exact complexity of this problem. We give the first non-trivial exact algorithm for the NP-hard problem of finding an optimal Bayesian network of tree-width at most w, with running time 3^n n^w + O(1), and provide an implementation of this algorithm. Additionally, we propose a variant of Bayesian network learning with “super-structures”, and show that finding a Bayesian network consistent with a given super-structure is fixed-parameter tractable in the tree-width of the super-structure.

Beyond Sentiment: The Manifold of Human Emotions

Mon, 29 Apr 2013 00:00:00 +0000

Sentiment analysis predicts the presence of positive or negative emotions in a text document. In this paper we consider higher dimensional extensions of the sentiment concept, which represent a richer set of human emotions. Our approach goes beyond previous work in that our model contains a continuous manifold rather than a finite set of human emotions. We investigate the resulting model, compare it to psychological observations, and explore its predictive capabilities. Besides obtaining significant improvements over a baseline without manifold, we are also able to visualize different notions of positive sentiment in different domains.

A Parallel, Block Greedy Method for Sparse Inverse Covariance Estimation for Ultra-high Dimensions

Mon, 29 Apr 2013 00:00:00 +0000

Discovering the graph structure of a Gaussian Markov Random Field is an important problem in application areas such as computational biology and atmospheric sciences. This task, which translates to estimating the sparsity pattern of the inverse covariance matrix, has been extensively studied in the literature. However, the existing approaches are unable to handle ultra-high dimensional datasets and there is a crucial need to develop methods that are both highly scalable and memory-efficient. In this paper, we present GINCO, a blocked greedy method for sparse inverse covariance matrix estimation. We also present detailed description of a highly-scalable and memory-efficient implementation of GINCO, which is able to operate on both shared- and distributed-memory architectures. Our implementation is able recover the sparsity pattern of 25,000 vertex random and chain graphs with 87% and 84% accuracy in \le 5 minutes using \le 10GB of memory on a single 8-core machine. Furthermore, our method is statistically consistent in recovering the sparsity pattern of the inverse covariance matrix, which we demonstrate through extensive empirical studies.

Diagonal Orthant Multinomial Probit Models

Mon, 29 Apr 2013 00:00:00 +0000

Bayesian classification commonly relies on probit models, with data augmentation algorithms used for posterior computation. By imputing latent Gaussian variables, one can often trivially adapt computational approaches used in Gaussian models. However, MCMC for multinomial probit (MNP) models can be inefficient in practice due to high posterior dependence between latent variables and parameters, and to difficulties in efficiently sampling latent variables when there are more than two categories. To address these problems, we propose a new class of diagonal orthant (DO) multinomial models. The key characteristics of these models include conditional independence of the latent variables given model parameters, avoidance of arbitrary identifiability restrictions, and simple expressions for category probabilities. We show substantially improved computational efficiency and comparable predictive performance to MNP.

Active Learning for Interactive Visualization

Mon, 29 Apr 2013 00:00:00 +0000

Many automatic visualization methods have been proposed. However, a visualization that is automatically generated might be different to how a user wants to arrange the objects in visualization space. By allowing users to re-locate objects in the embedding space of the visualization, they can adjust the visualization to their preference. We propose an active learning framework for interactive visualization which selects objects for the user to re-locate so that they can obtain their desired visualization by re-locating as few as possible. The framework is based on an information theoretic criterion, which favors objects that reduce the uncertainty of the visualization. We present a concrete application of the proposed framework to the Laplacian eigenmap visualization method. We demonstrate experimentally that the proposed framework yields the desired visualization with fewer user interactions than existing methods.

DYNACARE: Dynamic Cardiac Arrest Risk Estimation

Mon, 29 Apr 2013 00:00:00 +0000

Cardiac arrest is a deadly condition caused by a sudden failure of the heart with an in-hospital mortality rate of ∼80%. Therefore, the ability to accurately estimate patients at high risk of cardiac arrest is crucial for improving the survival rate. Existing research generally fails to utilize a patient’s temporal dynamics. In this paper, we present two dynamic cardiac risk estimation models, focusing on different temporal signatures in a patient’s risk trajectory. These models can track a patient’s risk trajectory in real time, allow interpretability and predictability of a cardiac arrest event, provide an intuitive visualization to medical professionals, offer a personalized dynamic hazard function, and estimate the risk for a new patient.

Recursive Karcher Expectation Estimators And Geometric Law of Large Numbers

Mon, 29 Apr 2013 00:00:00 +0000

This paper studies a form of law of large numbers on Pn, the space of nxn symmetric positive-definite matrices equipped with Fisher-Rao metric. Specifically, we propose a recursive algorithm for estimating the Karcher expectation of an arbitrary distribution defined on Pn, and we show that the estimates computed by the recursive algorithm asymptotically converge in probability to the correct Karcher expectation. The steps in the recursive algorithm mainly consist of making appropriate moves on geodesics in Pn, and the algorithm is simple to implement and it offers a tremendous gain in computation time of several orders in magnitude over existing non-recursive algorithms. We elucidate the connection between the more familiar law of large numbers for real-valued random variables and the asymptotic convergence of the proposed recursive algorithm, and our result provides an example of a new form of law of large numbers for random variables taking values in a Riemannian manifold. From the practical side, the computation of the mean of a collection of symmetric positive-definite (SPD) matrices is a fundamental ingredient in many algorithms in machine learning, computer vision and medical imaging applications. We report an experiment using the proposed recursive algorithm for K-means clustering, demonstrating the algorithm’s efficiency, accuracy and stability.

DivMCuts: Faster Training of Structural SVMs with Diverse M-Best Cutting-Planes

Mon, 29 Apr 2013 00:00:00 +0000

Training of Structural SVMs involves solving a large Quadratic Program (QP). One popular method for solving this QP is a cutting-plane approach, where the most violated constraint is iteratively added to a working-set of constraints. Unfortunately, training models with a large number of parameters remains a time consuming process. This paper shows that significant computational savings can be achieved by adding multiple diverse and highly violated constraints at every iteration of the cutting-plane algorithm. We show that generation of such diverse cutting-planes involves extracting diverse M-Best solutions from the loss-augmented score of the training instances. To find these diverse M-Best solutions, we employ a recently proposed algorithm [4]. Our experiments on image segmentation and protein side-chain prediction show that the proposed approach can lead to significant computational savings, e.g., ∼28% reduction in training time.

Clustered Support Vector Machines

Mon, 29 Apr 2013 00:00:00 +0000

In many problems of machine learning, the data are distributed nonlinearly. One way to address this kind of data is training a nonlinear classifier such as kernel support vector machine (kernel SVM). However, the computational burden of kernel SVM limits its application to large scale datasets. In this paper, we propose a Clustered Support Vector Machine (CSVM), which tackles the data in a divide and conquer manner. More specifically, CSVM groups the data into several clusters, followed which it trains a linear support vector machine in each cluster to separate the data locally. Meanwhile, CSVM has an additional global regularization, which requires the weight vector of each local linear SVM aligning with a global weight vector. The global regularization leverages the information from one cluster to another, and avoids over-fitting in each cluster. We derive a data-dependent generalization error bound for CSVM, which explains the advantage of CSVM over linear SVM. Experiments on several benchmark datasets show that the proposed method outperforms linear SVM and some other related locally linear classifiers. It is also comparable to a fine-tuned kernel SVM in terms of prediction performance, while it is more efficient than kernel SVM.

Unsupervised Link Selection in Networks

Mon, 29 Apr 2013 00:00:00 +0000

Real-world networks are often noisy, and the existing linkage structure may not be reliable. For example, a link which connects nodes from different communities may affect the group assignment of nodes in a negative way. In this paper, we study a new problem called link selection, which can be seen as the network equivalent of the traditional feature selection problem in machine learning. More specifically, we investigate unsupervised link selection as follows: given a network, it selects a subset of informative links from the original network which enhance the quality of community structures. To achieve this goal, we use Ratio Cut size of a network as the quality measure. The resulting link selection approach can be formulated as a semi-definite programming problem. In order to solve it efficiently, we propose a backward elimination algorithm using sequential optimization. Experiments on benchmark network datasets illustrate the effectiveness of our method.

Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction

Mon, 29 Apr 2013 00:00:00 +0000

We introduce mixed LICORS, an algorithm for learning nonlinear, high-dimensional dynamics from spatio-temporal data, suitable for both prediction and simulation. Mixed LICORS extends the recent LICORS algorithm (Goerg and Shalizi, 2012) from hard clustering of predictive distributions to a non-parametric, EM-like soft clustering. This retains the asymptotic predictive optimality of LICORS, but, as we show in simulations, greatly improves out-of-sample forecasts with limited data. The new method is implemented in the publicly-available R package LICORS.

A unifying representation for a class of dependent random measures

Mon, 29 Apr 2013 00:00:00 +0000

We present a general construction for dependent random measures based on thinning Poisson processes on an augmented space. The framework is not restricted to dependent versions of a specific nonparametric model, but can be applied to all models that can be represented using completely random measures. Several existing dependent random measures can be seen as specific cases of this framework. Interesting properties of the resulting measures are derived and the efficacy of the framework is demonstrated by constructing a covariate-dependent latent feature model and topic model that obtain superior predictive performance.

Predictive Correlation Screening: Application to Two-stage Predictor Design in High Dimension

Mon, 29 Apr 2013 00:00:00 +0000

We introduce a new approach to variable selection, called Predictive Correlation Screening, for predictor design. Predictive Correlation Screening (PCS) implements false positive control on the selected variables, is well suited to small sample sizes, and is scalable to high dimensions. We establish asymptotic bounds for Familywise Error Rate (FWER), and resultant mean square error of a linear predictor on the selected variables. We apply Predictive Correlation Screening to the following two-stage predictor design problem. An experimenter wants to learn a multivariate predictor of gene expressions based on successive biological samples assayed on mRNA arrays. She assays the whole genome on a few samples and from these assays she selects a small number of variables using Predictive Correlation Screening. To reduce assay cost, she subsequently assays only the selected variables on the remaining samples, to learn the predictor coefficients. We show superiority of Predictive Correlation Screening relative to LASSO and correlation learning (sometimes popularly referred to in the literature as marginal regression or simple thresholding) in terms of performance and computational complexity.

Learning to Top-K Search using Pairwise Comparisons

Mon, 29 Apr 2013 00:00:00 +0000

Given a collection of N items with some unknown underlying ranking, we examine how to use pairwise comparisons to determine the top ranked items in the set. Resolving the top items from pairwise comparisons has application in diverse fields ranging from recommender systems to image-based search to protein structure analysis. In this paper we introduce techniques to resolve the top ranked items using significantly fewer than all the possible pairwise comparisons using both random and adaptive sampling methodologies. Using randomly-chosen comparisons, a graph-based technique is shown to efficiently resolve the top O(\logN) items when there are no comparison errors. In terms of adaptively-chosen comparisons, we show how the top O(\logN) items can be found, even in the presence of corrupted observations, using a voting methodology that only requires O(N\log^2N) pairwise comparisons.

Data-driven covariate selection for nonparametric estimation of causal effects

Mon, 29 Apr 2013 00:00:00 +0000

The estimation of causal effects from non-experimental data is a fundamental problem in many fields of science. One of the main obstacles concerns confounding by observed or latent covariates, an issue which is typically tackled by adjusting for some set of observed covariates. In this contribution, we analyze the problem of inferring whether a given variable has a causal effect on another and, if it does, inferring an adjustment set of covariates that yields a consistent and unbiased estimator of this effect, based on the (conditional) independence and dependence relationships among the observed variables. We provide two elementary rules that we show to be both sound and complete for this task, and compare the performance of a straightforward application of these rules with standard alternative procedures for selecting adjustment sets.

Dynamic Copula Networks for Modeling Real-valued Time Series

Mon, 29 Apr 2013 00:00:00 +0000

Probabilistic modeling of temporal phenomena is of central importance in a variety of fields ranging from neuroscience to economics to speech recognition. While the task has received extensive attention in recent decades, learning temporal models for multivariate real-valued data that is non-Gaussian is still a formidable challenge. Recently, the power of copulas, a framework for representing complex multi-modal and heavy-tailed distributions, was fused with the formalism of Bayesian networks to allow for flexible modeling of high-dimensional distributions. In this work we introduce Dynamic Copula Bayesian Networks, a generalization aimed at capturing the distribution of rich temporal sequences. We apply our model to three markedly different real-life domains and demonstrate substantial quantitative and qualitative advantage.

Stochastic blockmodeling of relational event dynamics

Mon, 29 Apr 2013 00:00:00 +0000

Several approaches have recently been proposed for modeling of continuous-time network data via dyadic event rates conditioned on the observed history of events and nodal or dyadic covariates. In many cases, however, interaction propensities – and even the underlying mechanisms of interaction – vary systematically across subgroups whose identities are unobserved. For static networks such heterogeneity has been treated via methods such as stochastic blockmodeling, which operate by assuming latent groups of individuals with similar tendencies in their group-wise interactions. Here we combine ideas from stochastic blockmodeling and continuous-time network models by positing a latent partition of the node set such that event dynamics within and between subsets evolve in potentially distinct ways. We illustrate the use of our model family by application to several forms of dyadic interaction data, including email communication and Twitter direct messages. Parameter estimates from the fitted models clearly reveal heterogeneity in the dynamics among groups of individuals. We also find that the fitted models have better predictive accuracy than both baseline models and relational event models that lack latent structure.

Uncover Topic-Sensitive Information Diffusion Networks

Mon, 29 Apr 2013 00:00:00 +0000

Analyzing the spreading patterns of memes with respect to their topic distributions and the underlying diffusion network structures is an important task in social network analysis. This task in many cases becomes very challenging since the underlying diffusion networks are often hidden, and the topic specific transmission rates are unknown either. In this paper, we propose a continuous time model, TopicCascade, for topic-sensitive information diffusion networks, and infer the hidden diffusion networks and the topic dependent transmission rates from the observed time stamps and contents of cascades. One attractive property of the model is that its parameters can be estimated via a convex optimization which we solve with an efficient proximal gradient based block coordinate descent (BCD) algorithm. In both synthetic and real-world data, we show that our method significantly improves over the previous state-of-the-art models in terms of both recovering the hidden diffusion networks and predicting the transmission times of memes.

ODE parameter inference using adaptive gradient matching with Gaussian processes

Mon, 29 Apr 2013 00:00:00 +0000

Parameter inference in mechanistic models based on systems of coupled differential equations is a topical yet computationally challenging problem, due to the need to follow each parameter adaptation with a numerical integration of the differential equations. Techniques based on gradient matching, which aim to minimize the discrepancy between the slope of a data interpolant and the derivatives predicted from the differential equations, offer a computationally appealing shortcut to the inference problem. The present paper discusses a method based on nonparametric Bayesian statistics with Gaussian processes due to Calderhead et al. (2008), and shows how inference in this model can be substantially improved by consistently sampling from the joint distribution of the ODE parameters and GP hyperparameters. We demonstrate the efficiency of our adaptive gradient matching technique on three benchmark systems, and perform a detailed comparison with the method in Calderhead et al. (2008) and the explicit ODE integration approach, both in terms of parameter inference accuracy and in terms of computational efficiency.

Deep Gaussian Processes

Mon, 29 Apr 2013 00:00:00 +0000

In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief network based on Gaussian process mappings. The data is modeled as the output of a multivariate GP. The inputs to that Gaussian process are then governed by another GP. A single layer model is equivalent to a standard GP or the GP latent variable model (GP-LVM). We perform inference in the model by approximate variational marginalization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically applied to relatively large data sets using stochastic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model selection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.

Permutation estimation and minimax rates of identifiability

Mon, 29 Apr 2013 00:00:00 +0000

The problem of matching two sets of features appears in various tasks of computer vision and can be often formalized as a problem of permutation estimation. We address this problem from a statistical point of view and provide a theoretical analysis of the accuracy of several natural estimators. To this end, the notion of the minimax matching threshold is introduced and its expression is obtained as a function of the sample size, noise level and dimensionality. We consider the cases of homoscedastic and heteroscedastic noise and carry out, in each case, upper bounds on the matching threshold of several estimators. This upper bounds are shown to be unimprovable in the homoscedastic setting. We also discuss the computational aspects of the estimators and provide some empirical evidence of their consistency on synthetic data-sets.

A simple sketching algorithm for entropy estimation over streaming data

Mon, 29 Apr 2013 00:00:00 +0000

We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Renyi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Renyi entropy by taking alpha sufficiently close to 1. However, practical guidelines for parameter calibration with respect to αare lacking. We avoid this problem by showing that the random variables used in estimating the Renyi entropy can be transformed to have a proper distributional limit as αapproaches 1: the maximally skewed, strictly stable distribution with α= 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ> 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ= 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity.

Scoring anomalies: a M-estimation formulation

Mon, 29 Apr 2013 00:00:00 +0000

It is the purpose of this paper to formulate the issue of scoring multivariate observations depending on their degree of abnormality/novelty as an unsupervised learning task. Whereas in the 1-d situation, this problem can be dealt with by means of tail estimation techniques, observations being viewed as all the more “abnormal” as they are located far in the tail(s) of the underlying probability distribution. In a wide variety of applications, it is desirable to dispose of a scalar valued “scoring” function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem. A (functional) performance criterion is proposed, whose optimal elements are, as expected, nondecreasing transforms of the density. The question of empirical estimation of this criterion is tackled and preliminary statistical results related to the accuracy of partition-based techniques for optimizing empirical estimates of the empirical performance measure are established.

Why Steiner-tree type algorithms work for community detection

Mon, 29 Apr 2013 00:00:00 +0000

We consider the problem of reconstructing a specific connected community S ⊂V in a graph G = (V, E), where each node v is associated with a signal whose strength grows with the likelihood that v belongs to S. This problem appears in social or protein interaction network, the latter also referred to as the signaling pathway reconstruction problem. We study this community reconstruction problem under several natural generative models, and make the following two contributions. First, in the context of social networks, where the signals are modeled as bounded-supported random variables, we design an efficient algorithm for recovering most members in S with well-controlled false positive overhead, by utilizing the network structure for a large family of “homogeneous” generative models. This positive result is complemented by an information theoretic lower bound for the case where the network structure is unknown or the network is heterogeneous. Second, we consider the case in which the graph represents the protein interaction network, in which it is customary to consider signals that have unbounded support, we generalize our first contribution to give the first theoretical justification of why existing Steiner-tree type heuristics work well in practice.

Evidence Estimation for Bayesian Partially Observed MRFs

Mon, 29 Apr 2013 00:00:00 +0000

Bayesian estimation in Markov random fields is very hard due to the intractability of the partition function. The introduction of hidden units makes the situation even worse due to the presence of potentially very many modes in the posterior distribution. For the first time we propose a comprehensive procedure to address one of the Bayesian estimation problems, approximating the evidence of partially observed MRFs based on the Laplace approximation. We also introduce a number of approximate MCMC-based methods for comparison but find that the Laplace approximation significantly outperforms these.

A simple criterion for controlling selection bias

Mon, 29 Apr 2013 00:00:00 +0000

Controlling selection bias, a statistical error caused by preferential sampling of data, is a fundamental problem in machine learning and statistical inference. This paper presents a simple criterion for controlling selection bias in the odds ratio, a widely used measure for association between variables, that connects the nature of selection bias with the graph modeling the selection mechanism. If the graph contains certain paths, we show that the odds ratio cannot be expressed using data with selection bias. Otherwise, we show that a d-separability test can determine whether the odds ratio can be recovered, and when the answer is affirmative, output an unbiased estimand of the odds ratio. The criterion can be test in linear time and enhances the power of the estimand.

Computing the M Most Probable Modes of a Graphical Model

Mon, 29 Apr 2013 00:00:00 +0000

We introduce the M-modes problem for graphical models: predicting the M label configurations of highest probability that are at the same time local maxima of the probability landscape. M-modes have multiple possible applications: because they are intrinsically diverse, they provide a principled alternative to non-maximum suppression techniques for structured prediction, they can act as codebook vectors for quantizing the configuration space, or they can form component centers for mixture model approximation. We present two algorithms for solving the M-modes problem. The first algorithm solves the problem in polynomial time when the underlying graphical model is a simple chain. The second algorithm solves the problem for junction chains. In synthetic and real dataset, we demonstrate how M-modes can improve the performance of prediction. We also use the generated modes as a tool to understand the topography of the probability distribution of configurations, for example with relation to the training set size and amount of noise in the data.

Efficiently Sampling Probabilistic Programs via Program Analysis

Mon, 29 Apr 2013 00:00:00 +0000

Probabilistic programs are intuitive and succinct representations of complex probability distributions. A natural approach to performing inference over these programs is to execute them and compute statistics over the resulting samples. Indeed, this approach has been taken before in a number of probabilistic programming tools. In this paper, we address two key challenges of this paradigm: (i) ensuring samples are well distributed in the combinatorial space of the program, and (ii) efficiently generating samples with minimal rejection. We present a new sampling algorithm Qi that addresses these challenges using concepts from the field of program analysis. To solve the first challenge (getting diverse samples), we use a technique called symbolic execution to systematically explore all the paths in a program. In the case of programs with loops, we systematically explore all paths up to a given depth, and present theorems on error bounds on the estimates as a function of the path bounds used. To solve the second challenge (efficient samples with minimal rejection), we propagate observations backward through the program using the notion of Dijkstra’s weakest preconditions and hoist these propagated conditions to condition elementary distributions during sampling. We present theorems explaining the mathematical properties of Qi, as well as empirical results from an implementation of the algorithm.

Convex Collective Matrix Factorization

Mon, 29 Apr 2013 00:00:00 +0000

In many applications, multiple interlinked sources of data are available and they cannot be represented by a single adjacency matrix, to which large scale factorization method could be applied. Collective matrix factorization is a simple yet powerful approach to jointly factorize multiple matrices, each of which represents a relation between two entity types. Existing algorithms to estimate parameters of collective matrix factorization models are based on non-convex formulations of the problem; in this paper, a convex formulation of this approach is proposed. This enables the derivation of large scale algorithms to estimate the parameters, including an iterative eigenvalue thresholding algorithm. Numerical experiments illustrate the benefits of this new approach.

Meta-Transportability of Causal Effects: A Formal Approach

Mon, 29 Apr 2013 00:00:00 +0000

This paper considers the problem of transferring experimental findings learned from multiple heterogeneous domains to a different environment, in which only passive observations can be collected. Pearl and Bareinboim (2011) established a complete characterization for such transfer between two domains, a source and a target, and this paper generalizes their results to multiple heterogeneous domains. It establishes a necessary and sufficient condition for deciding when effects in the target domain are estimable from both statistical and causal information transferred from the experiments in the source domains. The paper further provides a complete algorithm for computing the transport formula, that is, a way of fusing observational and experimental information to synthesize an unbiased estimate of the desired effects.

Bayesian learning of joint distributions of objects

Mon, 29 Apr 2013 00:00:00 +0000

There is increasing interest in broad application areas in defining flexible joint models for data having a variety of measurement scales, while also allowing data of complex types, such as functions, images and documents. We consider a general framework for nonparametric Bayes joint modeling through mixture models that incorporate dependence across data types through a joint mixing measure. The mixing measure is assigned a novel infinite tensor factorization (ITF) prior that allows flexible dependence in cluster allocation across data types. The ITF prior is formulated as a tensor product of stick-breaking processes. Focusing on a convenient special case corresponding to a Parafac factorization, we provide basic theory justifying the flexibility of the proposed prior. Focusing on ITF mixtures of product kernels, we develop a new Gibbs sampling algorithm for routine implementation relying on slice sampling. The methods are compared with alternative joint mixture models based on Dirichlet processes and related approaches through simulations and real data applications.

Ultrahigh Dimensional Feature Screening via RKHS Embeddings

Mon, 29 Apr 2013 00:00:00 +0000

Feature screening is a key step in handling ultrahigh dimensional data sets that are ubiquitous in modern statistical problems. Over the last decade, convex relaxation based approaches (e.g., Lasso/sparse additive model) have been extensively developed and analyzed for feature selection in high dimensional regime. But in the ultrahigh dimensional regime, these approaches suffer from several problems, both computationally and statistically. To overcome these issues, in this paper, we propose a novel Hilbert space embedding based approach to independence screening for ultrahigh dimensional data sets. The proposed approach is model-free (i.e., no model assumption is made between response and predictors) and could handle non-standard (e.g., graphs) and multivariate outputs directly. We establish the sure screening property of the proposed approach in the ultrahigh dimensional regime, and experimentally demonstrate its advantages and superiority over other approaches on several synthetic and real data sets.

Consensus Ranking with Signed Permutations

Mon, 29 Apr 2013 00:00:00 +0000

Signed permutations (also known as the hyperoctahedral group) are used in modeling genome rearrangements. The algorithmic problems they raise are computationally demanding when not NP-hard. This paper presents a tractable algorithm for learning consensus ranking between signed permutations under the inversion distance. This can be extended to estimate a natural class of exponential models over the group of signed permutations. We investigate experimentally the efficiency of our algorithm for modeling data generated by random reversals.

Distributed and Adaptive Darting Monte Carlo through Regenerations

Mon, 29 Apr 2013 00:00:00 +0000

Darting Monte Carlo (DMC) is a MCMC procedure designed to effectively mix between multiple modes of a probability distribution. We propose an adaptive and distributed version of this method by using regenerations. This allows us to run multiple chains in parallel and adapt the shape of the jump regions as well as all other aspects of the Markov chain on the fly. We show that this significantly improves the performance of DMC because 1) a population of chains has a higher chance of finding the modes in the distribution, 2) jumping between modes becomes easier due to the adaptation of their shapes, 3) computation is much more efficient due to parallelization across multiple processors. While the curse of dimensionality is a challenge for both DMC and regeneration, we find that their combination ameliorates this issue slightly.

Further Optimal Regret Bounds for Thompson Sampling

Mon, 29 Apr 2013 00:00:00 +0000

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that proves the first near-optimal problem-independent bound of O(\sqrtNT\ln T) on the expected regret of this algorithm. Our novel martingale-based analysis techniques are conceptually simple, and easily extend to distributions other than the Beta distribution. For the version of Thompson Sampling that uses Gaussian priors, we prove a problem-independent bound of O(\sqrtNT\ln N) on the expected regret, and demonstrate the optimality of this bound by providing a matching lower bound. This lower bound of Ω(\sqrtNT\ln N) is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of O(\sqrtNT) for the multi-armed bandit problem. Our near-optimal problem-independent bounds for Thompson Sampling solve a COLT 2012 open problem of Chapelle and Li. Additionally, our techniques simultaneously provide the optimal problem-dependent bound of (1+ε)\sum_i \frac\ln Td(\mu_i, \mu_1)+O(\fracNε^2) on the expected regret. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [2012].

Nyström Approximation for Large-Scale Determinantal Processes

Mon, 29 Apr 2013 00:00:00 +0000

Determinantal point processes (DPPs) are appealing models for subset selection problems where diversity is desired. They offer surprisingly efficient inference, including sampling in $O(N^3)$ time and $O(N^2)$ space, where $N$ is the number of base items. However, in some applications, $N$ may grow so large that sampling from a DPP becomes computationally infeasible. This is especially true in settings where the DPP kernel matrix cannot be represented by a linear decomposition of low-dimensional feature vectors. In these cases, we propose applying the Nystrom approximation to project the kernel matrix into a low-dimensional space. While theoretical guarantees for the Nystrom approximation in terms of standard matrix norms have been previously established, we are concerned with probabilistic measures, like total variation distance between the DPP generated by a kernel matrix and the one generated by its Nystrom approximation, that behave quite differently. In this paper we derive new error bounds for the Nystrom-approximated DPP and present empirical results to corroborate them. We then demonstrate the Nystrom-approximated DPP by applying it to a motion capture summarization task.

Reconstructing ecological networks with hierarchical Bayesian regression and Mondrian processes

Mon, 29 Apr 2013 00:00:00 +0000

Ecological systems consist of complex sets of interactions among species and their environment, the understanding of which has implications for predicting environmental response to perturbations such as invading species and climate change. However, the revelation of these interactions is not straightforward, nor are the interactions necessarily stable across space. Machine learning can enable the recovery of such complex, spatially varying interactions from relatively easily obtained species abundance data. Here, we describe a novel Bayesian regression and Mondrian process model (BRAMP) for reconstructing species interaction networks from observed field data. BRAMP enables robust inference of species interactions considering autocorrelation in species abundances and allowing for variation in the interactions across space. We evaluate the model on spatially explicit simulated data, produced using a trophic niche model combined with stochastic population dynamics. We compare the model’s performance against L1-penalized sparse regression (LASSO) and non-linear Bayesian networks with the BDe scoring scheme. Finally, we apply BRAMP to real ecological data.

Clustering Oligarchies

Mon, 29 Apr 2013 00:00:00 +0000

We investigate the extent to which clustering algorithms are robust to the addition of a small, potentially adversarial, set of points. Our analysis reveals radical differences in the robustness of popular clustering methods. k-means and several related techniques are robust when data is clusterable, and we provide a quantitative analysis capturing the precise relationship between clusterability and robustness. In contrast, common linkage-based algorithms and several standard objective-function-based clustering methods can be highly sensitive to the addition of a small set of points even when the data is highly clusterable. We call such sets of points oligarchies. Lastly, we show that the behavior with respect to oligarchies of the popular Lloyd’s method changes radically with the initialization technique.

A Competitive Test for Uniformity of Monotone Distributions

Mon, 29 Apr 2013 00:00:00 +0000

We propose a test that takes random samples drawn from a monotone distribution and decides whether or not the distribution is uniform. The test is nearly optimal in that it uses at most O(n\sqrt\log n) samples, where n is the number of samples that a genie who knew all but one bit about the underlying distribution would need for the same task. Furthermore, we show that any such test would require Ω(n\sqrt\log n) samples for some distributions.