Proceedings of Machine Learning Research

On Convergence of Model Parallel Proximal Gradient Algorithm for Stale Synchronous Parallel System

Mon, 02 May 2016 00:00:00 +0000

With ever growing data volume and model size, an error-tolerant, communication efficient, yet versatile parallel algorithm has become a vital part for the success of many large-scale applications. In this work we propose mspg, an extension of the flexible proximal gradient algorithm to the model parallel and stale synchronous setting. The worker machines of mspg operate asynchronously as long as they are not too far apart, and they communicate efficiently through a dedicated parameter server. Theoretically, we provide a rigorous analysis of the various convergence properties of mspg, and a salient feature of our analysis is its seamless generality that allows both nonsmooth and nonconvex functions. Under mild conditions, we prove the whole iterate sequence of mspg converges to a critical point (which is optimal under convexity assumptions). We further provide an economical implementation of mspg, completely bypassing the need of keeping a local full model. We confirm our theoretical findings through numerical experiments.

Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation

Mon, 02 May 2016 00:00:00 +0000

It has been observed in a variety of contexts that gradient descent methods have great success in solving low-rank matrix factorization problems, despite the relevant problem formulation being non-convex. We tackle a particular instance of this scenario, where we seek the d-dimensional subspace spanned by a streaming data matrix. We apply the natural first order incremental gradient descent method, constraining the gradient method to the Grassmannian. In this paper, we propose an adaptive step size scheme that is greedy for the noiseless case, that maximizes the improvement of our metric of convergence at each data index t, and yields an expected improvement for the noisy case. We show that, with noise-free data, this method converges from any random initialization to the global minimum of the problem. For noisy data, we provide the expected convergence rate of the proposed algorithm per iteration.

Stochastic Variational Inference for the HDP-HMM

Mon, 02 May 2016 00:00:00 +0000

We derive a variational inference algorithm for the HDP-HMM based on the two-level stick breaking construction. This construction has previously been applied to the hierarchical Dirichlet processes (HDP) for mixed membership models, allowing for efficient handling of the coupled weight parameters. However, the same algorithm is not directly applicable to HDP-based infinite hidden Markov models (HDP-HMM) because of extra sequential dependencies in the Markov chain. In this paper we provide a solution to this problem by deriving a variational inference algorithm for the HDP-HMM, as well as its stochastic extension, for which all parameter updates are in closed form. We apply our algorithm to sequential text analysis and audio signal analysis, comparing our results with the beam-sampled iHMM, the parametric HMM, and other variational inference approximations.

Online (and Offline) Robust PCA: Novel Algorithms and Performance Guarantees

Mon, 02 May 2016 00:00:00 +0000

In this work we develop and study a novel online robust principal components’ analysis (RPCA) algorithm based on the recently introduced ReProCS framework. Our algorithm significantly improves upon the original ReProCS algorithm and it also returns even more accurate offline estimates. The key contribution of this work is a correctness result for this algorithm under relatively mild assumptions. By using extra (but usually valid) assumptions we are able to remove one important limitation of batch RPCA results and two important limitations of a recent result for ReProCS for online RPCA. To the best of our knowledge, this work is among the first correctness results for online RPCA.

Exponential Stochastic Cellular Automata for Massively Parallel Inference

Mon, 02 May 2016 00:00:00 +0000

We propose an embarrassingly parallel, memory efficient inference algorithm for latent variable models in which the complete data likelihood is in the exponential family. The algorithm is a stochastic cellular automaton and converges to a valid maximum a posteriori fixed point. Applied to latent Dirichlet allocation we find that our algorithm is over an order or magnitude faster than the fastest current approaches. A simple C++/MPI implementation on a 20-node Amazon EC2 cluster samples at more than 1 billion tokens per second. We process 3 billion documents and achieve predictive power competitive with collapsed Gibbs sampling and variational inference.

A Convex Surrogate Operator for General Non-Modular Loss Functions

Mon, 02 May 2016 00:00:00 +0000

Empirical risk minimization frequently employs convex surrogates to underlying discrete loss functions in order to achieve computational tractability during optimization. However, classical convex surrogates can only tightly bound modular loss functions, submodular functions or supermodular functions separately while maintaining polynomial time computation. In this work, a novel generic convex surrogate for general non-modular loss functions is introduced, which provides for the first time a tractable solution for loss functions that are neither supermodular nor submodular. This convex surrogate is based on a submodular-supermodular decomposition for which the existence and uniqueness is proven in this paper. It takes the sum of two convex surrogates that separately bound the supermodular component and the submodular component using slack-rescaling and the Lovasz hinge, respectively. It is further proven that this surrogate is convex, piecewise linear, an extension of the loss function, and for which subgradient computation is polynomial time. Empirical results are reported on a non-submodular loss based on the Sorensen-Dice difference function, and a real-world face track dataset with tens of thousands of frames, demonstrating the improved performance, efficiency, and scalability of the novel convex surrogate.

Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation

Mon, 02 May 2016 00:00:00 +0000

In recent years exemplar clustering has become a popular tool for applications in document and video summarization, active learning, and clustering with general similarity, where cluster centroids are required to be a subset of the data samples rather than their linear combinations. The problem is also well-known as facility location in the operations research literature. While the problem has well-developed convex relaxation with approximation and recovery guarantees, it has number of variables grows quadratically with the number of samples. Therefore, state-of-the-art methods can hardly handle more than 10^4 samples (i.e. 10^8 variables). In this work, we propose an Augmented-Lagrangian with Block Coordinate Descent (AL-BCD) algorithm that utilizes problem structure to obtain closed-form solution for each block sub-problem, and exploits low-rank representation of the dissimilarity matrix to search active columns without computing the entire matrix. Experiments show our approach to be orders of magnitude faster than existing approaches and can handle problems of up to 10^6 samples. We also demonstrate successful applications of the algorithm on world-scale facility location, document summarization and active learning.

Relationship between PreTraining and Maximum Likelihood Estimation in Deep Boltzmann Machines

Mon, 02 May 2016 00:00:00 +0000

A pretraining algorithm, which is a layer-by-layer greedy learning algorithm, for a deep Boltzmann machine (DBM) is presented in this paper. By considering the deep belief net type of pretraining for the DBM, which is a simplified version of the original pretraining of the DBM, two interesting theoretical facts about pretraining can be obtained. (1) By applying two different types of approximation, a replacing approximation by using a Bayesian network and a Bethe type of approximation based on the cluster variation method, to two different parts of the true log-likelihood function of the DBM, pretraining can be derived from a variational approximation of the original maximum likelihood estimation. (2) It can be ensured that the pretraining improves the variational bound of the true log-likelihood function of the DBM. These two theoretical results will help deepen our understanding of deep learning. Moreover, on the basis of the theoretical results, we discuss the original pretraining of the DBM in the latter part of this paper.

Deep Kernel Learning

Mon, 02 May 2016 00:00:00 +0000

We introduce scalable deep kernels, which combine the structural properties of deep learning architectures with the non-parametric flexibility of kernel methods. Specifically, we transform the inputs of a spectral mixture base kernel with a deep architecture, using local kernel interpolation, inducing points, and structure exploiting (Kronecker and Toeplitz) algebra for a scalable kernel representation. These closed-form kernels can be used as drop-in replacements for standard kernels, with benefits in expressive power and scalability. We jointly learn the properties of these kernels through the marginal likelihood of a Gaussian process. Inference and learning cost O(n) for n training points, and predictions cost O(1) per test point. On a large and diverse collection of applications, including a dataset with 2 million examples, we show improved performance over scalable Gaussian processes with flexible kernel learning models, and stand-alone deep architectures.

Tightness of LP Relaxations for Almost Balanced Models

Mon, 02 May 2016 00:00:00 +0000

Linear programming (LP) relaxations are widely used to attempt to identify a most likely conﬁguration of a discrete graphical model. In some cases, the LP relaxation attains an optimum vertex at an integral location and thus guarantees an exact solution to the original optimization problem. When this occurs, we say that the LP relaxation is tight. Here we consider binary pairwise models and derive sufﬁcient conditions for guaranteed tightness of (i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP relaxation on the triplet-consistent polytope LP+TRI (the next level in the Sherali-Adams hierarchy). We provide simple new proofs of earlier results and derive significant novel results including that LP+TRI is tight for any model where each block is balanced or almost balanced, and a decomposition theorem that may be used to break apart complex models into smaller pieces. An almost balanced (sub-)model is one that contains no frustrated cycles except through one privileged variable.

Clamping Improves TRW and Mean Field Approximations

Mon, 02 May 2016 00:00:00 +0000

We examine the effect of clamping variables for approximate inference in undirected graphical models with pairwise relationships and discrete variables. For any number of variable labels, we demonstrate that clamping and summing approximate sub-partition functions can lead only to a decrease in the partition function estimate for TRW, and an increase for the naive mean ﬁeld method, in each case guaranteeing an improvement in the approximation and bound. We next focus on binary variables, add the Bethe approximation to consideration and examine ways to choose good variables to clamp, introducing new methods. We show the importance of identifying highly frustrated cycles, and of checking the singleton entropy of a variable. We explore the value of our methods by empirical analysis and draw lessons to guide practitioners.

Unsupervised Feature Selection by Preserving Stochastic Neighbors

Mon, 02 May 2016 00:00:00 +0000

Feature selection is an important technique for alleviating the curse of dimensionality. Unsupervised feature selection is more challenging than its supervised counterpart due to the lack of labels. In this paper, we present an effective method, Stochastic Neighbor-preserving Feature Selection (SNFS), for selecting discriminative features in unsupervised setting. We employ the concept of stochastic neighbors and select the features that can best preserve such stochastic neighbors by minimizing the Kullback-Leibler (KL) Divergence between neighborhood distributions. The proposed approach measures feature utility jointly in a non-linear way and discriminative features can be selected due to its ’push-pull’ property. We develop an efficient algorithm for optimizing the objective function based on projected quasi-Newton method. Moreover, few existing methods provide ways for determining the optimal number of selected features and this hampers their utility in practice. Our approach is equipped with a guideline for choosing the number of features, which provides nearly optimal performance in our experiments. Experimental results show that the proposed method outperforms state-of-the-art methods significantly on several real-world datasets.

Inference for High-dimensional Exponential Family Graphical Models

Mon, 02 May 2016 00:00:00 +0000

Probabilistic graphical models have been widely used to model complex systems and aid scientific discoveries. Most existing work on high-dimensional estimation of exponential family graphical models, including Gaussian graphical models and Ising models, is focused on consistent model selection. However, these results do not characterize uncertainty in the estimated structure and are of limited value to scientists who worry whether their findings will be reproducible and if the estimated edges are present in the model due to random chance. In this paper, we propose a novel estimator for edge parameters in an exponential family graphical models. We prove that the estimator is \sqrtn-consistent and asymptotically Normal. This result allows us to construct confidence intervals for edge parameters, as well as, hypothesis tests. We establish our results under conditions that are typically assumed in the literature for consistent estimation. However, we do not require that the estimator consistently recovers the graph structure. In particular, we prove that the asymptotic distribution of the estimator is robust to model selection mistakes and uniformly valid for a large number of data-generating processes. We illustrate validity of our estimator through extensive simulation studies.

Optimization as Estimation with Gaussian Processes in Bandit Settings

Mon, 02 May 2016 00:00:00 +0000

Recently, there has been rising interest in Bayesian optimization – the optimization of an unknown function with assumptions usually expressed by a Gaussian Process (GP) prior. We study an optimization strategy that directly uses an estimate of the argmax of the function. This strategy offers both practical and theoretical advantages: no tradeoff parameter needs to be selected, and, moreover, we establish close connections to the popular GP-UCB and GP-PI strategies. Our approach can be understood as automatically and adaptively trading off exploration and exploitation in GP-UCB and GP-PI. We illustrate the effects of this adaptive tuning via bounds on the regret as well as an extensive empirical evaluation on robotics and vision tasks, demonstrating the robustness of this strategy for a range of performance criteria.

Scalable geometric density estimation

Mon, 02 May 2016 00:00:00 +0000

It is standard to assume a low-dimensional structure in estimating a high-dimensional density. However, popular methods, such as probabilistic principal component analysis, scale poorly computationally. We introduce a novel empirical Bayes method that we term geometric density estimation (GEODE) and show that, with mild conditions and among all d-dimensional linear subspaces, the span of the d leading principal axes of the data maximizes the model posterior. With these axes pre-computed using fast singular value decomposition, GEODE easily scales to high dimensional problems while providing uncertainty characterization. The model is also capable of imputing missing data and dynamically deleting redundant dimensions. Finally, we generalize GEODE by mixing it across a dyadic clustering tree. Both simulation studies and real world data applications show superior performance of GEODE in terms of robustness and computational efficiency.

Distributed Multi-Task Learning

Mon, 02 May 2016 00:00:00 +0000

We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space, where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method.

Sequential Inference for Deep Gaussian Process

Mon, 02 May 2016 00:00:00 +0000

A deep Gaussian process (DGP) is a deep network in which each layer is modelled with a Gaussian process (GP). It is a flexible model that can capture highly-nonlinear functions for complex data sets. However, the network structure of DGP often makes inference computationally expensive. In this paper, we propose an efficient sequential inference framework for DGP, where the data is processed sequentially. We also propose two DGP extensions to handle heteroscedasticity and multi-task learning. Our experimental evaluation shows the effectiveness of our sequential inference framework on a number of important learning tasks.

Graph Connectivity in Noisy Sparse Subspace Clustering

Mon, 02 May 2016 00:00:00 +0000

Subspace clustering is the problem of clustering data points into a union of low-dimensional linear/affine subspaces. It is the mathematical abstraction of many important problems in computer vision, image processing and machine learning. A line of recent work [4, 19, 24, 20] provided strong theoretical guarantee for sparse subspace cluster- ing [4], the state-of-the-art algorithm for sub- space clustering, on both noiseless and noisy data sets. It was shown that under mild conditions, with high probability no two points from different subspaces are clustered together. Such guarantee, however, is not sufficient for the clustering to be correct, due to the notorious “graph connectivity problem” [15]. In this paper, we investigate the graph connectivity problem for noisy sparse sub-space clustering and show that a simple post-processing procedure is capable of delivering consistent clustering under certain “general position” or “restricted eigenvalue” assumptions. We also show that our condition is almost tight with adversarial noise perturbation by constructing a counter-example. These results provide the first exact clustering guarantee of noisy SSC for subspaces of dimension greater then 3.

Precision Matrix Estimation in High Dimensional Gaussian Graphical Models with Faster Rates

Mon, 02 May 2016 00:00:00 +0000

In this paper, we present a new estimator for precision matrix in high dimensional Gaussian graphical models. At the core of the proposed estimator is a collection of node-wise linear regression with nonconvex penalty. In contrast to existing estimators for Gaussian graphical models with O(s\sqrt\log d/n) estimation error bound in terms of spectral norm, where s is the maximum degree of a graph, the proposed estimator could attain O(s/\sqrtn+\sqrt\log d/n) spectral norm based convergence rate in the best case, and it is no worse than exiting estimators in general. In addition, our proposed estimator enjoys the oracle property under a milder condition than existing estimators. We show through extensive experiments on both synthetic and real datasets that our estimator outperforms the state-of-the art estimators.

On Searching for Generalized Instrumental Variables

Mon, 02 May 2016 00:00:00 +0000

Instrumental Variables are a popular way to identify the direct causal effect of a random variable X on a variable Y. Often no single instrumental variable exists, although it is still possible to find a set of generalized instrumental variables (GIVs) and identify the causal effect of all these variables at once. Till now it was not known how to find GIVs systematically or even test efficiently, if given variables satisfy GIV conditions. We provide fast algorithms for searching and testing restricted cases of GIVs. However, we prove that in the most general case it is NP-hard to verify if given variables fulfill the conditions of a general instrumental sets.

Black-Box Policy Search with Probabilistic Programs

Mon, 02 May 2016 00:00:00 +0000

In this work we show how to represent policies as programs: that is, as stochastic simulators with tunable parameters. To learn the parameters of such policies we develop connections between black box variational inference and existing policy search approaches. We then explain how such learning can be implemented in a probabilistic programming system. Using our own novel implementation of such a system we demonstrate both conciseness of policy representation and automatic policy parameter learning for a set of canonical reinforcement learning problems.

Learning Sparse Additive Models with Interactions in High Dimensions

Mon, 02 May 2016 00:00:00 +0000

A function f: \mathbbR^d →\mathbbR is referred to as a Sparse Additive Model (SPAM), if it is of the form f(x) = \sum_l ∈S \phi_l(x_l), where S ⊂[d], |S| ≪d. Assuming \phi_l’s and S to be unknown, the problem of estimating f from its samples has been studied extensively. In this work, we consider a generalized SPAM, allowing for second order interaction terms. For some S_1 ⊂[d], S_2 ⊂[d] \choose 2, the function f is assumed to be of the form: f(x) = \sum_p ∈S_1 \phi_p (x_p) + \sum_(l,l’) ∈S_2 \phi_l,l’ (x_l, x_l’). Assuming \phi_p, \phi_(l,l’), S_1 and S_2 to be unknown, we provide a randomized algorithm that queries f and exactly recovers S_1,S_2. Consequently, this also enables us to estimate the underlying \phi_p, \phi_l,l’. We derive sample complexity bounds for our scheme and also extend our analysis to include the situation where the queries are corrupted with noise – either stochastic, or arbitrary but bounded. Lastly, we provide simulation results on synthetic data, that validate our theoretical findings.

Learning Probabilistic Submodular Diversity Models Via Noise Contrastive Estimation

Mon, 02 May 2016 00:00:00 +0000

Modeling diversity of sets of items is important in many applications such as product recommendation and data summarization. Probabilistic submodular models, a family of models including the determinantal point process, form a natural class of distributions, encouraging effects such as diversity, repulsion and coverage. Current models, however, are limited to small and medium number of items due to the high time complexity for learning and inference. In this paper, we propose FLID, a novel log-submodular diversity model that scales to large numbers of items and can be efficiently learned using noise contrastive estimation. We show that our model achieves state of the art performance in terms of model fit, but can be also learned orders of magnitude faster. We demonstrate the wide applicability of our model using several experiments.

Spectral M-estimation with Applications to Hidden Markov Models

Mon, 02 May 2016 00:00:00 +0000

Method of moment estimators exhibit appealing statistical properties, such as asymptotic unbiasedness, for nonconvex problems. However, they typically require a large number of samples and are extremely sensitive to model misspecification. In this paper, we apply the framework of M-estimation to develop both a generalized method of moments procedure and a principled method for regularization. Our proposed M-estimator obtains optimal sample efficiency rates (in the class of moment-based estimators) and the same well-known rates on prediction accuracy as other spectral estimators. It also makes it straightforward to incorporate regularization into the sample moment conditions. We demonstrate empirically the gains in sample efficiency from our approach on hidden Markov models.

Towards Stability and Optimality in Stochastic Gradient Descent

Mon, 02 May 2016 00:00:00 +0000

Iterative procedures for parameter estimation based on stochastic gradient descent (SGD) allow the estimation to scale to massive data sets. However, they typically suffer from numerical instability, while estimators based on SGD are statistically inefficient as they do not use all the information in the data set. To address these two issues we propose an iterative estimation procedure termed averaged implicit SGD (AI-SGD). For statistical efficiency AI-SGD employs averaging of the iterates, which achieves the Cramer-Rao bound under strong convexity, i.e., it is asymptotically an optimal unbiased estimator of the true parameter value. For numerical stability AI-SGD employs an implicit update at each iteration, which is similar to updates performed by proximal operators in optimization. In practice, AI-SGD achieves competitive performance with state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal updates, and is simple to implement as it requires fewer tunable hyperparameters than procedures that do employ proximal updates.

Generalized Ideal Parent (GIP): Discovering non-Gaussian Hidden Variables

Mon, 02 May 2016 00:00:00 +0000

A formidable challenge in uncertainty modeling in general, and when learning Bayesian networks in particular, is the discovery of unknown hidden variables. Few works that tackle this task are typically limited to discrete or Gaussian domains, or to tree structures. We propose a novel general purpose approach for discovering hidden variables in flexible non-Gaussian domains using the powerful class of Gaussian copula networks. Briefly, we define the concept of a hypothetically optimal predictor of variable and show it can be used to discover useful hidden variables in the expressive framework of copula networks. Our approach leads to performance and compactness advantages over competitors in a variety of domains.

Multiresolution Matrix Compression

Mon, 02 May 2016 00:00:00 +0000

Multiresolution Matrix Factorization (MMF) is a recently introduced method for finding multiscale structure and defining wavelets on graphs and matrices. MMF can also be used for matrix compression (sketching). However, the original MMF algorithm of (Kondor et al., 2014) scales with n^3 or worse (where n is the number of rows/columns in the matrix to be factorized) making it infeasible for large scale problems. In this paper we describe pMMF, a fast parallel MMF algorithm, which can scale to n in the range of millions. Our experimental results show that when used for matrix compression, pMMF often achieves much lower error than competing algorithms (especially on network data), yet for sparse matrices its running time scales close to linearly with n.

On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice

Mon, 02 May 2016 00:00:00 +0000

We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. Kumar and Kannan (2010) showed that running k-SVD followed by a constant approximation k-means algorithm, and then Lloyd’s algorithm, will correctly cluster nearly all of the dataset with respect to the optimal clustering, provided the dataset satisfies a deterministic clusterability assumption. This method is viewed as the "Swiss Army knife" for clustering problems, subsuming popular generative models such as Gaussian mixtures. However, it is tailored to high dimensional data, i.e., when d ≫k . We analyze Lloyd’s algorithm for general d without using the spectral projection, which leads to a weaker assumption in the case d < k. Surprisingly, we show that a simple and scalable heuristic that combines random sampling with Single-Linkage serves as a good seeding algorithm for Lloyd’s algorithm under this assumption. We then study stopping criteria for Lloyd’s algorithm under the lens of clusterability, accompanied by controlled simulations.

Bethe Learning of Graphical Models via MAP Decoding

Mon, 02 May 2016 00:00:00 +0000

Many machine learning tasks require fitting probabilistic models over structured objects, such as pixel grids, matchings, and graph edges. Maximum likelihood estimation (MLE) for such domains is challenging due to the intractability of computing partition functions. One can resort to approximate marginal inference in conjunction with gradient descent, but such algorithms require careful tuning. Alternatively, in frameworks such as the structured support vector machine (SVM-Struct), discriminative functions are learned by iteratively applying efficient maximum a posteriori (MAP) decoders. We introduce MLE-Struct, a method for learning discrete exponential family models using the Bethe approximation to the partition function. Remarkably, this problem can also be reduced to iterative (MAP) decoding. This connection emerges by combining the Bethe approximation with the Frank-Wolfe (FW) algorithm on a convex dual objective, which circumvents the intractable partition function. Our method can learn both generative and conditional models and is substantially faster and easier to implement than existing MLE approaches while still relying on the same black-box interface to MAP decoding as SVM-Struct. We perform competitively on problems in denoising, segmentation, matching, and new datasets of roommate assignments and news and financial time series.

Computationally Efficient Bayesian Learning of Gaussian Process State Space Models

Mon, 02 May 2016 00:00:00 +0000

Gaussian processes allow for flexible specification of prior assumptions of unknown dynamics in state space models. We present a procedure for efficient Bayesian learning in Gaussian process state space models, where the representation is formed by projecting the problem onto a set of approximate eigenfunctions derived from the prior covariance structure. Learning under this family of models can be conducted using a carefully crafted particle MCMC algorithm. This scheme is computationally efficient and yet allows for a fully Bayesian treatment of the problem. Compared to conventional system identification tools or existing learning methods, we show competitive performance and reliable quantification of uncertainties in the model.

Survey Propagation beyond Constraint Satisfaction Problems

Mon, 02 May 2016 00:00:00 +0000

Survey propagation (SP) is a message passing procedure that attempts to model all the fixed points of Belief Propagation (BP), thereby improving BP’s approximation in loopy graphs where BP’s assumptions do not hold. For this, SP messages represent distributions over BP messages. Unfortunately this requirement makes SP intractable beyond constraint satisfaction problems because, to perform general SP updates, one has to operate on distributions over a continuous domain. We propose an approximation scheme to efficiently extend the application of SP to marginalization in binary pairwise graphical models. Our approximate SP has O(DK\log(DK)t) complexity per iteration, where t is the complexity of BP per iteration, D is the maximum node degree and K is a resolution constant controlling the approximation’s fidelity. Our experiments show that this method can track many BP fixed points, achieving a high marginalization accuracy within a few iterations, in difficult settings where BP is often non-convergent and inaccurate.

AdaDelay: Delay Adaptive Distributed Stochastic Optimization

Mon, 02 May 2016 00:00:00 +0000

We develop distributed stochastic convex optimization algorithms under a delayed gradient model in which server nodes update parameters and worker nodes compute stochastic (sub)gradients. Our setup is motivated by the behavior of real-world distributed computation systems; in particular, we analyze a setting wherein worker nodes can be differently slow at different times. In contrast to existing approaches, we do not impose a worst-case bound on the delays experienced but rather allow the updates to be sensitive to the actual delays experienced. This sensitivity allows use of larger stepsizes, which can help speed up initial convergence without having to wait too long for slower machines; the global convergence rate is still preserved. We experiment with different delay patterns, and obtain noticeable improvements for large-scale real datasets with billions of examples and features.

Learning Sigmoid Belief Networks via Monte Carlo Expectation Maximization

Mon, 02 May 2016 00:00:00 +0000

Belief networks are commonly used generative models of data, but require expensive posterior estimation to train and test the model. Learning typically proceeds by posterior sampling, variational approximations, or recognition networks, combined with stochastic optimization. We propose using an online Monte Carlo expectation-maximization (MCEM) algorithm to learn the maximum a posteriori (MAP) estimator of the generative model or optimize the variational lower bound of a recognition network. The E-step in this algorithm requires posterior samples, which are already generated in current learning schema. For the M-step, we augment with Polya-Gamma (PG) random variables to give an analytic updating scheme. We show relationships to standard learning approaches by deriving stochastic gradient ascent in the MCEM framework. We apply the proposed methods to both binary and count data. Experimental results show that MCEM improves the convergence speed and often improves hold-out performance over existing learning methods. Our approach is readily generalized to other recognition networks.

A Fixed-Point Operator for Inference in Variational Bayesian Latent Gaussian Models

Mon, 02 May 2016 00:00:00 +0000

Latent Gaussian Models (LGM) provide a rich modeling framework with general inference procedures. The variational approximation offers an effective solution for such models and has attracted a significant amount of interest. Recent work proposed a fixed-point (FP) update procedure to optimize the covariance matrix in the variational solution and demonstrated its efficacy in specific models. The paper makes three contributions. First, it shows that the same approach can be used more generally in extensions of LGM. Second, it provides an analysis identifying conditions for the convergence of the FP method. Third, it provides an extensive experimental evaluation in Gaussian processes, sparse Gaussian processes, and generalized linear models, with several non-conjugate observation likelihoods, showing wide applicability of the FP method and a significant advantage over gradient based optimization.

Learning Structured Low-Rank Representation via Matrix Factorization

Mon, 02 May 2016 00:00:00 +0000

A vast body of recent works in the literature has shown that exploring structures beyond data low-rankness can boost the performance of subspace clustering methods such as Low-Rank Representation (LRR). It has also been well recognized that the matrix factorization framework might offer more flexibility on pursuing underlying structures of the data. In this paper, we propose to learn structured LRR by factorizing the nuclear norm regularized matrix, which leads to our proposed non-convex formulation NLRR. Interestingly, this formulation of NLRR provides a general framework for unifying a variety of popular algorithms including LRR, dictionary learning, robust principal component analysis, sparse subspace clustering, etc. Several variants of NLRR are also proposed, for example, to promote sparsity while preserving low-rankness. We design a practical algorithm for NLRR and its variants, and establish theoretical guarantee for the stability of the solution and the convergence of the algorithm. Perhaps surprisingly, the computational and memory cost of NLRR can be reduced by roughly one order of magnitude compared to the cost of LRR. Experiments on extensive simulations and real datasets confirm the robustness of efficiency of NLRR and the variants.

Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

Mon, 02 May 2016 00:00:00 +0000

The Schatten quasi-norm was introduced to bridge the gap between the trace norm and rank function. However, existing algorithms are too slow or even impractical for large-scale problems. Motivated by the equivalence relation between the trace norm and its bilinear spectral penalty, we define two tractable Schatten norms, i.e. the bi-trace and tri-trace norms, and prove that they are in essence the Schatten-1/2 and 1/3 quasi-norms, respectively. By applying the two defined Schatten quasi-norms to various rank minimization problems such as MC and RPCA, we only need to solve much smaller factor matrices. We design two efficient linearized alternating minimization algorithms to solve our problems and establish that each bounded sequence generated by our algorithms converges to a critical point. We also provide the restricted strong convexity (RSC) based and MC error bounds for our algorithms. Our experimental results verified both the efficiency and effectiveness of our algorithms compared with the state-of-the-art methods.

Unbounded Bayesian Optimization via Regularization

Mon, 02 May 2016 00:00:00 +0000

Bayesian optimization has recently emerged as a powerful and flexible tool in machine learning for hyperparameter tuning and more generally for the efficient global optimization of expensive black box functions. The established practice requires a user-defined bounded domain, which is assumed to contain the global optimizer. However, when little is known about the probed objective function, it can be difficult to prescribe such a domain. In this work, we modify the standard Bayesian optimization framework in a principled way to allow for unconstrained exploration of the search space. We introduce a new alternative method and compare it to a volume doubling baseline on two common synthetic benchmarking test functions. Finally, we apply our proposed methods on the task of tuning the stochastic gradient descent optimizer for both a multi-layered perceptron and a convolutional neural network on the MNIST dataset.

Provable Tensor Methods for Learning Mixtures of Generalized Linear Models

Mon, 02 May 2016 00:00:00 +0000

We consider the problem of learning mixtures of generalized linear models (GLM) which arise in classification and regression problems. Typical learning approaches such as expectation maximization (EM) or variational Bayes can get stuck in spurious local optima. In contrast, we present a tensor decomposition method which is guaranteed to correctly recover the parameters. The key insight is to employ certain feature transformations of the input, which depend on the input generative model. Specifically, we employ score function tensors of the input and compute their cross-correlation with the response variable. We establish that the decomposition of this tensor consistently recovers the parameters, under mild non-degeneracy conditions. We demonstrate that the computational and sample complexity of our method is a low order polynomial of the input and the latent dimensions.

Probability Inequalities for Kernel Embeddings in Sampling without Replacement

Mon, 02 May 2016 00:00:00 +0000

The \emphkernel embedding of distributions is a popular machine learning technique to manipulate probability distributions and an integral part of numerous applications. Its empirical counterpart is an estimate from a finite dataset of samples from the distribution under consideration. However, for large-scale learning problems the empirical kernel embedding becomes infeasible to compute and approximate, constant time, solutions are necessary. Instead of the full dataset, a random subset of smaller size can be used to calculate the empirical kernel embedding, known as \emphsampling without replacement. In this work we generalize the results of (Serfling 1974) to quantify the difference between this two estimates. We derive probability inequalities for the kernel embedding and more general inequalities for Banach space valued martingales in the setting of sampling without replacement.

Limits on Sparse Support Recovery via Linear Sketching with Random Expander Matrices

Mon, 02 May 2016 00:00:00 +0000

Linear sketching is a powerful tool for the problem of sparse signal recovery, having numerous applications such as compressive sensing, data stream computing, graph sketching, and routing. Motivated by applications where the \emphpositions of the non-zero entries in a sparse vector are of primary interest, we consider the problem of \emphsupport recovery from a linear sketch taking the form \mathbfY = \mathbfXβ+ \mathbfZ. We focus on a widely-used expander-based construction in the columns of the measurement matrix \mathbfX ∈\mathbbR^n \times p are random permutations of a sparse binary vector containing d ≪n ones and n-d zeros. We provide a sharp characterization of the number of measurements required for an information-theoretically optimal decoder, thus permitting a precise comparison to the i.i.d. Gaussian construction. Our findings reveal both positive and negative results, showing that the performance nearly matches the Gaussian construction at moderate-to-high noise levels, while being worse by an arbitrarily large factor at low noise levels.

Chained Gaussian Processes

Mon, 02 May 2016 00:00:00 +0000

Gaussian process models are flexible, Bayesian non-parametric approaches to regression. Properties of multivariate Gaussians mean that they can be combined linearly in the manner of additive models and via a link function (like in generalized linear models) to handle non-Gaussian data. However, the link function formalism is restrictive, link functions are always invertible and must convert a parameter of interest to an linear combination of the underlying processes. There are many likelihoods and models where a non-linear combination is more appropriate. We term these more general models "Chained Gaussian Processes": the transformation of the GPs to the likelihood parameters will not generally be invertible, and that implies that linearisation would only be possible with multiple (localized) links, i.e a chain. We develop an approximate inference procedure for Chained GPs that is scalable and applicable to any factorized likelihood. We demonstrate the approximation on a range of likelihood functions.

Non-Gaussian Component Analysis with Log-Density Gradient Estimation

Mon, 02 May 2016 00:00:00 +0000

Non-Gaussian component analysis (NGCA) is aimed at identifying a linear subspace such that the projected data follows a non-Gaussian distribution. In this paper, we propose a novel NGCA algorithm based on log-density gradient estimation. Unlike existing methods, the proposed NGCA algorithm identifies the linear subspace by using the eigenvalue decomposition without any iterative procedures, and thus is computationally reasonable. Furthermore, through theoretical analysis, we prove that the identified subspace converges to the true subspace at the optimal parametric rate. Finally, the practical performance of the proposed algorithm is demonstrated on both artificial and benchmark datasets.

Model-based Co-clustering for High Dimensional Sparse Data

Mon, 02 May 2016 00:00:00 +0000

We propose a novel model based on the von Mises-Fisher (vMF) distribution for co-clustering high dimensional sparse matrices. While existing vMF-based models are only suitable for clustering along one dimension, our model acts simultaneously on both dimensions of a data matrix. Thereby it has the advantage of exploiting the inherent duality between rows and columns. Setting our model under the maximum likelihood (ML) approach and the classification ML (CML) approach, we derive two novel, hard and soft, co-clustering algorithms. Empirical results on numerous synthetic and real-world text datasets, demonstrate the effectiveness of our approach, for modelling high dimensional sparse data and co-clustering. Furthermore, thanks to our formulation, that performs an implicitly adaptive dimensionality reduction at each stage, our model alleviates the problem of high concentration parameters kappa’s, a well known difficulty in the classical vMF-based models.

Graph Sparsification Approaches for Laplacian Smoothing

Mon, 02 May 2016 00:00:00 +0000

Given a statistical estimation problem where regularization is performed according to the structure of a large, dense graph G, we consider fitting the statistical estimate using a \it sparsified surrogate graph \mathbfG, which shares the vertices of G but has far fewer edges, and is thus more tractable to work with computationally. We examine three types of sparsification: spectral sparsification, which can be seen as the result of sampling edges from the graph with probabilities proportional to their effective resistances, and two simpler sparsifiers, which sample edges uniformly from the graph, either globally or locally. We provide strong theoretical and experimental results, demonstrating that sparsification before estimation can give statistically sensible solutions, with significant computational savings.

Controlling Bias in Adaptive Data Analysis Using Information Theory

Mon, 02 May 2016 00:00:00 +0000

Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.

A Column Generation Bound Minimization Approach with PAC-Bayesian Generalization Guarantees

Mon, 02 May 2016 00:00:00 +0000

The C-bound, introduced in Lacasse et al (2006), gives a tight upper bound on the risk of the majority vote classifier. Laviolette et al. (2011) designed a learning algorithm named MinCq that outputs a dense distribution on a finite set of base classifiers by minimizing the C-bound, together with a PAC-Bayesian generalization guarantee. In this work, we design a column generation algorithm that we call CqBoost, that optimizes the C-bound and outputs a sparse distribution on a possibly infinite set of voters. We also propose a PAC-Bayesian bound for CqBoost that holds for finite and two cases of continuous sets of base classifiers. Finally, compare the accuracy and the sparsity of CqBoost with MinCq and other state-of-the-art boosting algorithms.

Discriminative Structure Learning of Arithmetic Circuits

Mon, 02 May 2016 00:00:00 +0000

The biggest limitation of probabilistic graphical models is the complexity of inference, which is often intractable. An appealing alternative is to use tractable probabilistic models, such as arithmetic circuits (ACs) and sum-product networks (SPNs), in which marginal and conditional queries can be answered efficiently. In this paper, we present the first discriminative structure learning algorithm for ACs, DACLearn (Discriminative AC Learner). Like previous work on generative structure learning, DACLearn finds a log-linear model with conjunctive features, using the size of an equivalent AC representation as a learning bias. Unlike previous work, DACLearn optimizes conditional likelihood, resulting in a more accurate conditional distribution. DACLearn also learns much more compact ACs than generative methods, since it does not need to represent a consistent distribution over the evidence variables. To ensure efficiency, DACLearn uses novel initialization and search heuristics to drastically reduce the number of feature evaluations required to learn an accurate model. In experiments on 20 benchmark domains, we find that our DACLearn learns models that are more accurate and compact than other tractable generative and discriminative methods.

Fast Dictionary Learning with a Smoothed Wasserstein Loss

Mon, 02 May 2016 00:00:00 +0000

We consider in this paper the dictionary learning problem when the observations are normalized histograms of features. This problem can be tackled using non-negative matrix factorization approaches, using typically Euclidean or Kullback-Leibler fitting errors. Because these fitting errors are separable and treat each feature on equal footing, they are blind to any similarity the features may share. We assume in this work that we have prior knowledge on these features. To leverage this side-information, we propose to use the Wasserstein (\textita.k.a. earth mover’s or optimal transport) distance as the fitting error between each original point and its reconstruction, and we propose scalable algorithms to to so. Our methods build upon Fenchel duality and entropic regularization of Wasserstein distances, which improves not only speed but also computational stability. We apply these techniques on face images and text documents. We show in particular that we can learn dictionaries (topics) for bag-of-word representations of texts using words that may not have appeared in the original texts, or even words that come from a different language than that used in the texts.

Determinantal Regularization for Ensemble Variable Selection

Mon, 02 May 2016 00:00:00 +0000

Recent years have seen growing interest in deterministic search approaches to spike-and-slab Bayesian variable selection. Such methods have focused on the goal of finding a global mode to identify a “best model”. However, the report of a single model will be a misleading reflection of the model uncertainty inherent in a highly multimodal posterior. Motivated by non-parametric variational Bayes strategies, we move beyond this limitation by proposing an ensemble optimization approach to identify a collection of representative posterior modes. By deploying determinantal penalty functions as diversity regularizers, our approach performs regularization over multiple locations of the posterior. The key driver of these determinantal penalties is a kernel function that induces repulsion in the latent model space domain.

C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching

Mon, 02 May 2016 00:00:00 +0000

Lightweight, source-to-source transformation approaches to implementing MCMC for probabilistic programming languages are popular for their simplicity, support of existing deterministic code, and ability to execute on existing fast runtimes. However, they are also inefficient, requiring a complete re-execution of the program on every Metropolis Hastings proposal. We present a new extension to the lightweight approach, C3, which enables efficient, incrementalized re-execution of MH proposals. C3 is based on two core ideas: transforming probabilistic programs into continuation passing style (CPS), and caching the results of function calls. It is particularly effective at speeding up recursive programs with many local latent variables. We show that on several common models, C3 reduces proposal runtime by 20-100x, in some cases reducing runtime complexity from linear in model size to constant. We also demonstrate nearly an order of magnitude speedup on a complex inverse procedural modeling application.

Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Mon, 02 May 2016 00:00:00 +0000

Traditional graph-based semi-supervised learning (SSL) approaches are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the label distribution per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that effectively captures the sparsity of the label distribution and further reduces the space complexity per node to O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. Finally, we propose a robust graph augmentation strategy using unsupervised deep learning architectures that yields further significant quality gains for SSL in natural language applications.

Stochastic Neural Networks with Monotonic Activation Functions

Mon, 02 May 2016 00:00:00 +0000

We propose a Laplace approximation that creates a stochastic unit from any smooth monotonic activation function, using only Gaussian noise. This paper investigates the application of this stochastic approximation in training a family of Restricted Boltzmann Machines (RBM) that are closely linked to Bregman divergences. This family, that we call exponential family RBM (Exp-RBM), is a subset of the exponential family Harmoniums that expresses family members through a choice of smooth monotonic non-linearity for each neuron. Using contrastive divergence along with our Gaussian approximation, we show that Exp-RBM can learn useful representations using novel stochastic units.

Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks

Mon, 02 May 2016 00:00:00 +0000

Probabilistic graphical models provide a general and flexible framework for reasoning about complex dependencies in noisy domains with many variables. Among the various types of probabilistic graphical models, sum-product networks (SPNs) have recently generated some interest because exact inference can always be done in linear time with respect to the size of the network. This is particularly attractive since it means that learning an SPN from data always yields a tractable model for inference. However, existing parameter learning algorithms for SPNs operate in batch mode and do not scale easily to large datasets. In this work, we explore online algorithms to ensure that parameter learning can also be done tractably with respect to the amount of data. More specifically, we propose a new Bayesian moment matching (BMM) algorithm that operates naturally in an online fashion and that can be easily distributed. We demonstrate the effectiveness and scalability of BMM in comparison to online extensions of gradient descent, exponentiated gradient and expectation maximization on 20 classic benchmarks and 4 large scale datasets.

Low-Rank Approximation of Weighted Tree Automata

Mon, 02 May 2016 00:00:00 +0000

We describe a technique to minimize weighted tree automata (WTA), a powerful formalisms that subsumes probabilistic context-free grammars (PCFGs) and latent-variable PCFGs. Our method relies on a singular value decomposition of the underlying Hankel matrix defined by the WTA. Our main theoretical result is an efficient algorithm for computing the SVD of an infinite Hankel matrix implicitly represented as a WTA. We evaluate our method on real-world data originating in newswire treebank. We show that the model achieves lower perplexity than previous methods for PCFG minimization, and also is much more stable due to the absence of local optima.

Back to the Future: Radial Basis Function Networks Revisited

Mon, 02 May 2016 00:00:00 +0000

Radial Basis Function (RBF) networks are a classical family of algorithms for supervised learning. The most popular approach for training RBF networks has relied on kernel methods using regularization based on a norm in a Reproducing Kernel Hilbert Space (RKHS), which is a principled and empirically successful framework. In this paper we aim to revisit some of the older approaches to training the RBF networks from a more modern perspective. Specifically, we analyze two common regularization procedures, one based on the square norm of the coefficients in the network and another on using centers obtained by k-means clustering. We show that both of these RBF methods can be recast as certain data-dependent kernels. We provide a theoretical analysis of these methods as well as a number of experimental results, pointing out very competitive experimental performance as well as certain advantages over the standard kernel methods in terms of both flexibility (incorporating of unlabeled data) and computational complexity. Finally, our results shed light on some impressive recent successes of using soft k-means features for image recognition and other tasks.

A Deep Generative Deconvolutional Image Model

Mon, 02 May 2016 00:00:00 +0000

A deep generative model is developed for representation and analysis of images, based on a hierarchical convolutional dictionary-learning framework. Stochastic unpooling is employed to link consecutive layers in the model, yielding top-down image generation. A Bayesian support vector machine is linked to the top-layer features, yielding max-margin discrimination. Deep deconvolutional inference is employed when testing, to infer the latent features, and the top-layer features are connected with the max-margin classifier for discrimination tasks. The algorithm is efficiently trained via Monte Carlo expectation-maximization (MCEM), with implementation on graphical processor units (GPUs) for efficient large-scale learning, and fast testing. Excellent results are obtained on several benchmark datasets, including ImageNet, demonstrating that the proposed model achieves results that are highly competitive with similarly sized convolutional neural networks.

On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games

Mon, 02 May 2016 00:00:00 +0000

The main contribution of this paper consists in extending several non-stationary Reinforcement Learning (RL) algorithms and their theoretical guarantees to the case of γ-discounted zero-sum Markov Games (MGs). As in the case of Markov Decision Processes (MDPs), non-stationary algorithms are shown to exhibit better performance bounds compared to their stationary counterparts. The obtained bounds are generically composed of three terms: 1) a dependency on γ(discount factor), 2) a concentrability coefficient and 3) a propagation error term. This error, depending on the algorithm, can be caused by a regression step, a policy evaluation step or a best-response evaluation step. As a second contribution, we empirically demonstrate, on generic MGs (called Garnets), that non-stationary algorithms outperform their stationary counterparts. In addition, it is shown that their performance mostly depends on the nature of the propagation error. Indeed, algorithms where the error is due to the evaluation of a best-response are penalized (even if they exhibit better concentrability coefficients and dependencies on γ) compared to those suffering from a regression error.

K2-ABC: Approximate Bayesian Computation with Kernel Embeddings

Mon, 02 May 2016 00:00:00 +0000

Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will leak information, which leads to ABC algorithms yielding samples from an incorrect posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2-ABC, uses maximum mean discrepancy (MMD) to construct a dissimilarity measure between the observed and simulated data. The embedding of an empirical distribution of the data into a reproducing kernel Hilbert space plays a role of the summary statistic and is sufficient whenever the corresponding kernels are characteristic. Experiments on a simulated scenario and a real-world biological problem illustrate the effectiveness of the proposed algorithm.

Bayesian Nonparametric Kernel-Learning

Mon, 02 May 2016 00:00:00 +0000

Kernel methods are ubiquitous tools in machine learning. They have proven to be effective in many domains and tasks. Yet, kernel methods often require the user to select a predefined kernel to build an estimator with. However, there is often little reason for the common practice of selecting a kernel a priori. Even if a universal approximating kernel is selected, the quality of the finite sample estimator may be greatly affected by the choice of kernel. Furthermore, when directly applying kernel methods, one typically needs to compute a N \times N Gram matrix of pairwise kernel evaluations to work with a dataset of N instances. The computation of this Gram matrix precludes the direct application of kernel methods on large datasets, and makes kernel learning especially difficult. In this paper we introduce Bayesian nonparmetric kernel-learning (BaNK), a generic, data-driven framework for scalable learning of kernels. BaNK places a nonparametric prior on the spectral distribution of random frequencies allowing it to both learn kernels and scale to large datasets. We show that this framework can be used for large scale regression and classification tasks. Furthermore, we show that BaNK outperforms several other scalable approaches for kernel learning on a variety of real world datasets.

Control Functionals for Quasi-Monte Carlo Integration

Mon, 02 May 2016 00:00:00 +0000

Quasi-Monte Carlo (QMC) methods are being adopted in statistical applications due to the increasingly challenging nature of numerical integrals that are now routinely encountered. For integrands with d-dimensions and derivatives of order α, an optimal QMC rule converges at a best-possible rate O(N^-α/d). However, in applications the value of αcan be unknown and/or a rate-optimal QMC rule can be unavailable. Standard practice is to employ \alpha_L-optimal QMC where the lower bound \alpha_L ≤αis known, but in general this does not exploit the full power of QMC. One solution is to trade-off numerical integration with functional approximation. This strategy is explored herein and shown to be well-suited to modern statistical computation. A challenging application to robotic arm data demonstrates a substantial variance reduction in predictions for mechanical torques.

Accelerated Stochastic Gradient Descent for Minimizing Finite Sums

Mon, 02 May 2016 00:00:00 +0000

We propose an optimization method for minimizing the finite sums of smooth convex functions. Our method incorporates an accelerated gradient descent (AGD) and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. An important feature of the method is that it can be directly applied to general convex and semi-strongly convex problems that is a weaker condition than strong convexity. We show that our method achieves a better overall complexity for the general convex problems and linear convergence for optimal strongly convex problems. Moreover we prove the fast iteration complexity of our method. Our experiments show the effectiveness of our method.

The Nonparametric Kernel Bayes Smoother

Mon, 02 May 2016 00:00:00 +0000

Recently, significant progress has been made developing kernel mean expressions for Bayesian inference. An important success in this domain is the nonparametric kernel Bayes’ filter (nKB-filter), which can be used for sequential inference in state space models. We expand upon this work by introducing a smoothing algorithm, the nonparametric kernel Bayes’ smoother (nKB-smoother) which relies on kernel Bayesian inference through the kernel sum rule and kernel Bayes’ rule. We derive the smoothing equations, analyze the computational cost, and show smoothing consistency. We summarize the algorithm, which is simple to implement, requiring only matrix multiplications and the output of the nKB-filter. Finally, we report experimental results that compare the nKB-smoother to previous parametric and nonparametric approaches to Bayesian filtering and smoothing. In the supplementary materials, we show that the combination of the nKB-filter and the nKB-smoother allows marginal kernel mean computation, which gives an alternative to kernel belief propagation.

No Regret Bound for Extreme Bandits

Mon, 02 May 2016 00:00:00 +0000

Algorithms for hyperparameter optimization abound, all of which work well under different and often unverifiable assumptions. Motivated by the general challenge of sequentially choosing which algorithm to use, we study the more specific task of choosing among distributions to use for random hyperparameter optimization. This work is naturally framed in the extreme bandit setting, which deals with sequentially choosing which distribution from a collection to sample in order to minimize (maximize) the single best cost (reward). Whereas the distributions in the standard bandit setting are primarily characterized by their means, a number of subtleties arise when we care about the minimal cost as opposed to the average cost. For example, there may not be a well-defined “best” distribution as there is in the standard bandit setting. The best distribution depends on the rewards that have been obtained and on the remaining time horizon. Whereas in the standard bandit setting, it is sensible to compare policies with an oracle which plays the single best arm, in the extreme bandit setting, there are multiple sensible oracle models. We define a sensible notion of “extreme regret” in the extreme bandit setting, which parallels the concept of regret in the standard bandit setting. We then prove that no policy can asymptotically achieve no extreme regret.

Optimal Statistical and Computational Rates for One Bit Matrix Completion

Mon, 02 May 2016 00:00:00 +0000

We consider one bit matrix completion under rank constraint. We present an estimator based on rank constrained maximum likelihood estimation, and an efficient greedy algorithm to solve it approximately based on an extension of conditional gradient descent. The output of the proposed algorithm converges at a linear rate to the underlying true low-rank matrix up to the optimal statistical estimation error rate, i.e., O(\sqrtrn\log(n)/|Ω|), where n is the dimension of the underlying matrix and |Ω| is the number of observed entries. Our work establishes the first computationally efficient approach with provable guarantee for optimal estimation in one bit matrix completion. Our theory is supported by thorough numerical results.

Approximate Inference Using DC Programming For Collective Graphical Models

Mon, 02 May 2016 00:00:00 +0000

Collective graphical models (CGMs) provide a framework for reasoning about a population of independent and identically distributed individuals when only noisy and aggregate observations are given. Previous approaches for inference in CGMs work on a junction-tree representation, thereby highly limiting their scalability. To remedy this, we show how the Bethe entropy approximation naturally arises for the inference problem in CGMs. We reformulate the resulting optimization problem as a difference-of-convex functions program that can capture different types of CGM noise models. Using the concave-convex procedure, we then develop a scalable message-passing algorithm. Empirically, our approach is highly scalable and accurate for large graphs, more than an order-of-magnitude faster than a generic optimization solver, and is guaranteed to converge unlike the previous message-passing approach NLBP that fails in several loopy graphs.

New Resistance Distances with Global Information on Large Graphs

Mon, 02 May 2016 00:00:00 +0000

We consider the problem that on large random geometric graphs, random walk-based distances between nodes do not carry global information such as cluster structure. Instead, as the graphs become larger, the distances contain mainly the obsolete information of local density of the nodes. Many distances or similarity measures between nodes on a graph have been proposed but none are both proved to overcome this problem or computationally feasible even for small graphs. We propose new distance functions between nodes for this problem. The idea is to use electrical flows with different energy functions. Our proposed distances are proved analytically to be metrics in L^p spaces, to keep global information, avoiding the problem, and can be computed efficiently for large graphs. Our experiments with synthetic and real data confirmed the theoretical properties and practical performances of our proposed distances.

Pseudo-Marginal Slice Sampling

Mon, 02 May 2016 00:00:00 +0000

Markov chain Monte Carlo (MCMC) methods asymptotically sample from complex probability distributions. The pseudo-marginal MCMC framework only requires an unbiased estimator of the unnormalized probability distribution function to construct a Markov chain. However, the resulting chains are harder to tune to a target distribution than conventional MCMC, and the types of updates available are limited. We describe a general way to clamp and update the random numbers used in a pseudo-marginal method’s unbiased estimator. In this framework we can use slice sampling and other adaptive methods. We obtain more robust Markov chains, which often mix more quickly.

A Linearly-Convergent Stochastic L-BFGS Algorithm

Mon, 02 May 2016 00:00:00 +0000

We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as a recent approach to variance reduction for stochastic gradient descent from Johnson and Zhang (2013). We demonstrate experimentally that our algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision. Furthermore, we show that our algorithm performs well for a wide-range of step sizes, often differing by several orders of magnitude.

Accelerating Online Convex Optimization via Adaptive Prediction

Mon, 02 May 2016 00:00:00 +0000

We present a powerful general framework for designing data-dependent online convex optimization algorithms, building upon and unifying recent techniques in adaptive regularization, optimistic gradient predictions, and problem-dependent randomization. We first present a series of new regret guarantees that hold at any time and under very minimal assumptions, and then show how different relaxations recover existing algorithms, both basic as well as more recent sophisticated ones. Finally, we show how combining adaptivity, optimism, and problem-dependent randomization can guide the design of algorithms that benefit from more favorable guarantees than recent state-of-the-art methods.

NuC-MKL: A Convex Approach to Non Linear Multiple Kernel Learning

Mon, 02 May 2016 00:00:00 +0000

Multiple Kernel Learning (MKL) methods are known for their effectiveness in solving classification and regression problems involving multi-modal data. Many MKL approaches use linear combination of base kernels, resulting in somewhat limited feature representations. Several non-linear MKL formulations were proposed recently. They provide much higher dimensional feature spaces, and, therefore, richer representations. However, these methods often lead to non-convex optimization and to intractable number of optimization parameters. In this paper, we propose a new non-linear MKL method that utilizes nuclear norm regularization and leads to convex optimization problem. The proposed Nuclear-norm-Constrained MKL (NuC-MKL) algorithm converges faster, and requires smaller number of calls to an SVM solver, as compared to other competing methods. Moreover, the number of the model support vectors in our approach is usually much smaller, as compared to other methods. This suggests that our algorithm is more resilient to overfitting. We test our algorithm on several known benchmarks, and show that it equals or outperforms the state-of-the-art MKL methods on all these data sets.

On the Reducibility of Submodular Functions

Mon, 02 May 2016 00:00:00 +0000

The scalability of submodular optimization methods is critical for their usability in practice. In this paper, we study the reducibility of submodular functions, a property that enables us to reduce the solution space of submodular optimization problems without performance loss. We introduce the concept of reducibility using marginal gains. Then we show that by adding perturbation, we can endow irreducible functions with reducibility, based on which we propose the perturbation-reduction optimization framework. Our theoretical analysis proves that given the perturbation scales, the reducibility gain could be computed, and the performance loss has additive upper bounds. We further conduct empirical studies and the results demonstrate that our proposed framework significantly accelerates existing optimization methods for irreducible submodular functions with a cost of only small performance losses.

Fitting Spectral Decay with the k-Support Norm

Mon, 02 May 2016 00:00:00 +0000

The spectral k-support norm enjoys good estimation properties in low rank matrix learning problems, empirically outperforming the trace norm. Its unit ball is the convex hull of rank k matrices with unit Frobenius norm. In this paper we generalize the norm to the spectral (k,p)-support norm, whose additional parameter p can be used to tailor the norm to the decay of the spectrum of the underlying model. We characterize the unit ball and we explicitly compute the norm. We further provide a conditional gradient method to solve regularization problems with the norm, and we derive an efficient algorithm to compute the Euclidean projection on the unit ball in the case p=∞. In numerical experiments, we show that allowing p to vary significantly improves performance over the spectral k-support norm on various matrix completion benchmarks, and better captures the spectral decay of the underlying model.

Large-Scale Optimization Algorithms for Sparse Conditional Gaussian Graphical Models

Mon, 02 May 2016 00:00:00 +0000

This paper addresses the problem of scalable optimization for L1-regularized conditional Gaussian graphical models. Conditional Gaussian graphical models generalize the well-known Gaussian graphical models to conditional distributions to model the output network influenced by conditioning input variables. While highly scalable optimization methods exist for sparse Gaussian graphical model estimation, state-of-the-art methods for conditional Gaussian graphical models are not efficient enough and more importantly, fail due to memory constraints for very large problems. In this paper, we propose a new optimization procedure based on a Newton method that efficiently iterates over two sub-problems, leading to drastic improvement in computation time compared to the previous methods. We then extend our method to scale to large problems under memory constraints, using block coordinate descent to limit memory usage while achieving fast convergence. Using synthetic and genomic data, we show that our methods can solve problems with millions of variables and tens of billions of parameters to high accuracy on a single machine.

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

Mon, 02 May 2016 00:00:00 +0000

The variational framework for learning inducing variables (Titsias, 2009) has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes.

Globally Sparse Probabilistic PCA

Mon, 02 May 2016 00:00:00 +0000

With the flourishing development of high-dimensional data, sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables may be difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure that allows to obtain several sparse components with the same sparsity pattern. To this end, using Roweis’ probabilistic interpretation of PCA and an isotropic Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. In order to avoid the drawbacks of discrete model selection, we propose a simple relaxation of our framework which allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood can eventually be maximized over this path, relying on Occam’s razor to select the relevant variables. Since the sparsity pattern is common to all components, we call this approach globally sparse probabilistic PCA (GSPPCA). Its usefulness is illustrated on synthetic data sets and on several real unsupervised feature selection problems.

Variational Tempering

Mon, 02 May 2016 00:00:00 +0000

Variational inference (VI) combined with data subsampling enables approximate posterior inference with large data sets for otherwise intractable models, but suffers from poor local optima. We first formulate a deterministic annealing approach for the generic class of conditionally conjugate exponential family models. This algorithm uses a temperature parameter that deterministically deforms the objective and reduces this parameter over the course of the optimization. A well-known drawback in annealing is the choice of the annealing schedule. We therefore introduce variational tempering, a variational algorithm that introduces a temperature latent variable to the model. In contrast to related work in the Markov chain Monte Carlo literature, this algorithm results in adaptive annealing schedules. Lastly, we develop local variational tempering, which assigns a latent temperature to each data point; this allows for dynamic annealing that varies across data. Compared to the traditional VI, all proposed approaches find improved predictive likelihoods on held-out data.

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Mon, 02 May 2016 00:00:00 +0000

Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.

Latent Point Process Allocation

Mon, 02 May 2016 00:00:00 +0000

We introduce a probabilistic model for the factorisation of continuous Poisson process rate functions. Our model can be thought of as a topic model for Poisson point processes in which each point is assigned to one of a set of latent rate functions that are shared across multiple outputs. We show that the model brings a means of incorporating structure in point process inference beyond the state-of-the-art. We derive an efficient variational inference scheme for the model based on sparse Gaussian processes that scales linearly in the number of data points. Finally, we demonstrate, using examples from spatial and temporal statistics, how the model can be used for discovering hidden structure with greater precision than standard frequentist approaches.

Semi-Supervised Learning with Adaptive Spectral Transform

Mon, 02 May 2016 00:00:00 +0000

This paper proposes a novel nonparametric framework for semi-supervised learning and for optimizing the Laplacian spectrum of the data manifold simultaneously. Our formulation leads to a convex optimization problem that can be efficiently solved via the bundle method, and can be interpreted as to asymptotically minimize the generalization error bound of semi-supervised learning with respect to the graph spectrum. Experiments over benchmark datasets in various domains show advantageous performance of the proposed method over strong baselines.

Efficient Bregman Projections onto the Permutahedron and Related Polytopes

Mon, 02 May 2016 00:00:00 +0000

The problem of projecting onto the permutahedron \mathbfPH(c) – the convex hull of all permutations of a fixed vector c – under a uniformly separable Bregman divergence is shown to be reducible to the Isotonic Optimization problem. This allows us to employ known fast algorithms to improve on several recent results on Bregman projections onto permutahedra. In addition, we present a new algorithm \rm mergepool that have better complexity when the number of distinct entries d in the vector c is small, the simplex being one such example, with c=(1,0,0,\dotsc,0)^T and d=2. \rm mergepool runs in O(n \log d) for certain popular Bregman divergence measures and requires O((n \log d) \log \fracU ε) to find ε-close solutions for general uniformly separable Bregman divergences, where U is a bound on the width of the interval containing the dual solution components. These estimates matches or improves best known bounds for all Bregman projection problems onto various permutahedra, including recent results for projection onto the simplex. The same complexity bounds apply to signed permutahedra, a class that includes the \ell_1-ball as a special case. In summary, this work describes a fast unified approach to this well-known class of problems.

One Scan 1-Bit Compressed Sensing

Mon, 02 May 2016 00:00:00 +0000

Based on α-stable random projections with small α, we develop a simple algorithm for compressed sensing (sparse signal recovery) by utilizing only the signs (i.e., 1-bit) of the measurements. Using only 1-bit information of the measurements results in substantial cost reduction in collection, storage, communication, and decoding for compressed sensing. The proposed algorithm is efficient in that the decoding procedure requires only one scan of the coordinates. Our analysis can precisely show that, for a K-sparse signal of length N, 12.3K\log N/δmeasurements (where δis the confidence) would be sufficient for recovering the support and the signs of the signal. While the method is highly robust against typical measurement noises, we also provide the analysis of the scheme under random flipping of the signs of measurements. Compared to the well-known work on 1-bit marginal regression (which can also be viewed as a one-scan method), the proposed algorithm requires orders of magnitude fewer measurements. Compared to 1-bit Iterative Hard Thresholding (IHT) (which is not a one-scan algorithm), our method is still significantly more accurate. Furthermore, the proposed method is reasonably robust against random sign flipping while IHT is known to be very sensitive to this type of noise.

Efficient Sampling for k-Determinantal Point Processes

Mon, 02 May 2016 00:00:00 +0000

Determinantal Point Processes (DPPs) are elegant probabilistic models of repulsion and diversity over discrete sets of items. But their applicability to large sets is hindered by expensive cubic-complexity matrix operations for basic tasks such as sampling. In light of this, we propose a new method for approximate sampling from discrete k-DPPs. Our method takes advantage of the diversity property of subsets sampled from a DPP, and proceeds in two stages: first it constructs coresets for the ground set of items; thereafter, it efficiently samples subsets based on the constructed coresets. As opposed to previous approaches, our algorithm aims to minimize the total variation distance to the original distribution. Experiments on both synthetic and real datasets indicate that our sampling algorithm works efficiently on large data sets, and yields more accurate samples than previous approaches.

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models

Mon, 02 May 2016 00:00:00 +0000

Bayesian Optimization (BO) is commonly used to optimize blackbox objective functions which are expensive to evaluate. A common approach is based on using Gaussian Process (GP) to model the objective function. Applying GP to higher dimensional settings is generally difficult due to the curse of dimensionality for nonparametric regression. Existing works makes strong assumptions such as the function is low-dimensional embedding (Wang et al., 2013) or is axis-aligned additive (Kandasamy et al., 2015). In this pa- per, we generalize the existing assumption to a projected-additive assumption. Our generalization provides the benefits of i) greatly increasing the space of functions that can be modeled by our approach, which covers the previous works (Wang et al., 2013; Kandasamy et al., 2015) as special cases, and ii) efficiently handling the learning in a larger model space. We prove that the regret for projected-additive functions has only linear dependence on the number of dimensions in this general setting. Directly using projected-additive GP (Gilboa et al., 2013) to BO results in a non-box constraint, which is not easy to optimize. We tackle this problem by proposing a restricted-projection-pursuit GP for BO. We conduct experiments on synthetic examples and scientific and hyper-parameter tuning tasks in many cases. Our method outperforms existing approaches even when the function does not meet the projected additive assumption. Last, we study the validity of the additive and projected-additive assumption in practice.

Scalable MCMC for Mixed Membership Stochastic Blockmodels

Mon, 02 May 2016 00:00:00 +0000

We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition we develop an approximation that can handle models that entertain a very large number of communities. The experimental results show that SG-MCMC strictly dominates competing algorithms in all cases.

An Improved Convergence Analysis of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization

Mon, 02 May 2016 00:00:00 +0000

The cyclic block coordinate descent-type (CBCD-type) methods have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that the CBCD-type methods attain iteration complexity of O(p⋅\log(1/ε)), where εis a pre-specified accuracy of the objective value, and p is the number of blocks. However, such iteration complexity explicitly depends on p, and therefore is at least p times worse than those of gradient descent methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity of the CBCD-type methods matches that of the GD methods in term of dependency on p (up to a \log^2 p factor). Thus our complexity bounds are sharper than the existing bounds by at least a factor of p/\log^2p. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a \log^2 p factor) if the largest and smallest eigenvalues of the Hessian matrix do not scale with p. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones.

Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA

Mon, 02 May 2016 00:00:00 +0000

We study the problem of recovering the subspace spanned by the first k principal components of d-dimensional data under the streaming setting, with a memory bound of O(kd). Two families of algorithms are known for this problem. The first family is based on the framework of stochastic gradient descent. Nevertheless, the convergence rate of the family can be seriously affected by the learning rate of the descent steps and deserves more serious study. The second family is based on the power method over blocks of data, but setting the block size for its existing algorithms is not an easy task. In this paper, we analyze the convergence rate of a representative algorithm with decayed learning rate (Oja and Karhunen, 1985) in the first family for the general k>1 case. Moreover, we propose a novel algorithm for the second family that sets the block sizes automatically and dynamically with faster convergence rate. We then conduct empirical studies that fairly compare the two families on real-world data. The studies reveal the advantages and disadvantages of these two families.

A Lasso-based Sparse Knowledge Gradient Policy for Sequential Optimal Learning

Mon, 02 May 2016 00:00:00 +0000

We propose a sequential learning policy for noisy discrete global optimization and ranking and selection (R&S) problems with high dimensional sparse belief functions, where there are hundreds or even thousands of features, but only a small portion of these features contain explanatory power. Our problem setting, motivated by the experimental sciences, arises where we have to choose which experiment to run next. Here the experiments are time-consuming and expensive. We derive a sparse knowledge gradient (SpKG) decision policy based on the \ell_1-penalized regression Lasso to identify the sparsity pattern before our budget is exhausted. This policy is a unique and novel hybrid of Bayesian R&S with a frequentist learning approach. Theoretically, we provide the error bound of the posterior mean estimate, which has shown to be at the minimax optimal \sqrts\log p/n rate. Controlled experiments on both synthetic data and real application for automatically designing experiments to identify the structure of an RNA molecule show that the algorithm efficiently learns the correct set of nonzero parameters. It also outperforms several other learning policies.

Fast Saddle-Point Algorithm for Generalized Dantzig Selector and FDR Control with Ordered L1-Norm

Mon, 02 May 2016 00:00:00 +0000

In this paper we propose a primal-dual proximal extragradient algorithm to solve the generalized Dantzig selector (GDS) estimation problem, based on a new convex-concave saddle-point (SP) reformulation. Our new formulation makes it possible to adopt recent developments in saddle-point optimization, to achieve the optimal O(1/k) rate of convergence. Compared to the optimal non-SP algorithms, ours do not require specification of sensitive parameters that affect algorithm performance or solution quality. We also provide a new analysis showing a possibility of local acceleration to achieve the rate of O(1/k^2) in special cases even without strong convexity or strong smoothness. As an application, we propose a GDS equipped with the ordered \ell_1-norm, showing its false discovery rate control properties in variable selection. Algorithm performance is compared between ours and other alternatives, including the linearized ADMM, Nesterov’s smoothing, Nemirovski’s mirror-prox, and the accelerated hybrid proximal extragradient techniques.

Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree

Mon, 02 May 2016 00:00:00 +0000

We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.

Nonparametric Budgeted Stochastic Gradient Descent

Mon, 02 May 2016 00:00:00 +0000

One of the most challenging problems in kernel online learning is to bound the model size. Budgeted kernel online learning addresses this issue by bounding the model size to a predefined budget. However, determining an appropriate value for such predefined budget is arduous. In this paper, we propose the Nonparametric Budgeted Stochastic Gradient Descent that allows the model size to automatically grow with data in a principled way. We provide theoretical analysis to show that our framework is guaranteed to converge for a large collection of loss functions (e.g. Hinge, Logistic, L2, L1, and \varepsilon-insensitive) which enables the proposed algorithm to perform both classification and regression tasks without hurting the ideal convergence rate O\left(\frac1T\right) of the standard Stochastic Gradient Descent. We validate our algorithm on the real-world datasets to consolidate the theoretical claims.

Cut Pursuit: Fast Algorithms to Learn Piecewise Constant Functions

Mon, 02 May 2016 00:00:00 +0000

We propose working-set/greedy algorithms to efficiently solve problems penalized respectively by the total variation and the Mumford Shah boundary size when the piecewise constant solutions has a small number of levelsets. Our algorithms exploit this structure by recursively splitting the level-sets using graph cuts. We obtain significant speed up on images that can be approximated with few levelsets compared to state-of-the-art algorithms.

Mondrian Forests for Large-Scale Regression when Uncertainty Matters

Mon, 02 May 2016 00:00:00 +0000

Many real-world regression problems demand a measure of the uncertainty associated with each prediction. Standard decision forests deliver efficient state-of-the-art predictive performance, but high-quality uncertainty estimates are lacking. Gaussian processes (GPs) deliver uncertainty estimates, but scaling GPs to large-scale data sets comes at the cost of approximating the uncertainty estimates. We extend Mondrian forests, first proposed by Lakshminarayanan et al. (2014) for classification problems, to the large-scale nonparametric regression setting. Using a novel hierarchical Gaussian prior that dovetails with the Mondrian forest framework, we obtain principled uncertainty estimates, while still retaining the computational advantages of decision forests. Through a combination of illustrative examples, real-world large-scale datasets and Bayesian optimization benchmarks, we demonstrate that Mondrian forests outperform approximate GPs on large-scale regression tasks and deliver better-calibrated uncertainty assessments than decision-forest-based methods.

Convex Block-sparse Linear Regression with Expanders – Provably

Mon, 02 May 2016 00:00:00 +0000

Sparse matrices are favorable objects in machine learning and optimization. When such matrices are used, in place of dense ones, the overall complexity requirements in optimization can be significantly reduced in practice, both in terms of space and run-time. Prompted by this observation, we study a convex optimization scheme for block-sparse recovery from linear measurements. To obtain linear sketches, we use expander matrices, i.e., sparse matrices containing only few non-zeros per column. Hitherto, to the best of our knowledge, such algorithmic solutions have been only studied from a non-convex perspective. Our aim here is to theoretically characterize the performance of convex approaches under such setting. Our key novelty is the expression of the recovery error in terms of the model-based norm, while assuring that solution lives in the model. To achieve this, we show that sparse model-based matrices satisfy a group version of the null-space property. Our experimental findings on synthetic and real applications support our claims for faster recovery in the convex setting – as opposed to using dense sensing matrices, while showing a competitive recovery performance.

Private Causal Inference

Mon, 02 May 2016 00:00:00 +0000

Causal inference deals with identifying which random variables ”cause” or control other random variables. Recent advances on the topic of causal inference based on tools from statistical estimation and machine learning have resulted in practical algorithms for causal inference. Causal inference has the potential to have significant impact on medical research, prevention and control of diseases, and identifying factors that impact economic changes to name just a few. However, these promising applications for causal inference are often ones that involve sensitive or personal data of users that need to be kept private (e.g., medical records, personal finances, etc). Therefore, there is a need for the development of causal inference methods that preserve data privacy. We study the problem of inferring causality using the current, popular causal inference framework, the additive noise model (ANM) while simultaneously ensuring privacy of the users. Our framework provides differential privacy guarantees for a variety of ANM variants. We run extensive experiments, and demonstrate that our techniques are practical and easy to implement.

Online Learning with Noisy Side Observations

Mon, 02 May 2016 00:00:00 +0000

We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of O(\sqrt(α^* T) after T rounds, where α^* is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of alpha^*. For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor & Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.

Bayesian Markov Blanket Estimation

Mon, 02 May 2016 00:00:00 +0000

This paper considers a Bayesian view for estimating the Markov blanket of a set of query variables, where the set of potential neighbours here is big. We factorize the posterior such that the Markov blanket is conditionally independent of the network of the potential neighbours. By exploiting this blockwise decoupling, we derive analytic expressions for posterior conditionals. Subsequently, we develop an inference scheme, which makes use of the factorization. As a result, estimation of a sub-network is possible without inferring an entire network. Since the resulting Gibbs sampler scales linearly with the number of variables, it can handle relatively large neighbourhoods. The proposed scheme results in faster convergence and superior mixing of the Markov chain than existing Bayesian network estimation techniques.

How to Learn a Graph from Smooth Signals

Mon, 02 May 2016 00:00:00 +0000

We propose a framework to learn the graph structure underlying a set of smooth signals. Given X∈\mathbbR^m\times n whose rows reside on the vertices of an unknown graph, we learn the edge weights w∈\mathbbR_+^m(m-1)/2 under the smoothness assumption that \rm trX^⊤LX is small, where L is the graph Laplacian. We show that the problem is a weighted \ell-1 minimization that leads to naturally sparse solutions. We prove that the standard graph construction with Gaussian weights w_ij = \exp(-\frac1σ^2\|x_i-x_j\|^2) and the previous state of the art are special cases of our framework. We propose a new model and present efficient, scalable primal-dual based algorithms both for this and the previous state of the art, to evaluate their performance on artificial and real data. The new model performs best in most settings.

Parallel Majorization Minimization with Dynamically Restricted Domains for Nonconvex Optimization

Mon, 02 May 2016 00:00:00 +0000

We propose an optimization framework for nonconvex problems based on majorization-minimization that is particularity well-suited for parallel computing. It reduces the optimization of a high dimensional nonconvex objective function to successive optimizations of locally tight and convex upper bounds which are additively separable into low dimensional objectives. The original problem is then broken into simpler and parallel tasks, while guaranteeing the monotonic reduction of the original objective function and convergence to a local minimum. This framework also allows one to restrict the upper bound to a local dynamic convex domain, so that the bound is better matched to the local curvature of the objective function, resulting in accelerated convergence. We test the proposed framework on a nonconvex support vector machine based on a sigmoid loss function and on nonconvex penalized logistic regression.

Top Arm Identification in Multi-Armed Bandits with Batch Arm Pulls

Mon, 02 May 2016 00:00:00 +0000

We introduce a new multi-armed bandit (MAB) problem in which arms must be sampled in batches, rather than one at a time. This is motivated by applications in social media monitoring and biological experimentation where such batch constraints naturally arise. This paper develops and analyzes algorithms for batch MABs and top arm identification, for both fixed confidence and fixed budget settings. Our main theoretical results show that the batch constraint does not significantly affect the sample complexity of top arm identification compared to unconstrained MAB algorithms. Alternatively, if one views a batch as the fundamental sampling unit, then the results can be interpreted as showing that the sample complexity of batch MABs can be significantly less than traditional MABs. We demonstrate the new batch MAB algorithms with simulations and in two interesting real-world applications: (i) microwell array experiments for identifying genes that are important in virus replication and (ii) finding the most active users in Twitter on a specific topic.

Non-stochastic Best Arm Identification and Hyperparameter Optimization

Mon, 02 May 2016 00:00:00 +0000

Motivated by the task of hyperparameter optimization, we introduce the \em non-stochastic best-arm identification problem. We identify an attractive algorithm for this setting that makes no assumptions on the convergence behavior of the arms’ losses, has no free-parameters to adjust, provably outperforms the uniform allocation baseline in favorable conditions, and performs comparably (up to \log factors) otherwise. Next, by leveraging the iterative nature of many learning algorithms, we cast hyperparameter optimization as an instance of non-stochastic best-arm identification. Our empirical results show that, by allocating more resources to promising hyperparameter settings, our approach achieves comparable test accuracies an order of magnitude faster than the uniform strategy. The robustness and simplicity of our approach makes it well-suited to ultimately replace the uniform strategy currently used in most machine learning software packages.

Unsupervised Ensemble Learning with Dependent Classifiers

Mon, 02 May 2016 00:00:00 +0000

In unsupervised ensemble learning, one obtains predictions from multiple sources or classifiers, yet without knowing the reliability and expertise of each source, and with no labeled data to assess it. The task is to combine these possibly conﬂicting predictions into an accurate meta-learner. Most works to date assumed perfect diversity between the different sources, a property known as conditional independence. In realistic scenarios, however, this assumption is often violated, and ensemble learners based on it can be severely sub-optimal. The key challenges we address in this paper are: (i) how to detect, in an unsupervised manner, strong violations of conditional independence; and (ii) construct a suitable meta-learner. To this end we introduce a statistical model that allows for dependencies between classifiers. Based on this model, we develop novel unsupervised methods to detect strongly dependent classifiers, better estimate their accuracies, and construct an improved meta-learner. Using both artificial and real datasets, we showcase the importance of taking classifier dependencies into account and the competitive performance of our approach.

Topic-Based Embeddings for Learning from Large Knowledge Graphs

Mon, 02 May 2016 00:00:00 +0000

We present a scalable probabilistic framework for learning from multi-relational data given in form of entity-relation-entity triplets, with a potentially massive number of entities and relations (e.g., in multi-relational networks, knowledge bases, etc.). We define each triplet via a relation-specific bilinear function of the embeddings of entities associated with it (these embeddings correspond to “topics”). To handle massive number of relations and the data sparsity problem (very few observations per relation), we also extend this model to allow sharing of parameters across relations, which leads to a substantial reduction in the number of parameters to be learned. In addition to yielding excellent predictive performance (e.g., for knowledge base completion tasks), the interpretability of our topic-based embedding framework enables easy qualitative analyses. Computational cost of our models scales in the number of positive triplets, which makes it easy to scale to massive real-world multi-relational data sets, which are usually extremely sparse. We develop simple-to-implement batch as well as online Gibbs sampling algorithms and demonstrate the effectiveness of our models on tasks such as multi-relational link-prediction, and learning from large knowledge bases.

Non-negative Matrix Factorization for Discrete Data with Hierarchical Side-Information

Mon, 02 May 2016 00:00:00 +0000

We present a probabilistic framework for efficient non-negative matrix factorization of discrete (count/binary) data with side-information. The side-information is given as a multi-level structure, taxonomy, or ontology, with nodes at each level being categorical-valued observations. For example, when modeling documents with a two-level side-information (documents being at level-zero), level-one may represent (one or more) authors associated with each document and level-two may represent affiliations of each author. The model easily generalizes to more than two levels (or taxonomy/ontology of arbitrary depth). Our model can learn embeddings of entities present at each level in the data/side-information hierarchy (e.g., documents, authors, affiliations, in the previous example), with appropriate sharing of information across levels. The model also enjoys full local conjugacy, facilitating efficient Gibbs sampling for model inference. Inference cost scales in the number of non-zero entries in the data matrix, which is especially appealing for real-world massive but sparse matrices. We demonstrate the effectiveness of the model on several real-world data sets.

(Bandit) Convex Optimization with Biased Noisy Gradient Oracles

Mon, 02 May 2016 00:00:00 +0000

A popular class of algorithms for convex optimization and online learning with bandit feedback rely on constructing noisy gradient estimates, which are then used in place of the actual gradients in appropriately adjusted first-order algorithms. Depending on the properties of the function to be optimized and the nature of “noise” in the bandit feedback, the bias and variance of gradient estimates exhibit various tradeoffs. In this paper we propose a novel framework that replaces the specific gradient estimation methods with an abstract oracle model. With the help of the new framework we unify previous works, reproducing their results in a clean and concise fashion, while, perhaps more importantly, the framework also allows us to formally show that to achieve the optimal root-n rate either the algorithms that use existing gradient estimators, or the proofs have to go beyond what exists today.

Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index Policies

Mon, 02 May 2016 00:00:00 +0000

We consider effort allocation in crowdsourcing, where we wish to assign labeling tasks to imperfect homogeneous crowd workers to maximize overall accuracy in a continuous-time Bayesian setting, subject to budget and time constraints. The Bayes-optimal policy for this problem is the solution to a partially observable Markov decision process, but the curse of dimensionality renders the computation infeasible. Following a similar approach to the Lagrangian Relaxation technique in Adelman and Mersereau (2008), we provide a computationally tractable instance-specific upper bound on the value of this Bayes-optimal policy, which can in turn be used to bound the optimality gap of any other sub-optimal policy. In an approach similar in spirit to the Whittle index for restless multi-armed bandits, we provide an index policy for effort allocation in crowdsourcing and demonstrate numerically that it outperforms other state-of-the-art policies and performs close to optimal.

Tight Variational Bounds via Random Projections and I-Projections

Mon, 02 May 2016 00:00:00 +0000

Information projections are the key building block of variational inference algorithms and are used to approximate a target probabilistic model by projecting it onto a family of tractable distributions. In general, there is no guarantee on the quality of the approximation obtained. To overcome this issue, we introduce a new class of random projections to reduce the dimensionality and hence the complexity of the original model. In the spirit of random projections, the projection preserves (with high probability) key properties of the target distribution. We show that information projections can be combined with random projections to obtain provable guarantees on the quality of the approximation obtained, regardless of the complexity of the original model. We demonstrate empirically that augmenting mean field with a random projection step dramatically improves partition function and marginal probability estimates, both on synthetic and real world data.

Scalable Gaussian Process Classification via Expectation Propagation

Mon, 02 May 2016 00:00:00 +0000

Variational methods have been recently considered for scaling the training process of Gaussian process classifiers to large datasets. As an alternative, we describe here how to train these classifiers efficiently using expectation propagation (EP). The proposed EP method allows to train Gaussian process classifiers on very large datasets, with millions of instances, that were out of the reach of previous implementations of EP. More precisely, it can be used for (i) training in a distributed fashion where the data instances are sent to different nodes in which the required computations are carried out, and for (ii) maximizing an estimate of the marginal likelihood using a stochastic approximation of the gradient. Several experiments involving large datasets show that the method described is competitive with the variational approach.

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics

Mon, 02 May 2016 00:00:00 +0000

Inverse Reinforcement Learning (IRL) describes the problem of learning an unknown reward function of a Markov Decision Process (MDP) from observed behavior of an agent. Since the agent’s behavior originates in its policy and MDP policies depend on both the stochastic system dynamics as well as the reward function, the solution of the inverse problem is significantly influenced by both. Current IRL approaches assume that if the transition model is unknown, additional samples from the system’s dynamics are accessible, or the observed behavior provides enough samples of the system’s dynamics to solve the inverse problem accurately. These assumptions are often not satisfied. To overcome this, we present a gradient-based IRL approach that simultaneously estimates the system’s dynamics. By solving the combined optimization problem, our approach takes into account the bias of the demonstrations, which stems from the generating policy. The evaluation on a synthetic MDP and a transfer learning task shows improvements regarding the sample efficiency as well as the accuracy of the estimated reward functions and transition models.

Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces

Mon, 02 May 2016 00:00:00 +0000

We present a scalable Gaussian process model for identifying and characterizing smooth multidimensional changepoints, and automatically learning changes in expressive covariance structure. We use Random Kitchen Sink features to flexibly define a change surface in combination with expressive spectral mixture kernels to capture the complex statistical structure. Finally, through the use of novel methods for additive non-separable kernels, we can scale the model to large datasets. We demonstrate the model on numerical and real world data, including a large spatio-temporal disease dataset where we identify previously unknown heterogeneous changes in space and time.

DUAL-LOCO: Distributing Statistical Estimation Using Random Projections

Mon, 02 May 2016 00:00:00 +0000

We present DUAL-LOCO, a communication-efficient algorithm for distributed statistical estimation. DUAL-LOCO assumes that the data is distributed across workers according to the features rather than the samples. It requires only a single round of communication where low-dimensional random projections are used to approximate the dependencies between features available to different workers. We show that DUAL-LOCO has bounded approximation error which only depends weakly on the number of workers. We compare DUAL-LOCO against a state-of-the-art distributed optimization method on a variety of real world datasets and show that it obtains better speedups while retaining good accuracy. In particular, DUAL-LOCO allows for fast cross validation as only part of the algorithm depends on the regularization parameter.

Non-Stationary Gaussian Process Regression with Hamiltonian Monte Carlo

Mon, 02 May 2016 00:00:00 +0000

We present a novel approach for non-stationary Gaussian process regression (GPR), where the three key parameters – noise variance, signal variance and lengthscale – can be simultaneously input-dependent. We develop gradient-based inference methods to learn the unknown function and the non-stationary model parameters, without requiring any model approximations. For inferring the full posterior distribution we use Hamiltonian Monte Carlo (HMC), which conveniently extends the analytical gradient-based GPR learning by guiding the sampling with the gradients. The MAP solution can also be learned with gradient ascent. In experiments on several synthetic datasets and in modelling of temporal gene expression, the non-stationary GPR is shown to give major improvement when modeling realistic input-dependent dynamics.

Improper Deep Kernels

Mon, 02 May 2016 00:00:00 +0000

Neural networks have recently re-emerged as a powerful hypothesis class, yielding impressive classification accuracy in multiple domains. However, their training is a non convex optimization problem. Here we address this difficulty by turning to "improper learning" of neural nets. In other words, we learn a classifier that is not a neural net but is competitive with the best neural net model given a sufficient number of training examples. Our approach relies on a novel kernel which integrates over the set of possible neural models. It turns out that the corresponding integral can be evaluated in closed form via a simple recursion. The learning problem is then an SVM with this kernel, and a global optimum can thus be found efficiently. We also provide sample complexity results which depend on the stability of the optimal neural net.

Dreaming More Data: Class-dependent Distributions over Diffeomorphisms for Learned Data Augmentation

Mon, 02 May 2016 00:00:00 +0000

Data augmentation is a key element in training high-dimensional models. In this approach, one synthesizes new observations by applying pre-specified transformations to the original training data; e.g. new images are formed by rotating old ones. Current augmentation schemes, however, rely on manual specification of the applied transformations, making data augmentation an implicit form of feature engineering. With an eye towards true end-to-end learning, we suggest learning the applied transformations on a per-class basis. Particularly, we align image pairs within each class under the assumption that the spatial transformation between images belongs to a large class of diffeomorphisms. We then learn a class-specific probabilistic generative models of the transformations in a Riemannian submanifold of the Lie group of diffeomorphisms. We demonstrate significant performance improvements in training deep neural nets over manually-specified augmentation schemes. Our code and augmented datasets are available online.

Variational Gaussian Copula Inference

Mon, 02 May 2016 00:00:00 +0000

We utilize copulas to constitute a unified framework for constructing and optimizing variational proposals in hierarchical Bayesian models. For models with continuous and non-Gaussian hidden variables, we propose a semiparametric and automated variational Gaussian copula approach, in which the parametric Gaussian copula family is able to preserve multivariate posterior dependence, and the nonparametric transformations based on Bernstein polynomials provide ample flexibility in characterizing the univariate marginal posteriors.

A PAC RL Algorithm for Episodic POMDPs

Mon, 02 May 2016 00:00:00 +0000

Many interesting real world domains involve reinforcement learning (RL) in partially observable environments. Efficient learning in such domains is important, but existing sample complexity bounds for partially observable RL are at least exponential in the episode length. We give, to our knowledge, the first partially observable RL algorithm with a polynomial bound on the number of episodes on which the algorithm may not achieve near-optimal performance. Our algorithm is suitable for an important class of episodic POMDPs. Our approach builds on recent advances in the method of moments for latent variable model estimation.

Quantization based Fast Inner Product Search

Mon, 02 May 2016 00:00:00 +0000

We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). Each database vector is quantized in multiple subspaces via a set of codebooks, learned directly by minimizing the inner product quantization error. Then, the inner product of a query to a database vector is approximated as the sum of inner products with the subspace quantizers. Different from recently proposed LSH approaches to MIPS, the database vectors and queries do not need to be augmented in a higher dimensional feature space. We also provide a theoretical analysis of the proposed approach, consisting of the concentration results under mild assumptions. Furthermore, if a small set of held-out queries is given at the training time, we propose a modified codebook learning procedure which further improves the accuracy. Experimental results on a variety of datasets including those arising from deep neural networks show that the proposed approach significantly outperforms the existing state-of-the-art.

Universal Models of Multivariate Temporal Point Processes

Mon, 02 May 2016 00:00:00 +0000

With the rapidly increasing availability of event stream data there is growing interest in multivariate temporal point process models to capture both qualitative and quantitative features of this type of data. Recent research on multivariate point processes have focused in inference and estimation problems for restricted classes of models such as continuous time Bayesian networks, Markov jump processes, Gaussian Cox processes, and Hawkes Processes. In this paper, we study the expressive power and learnability of Graphical Event Models (GEMs) – the analogue of directed graphical models for multivariate temporal point processes. In particular, we describe a set of Graphical Event Models (GEMs) and show that this class can universally approximate any smooth multivariate temporal point process. We also describe a universal learning algorithm for this class of GEMs and show, under a mild set of assumptions, learnability results for both the dependency structures and distributions in this class. Our consistency results demonstrate the possibility of learning about both qualitative and quantitative dependencies from rich event stream data.

Low-Rank and Sparse Structure Pursuit via Alternating Minimization

Mon, 02 May 2016 00:00:00 +0000

In this paper, we present a nonconvex alternating minimization optimization algorithm for low-rank and sparse structure pursuit. Compared with convex relaxation based methods, the proposed algorithm is computationally more efficient for large scale problems. In our study, we define a notion of bounded difference of gradients, based on which we rigorously prove that with suitable initialization, the proposed nonconvex optimization algorithm enjoys linear convergence to the global optima and exactly recovers the underlying low rank and sparse matrices under standard conditions such as incoherence and sparsity conditions. For a wide range of statistical models such as multi-task learning and robust principal component analysis (RPCA), our algorithm provides a principled approach to learning the low rank and sparse structures with provable guarantee. Thorough experiments on both synthetic and real datasets backup our theory.

Nearly Optimal Classification for Semimetrics

Mon, 02 May 2016 00:00:00 +0000

We initiate the rigorous study of classification in semimetric spaces, which are point sets with a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. We define the \em density dimension \dens and discover that it plays a central role in the statistical and algorithmic feasibility of learning in semimetric spaces. We compute this quantity for several widely used semimetrics and present nearly optimal sample compression algorithms, which are then used to obtain generalization guarantees, including fast rates. Our claim of near-optimality holds in both computational and statistical senses. When the sample has radius R and margin γ, we show that it can be compressed down to roughly d=(R/γ)^\dens points, and further that finding a significantly better compression is algorithmically intractable unless P=NP. This compression implies generalization via standard Occam-type arguments, to which we provide a nearly matching lower bound.

GLASSES: Relieving The Myopia Of Bayesian Optimisation

Mon, 02 May 2016 00:00:00 +0000

We present GLASSES: Global optimisation with Look-Ahead through Stochastic Simulation and Expected-loss Search. The majority of global optimisation approaches in use are myopic, in only considering the impact of the next function value; the non-myopic approaches that do exist are able to consider only a handful of future evaluations. Our novel algorithm, GLASSES, permits the consideration of dozens of evaluations into the future. This is done by approximating the ideal look-ahead loss function, which is expensive to evaluate, by a cheaper alternative in which the future steps of the algorithm are simulated beforehand. An Expectation Propagation algorithm is used to compute the expected value of the loss. We show that the far-horizon planning thus enabled leads to substantive performance gains in empirical tests.

Batch Bayesian Optimization via Local Penalization

Mon, 02 May 2016 00:00:00 +0000

The popularity of Bayesian optimization methods for efficient exploration of parameter spaces has lead to a series of papers applying Gaussian processes as surrogates in the optimization of functions. However, most proposed approaches only allow the exploration of the parameter space to occur sequentially. Often, it is desirable to simultaneously propose batches of parameter values to explore. This is particularly the case when large parallel processing facilities are available. These could either be computational or physical facets of the process being optimized. Batch methods, however, require the modeling of the interaction between the different evaluations in the batch, which can be expensive in complex scenarios. We investigate this issue and propose a highly effective heuristic based on an estimate of the function’s Lipschitz constant that captures the most important aspect of this interaction–local repulsion–at negligible computational overhead. A penalized acquisition function is used to collect batches of points minimizing the non-parallelizable computational effort. The resulting algorithm compares very well, in run-time, with much more elaborate alternatives.

Unwrapping ADMM: Efficient Distributed Computing via Transpose Reduction

Mon, 02 May 2016 00:00:00 +0000

Recent approaches to distributed model fitting rely heavily on consensus ADMM, where each node solves small sub-problems using only local data. We propose iterative methods that solve global sub-problems over an entire distributed dataset. This is possible using transpose reduction strategies that allow a single node to solve least-squares over massive datasets without putting all the data in one place. This results in simple iterative methods that avoid the expensive inner loops required for consensus methods. We analyze the convergence rates of the proposed schemes and demonstrate the efficiency of this approach by fitting linear classifiers and sparse linear models to large datasets using a distributed implementation with up to 20,000 cores in far less time than previous approaches.

Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking

Mon, 02 May 2016 00:00:00 +0000

Extremes play a special role in Anomaly Detection. Beyond inference and simulation purposes, probabilistic tools borrowed from Extreme Value Theory (EVT), such as the \textitangular measure, can also be used to design novel statistical learning methods for Anomaly Detection/ranking. This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space with respect to their degree of ‘abnormality’. The procedure relies on an original dimension-reduction technique in the extreme domain that possibly produces a sparse representation of multivariate extremes and allows to gain insight into the dependence structure thereof, escaping the curse of dimensionality. The representation output by the unsupervised methodology we propose here can be combined with any Anomaly Detection technique tailored to non-extreme data. As it performs linearly with the dimension and almost linearly in the data (in O(d n \log n)), it fits to large scale problems. The approach in this paper is novel in that EVT has never been used in its multivariate version in the field of Anomaly Detection. Illustrative experimental results provide strong empirical evidence of the relevance of our approach.

Streaming Kernel Principal Component Analysis

Mon, 02 May 2016 00:00:00 +0000

Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full n x n kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nystrom method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.

CRAFT: ClusteR-specific Assorted Feature selecTion

Mon, 02 May 2016 00:00:00 +0000

We present a hierarchical Bayesian framework for clustering with cluster-specific feature selection. We derive a simplified model, CRAFT, by analyzing the asymptotic behavior of the log posterior formulations in a nonparametric MAP-based clustering setting in this framework. CRAFT handles assorted data, i.e., both numeric and categorical data, and the underlying objective functions are intuitively appealing. The resulting algorithm is simple to implement and scales nicely, requires minimal parameter tuning, obviates the need to specify the number of clusters a priori, and compares favorably with other state-of-the-art methods on real datasets. We also provide empirical evidence on carefully designed synthetic data sets to highlight the robustness of the algorithm to recover the underlying feature subspaces, even when the average dimensionality of the features across clusters is misspecified. Besides, the framework seamlessly allows for multiple views of clustering by interpolating between the two extremes of cluster-specific feature selection and global selection, and recovers the DP-means objective under the degenerate setting of clustering without feature selection.

Improved Learning Complexity in Combinatorial Pure Exploration Bandits

Mon, 02 May 2016 00:00:00 +0000

We study the problem of combinatorial pure exploration in the stochastic multi-armed bandit problem. We first construct a new measure of complexity that provably characterizes the learning performance of the algorithms we propose for the fixed confidence and the fixed budget setting. We show that this complexity is never higher than the one in existing work and illustrate a number of configurations in which it can be significantly smaller. While in general this improvement comes at the cost of increased computational complexity, we provide a series of examples, including a planning problem, where this extra cost is not significant.

Bayesian Generalised Ensemble Markov Chain Monte Carlo

Mon, 02 May 2016 00:00:00 +0000

Bayesian generalised ensemble (BayesGE) is a new method that addresses two major drawbacks of standard Markov chain Monte Carlo algorithms for inference in high-dimensional probability models: inapplicability to estimate the partition function and poor mixing properties. BayesGE uses a Bayesian approach to iteratively update the belief about the density of states (distribution of the log likelihood under the prior) for the model, with the dual purpose of enhancing the sampling efficiency and making the estimation of the partition function tractable. We benchmark BayesGE on Ising and Potts systems and show that it compares favourably to existing state-of-the-art methods.

Loss Bounds and Time Complexity for Speed Priors

Mon, 02 May 2016 00:00:00 +0000

This paper establishes for the first time the predictive performance of speed priors and their computational complexity. A speed prior is essentially a probability distribution that puts low probability on strings that are not efficiently computable. We propose a variant to the original speed prior (Schmidhuber, 2002), and show that our prior can predict sequences drawn from probability measures that are estimable in polynomial time. Our speed prior is computable in doubly-exponential time, but not in polynomial time. On a polynomial time computable sequence our speed prior is computable in exponential time. We show that Schmidhuber’s speed prior has better complexity under the same conditions; however, the question of its predictive properties remains open.

Ordered Weighted L1 Regularized Regression with Strongly Correlated Covariates: Theoretical Aspects

Mon, 02 May 2016 00:00:00 +0000

This paper studies the ordered weighted L1 (OWL) family of regularizers for sparse linear regression with strongly correlated covariates. We prove sufficient conditions for clustering correlated covariates, extending and qualitatively strengthening previous results for a particular member of the OWL family: OSCAR (octagonal shrinkage and clustering algorithm for regression). We derive error bounds for OWL with correlated Gaussian covariates: for cases in which clusters of covariates are strongly (even perfectly) correlated, but covariates in different clusters are uncorrelated, we show that if the true p-dimensional signal involves only s clusters, then O(s \log p) samples suffice to accurately estimate it, regardless of the number of coefficients within the clusters. Since the estimation of s-sparse signals with completely independent covariates also requires O(s \log p) measurements, this shows that by using OWL regularization, we pay no price (in the number of measurements) for the presence of strongly correlated covariates.

Random Forest for the Contextual Bandit Problem

Mon, 02 May 2016 00:00:00 +0000

To address the contextual bandit problem, we propose an online random forest algorithm. The analysis of the proposed algorithm is based on the sample complexity needed to find the optimal decision stump. Then, the decision stumps are recursively stacked in a random collection of decision trees, BANDIT FOREST. We show that the proposed algorithm is optimal up to logarithmic factors. The dependence of the sample complexity upon the number of contextual variables is logarithmic. The computational cost of the proposed algorithm with respect to the time horizon is linear. These analytical results allow the proposed algorithm to be efficient in real applications , where the number of events to process is huge, and where we expect that some contextual variables, chosen from a large set, have potentially non-linear dependencies with the rewards. In the experiments done to illustrate the theoretical analysis, BANDIT FOREST obtain promising results in comparison with state-of-the-art algorithms.

Early Stopping as Nonparametric Variational Inference

Mon, 02 May 2016 00:00:00 +0000

We show that unconverged stochastic gradient descent can be interpreted as sampling from a nonparametric approximate posterior distribution. This distribution is implicitly defined by the transformation of an initial distribution by a sequence of optimization steps. By tracking the change in entropy of this distribution during optimization, we give a scalable, unbiased estimate of a variational lower bound on the log marginal likelihood. This bound can be used to optimize hyperparameters instead of cross-validation. This Bayesian interpretation of SGD also suggests new overfitting-resistant optimization procedures, and gives a theoretical foundation for early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.

Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models

Mon, 02 May 2016 00:00:00 +0000

We study maximum likelihood estimators (MLEs) for the residual variance, the signal-to-noise ratio, and other variance parameters in high-dimensional linear models. These parameters are essential in many statistical applications involving regression diagnostics, inference, tuning parameter selection for high-dimensional regression, and other applications, including genetics. The estimators that we study are not new, and have been widely used for variance component estimation in linear random-effects models. However, our analysis is new and it implies that the MLEs, which were devised for random-effects models, may also perform very well in high-dimensional linear models with fixed-effects, which are more commonly studied in some areas of high-dimensional statistics. The MLEs are shown to be consistent and asymptotically normal in fixed-effects models with random design, in asymptotic settings where the number of predictors ($p$) is proportional to the number of observations ($n$). Moreover, the estimators’ asymptotic variance can be given explicitly in terms moments of the Marcenko-Pastur distribution. A variety of analytical and empirical results show that the MLEs outperform other, previously proposed estimators for variance parameters in high-dimensional linear models with fixed-effects. More broadly, the results in this paper illustrate a strategy for drawing connections between fixed- and random-effects models in high dimensions, which may be useful in other applications.

Active Learning Algorithms for Graphical Model Selection

Mon, 02 May 2016 00:00:00 +0000

The problem of learning the structure of a high dimensional graphical model from data has received considerable attention in recent years. In many applications such as sensor networks and proteomics it is often expensive to obtain samples from all the variables involved simultaneously. For instance, this might involve the synchronization of a large number of sensors or the tagging of a large number of proteins. To address this important issue, we initiate the study of a novel graphical model selection problem, where the goal is to optimize the total number of scalar samples obtained by allowing the collection of samples from only subsets of the variables. We propose a general paradigm for graphical model selection where feedback is used to guide the sampling to high degree vertices, while obtaining only few samples from the ones with the low degrees. We instantiate this framework with two specific active learning algorithms, one of which makes mild assumptions but is computationally expensive, while the other is computationally more efficient but requires stronger (nevertheless standard) assumptions. Whereas the sample complexity of passive algorithms is typically a function of the maximum degree of the graph, we show that the sample complexity of our algorithms is provably smaller and that it depends on a novel local complexity measure that is akin to the average degree of the graph. We finally demonstrate the efficacy of our framework via simulations.

Provable Bayesian Inference via Particle Mirror Descent

Mon, 02 May 2016 00:00:00 +0000

Bayesian methods are appealing in their flexibility in modeling complex data and ability in capturing uncertainty in parameters. However, when Bayes’ rule does not result in tractable closed-form, most approximate inference algorithms lack either scalability or rigorous guarantees. To tackle this challenge, we propose a simple yet provable algorithm, Particle Mirror Descent (PMD), to iteratively approximate the posterior density. PMD is inspired by stochastic functional mirror descent where one descends in the density space using a small batch of data points at each iteration, and by particle filtering where one uses samples to approximate a function. We prove result of the first kind that, with m particles, PMD provides a posterior density estimator that converges in terms of KL-divergence to the true posterior in rate O(1/\sqrtm). We demonstrate competitive empirical performances of PMD compared to several approximate inference algorithms in mixture models, logistic regression, sparse Gaussian processes and latent Dirichlet allocation on large scale datasets.

Simple and Scalable Constrained Clustering: a Generalized Spectral Method

Mon, 02 May 2016 00:00:00 +0000

We present a simple spectral approach to the well-studied constrained clustering problem. It captures constrained clustering as a generalized eigenvalue problem with graph Laplacians. The algorithm works in nearly-linear time and provides concrete guarantees for the quality of the clusters, at least for the case of 2-way partitioning. In practice this translates to a very fast implementation that consistently outperforms existing spectral approaches both in speed and quality.

Score Permutation Based Finite Sample Inference for Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) Models

Mon, 02 May 2016 00:00:00 +0000

A standard model of (conditional) heteroscedasticity, i.e., the phenomenon that the variance of a process changes over time, is the Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) model, which is especially important for economics and finance. GARCH models are typically estimated by the Quasi-Maximum Likelihood (QML) method, which works under mild statistical assumptions. Here, we suggest a finite sample approach, called ScoPe, to construct distribution-free confidence regions around the QML estimate, which have exact coverage probabilities, despite no additional assumptions about moments are made. ScoPe is inspired by the recently developed Sign-Perturbed Sums (SPS) method, which however cannot be applied in the GARCH case. ScoPe works by perturbing the score function using randomly permuted residuals. This produces alternative samples which lead to exact confidence regions. Experiments on simulated and stock market data are also presented, and ScoPe is compared with the asymptotic theory and bootstrap approaches.

Fast and Scalable Structural SVM with Slack Rescaling

Mon, 02 May 2016 00:00:00 +0000

We present an efficient method for training slack-rescaled structural SVM. Although finding the most violating label in a margin-rescaled formulation is often easy since the target function decomposes with respect to the structure, this is not the case for a slack-rescaled formulation, and finding the most violated label might be very difficult. Our core contribution is an efficient method for finding the most-violating-label in a slack-rescaled formulation, given an oracle that returns the most-violating-label in a (slightly modified) margin-rescaled formulation. We show that our method enables accurate and scalable training for slack-rescaled SVMs, reducing runtime by an order of magnitude compared to previous approaches to slack-rescaled SVMs.

Scalable and Sound Low-Rank Tensor Learning

Mon, 02 May 2016 00:00:00 +0000

Many real-world data arise naturally as tensors. Equipped with a low rank prior, learning algorithms can benefit from exploiting the rich dependency encoded in a tensor. Despite its prevalence in low-rank matrix learning, trace norm ceases to be tractable in tensors and therefore most existing works resort to matrix unfolding. Although some theoretical guarantees are available, these approaches may lose valuable structure information and are not scalable in general. To address this problem, we propose directly optimizing the tensor trace norm by approximating its dual spectral norm, and we show that the approximation bounds can be efficiently converted to the original problem via the generalized conditional gradient algorithm. The resulting approach is scalable to large datasets, and matches state-of-the-art recovery guarantees. Experimental results on tensor completion and multitask learning confirm the superiority of the proposed method.

Communication Efficient Distributed Agnostic Boosting

Mon, 02 May 2016 00:00:00 +0000

We consider the problem of learning from distributed data in the agnostic setting, i.e., in the presence of arbitrary forms of noise. Our main contribution is a general distributed boosting-based procedure for learning an arbitrary concept space, that is simultaneously noise tolerant, communication efficient, and computationally efficient. This improves significantly over prior works that were either communication efficient only in noise-free scenarios or computationally prohibitive. Empirical results on large synthetic and real-world datasets demonstrate the effectiveness and scalability of the proposed approach.

Robust Covariate Shift Regression

Mon, 02 May 2016 00:00:00 +0000

In many learning settings, the source data available to train a regression model differs from the target data it encounters when making predictions due to input distribution shift. Appropriately dealing with this situation remains an important challenge. Existing methods attempt to “reweight” the source data samples to better represent the target domain, but this introduces strong inductive biases that are highly extrapolative and can often err greatly in practice. We propose a robust approach for regression under covariate shift that embraces the uncertainty resulting from sample selection bias by producing regression models that are explicitly robust to it. We demonstrate the benefits of our approach on a number of regression tasks.

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Mon, 02 May 2016 00:00:00 +0000

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods; however, this connection is not well studied. We explore this relationship by applying simulated annealing to an SG-MCMC algorithm. Furthermore, we extend recent SG-MCMC methods with two key components: i) adaptive preconditioners (as in ADAgrad or RMSprop), and ii) adaptive element-wise momentum weights. The zero-temperature limit gives a novel stochastic optimization method with adaptive element-wise momentum weights, while conventional optimization methods only have a shared, static momentum weight. Under certain assumptions, our theoretical analysis suggests the proposed simulated annealing approach converges close to the global optima. Experiments on several deep neural network models show state-of-the-art results compared to related stochastic optimization algorithms.

Enumerating Equivalence Classes of Bayesian Networks using EC Graphs

Mon, 02 May 2016 00:00:00 +0000

We consider the problem of learning Bayesian network structures from complete data. In particular, we consider the enumeration of their k-best equivalence classes. We propose a new search space for A* search, called the EC graph, that facilitates the enumeration of equivalence classes, by representing the space of completed, partially directed acyclic graphs. We also propose a canonization of this search space, called the EC tree, which further improves the efficiency of enumeration. Empirically, our approach is orders of magnitude more efficient than the state-of-the-art at enumerating equivalence classes.

Online Relative Entropy Policy Search using Reproducing Kernel Hilbert Space Embeddings

Mon, 02 May 2016 00:00:00 +0000

Kernel methods have been successfully applied to reinforcement learning problems to address some challenges such as high dimensional and continuous states, value function approximation and state transition probability modeling. In this paper, we develop an online policy search algorithm based on a recent state-of-the-art algorithm REPS-RKHS that uses conditional kernel embeddings. Our online algorithm inherits the advantages of REPS-RKHS, including the ability to learn non-parametric control policies for infinite horizon continuous MDPs with high- dimensional sensory representations. Different from the original REPS-RKHS algorithm which is based on batch learning, the proposed online algorithm updates the model in an online fashion and thus is able to capture and respond to rapid changes in the system dynamics. In addition, the online update operation takes constant time (i.e., independent of the sample size n), which is much more efficient computationally and allows the policy to be continuously revised. Experiments on different domains are conducted and results show that our online algorithm outperforms the original algorithm.

Online Learning to Rank with Feedback at the Top

Mon, 02 May 2016 00:00:00 +0000

We consider an online learning to rank setting in which, at each round, an oblivious adversary generates a list of m documents, pertaining to a query, and the learner ranks the documents according to assigned scores. The adversary then generates a relevance vector and the learner updates its ranker according to the feedback received. We consider the setting where the feedback is restricted to be the relevance levels of only the top k documents in the ranked list, for k ≪m. However, the performance of learner is judged based on the unrevealed full relevance vectors, using an appropriate learning to rank loss function. We develop efficient algorithms for well known losses in the pointwise, pairwise and listwise families. We also prove that no online algorithm can have sublinear regret, with top 1 feedback, for any loss that is calibrated with respect to NDCG. We apply our algorithms on benchmark datasets demonstrating efficient online learning of a ranking function from highly restricted feedback.

A Robust-Equitable Copula Dependence Measure for Feature Selection

Mon, 02 May 2016 00:00:00 +0000

Feature selection aims to select relevant features to improve the performance of predictors. Many feature selection methods depend on the choice of dependence measures. To select features that have complex nonlinear relationships with the response variable, the dependence measure should be equitable: treating linear and nonlinear relationships equally. In this paper we introduce the concept of robust-equitability and a robust-equitable dependence measure copula correlation (Ccor). This measure has the following advantages compared to existing dependence measures: it is robust to different relationship forms and robust to unequal sample sizes of different features. In contrast, existing dependence measures cannot take these factors into account simultaneously. Experiments on synthetic and real-world datasets confirm our theoretical analysis, and illustrates its advantage in feature selection.

Multi-Level Cause-Effect Systems

Mon, 02 May 2016 00:00:00 +0000

We present a domain-general account of causation that applies to settings in which macro-level causal relations between two systems are of interest, but the relevant causal features are poorly understood and have to be aggregated from vast arrays of micro-measurements. Our approach generalizes that of Chalupka et. al. (2015) to the setting in which the macro-level effect is not specified. We formalize the connection between micro- and macro-variables in such situations and provide a coherent framework describing causal relations at multiple levels of analysis. We present an algorithm that discovers macro-variable causes and effects from micro-level measurements obtained from an experiment. We further show how to design experiments to discover macro-variables from observational micro-variable data. Finally, we show that under specific conditions, one can identify multiple levels of causal structure. Throughout the article, we use a simulated neuroscience multi-unit recording experiment to illustrate the ideas and the algorithms.

Learning Relationships between Data Obtained Independently

Mon, 02 May 2016 00:00:00 +0000

The aim of this paper is to provide a new method for learning the relationships between data that have been obtained independently. Unlike existing methods like matching, the proposed technique does not require any contextual information, provided that the dependency between the variables of interest is monotone. It can therefore be easily combined with matching in order to exploit the advantages of both methods. This technique can be described as a mix between quantile matching, and deconvolution. We provide for it a theoretical and an empirical validation.

Revealing Graph Bandits for Maximizing Local Influence

Mon, 02 May 2016 00:00:00 +0000

We study a graph bandit setting where the objective of the learner is to detect the most influential node of a graph by requesting as little information from the graph as possible. One of the relevant applications for this setting is marketing in social networks, where the marketer aims at finding and taking advantage of the most influential customers. The existing approaches for bandit problems on graphs require either partial or complete knowledge of the graph. In this paper, we do not assume any knowledge of the graph, but we consider a setting where it can be gradually discovered in a sequential and active way. At each round, the learner chooses a node of the graph and the only information it receives is a stochastic set of the nodes that the chosen node is currently influencing. To address this setting, we propose BARE, a bandit strategy for which we prove a regret guarantee that scales with the detectable dimension, a problem dependent quantity that is often much smaller than the number of nodes.

NYTRO: When Subsampling Meets Early Stopping

Mon, 02 May 2016 00:00:00 +0000

Early stopping is a well known approach to reduce the time complexity for performing training and model selection of large scale learning machines. On the other hand, memory/space (rather than time) complexity is the main constraint in many applications, and randomized subsampling techniques have been proposed to tackle this issue. In this paper we ask whether early stopping and subsampling ideas can be combined in a fruitful way. We consider the question in a least squares regression setting and propose a form of randomized iterative regularization based on early stopping and subsampling. In this context, we analyze the statistical and computational properties of the proposed method. Theoretical results are complemented and validated by a thorough experimental analysis.

Fast Convergence of Online Pairwise Learning Algorithms

Mon, 02 May 2016 00:00:00 +0000

Pairwise learning usually refers to a learning task which involves a loss function depending on pairs of examples, among which most notable ones are bipartite ranking, metric learning and AUC maximization. In this paper, we focus on online learning algorithms for pairwise learning problems without strong convexity, for which all previously known algorithms achieve a convergence rate of \mathcalO(1/\sqrtT) after T iterations. In particular, we study an online learning algorithm for pairwise learning with a least-square loss function in an unconstrained setting. We prove that the convergence of its last iterate can converge to the desired minimizer at a rate arbitrarily close to \mathcalO(1/T) up to logarithmic factor. The rates for this algorithm are established in high probability under the assumptions of polynomially decaying step sizes.

Time-Varying Gaussian Process Bandit Optimization

Mon, 02 May 2016 00:00:00 +0000

We consider the sequential Bayesian optimization problem with bandit feedback, adopting a formulation that allows for the reward function to vary with time. We model the reward function using a Gaussian process whose evolution obeys a simple Markov model. We introduce two natural extensions of the classical Gaussian process upper confidence bound (GP-UCB) algorithm. The first, R-GP-UCB, resets GP-UCB at regular intervals. The second, TV-GP-UCB, instead forgets about old data in a smooth fashion. Our main contribution comprises of novel regret bounds for these algorithms, providing an explicit characterization of the trade-off between the time horizon and the rate at which the function varies. We illustrate the performance of the algorithms on both synthetic and real data, and we find the gradual forgetting of TV-GP-UCB to perform favorably compared to the sharp resetting of R-GP-UCB. Moreover, both algorithms significantly outperform classical GP-UCB, since it treats stale and fresh data equally.

Supervised Neighborhoods for Distributed Nonparametric Regression

Mon, 02 May 2016 00:00:00 +0000

Techniques for nonparametric regression based on fitting small-scale local models at prediction time have long been studied in statistics and pattern recognition, but have received less attention in modern large-scale machine learning applications. In practice, such methods are generally applied to low-dimensional problems, but may falter with high-dimensional predictors if they use a Euclidean distance-based kernel. We propose a new method, SILO, for fitting prediction-time local models that uses supervised neighborhoods that adapt to the local shape of the regression surface. To learn such neighborhoods, we use a weight function between points derived from random forests. We prove the consistency of SILO, and demonstrate through simulations and real data that our method works well in both the serial and distributed settings. In the latter case, SILO learns the weighting function in a divide-and-conquer manner, entirely avoiding communication at training time.

Geometry Aware Mappings for High Dimensional Sparse Factors

Mon, 02 May 2016 00:00:00 +0000

While matrix factorisation models are ubiquitous in large scale recommendation and search, real time application of such models requires inner product computations over an intractably large set of item factors. In this manuscript we present a novel framework that exploits structural properties of sparse vectors, using the inverted index representation, to significantly reduce the run time computational cost of factorisation models. We develop techniques that use geometry aware permutation maps on a tessellated unit sphere to obtain high dimensional sparse embeddings for latent factors with sparsity patterns related to angular closeness of the original latent factors. We also design several efficient and deterministic realisations within this framework and demonstrate with experiments that our techniques lead to faster run time operation with minimal loss of accuracy.

Consistently Estimating Markov Chains with Noisy Aggregate Data

Mon, 02 May 2016 00:00:00 +0000

We address the problem of estimating the parameters of a time-homogeneous Markov chain given only noisy, aggregate data. This arises when a population of individuals behave independently according to a Markov chain, but individual sample paths cannot be observed due to limitations of the observation process or the need to protect privacy. Instead, only population-level counts of the number of individuals in each state at each time step are available. When these counts are exact, a conditional least squares (CLS) estimator is known to be consistent and asymptotically normal. We initiate the study of method of moments estimators for this problem to handle the more realistic case when observations are additionally corrupted by noise. We show that CLS can be interpreted as a simple “plug-in” method of moments estimator. However, when observations are noisy, it is not consistent because it fails to account for additional variance introduced by the noise. We develop a new, simpler method of moments estimator that bypasses this problem and is consistent under noisy observations.

PAC-Bayesian Bounds based on the Rényi Divergence

Mon, 02 May 2016 00:00:00 +0000

We propose a simplified proof process for PAC-Bayesian generalization bounds, that allows to divide the proof in four successive inequalities, easing the "customization" of PAC-Bayesian theorems. We also propose a family of PAC-Bayesian bounds based on the Rényi divergence between the prior and posterior distributions, whereas most PAC-Bayesian bounds are based on the Kullback-Leibler divergence. Finally, we present an empirical evaluation of the tightness of each inequality of the simplified proof, for both the classical PAC-Bayesian bounds and those based on the Rényi divergence.

Randomization and The Pernicious Effects of Limited Budgets on Auction Experiments

Mon, 02 May 2016 00:00:00 +0000

Buyers (e.g., advertisers) often have limited financial or processing resources, and so their participation in auctions is throttled. Changes to auctions may affect bids or throttling, and any change may affect what buyers pay. This paper shows that if an A/B experiment affects only bids, then the observed treatment effect is an unbiased estimator when all the bidders in the same auction are randomly assigned to A or B but the observed treatment effect can be severely biased otherwise, even in the absence of throttling. Experiments that affect throttling algorithms can also be badly biased, but the bias can be much reduced if separate budgets are maintained for the A and B arms of the experiment.

Parallel Markov Chain Monte Carlo via Spectral Clustering

Mon, 02 May 2016 00:00:00 +0000

As it has become common to use many computer cores in routine applications, finding good ways to parallelize popular algorithms has become increasingly important. In this paper, we present a parallelization scheme for Markov chain Monte Carlo (MCMC) methods based on spectral clustering of the underlying state space, generalizing earlier work on parallelization of MCMC methods by state space partitioning. We show empirically that this approach speeds up MCMC sampling for multimodal distributions and that it can be usefully applied in greater generality than several related algorithms. Our algorithm converges under reasonable conditions to an ‘optimal’ MCMC algorithm. We also show that our approach can be asymptotically far more efficient than naive parallelization, even in situations such as completely flat target distributions where no unique optimal algorithm exists. Finally, we combine theoretical and empirical bounds to provide practical guidance on the choice of tuning parameters.

Breaking Sticks and Ambiguities with Adaptive Skip-gram

Mon, 02 May 2016 00:00:00 +0000

The recently proposed Skip-gram model is a powerful method for learning high-dimensional word representations that capture rich semantic relationships between words. However, Skip-gram as well as most prior work on learning word representations does not take into account word ambiguity and maintain only single representation per word. Although a number of Skip-gram modifications were proposed to overcome this limitation and learn multi-prototype word representations, they either require a known number of word meanings or learn them using greedy heuristic approaches. In this paper we propose the Adaptive Skip-gram model which is a nonparametric Bayesian extension of Skip-gram capable to automatically learn the required number of representations for all words at desired semantic resolution. We derive efficient online variational learning algorithm for the model and empirically demonstrate its efficiency on word-sense induction task.

Probabilistic Approximate Least-Squares

Mon, 02 May 2016 00:00:00 +0000

Least-squares and kernel-ridge / Gaussian process regression are among the foundational algorithms of statistics and machine learning. Famously, the worst-case cost of exact nonparametric regression grows cubically with the data-set size; but a growing number of approximations have been developed that estimate good solutions at lower cost. These algorithms typically return point estimators, without measures of uncertainty. Leveraging recent results casting elementary linear algebra operations as probabilistic inference, we propose a new approximate method for nonparametric least-squares that affords a probabilistic uncertainty estimate over the error between the approximate and exact least-squares solution (this is not the same as the posterior variance of the associated Gaussian process regressor). This allows estimating the error of the least-squares solution on a subset of the data relative to the full-data solution. The uncertainty can be used to control the computational effort invested in the approximation. Our algorithm has linear cost in the data-set size, and a simple formal form, so that it can be implemented with a few lines of code in programming languages with linear algebra functionality.

Pareto Front Identification from Stochastic Bandit Feedback

Mon, 02 May 2016 00:00:00 +0000

We consider the problem of identifying the Pareto front for multiple objectives from a finite set of operating points. Sampling an operating point gives a random vector where each coordinate corresponds to the value of one of the objectives. The Pareto front is the set of operating points that are not dominated by any other operating point in respect to all objectives (considering the mean of their objective values). We propose a confidence bound algorithm to approximate the Pareto front, and prove problem specific lower and upper bounds, showing that the sample complexity is characterized by some natural geometric properties of the operating points. Experiments confirm the reliability of our algorithm. For the problem of finding a sparse cover of the Pareto front, we propose an asymmetric covering algorithm of independent interest.

Bipartite Correlation Clustering: Maximizing Agreements

Mon, 02 May 2016 00:00:00 +0000

In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph G with ’+’ and ’-’ edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all ’+’ edges within clusters plus all ’-’ edges cut across clusters. BCC is known to be NP-hard [5]. We present a novel approximation algorithm for k-BCC, a variant of BCC with an upper bound k on the number of clusters. Our algorithm outputs a k-clustering that provably achieves a number of agreements within a multiplicative (1-δ)-factor from the optimal, for any desired accuracy δ. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of G. It runs in time exponential in k and 1/δ, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an (1-δ)-approximation can be achieved by O(1/δ) clusters regardless of the size of the graph. In turn, our k-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.

Tensor vs. Matrix Methods: Robust Tensor Decomposition under Block Sparse Perturbations

Mon, 02 May 2016 00:00:00 +0000

Robust tensor CP decomposition involves decomposing a tensor into low rank and sparse components. We propose a novel non-convex iterative algorithm with guaranteed recovery. It alternates between low-rank CP decomposition through gradient ascent (a variant of the tensor power method), and hard thresholding of the residual. We prove convergence to the globally optimal solution under natural incoherence conditions on the low rank component, and bounded level of sparse perturbations. We compare our method with natural baselines, which apply robust matrix PCA either to the \em flattened tensor, or to the matrix slices of the tensor. Our method can provably handle a far greater level of perturbation when the sparse tensor is block-structured. This naturally occurs in many applications such as the foreground-background separation task in videos. Our experiments validate these findings. Thus, we establish that tensor methods can tolerate a higher level of gross corruptions compared to matrix methods.

Sketching, Embedding and Dimensionality Reduction in Information Theoretic Spaces

Mon, 02 May 2016 00:00:00 +0000

In this paper we show how to embed information distances like the χ^2 and Jensen-Shannon divergences efficiently in low dimensional spaces while preserving all pairwise distances. We then prove a dimensionality reduction result for the Hellinger, Jensen–Shannon, and χ^2 divergences that preserves the information geometry of the distributions, specifically, by retaining the simplex structure of the space. While our first result already implies these divergences can be explicitly embedded in the Euclidean space, retaining the simplex structure is important because it allows us to do inferences in the reduced space. We also show that these divergences can be sketched efficiently (i.e., up to a multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model.

A Fast and Reliable Policy Improvement Algorithm

Mon, 02 May 2016 00:00:00 +0000

We introduce a simple, efficient method that improves stochastic policies for Markov decision processes. The computational complexity is the same as that of the value estimation problem. We prove that when the value estimation error is small, this method gives an improvement in performance that increases with certain variance properties of the initial policy and transition dynamics. Performance in numerical experiments compares favorably with previous policy improvement algorithms.