Proceedings of Machine Learning Research

Learning Theory for Conditional Risk Minimization

Mon, 10 Apr 2017 00:00:00 +0000

In this work we study the learnability of stochastic processes with respect to the conditional risk, i.e. the existence of a learning algorithm that improves its next-step performance with the amount of observed data. We introduce a notion of pairwise discrepancy between conditional distributions at different times steps and show how certain properties of these discrepancies can be used to construct a successful learning algorithm. Our main results are two theorems that establish criteria for learnability for many classes of stochastic processes, including all special cases studied previously in the literature.

Learning Nonparametric Forest Graphical Models with Prior Information

Mon, 10 Apr 2017 00:00:00 +0000

We present a framework for incorporating prior information into nonparametric estimation of graphical models. To avoid distributional assumptions, we restrict the graph to be a forest and build on the work of forest density estimation (FDE). We reformulate the FDE approach from a Bayesian perspective, and introduce prior distributions on the graphs. As two concrete examples, we apply this framework to estimating scale-free graphs and learning multiple graphs with similar structures. The resulting algorithms are equivalent to finding a maximum spanning tree of a weighted graph with a penalty term on the connectivity pattern of the graph. We solve the optimization problem via a minorize-maximization procedure with Kruskal’s algorithm. Simulations show that the proposed methods outperform competing parametric methods, and are robust to the true data distribution. They also lead to improvement in predictive power and interpretability in two real data sets.

Scaling Submodular Maximization via Pruned Submodularity Graphs

Mon, 10 Apr 2017 00:00:00 +0000

We propose a new random pruning method (called “submodular sparsification (SS)”) to reduce the cost of submodular maximization. The pruning is applied via a “submodularity graph” over the $n$ ground elements, where each directed edge is associated with a pairwise dependency defined by the submodular function. In each step, SS prunes a $1-1/\sqrt{c}$ (for $c>1$) fraction of the nodes using weights on edges computed based on only a small number ($O(\log n)$) of randomly sampled nodes. The algorithm requires $\log_\sqrt{c}n$ steps with a small and highly parallelizable per-step computation. An accuracy-speed tradeoff parameter $c$, set as $c = 8$, leads to a fast shrink rate $\sqrt{2}/4$ and small iteration complexity $\log_{2\sqrt{2}}n$. Analysis shows that w.h.p., the greedy algorithm on the pruned set of size $O(\log^2 n)$ can achieve a guarantee similar to that of processing the original dataset. In news and video summarization tasks, SS is able to substantially reduce both computational costs and memory usage, while maintaining (or even slightly exceeding) the quality of the original (and much more costly) greedy algorithm.

Fast Classification with Binary Prototypes

Mon, 10 Apr 2017 00:00:00 +0000

In this work, we propose a new technique for \emphfast k-nearest neighbor (k-NN) classification in which the original database is represented via a small set of learned binary prototypes. The training phase simultaneously learns a hash function which maps the data points to binary codes, and a set of representative binary prototypes. In the prediction phase, we first hash the query into a binary code and then do the k-NN classification using the binary prototypes as the database. Our approach speeds up k-NN classification in two aspects. First, we compress the database into a smaller set of prototypes such that k-NN search only goes through a smaller set rather than the whole dataset. Second, we reduce the original space to a compact binary embedding, where the Hamming distance between two binary codes is very efficient to compute. We propose a formulation to learn the hash function and prototypes such that the classification error is minimized. We also provide a novel theoretical analysis of the proposed technique in terms of Bayes error consistency. Empirically, our method is much faster than the state-of-the-art k-NN compression methods with comparable accuracy.

Online Nonnegative Matrix Factorization with General Divergences

Mon, 10 Apr 2017 00:00:00 +0000

We develop a unified and systematic framework for performing online nonnegative matrix factorization under a wide variety of important divergences. The online nature of our algorithms makes them particularly amenable to large-scale data. We prove that the sequence of learned dictionaries converges almost surely to the set of critical points of the expected loss function. Experimental results demonstrate the computational efficiency and outstanding performances of our algorithms on several real-life applications, including topic modeling, document clustering and foreground-background separation.

Scalable Convex Multiple Sequence Alignment via Entropy-Regularized Dual Decomposition

Mon, 10 Apr 2017 00:00:00 +0000

Multiple Sequence Alignment (MSA) is one of the fundamental tasks in biological sequence analysis that underlies applications such as phylogenetic trees, profiles, and structure prediction. The task, however, is NP-hard, and the current practice resorts to heuristic and local-search methods. Recently, a convex optimization approach for MSA was proposed based on the concept of atomic norm, which demonstrates significant improvement over existing methods in the quality of alignments. However, the convex program is challenging to solve due to the constraint given by the intersection of two atomic-norm balls, for which the existing algorithm can only handle sequences of length up to 50, with an iteration complexity subject to constants of unknown relation to the natural parameters of MSA. In this work, we propose an accelerated dual decomposition algorithm that exploits entropy regularization to induce closed-form solutions for each atomic-norm-constrained subproblem, giving a single-loop algorithm of iteration complexity linear to the problem size (total length of all sequences). The proposed algorithm gives significantly better alignments than existing methods on sequences of length up to hundreds, where the existing convex programming method fails to converge in one day.

On the Learnability of Fully-Connected Neural Networks

Mon, 10 Apr 2017 00:00:00 +0000

Despite the empirical success of deep neural networks, there is limited theoretical understanding on the learnability of these models using a polynomial-time algorithm. In this paper, we characterize the learnability of fully-connected neural networks via both positive and negative results. We focus on $\ell_1$-regularized networks, where the $\ell_1$-norm of the incoming weights of every neuron is assumed to be bounded by a constant $B > 0$. Our first result shows that such networks are properly learnable in $\text{poly}(n,d,\exp(1/ε^2))$ time, where $n$ and $d$ are the sample size and the input dimension, and $ε> 0$ is the gap to optimality. The bound is achieved by repeatedly sampling over a low-dimensional manifold so as to ensure approximate optimality, but avoids the $\exp(d)$ cost of exhaustively searching over the parameter space. We also establish a hardness result showing that the exponential dependence on $1/ε$ is unavoidable unless $\bf RP = \bf NP$. Our second result shows that the exponential dependence on $1/ε$ can be avoided by exploiting the underlying structure of the data distribution. In particular, if the positive and negative examples can be separated with margin $γ> 0$ by an unknown neural network, then the network can be learned in $\text{poly}(n,d,1/ε)$ time. The bound is achieved by an ensemble method which uses the first algorithm as a weak learner. We further show that the separability assumption can be weakened to tolerate noisy labels. Finally, we show that the exponential dependence on $1/γ$ is unimprovable under a certain cryptographic assumption.

Minimax Approach to Variable Fidelity Data Interpolation

Mon, 10 Apr 2017 00:00:00 +0000

Engineering problems often involve data sources of variable fidelity with different costs of obtaining an observation. In particular, one can use both a cheap low fidelity function (e.g. a computational experiment with a CFD code) and an expensive high fidelity function (e.g. a wind tunnel experiment) to generate a data sample in order to construct a regression model of a high fidelity function. The key question in this setting is how the sizes of the high and low fidelity data samples should be selected in order to stay within a given computational budget and maximize accuracy of the regression model prior to committing resources on data acquisition. In this paper we obtain minimax interpolation errors for single and variable fidelity scenarios for a multivariate Gaussian process regression. Evaluation of the minimax errors allows us to identify cases when the variable fidelity data provides better interpolation accuracy than the exclusively high fidelity data for the same computational budget. These results allow us to calculate the optimal shares of variable fidelity data samples under the given computational budget constraint. Real and synthetic data experiments suggest that using the obtained optimal shares often outperforms natural heuristics in terms of the regression accuracy.

Fairness Constraints: Mechanisms for Fair Classification

Mon, 10 Apr 2017 00:00:00 +0000

Algorithmic decision making systems are ubiquitous across a wide variety of online as well as offline services. These systems rely on complex learning methods and vast amounts of data to optimize the service functionality, satisfaction of the end user and profitability. However, there is a growing concern that these automated decisions can lead, even in the absence of intent, to a lack of fairness, i.e., their outcomes can disproportionately hurt (or, benefit) particular groups of people sharing one or more sensitive attributes (e.g., race, sex). In this paper, we introduce a flexible mechanism to design fair classifiers by leveraging a novel intuitive measure of decision boundary (un)fairness. We instantiate this mechanism with two well-known classifiers, logistic regression and support vector machines, and show on real-world data that our mechanism allows for a fine-grained control on the degree of fairness, often at a small cost in terms of accuracy.

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

Mon, 10 Apr 2017 00:00:00 +0000

This paper concerns a fundamental class of convex matrix optimization problems. It presents the first algorithm that uses optimal storage and provably computes a low-rank approximation of a solution. In particular, when all solutions have low rank, the algorithm converges to a solution. This algorithm, SketchyCGM, modifies a standard convex optimization scheme, the conditional gradient method, to store only a small randomized sketch of the matrix variable. After the optimization terminates, the algorithm extracts a low-rank approximation of the solution from the sketch. In contrast to nonconvex heuristics, the guarantees for SketchyCGM do not rely on statistical models for the problem data. Numerical work demonstrates the benefits of SketchyCGM over heuristics.

Localized Lasso for High-Dimensional Regression

Mon, 10 Apr 2017 00:00:00 +0000

We introduce the localized Lasso, which learns models that both are interpretable and have a high predictive power in problems with high dimensionality d and small sample size n. More specifically, we consider a function defined by local sparse models, one at each data point. We introduce sample-wise network regularization to borrow strength across the models, and sample-wise exclusive group sparsity (a.k.a., l12 norm) to introduce diversity into the choice of feature sets in the local models. The local models are interpretable in terms of similarity of their sparsity patterns. The cost function is convex, and thus has a globally optimal solution. Moreover, we propose a simple yet efficient iterative least-squares based optimization procedure for the localized Lasso, which does not need a tuning parameter, and is guaranteed to converge to a globally optimal solution. The solution is empirically shown to outperform alternatives for both simulated and genomic personalized/precision medicine data.

Efficient Algorithm for Sparse Tensor-variate Gaussian Graphical Models via Gradient Descent

Mon, 10 Apr 2017 00:00:00 +0000

We study the sparse tensor-variate Gaussian graphical model (STGGM), where each way of the tensor follows a multivariate normal distribution whose precision matrix has sparse structures. In order to estimate the precision matrices, we propose a sparsity constrained maximum likelihood estimator. However, due to the complex structure of the tensor-variate GGMs, the likelihood based estimator is non-convex, which poses great challenges for both computation and theoretical analysis. In order to address these challenges, we propose an efficient alternating gradient descent algorithm to solve this estimator, and prove that, under certain conditions on the initial estimator, our algorithm is guaranteed to linearly converge to the unknown precision matrices up to the optimal statistical error. Experiments on both synthetic data and real world brain imaging data corroborate our theory.

Adaptive ADMM with Spectral Penalty Parameter Selection

Mon, 10 Apr 2017 00:00:00 +0000

The alternating direction method of multipliers (ADMM) is a versatile tool for solving a wide range of constrained optimization problems. However, its performance is highly sensitive to a penalty parameter, making ADMM often unreliable and hard to automate for a non-expert user. We tackle this weakness of ADMM by proposing a method that adaptively tunes the penalty parameter to achieve fast convergence. The resulting adaptive ADMM (AADMM) algorithm, inspired by the successful Barzilai-Borwein spectral method for gradient descent, yields fast convergence and relative insensitivity to the initial stepsize and problem scaling.

Diverse Neural Network Learns True Target Functions

Mon, 10 Apr 2017 00:00:00 +0000

Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties? In this paper, we answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the “extended feature matrix” is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization ability.

Markov Chain Truncation for Doubly-Intractable Inference

Mon, 10 Apr 2017 00:00:00 +0000

Computing partition functions, the normalizing constants of probability distributions, is often hard. Variants of importance sampling give unbiased estimates of a normalizer Z, however, unbiased estimates of the reciprocal 1/Z are harder to obtain. Unbiased estimates of 1/Z allow Markov chain Monte Carlo sampling of “doubly-intractable” distributions, such as the parameter posterior for Markov Random Fields or Exponential Random Graphs. We demonstrate how to construct unbiased estimates for 1/Z given access to black-box importance sampling estimators for Z. We adapt recent work on random series truncation and Markov chain coupling, producing estimators with lower variance and a higher percentage of positive estimates than before. Our debiasing algorithms are simple to implement, and have some theoretical and empirical advantages over existing methods.

Linking Micro Event History to Macro Prediction in Point Process Models

Mon, 10 Apr 2017 00:00:00 +0000

User behaviors in social networks are microscopic with fine grained temporal information. Predicting a macroscopic quantity based on users’ collective behaviors is an important problem. However, existing works are mainly problem-specific models for the microscopic behaviors and typically design approximation or heuristic algorithms to compute the macroscopic quantity. In this paper, we propose a unifying framework with a jump stochastic differential equation model that systematically links the microscopic event data and macroscopic inference, and the theory to approximate its probability distribution. We showed that our method can better predict the user behaviors in real-world applications.

A Fast and Scalable Joint Estimator for Learning Multiple Related Sparse Gaussian Graphical Models

Mon, 10 Apr 2017 00:00:00 +0000

Estimating multiple sparse Gaussian Graphical Models (sGGMs) jointly for many related tasks (large $K$) under a high-dimensional (large $p$) situation is an important task. Most previous studies for the joint estimation of multiple sGGMs rely on penalized log-likelihood estimators that involve expensive and difficult non-smooth optimizations. We propose a novel approach, FASJEM for \underlinefast and \underlinescalable \underlinejoint structure-\underlineestimation of \underlinemultiple sGGMs at a large scale. As the first study of joint sGGM using the M-estimator framework, our work has three major contributions: (1) We solve FASJEM through an entry-wise manner which is parallelizable. (2) We choose a proximal algorithm to optimize FASJEM. This improves the computational efficiency from $O(Kp^3)$ to $O(Kp^2)$ and reduces the memory requirement from $O(Kp^2)$ to $O(K)$. (3) We theoretically prove that FASJEM achieves a consistent estimation with a convergence rate of $O(\log(Kp)/n_tot)$. On several synthetic and four real-world datasets, FASJEM shows significant improvements over baselines on accuracy, computational complexity and memory costs.

Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

Mon, 10 Apr 2017 00:00:00 +0000

Sketching techniques scale up machine learning algorithms by reducing the sample size or dimensionality of massive data sets, without sacrificing their statistical properties. In this paper, we study sketching from an optimization point of view. We first show that the iterative Hessian sketch is an optimization process with \emphpreconditioning and develop an \emphaccelerated version using this insight together with conjugate gradient descent. Next, we establish a primal-dual connection between the Hessian sketch and dual random projection, which allows us to develop an \emphaccelerated iterative dual random projection method by applying the preconditioned conjugate gradient descent on the dual problem. Finally, we tackle the problems of large sample size and high-dimensionality in massive data sets by developing the \emphprimal-dual sketch. The primal-dual sketch iteratively sketches the primal and dual formulations and requires only a logarithmic number of calls to solvers of small sub-problems to recover the optimum of the original problem up to \empharbitrary precision. Extensive experiments on synthetic and real data sets complement our theoretical results.

Tracking Objects with Higher Order Interactions via Delayed Column Generation

Mon, 10 Apr 2017 00:00:00 +0000

We study the problem of multi-target tracking and data association in video. We formulate this in terms of selecting a subset of high-quality tracks subject to the constraint that no pair of selected tracks is associated with a common detection (of an object). This objective is equivalent to the classic NP-hard problem of finding a maximum-weight set packing (MWSP) where tracks correspond to sets and is made further difficult since the number of candidate tracks grows exponentially in the number of detections. We present a relaxation of this combinatorial problem that uses a column generation formulation where the pricing problem is solved via dynamic programming to efficiently explore the space of tracks. We employ row generation to tighten the bound in such a way as to preserve efficient inference in the pricing problem. We show the practical utility of this algorithm for pedestrian and particle tracking.

A Unified Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation

Mon, 10 Apr 2017 00:00:00 +0000

We propose a unified framework for estimating low-rank matrices through nonconvex optimization based on gradient descent algorithm. Our framework is quite general and can be applied to both noisy and noiseless observations. In the general case with noisy observations, we show that our algorithm is guaranteed to linearly converge to the unknown low-rank matrix up to a minimax optimal statistical error, provided an appropriate initial estimator. While in the generic noiseless setting, our algorithm converges to the unknown low-rank matrix at a linear rate and enables exact recovery with optimal sample complexity. In addition, we develop a new initialization algorithm to provide the desired initial estimator, which outperforms existing initialization algorithms for nonconvex low-rank matrix estimation. We illustrate the superiority of our framework through three examples: matrix regression, matrix completion, and one-bit matrix completion. We also corroborate our theory through extensive experiments on synthetic data.

Tensor Decompositions via Two-Mode Higher-Order SVD (HOSVD)

Mon, 10 Apr 2017 00:00:00 +0000

Tensor decompositions have rich applications in statistics and machine learning, and developing efficient, accurate algorithms for the problem has received much attention recently. Here, we present a new method built on Kruskal’s uniqueness theorem to decompose symmetric, nearly orthogonally decomposable tensors. Unlike the classical higher-order singular value decomposition which unfolds a tensor along a single mode, we consider unfoldings along two modes and use rank-1 constraints to characterize the underlying components. This tensor decomposition method provably handles a greater level of noise compared to previous methods and achieves a high estimation accuracy. Numerical results demonstrate that our algorithm is robust to various noise distributions and that it performs especially favorably as the order increases.

Prediction Performance After Learning in Gaussian Process Regression

Mon, 10 Apr 2017 00:00:00 +0000

This paper considers the quantification of the prediction performance in Gaussian process regression. The standard approach is to base the prediction error bars on the theoretical predictive variance, which is a lower bound on the mean square-error (MSE). This approach, however, does not take into account that the statistical model is learned from the data. We show that this omission leads to a systematic underestimation of the prediction errors. Starting from a generalization of the Cramér-Rao bound, we derive a more accurate MSE bound which provides a measure of uncertainty for prediction of Gaussian processes. The improved bound is easily computed and we illustrate it using synthetic and real data examples.

Fast column generation for atomic norm regularization

Mon, 10 Apr 2017 00:00:00 +0000

We consider optimization problems that consist in minimizing a quadratic function under an atomic norm regularization or constraint. In the line of work on conditional gradient algorithms, we show that the fully corrective Frank-Wolfe (FCFW) algorithm — which is most naturally reformulated as a column generation algorithm in the regularized case — can be made particularly efficient for difficult problems in this family by solving the simplicial or conical subproblems produced by FCFW using a special instance of a classical active set algorithm for quadratic programming that generalizes the min-norm point algorithm.

Horde of Bandits using Gaussian Markov Random Fields

Mon, 10 Apr 2017 00:00:00 +0000

The gang of bandits (GOB) model [7] is a recent contextual bandits framework that shares information between a set of bandit problems, related by a known (possibly noisy) graph. This model is useful in problems like recommender systems where the large number of users makes it vital to transfer information between users. Despite its effectiveness, the existing GOB model can only be applied to small problems due to its quadratic time-dependence on the number of nodes. Existing solutions to combat the scalability issue require an often-unrealistic clustering assumption. By exploiting a connection to Gaussian Markov random fields (GMRFs), we show that the GOB model can be made to scale to much larger graphs without additional assumptions. In addition, we propose a Thompson sampling algorithm which uses the recent GMRF sampling-by-perturbation technique, allowing it to scale to even larger problems (leading to a “horde” of bandits). We give regret bounds and experimental results for GOB with Thompson sampling and epoch-greedy algorithms, indicating that these methods are as good as or significantly better than ignoring the graph or adopting a clustering-based approach. Finally, when an existing graph is not available, we propose a heuristic for learning it on the fly and show promising results.

Decentralized Collaborative Learning of Personalized Models over Networks

Mon, 10 Apr 2017 00:00:00 +0000

We consider a set of learning agents in a collaborative peer-to-peer network, where each agent learns a personalized model according to its own learning objective. The question addressed in this paper is: how can agents improve upon their locally trained model by communicating with other agents that have similar objectives? We introduce and analyze two asynchronous gossip algorithms running in a fully decentralized manner. Our first approach, inspired from label propagation, aims to smooth pre-trained local models over the network while accounting for the confidence that each agent has in its initial model. In our second approach, agents jointly learn and propagate their model by making iterative updates based on both their local dataset and the behavior of their neighbors. To optimize this challenging objective, our decentralized algorithm is based on ADMM.

Communication-efficient Distributed Sparse Linear Discriminant Analysis

Mon, 10 Apr 2017 00:00:00 +0000

We propose a communication-efficient distributed estimation method for sparse linear discriminant analysis (LDA) in the high dimensional regime. Our method distributes the data of size N into m machines, and estimates a local sparse LDA estimator on each machine using the data subset of size N/m. After the distributed estimation, our method aggregates the debiased local estimators from m machines, and sparsifies the aggregated estimator. We show that the aggregated estimator attains the same statistical rate as the centralized estimation method, as long as the number of machines m is chosen appropriately. Moreover, we prove that our method can attain the model selection consistency under a milder condition than the centralized method. Experiments on both synthetic and real datasets corroborate our theory.

Anomaly Detection in Extreme Regions via Empirical MV-sets on the Sphere

Mon, 10 Apr 2017 00:00:00 +0000

Extreme regions in the feature space are of particular concern for anomaly detection: anomalies are likely to be located in the tails, whereas data scarcity in such regions makes it difficult to distinguish between large normal instances and anomalies. This paper presents an unsupervised algorithm for anomaly detection in extreme regions. We propose a Minimum Volume set (MV-set) approach relying on multivariate extreme value theory. This framework includes a canonical pre-processing step, which addresses the issue of output sensitivity to standardization choices. The resulting data representation on the sphere highlights the dependence structure of the extremal observations. Anomaly detection is then cast as a MV-set estimation problem on the sphere, where volume is measured by the spherical measure and mass refers to the angular measure. An anomaly then corresponds to an unusual observation given that one of its variables is large. A preliminary rate bound analysis is carried out for the learning method we introduce and its computational advantages are discussed and illustrated by numerical experiments.

Spatial Decompositions for Large Scale SVMs

Mon, 10 Apr 2017 00:00:00 +0000

Although support vector machines (SVMs) are theoretically well understood, their underlying optimization problem becomes very expensive if, for example, hundreds of thousands of samples and a non-linear kernel are considered. Several approaches have been proposed in the past to address this serious limitation. In this work we investigate a decomposition strategy that learns on small, spatially defined data chunks. Our contributions are two fold: On the theoretical side we establish an oracle inequality for the overall learning method using the hinge loss, and show that the resulting rates match those known for SVMs solving the complete optimization problem with Gaussian kernels. On the practical result we compare our approach to learning SVMs on small, randomly chosen chunks. Here it turns out that for comparable training times our approach is significantly faster during testing and also reduces the test error in most cases significantly. Furthermore, we show that our approach easily scales up to 10 million training samples: including hyper-parameter selection using cross validation, the entire training only takes a few hours on a single machine. Finally, we report an experiment on 32 million training samples. All experiments used liquidSVM (Steinwart and Thomann, 2017)

Convergence Rate of Stochastic k-means

Mon, 10 Apr 2017 00:00:00 +0000

We analyze online (Bottou & Bengio, 1994) and mini-batch (Sculley, 2010) k-means variants. Both scale up the widely used Lloyd’s algorithm via stochastic approximation, and have become popular for large-scale clustering and unsupervised feature learning. We show, for the first time, that they have global convergence towards “local optima” at rate $O(1/t)$ under general conditions. In addition, we show that if the dataset is clusterable, stochastic k-means with suitable initialization converges to an optimal k-means solution at rate $O(1/t)$ with high probability. The k-means objective is non-convex and non-differentiable; we exploit ideas from non-convex gradient-based optimization by providing a novel characterization of the trajectory of the k-means algorithm on its solution space, and circumvent its non-differentiability via geometric insights about the k-means update.

Initialization and Coordinate Optimization for Multi-way Matching

Mon, 10 Apr 2017 00:00:00 +0000

We consider the problem of consistently matching multiple sets of elements to each other, which is a common task in fields such as computer vision. To solve the underlying NP-hard objective, existing methods often relax or approximate it, but end up with unsatisfying empirical performance due to a misaligned objective. We propose a coordinate update algorithm that directly optimizes the target objective. By using pairwise alignment information to build an undirected graph and initializing the permutation matrices along the edges of its Maximum Spanning Tree, our algorithm successfully avoids bad local optima. Theoretically, with high probability our algorithm guarantees an optimal solution under reasonable noise assumptions. Empirically, our algorithm consistently and significantly outperforms existing methods on several benchmark tasks on real datasets.

Learning Structured Weight Uncertainty in Bayesian Neural Networks

Mon, 10 Apr 2017 00:00:00 +0000

Deep neural networks (DNNs) are increasingly popular in modern machine learning. Bayesian learning affords the opportunity to quantify posterior uncertainty on DNN model parameters. Most existing work adopts independent Gaussian priors on the model weights, ignoring possible structural information. In this paper, we consider the matrix variate Gaussian (MVG) distribution to model structured correlations within the weights of a DNN. To make posterior inference feasible, a reparametrization is proposed for the MVG prior, simplifying the complex MVG-based model to an equivalent yet simpler model with independent Gaussian priors on the transformed weights. Consequently, we develop a scalable Bayesian online inference algorithm by adopting the recently proposed probabilistic backpropagation framework. Experiments on several synthetic and real datasets indicate the superiority of our model, achieving competitive performance in terms of model likelihood and predictive root mean square error. Importantly, it also yields faster convergence speed compared to related Bayesian DNN models.

Scalable Variational Inference for Super Resolution Microscopy

Mon, 10 Apr 2017 00:00:00 +0000

Super-resolution microscopy methods have become essential tools in biology, opening up a variety of new questions that were previously inaccessible with standard light microscopy methods. In this paper we develop new Bayesian image processing methods that extend the reach of super-resolution microscopy even further. Our method couples variational inference techniques with a data summarization based on Laplace approximation to ensure computational scalability. Our formulation makes it straightforward to incorporate prior information about the underlying sample to further improve accuracy. The proposed method obtains dramatic resolution improvements over previous methods while retaining computational tractability.

Tensor-Dictionary Learning with Deep Kruskal-Factor Analysis

Mon, 10 Apr 2017 00:00:00 +0000

A multi-way factor analysis model is introduced for tensor-variate data of any order. Each data item is represented as a (sparse) sum of Kruskal decompositions, a Kruskal- factor analysis (KFA). KFA is nonparametric and can infer both the tensor-rank of each dictionary atom and the number of dictionary atoms. The model is adapted for online learning, which allows dictionary learning on large data sets. After KFA is introduced, the model is extended to a deep convolutional tensor-factor analysis, supervised by a Bayesian SVM. The experiments section demonstrates the improvement of KFA over vectorized approaches (e.g., BPFA), tensor decompositions, and convolutional neural networks (CNN) in multi-way denoising, blind inpainting, and image classification. The improvement in PSNR for the inpainting results over other methods exceeds 1dB in several cases and we achieve state of the art results on Caltech101 image classification.

Performance Bounds for Graphical Record Linkage

Mon, 10 Apr 2017 00:00:00 +0000

Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance.

Generalization Error of Invariant Classifiers

Mon, 10 Apr 2017 00:00:00 +0000

This paper studies the generalization error of invariant classifiers. In particular, we consider the common scenario where the classification task is invariant to certain transformations of the input, and that the classifier is constructed (or learned) to be invariant to these transformations. Our approach relies on factoring the input space into a product of a base space and a set of transformations. We show that whereas the generalization error of a non-invariant classifier is proportional to the complexity of the input space, the generalization error of an invariant classifier is proportional to the complexity of the base space. We also derive a set of sufficient conditions on the geometry of the base space and the set of transformations that ensure that the complexity of the base space is much smaller than the complexity of the input space. Our analysis applies to general classifiers such as convolutional neural networks. We demonstrate the implications of the developed theory for such classifiers with experiments on the MNIST and CIFAR-10 datasets.

Compressed Least Squares Regression revisited

Mon, 10 Apr 2017 00:00:00 +0000

We revisit compressed least squares (CLS) regression as originally analyzed in Maillard and Munos (2009) and later on in Kaban (2014) with some refinements. Given a set of high-dimensional inputs, CLS applies a random projection and then performs least squares regression based on the projected inputs of lower dimension. This approach can be beneficial with regard to both computation (yielding a smaller least squares problem) and statistical performance (reducing the estimation error). We will argue below that the outcome of previous analysis of the procedure is not meaningful in typical situations, yielding a bound on the prediction error that is inferior to ordinary least squares while requiring the dimension of the projected data to be of the same order as the original dimension. As a fix, we subsequently present a modified analysis with meaningful implications that much better reflects empirical results with simulated and real data.

Sparse Randomized Partition Trees for Nearest Neighbor Search

Mon, 10 Apr 2017 00:00:00 +0000

Randomized partition trees have recently been shown to be very effective in solving nearest neighbor search problem. In spite of enjoying strong theoretical guarantee, it suffers from high space complexity, since each internal node of the tree needs to store a $d$ dimensional projection direction leading to a $O(nd)$ space complexity for a dataset of size $n$. Inspired by the fast Johnson-Lindenstrauss transform, in this paper, we propose a sparse version of randomized partition tree where each internal node needs to store only a few non-zero entries, as opposed to all $d$ entries, leading to significant space savings without sacrificing much in terms of nearest neighbor search accuracy. As a by product of this, query time of our proposed method is slightly better than that of its non-sparse counterpart for large dataset size. Our theoretical results indicate that our proposed method enjoys the same theoretical guarantee as that of the original non-sparse RP-tree. Experimental evaluations on four real world dataset strongly suggest that nearest neighbor search performance of our proposed sparse RP-tree is very similar to that of its non-sparse counterpart in terms of accuracy and number of retrieved points.

Regret Bounds for Transfer Learning in Bayesian Optimisation

Mon, 10 Apr 2017 00:00:00 +0000

This paper studies the regret bound of two transfer learning algorithms in Bayesian optimisation. The first algorithm models any difference between the source and target functions as a noise process. The second algorithm proposes a new way to model the difference between the source and target as a Gaussian process which is then used to adapt the source data. We show that in both cases the regret bounds are tighter than in the no transfer case. We also experimentally compare the performance of these algorithms relative to no transfer learning and demonstrate benefits of transfer learning.

Contextual Bandits with Latent Confounders: An NMF Approach

Mon, 10 Apr 2017 00:00:00 +0000

Motivated by online recommendation and advertising systems, we consider a causal model for stochastic contextual bandits with a latent low-dimensional confounder. In our model, there are $L$ observed contexts and $K$ arms of the bandit. The observed context influences the reward obtained through a latent confounder variable with cardinality $m$ ($m ≪L,K$). The arm choice and the latent confounder causally determines the reward while the observed context is correlated with the confounder. Under this model, the $L \times K$ mean reward matrix $\mathbfU$ (for each context in $[L]$ and each arm in $[K]$) factorizes into non-negative factors $\mathbfA$ ($L \times m$) and $\mathbfW$ ($m \times K$). This insight enables us to propose an $ε$-greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning this low-dimensional structure and selecting the best arm to minimize regret. Our algorithm achieves a regret of $\mathcalO\left(L\mathrmpoly(m, \log K) \log T \right)$ at time $T$, as compared to $\mathcalO(LK\log T)$ for conventional contextual bandits, assuming a constant gap between the best arm and the rest for each context. These guarantees are obtained under mild sufficiency conditions on the factors that are weaker versions of the well-known Statistical RIP condition. We further propose a class of generative models that satisfy our sufficient conditions, and derive a lower bound of $\mathcalO\left(Km\log T\right)$. These are the first regret guarantees for online matrix completion with bandit feedback, when the rank is greater than one. We further compare the performance of our algorithm with the state of the art, on synthetic and real world data-sets.

Lower Bounds on Active Learning for Graphical Model Selection

Mon, 10 Apr 2017 00:00:00 +0000

We consider the problem of estimating the underlying graph associated with a Markov random field, with the added twist that the decoding algorithm can iteratively choose which subsets of nodes to sample based on the previous samples, resulting in an active learning setting. Considering both Ising and Gaussian models, we provide algorithm-independent lower bounds for high-probability recovery within the class of degree-bounded graphs. Our main results are minimax lower bounds for the active setting that match the best known lower bounds for the passive setting, which in turn are known to be tight in several cases of interest. Our analysis is based on Fano’s inequality, along with novel mutual information bounds for the active learning setting, and the application of restricted graph ensembles. While we consider ensembles that are similar or identical to those used in the passive setting, we require different analysis techniques, with a key challenge being bounding a mutual information quantity associated with observed subsets of nodes, as opposed to full observations.

Estimating Density Ridges by Direct Estimation of Density-Derivative-Ratios

Mon, 10 Apr 2017 00:00:00 +0000

Estimation of \emphdensity ridges has been gathering a great deal of attention since it enables us to reveal lower-dimensional structures hidden in data. Recently, \emphsubspace constrained mean shift (SCMS) was proposed as a practical algorithm for density ridge estimation. A key technical ingredient in SCMS is to accurately estimate the ratios of the density derivatives to the density. SCMS takes a three-step approach for this purpose — first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator and division by an estimated density could significantly magnify the estimation error. To overcome these problems, we propose a novel method that directly estimates the ratios without going through density estimation and division. Our proposed estimator has an analytic-form solution and it can be computed efficiently. We further establish a non-parametric convergence bound for the proposed ratio estimator. Finally, based on this direct ratio estimator, we develop a practical algorithm for density ridge estimation and experimentally demonstrate its usefulness on a variety of datasets.

Identifying Groups of Strongly Correlated Variables through Smoothed Ordered Weighted $L_1$-norms

Mon, 10 Apr 2017 00:00:00 +0000

The failure of LASSO to identify groups of correlated predictors in linear regression has sparked significant research interest. Recently, various norms were proposed, which can be best described as instances of ordered weighted $\ell_1$ norms (OWL), as an alternative to $\ell_1$ regularization used in LASSO. OWL can identify groups of correlated variables but it forces the model to be constant within a group. This artifact induces unnecessary bias in the model estimation. In this paper we take a submodular perspective and show that OWL can be posed as the Lovász extension of a suitably defined submodular function. The submodular perspective not only explains the group-wise constant behavior of OWL, but also suggests alternatives. The main contribution of this paper is smoothed OWL (SOWL), a new family of norms, which not only identifies the groups but also allows the model to be flexible inside a group. We establish several algorithmic and theoretical properties of SOWL including group identification and model consistency. We also provide algorithmic tools to compute the SOWL norm and its proximal operator, whose computational complexity $O(d\log d)$ is significantly better than that of general purpose solvers in $O(d^2\log d)$. In our experiments, SOWL compares favorably with respect to OWL in the regimes of interest.

Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes

Mon, 10 Apr 2017 00:00:00 +0000

Datasets with hundreds of variables and many missing values are commonplace. In this setting, it is both statistically and computationally challenging to detect true predictive relationships between variables and also to suppress false positives. This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. It shows how to use Bayesian non-parametric modeling to (i) build an ensemble of joint probability models for all the variables; (ii) efficiently detect marginal independencies; and (iii) estimate the conditional mutual information between arbitrary subsets of variables, subject to a broad class of constraints. Users can access these capabilities using BayesDB, a probabilistic programming platform for probabilistic data analysis, by writing queries in a simple, SQL-like language. This paper demonstrates empirically that the method can (i) detect context-specific (in)dependencies on challenging synthetic problems and (ii) yield improved sensitivity and specificity over baselines from statistics and machine learning, on a real-world database of over 300 sparsely observed indicators of macroeconomic development and public health.

A Lower Bound on the Partition Function of Attractive Graphical Models in the Continuous Case

Mon, 10 Apr 2017 00:00:00 +0000

Computing the partition function of an arbitrary graphical model is generally intractable. As a result, approximate inference techniques such as loopy belief propagation and expectation propagation are used to compute an approximation to the true partition function. However, due to general issues of intractability in the continuous case, our understanding of these approximations is relatively limited. In particular, a number of theoretical results known for these approximations in the discrete case are missing in the continuous case. In this work, we use graph covers to extend several such results from the discrete case to the continuous case. Specifically, we provide a graph cover based upper bound for continuous graphical models, and we use this characterization (along with a continuous analog of a discrete correlation-type inequality) to show that the Bethe partition function also provides a lower bound on the true partition function of attractive graphical models in the continuous case.

Conditions beyond treewidth for tightness of higher-order LP relaxations

Mon, 10 Apr 2017 00:00:00 +0000

Linear programming (LP) relaxations are a popular method to attempt to find a most likely configuration of a discrete graphical model. If a solution to the relaxed problem is obtained at an integral vertex then the solution is guaranteed to be exact and we say that the relaxation is tight. We consider binary pairwise models and introduce new methods which allow us to demonstrate refined conditions for tightness of LP relaxations in the Sherali-Adams hierarchy. Our results include showing that for higher order LP relaxations, treewidth is not precisely the right way to characterize tightness. This work is primarily theoretical, with insights that can improve efficiency in practice.

A New Class of Private Chi-Square Hypothesis Tests

Mon, 10 Apr 2017 00:00:00 +0000

In this paper, we develop new test statistics for hypothesis testing over differentially private data. These statistics are designed specifically so that their asymptotic distributions, after accounting for privacy noise, match the asymptotics of the non-private chi-square tests for testing if the multinomial data parameters lie in lower dimensional manifolds (examples include goodness-of-fit and independence testing). Empirically, these new test statistics outperform prior work, which focused on noisy versions of existing statistics.

Efficient Online Multiclass Prediction on Graphs via Surrogate Losses

Mon, 10 Apr 2017 00:00:00 +0000

We develop computationally efficient algorithms for online multi-class prediction. Our construction is based on carefully-chosen data-dependent surrogate loss functions, and the new methods enjoy strong mistake bound guarantees. To illustrate the technique, we study the combinatorial problem of node classification and develop a prediction strategy that is linear-time per round. In contrast, the offline benchmark is NP-hard to compute in general. We demonstrate the empirical performance of the method on several datasets.

Local Group Invariant Representations via Orbit Embeddings

Mon, 10 Apr 2017 00:00:00 +0000

Invariance to nuisance transformations is one of the desirable properties of effective representations. We consider transformations that form a group and propose an approach based on kernel methods to derive local group invariant representations. Locality is achieved by defining a suitable probability distribution over the group which in turn induces distributions in the input feature space. We learn a decision function over these distributions by appealing to the powerful framework of kernel methods and generate local invariant random feature maps via kernel approximations. We show uniform convergence bounds for kernel approximation and provide generalization bounds for learning with these features. We evaluate our method on three real datasets, including Rotated MNIST and CIFAR-10, and observe that it outperforms competing kernel based approaches. The proposed method also outperforms deep CNN on Rotated MNIST and performs comparably to the recently proposed group-equivariant CNN.

A Maximum Matching Algorithm for Basis Selection in Spectral Learning

Mon, 10 Apr 2017 00:00:00 +0000

We present a solution to scale spectral algorithms for learning sequence functions. We are interested in the case where these functions are sparse (that is, for most sequences they return 0). Spectral algorithms reduce the learning problem to the task of computing an SVD decomposition over a special type of matrix called the Hankel matrix. This matrix is designed to capture the relevant statistics of the training sequences. What is crucial is that to capture long range dependencies we must consider very large Hankel matrices. Thus the computation of the SVD becomes a critical bottleneck. Our solution finds a subset of rows and columns of the Hankel that realizes a compact and informative Hankel submatrix. The novelty lies in the way that this subset is selected: we exploit a maximal bipartite matching combinatorial algorithm to look for a sub-block with full structural rank, and show how computation of this sub-block can be further improved by exploiting the specific structure of Hankel matrices.

Learning with Feature Feedback: from Theory to Practice

Mon, 10 Apr 2017 00:00:00 +0000

In supervised learning, a human annotator only needs to assign each data point (document, image, etc.) its correct label. But in many situations, the human can also provide richer feedback at essentially no extra cost. In this paper, we examine a particular type of feature feedback that has been used, with some success, in information retrieval and in computer vision. We formalize two models of feature feedback, give learning algorithms for them, and quantify their usefulness in the learning process. Our experiments also show the efficacy of these methods.

Belief Propagation in Conditional RBMs for Structured Prediction

Mon, 10 Apr 2017 00:00:00 +0000

Restricted Boltzmann machines (RBMs) and conditional RBMs (CRBMs) are popular models for a wide range of applications. In previous work, learning on such models has been dominated by contrastive divergence (CD) and its variants. Belief propagation (BP) algorithms are believed to be slow for structured prediction on conditional RBMs (e.g., Mnih et al. [2011]), and not as good as CD when applied in learning (e.g., Larochelle et al. [2012]). In this work, we present a matrix-based implementation of belief propagation algorithms on CRBMs, which is easily scalable to tens of thousands of visible and hidden units. We demonstrate that, in both maximum likelihood and max-margin learning, training conditional RBMs with BP as the inference routine can provide significantly better results than current state-of-the-art CD methods on structured prediction problems. We also include practical guidelines on training CRBMs with BP, and some insights on the interaction of learning and inference algorithms for CRBMs.

Random Consensus Robust PCA

Mon, 10 Apr 2017 00:00:00 +0000

This paper presents R2PCA, a random consensus method for robust principal component analysis. R2PCA takes RANSAC’s principle of using as little data as possible one step further. It iteratively selects small subsets of the data to identify pieces of the principal components, to then stitch them together. We show that if the principal components are in general position and the errors are sufficiently sparse, R2PCA will exactly recover the principal components with probability 1, in lieu of assumptions on coherence or the distribution of the sparse errors, and even under adversarial settings. R2PCA enjoys many advantages: it works well under noise, its computational complexity scales linearly in the ambient dimension, it is easily parallelizable, and due to its low sample complexity, it can be used in settings where data is so large it cannot even be stored in memory. We complement our theoretical findings with synthetic and real data experiments showing that r2pca outperforms state-of-the-art methods in a broad range of settings.

Optimistic Planning for the Stochastic Knapsack Problem

Mon, 10 Apr 2017 00:00:00 +0000

The stochastic knapsack problem is a stochastic resource allocation problem that arises frequently and yet is exceptionally hard to solve. We derive and study an optimistic planning algorithm specifically designed for the stochastic knapsack problem. Unlike other optimistic planning algorithms for MDPs, our algorithm, OpStoK, avoids the use of discounting and is adaptive to the amount of resources available. We achieve this behavior by means of a concentration inequality that simultaneously applies to capacity and reward estimates. Crucially, we are able to guarantee that the aforementioned confidence regions hold collectively over all time steps by an application of Doob’s inequality. We demonstrate that the method returns an $ε$-optimal solution to the stochastic knapsack problem with high probability. To the best of our knowledge, our algorithm is the first which provides such guarantees for the stochastic knapsack problem. Furthermore, our algorithm is an anytime algorithm and will return a good solution even if stopped prematurely. This is particularly important given the difficulty of the problem. We also provide theoretical conditions to guarantee OpStoK does not expand all policies and demonstrate favorable performance in a simple experimental setting.

On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior

Mon, 10 Apr 2017 00:00:00 +0000

The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter. We argue that the previous default choices are dubious due to their tendency to favor solutions with more unshrunk coefficients than we typically expect a priori. This can lead to bad results if this parameter is not strongly identified by data. We derive the relationship between the global parameter and the effective number of nonzeros in the coefficient vector, and show an easy and intuitive way of setting up the prior for the global parameter based on our prior beliefs about the number of nonzero coefficients in the model. The results on real world data show that one can benefit greatly – in terms of improved parameter estimates, prediction accuracy, and reduced computation time – from transforming even a crude guess for the number of nonzero coefficients into the prior for the global parameter using our framework.

Learning Nash Equilibrium for General-Sum Markov Games from Batch Data

Mon, 10 Apr 2017 00:00:00 +0000

This paper addresses the problem of learning a Nash equilibrium in $γ$-discounted multiplayer general-sum Markov Games (MGs) in a batch setting. As the number of players increases in MG, the agents may either collaborate or team apart to increase their final rewards. One solution to address this problem is to look for a Nash equilibrium. Although, several techniques were found for the subcase of two-player zero-sum MGs, those techniques fail to find a Nash equilibrium in general-sum Markov Games. In this paper, we introduce a new definition of $ε$-Nash equilibrium in MGs which grasps the strategy’s quality for multiplayer games. We prove that minimizing the norm of two Bellman-like residuals implies to learn such an $ε$-Nash equilibrium. Then, we show that minimizing an empirical estimate of the $L_p$ norm of these Bellman-like residuals allows learning for general-sum games within the batch setting. Finally, we introduce a neural network architecture that successfully learns a Nash equilibrium in generic multiplayer general-sum turn-based MGs.

Faster Coordinate Descent via Adaptive Importance Sampling

Mon, 10 Apr 2017 00:00:00 +0000

Coordinate descent methods employ random partial updates of decision variables in order to solve huge-scale convex optimization problems. In this work, we introduce new adaptive rules for the random selection of their updates. By adaptive, we mean that our selection rules are based on the dual residual or the primal-dual gap estimates and can change at each iteration. We theoretically characterize the performance of our selection rules and demonstrate improvements over the state-of-the-art, and extend our theory and algorithms to general convex objectives. Numerical evidence with hinge-loss support vector machines and Lasso confirm that the practice follows the theory.

Learning the Network Structure of Heterogeneous Data via Pairwise Exponential Markov Random Fields

Mon, 10 Apr 2017 00:00:00 +0000

Markov random fields (MRFs) are a useful tool for modeling relationships present in large and high-dimensional data. Often, this data comes from various sources and can have diverse distributions, for example a combination of numerical, binary, and categorical variables. Here, we define the pairwise exponential Markov random field (PE-MRF), an approach capable of modeling exponential family distributions in heterogeneous domains. We develop a scalable method of learning the graphical structure across the variables by solving a regularized approximated maximum likelihood problem. Specifically, we first derive a tractable upper bound on the log-partition function. We then use this upper bound to derive the group graphical lasso, a generalization of the classic graphical lasso problem to heterogeneous domains. To solve this problem, we develop a fast algorithm based on the alternating direction method of multipliers (ADMM). We also prove that our estimator is sparsistent, with guaranteed recovery of the true underlying graphical structure, and that it has a polynomially faster runtime than the current state-of-the-art method for learning such distributions. Experiments on synthetic and real-world examples demonstrate that our approach is both efficient and accurate at uncovering the structure of heterogeneous data.

DP-EM: Differentially Private Expectation Maximization

Mon, 10 Apr 2017 00:00:00 +0000

The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently developed composition methods to bound the privacy “cost” of multiple EM iterations: the moments accountant (MA) and zero-mean concentrated differential privacy (zCDP). Both MA and zCDP bound the moment generating function of the privacy loss random variable and achieve a refined tail bound, which effectively decrease the amount of additive noise. We present empirical results showing the benefits of our approach, as well as similar performance between these two composition methods in the DP-EM setting for Gaussian mixture models. Our approach can be readily extended to many iterative learning algorithms, opening up various exciting future directions.

Rapid Mixing Swendsen-Wang Sampler for Stochastic Partitioned Attractive Models

Mon, 10 Apr 2017 00:00:00 +0000

The Gibbs sampler is the most popular Markov chain used for learning and inference problems in Graphical Models (GM). These tasks are computationally intractable in general, and the Gibbs sampler often suffers from slow mixing. In this paper, we study the Swendsen-Wang dynamics which is a more sophisticated Markov chain designed to overcome bottlenecks that impede Gibbs sampler. We prove O(log n) mixing time for attractive binary pairwise GMs (i.e., ferromagnetic Ising models) on stochastic partitioned graphs having n vertices, under some mild conditions including low temperature regions where the Gibbs sampler provably mixes exponentially slow. Our experiments also confirm that the Swendsen-Wang sampler significantly outperforms the Gibbs sampler for learning parameters of attractive GMs.

Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach

Mon, 10 Apr 2017 00:00:00 +0000

We consider the non-square matrix sensing problem, under restricted isometry property (RIP) assumptions. We focus on the non-convex formulation, where any rank-r matrix $X ∈R^m x n$ is represented as $UV^T$, where $U ∈R^m x r$ and $V ∈R^n x r$. In this paper, we complement recent findings on the non-convex geometry of the analogous PSD setting [5], and show that matrix factorization does not introduce any spurious local minima, under RIP.

Stochastic Difference of Convex Algorithm and its Application to Training Deep Boltzmann Machines

Mon, 10 Apr 2017 00:00:00 +0000

Difference of convex functions (DC) programming is an important approach to nonconvex optimization problems because these structures can be encountered in several fields. Effective optimization methods, called DC algorithms, have been developed in deterministic optimization literature. In machine learning, a lot of important learning problems such as the Boltzmann machines (BMs) can be formulated as DC programming. However, there is no DC-like algorithm guaranteed by convergence rate analysis for stochastic problems that are more suitable settings for machine learning tasks. In this paper, we propose a stochastic variant of DC algorithm and give computational complexities to converge to a stationary point under several situations. Moreover, we show our method includes expectation-maximization (EM) and Monte Carlo EM (MCEM) algorithm as special cases on training BMs. In other words, we extend EM/MCEM algorithm to more effective methods from DC viewpoint with theoretical convergence guarantees. Experimental results indicate that our method performs well for training binary restricted Boltzmann machines and deep Boltzmann machines without pre-training.

Label Filters for Large Scale Multilabel Classification

Mon, 10 Apr 2017 00:00:00 +0000

When assigning labels to a test instance, most multilabel and multiclass classifiers systematically evaluate every single label to decide whether it is relevant or not. This linear scan over labels becomes prohibitive when the number of labels is very large. To alleviate this problem we propose a two step approach where computationally efficient label filters pre-select a small set of candidate labels before the base multiclass or multilabel classifier is applied. The label filters select candidate labels by projecting a test instance on a filtering line, and retaining only the labels that have training instances in the vicinity of this projection. The filter parameters are learned directly from data by solving a constraint optimization problem, and are independent of the base multilabel classifier. The proposed label filters can be used in conjunction with any multiclass or multilabel classifier that requires a linear scan over the labels, and speed up prediction by orders of magnitude without significant impact on performance.

A Sub-Quadratic Exact Medoid Algorithm

Mon, 10 Apr 2017 00:00:00 +0000

We present a new algorithm, ‘trimed’ for obtaining the medoid of a set, that is the element of the set which minimises the mean distance to all other elements. The algorithm is shown to have, under certain assumptions, expected run time $O(N^(3/2))$ in $R^d$ where N is the set size, making it the first sub-quadratic exact medoid algorithm for $d > 1$. Experiments show that it performs very well on spatial network data, frequently requiring two orders of magnitude fewer distance calculations than state-of-the-art approximate algorithms. As an application, we show how trimed can be used as a component in an accelerated K-medoids algorithm, and then how it can be relaxed to obtain further computational gains with only a minor loss in cluster quality.

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

Mon, 10 Apr 2017 00:00:00 +0000

Variational inference using the reparameterization trick has enabled large-scale approximate Bayesian inference in complex probabilistic models, leveraging stochastic optimization to sidestep intractable expectations. The reparameterization trick is applicable when we can simulate a random variable by applying a differentiable deterministic function on an auxiliary random variable whose distribution is fixed. For many distributions of interest (such as the gamma or Dirichlet), simulation of random variables relies on acceptance-rejection sampling. The discontinuity introduced by the accept-reject step means that standard reparameterization tricks are not applicable. We propose a new method that lets us leverage reparameterization gradients even when variables are outputs of a acceptance-rejection sampling algorithm. Our approach enables reparameterization on a larger class of variational distributions. In several studies of real and synthetic data, we show that the variance of the estimator of the gradient is significantly lower than other state-of-the-art methods. This leads to faster convergence of stochastic gradient variational inference.

Learning Optimal Interventions

Mon, 10 Apr 2017 00:00:00 +0000

Our goal is to identify beneficial interventions from observational data. We consider interventions that are narrowly focused (impacting few covariates) and may be tailored to each individual or globally enacted over a population. For applications where harmful intervention is drastically worse than proposing no change, we propose a conservative definition of the optimal intervention. Assuming the underlying relationship remains invariant under intervention, we develop efficient algorithms to identify the optimal intervention policy from limited data and provide theoretical guarantees for our approach in a Gaussian Process setting. Although our methods assume covariates can be precisely adjusted, they remain capable of improving outcomes in misspecified settings with unintentional downstream effects. Empirically, our approach identifies good interventions in two practical applications: gene perturbation and writing improvement.

Co-Occurring Directions Sketching for Approximate Matrix Multiply

Mon, 10 Apr 2017 00:00:00 +0000

We introduce co-occurring directions sketching, a deterministic algorithm for approximate matrix product (AMM), in the streaming model. We show that co-occurring directions achieves a better error bound for AMM than other randomized and deterministic approaches for AMM. Co-occurring directions gives a (1 + epsilon) - approximation of the optimal low rank approximation of a matrix product. Empirically our algorithm outperforms competing methods for AMM, for a small sketch size. We validate empirically our theoretical findings and algorithms.

Minimax-optimal semi-supervised regression on unknown manifolds

Mon, 10 Apr 2017 00:00:00 +0000

We consider semi-supervised regression when the predictor variables are drawn from an unknown manifold. A simple two step approach to this problem is to: (i) estimate the manifold geodesic distance between any pair of points using both the labeled and unlabeled instances; and (ii) apply a k nearest neighbor regressor based on these distance estimates. We prove that given sufficiently many unlabeled points, this simple method of geodesic kNN regression achieves the optimal finite-sample minimax bound on the mean squared error, as if the manifold were known. Furthermore, we show how this approach can be efficiently implemented, requiring only O(k N log N) operations to estimate the regression function at all N labeled and unlabeled points. We illustrate this approach on two datasets with a manifold structure: indoor localization using WiFi fingerprints and facial pose estimation. In both cases, geodesic kNN is more accurate and much faster than the popular Laplacian eigenvector regressor.

Signal-based Bayesian Seismic Monitoring

Mon, 10 Apr 2017 00:00:00 +0000

Detecting weak seismic events from noisy sensors is a difficult perceptual task. We formulate this task as Bayesian inference and propose a generative model of seismic events and signals across a network of spatially distributed stations. Our system, SIGVISA, is the first to directly model seismic waveforms, allowing it to incorporate a rich representation of the physics underlying the signal generation process. We use Gaussian processes over wavelet parameters to predict detailed waveform fluctuations based on historical events, while degrading smoothly to simple parametric envelopes in regions with no historical seismicity. Evaluating on data from the western US, we recover three times as many events as previous work, and reduce mean location errors by a factor of four while greatly increasing sensitivity to low-magnitude events.

Fast rates with high probability in exp-concave statistical learning

Mon, 10 Apr 2017 00:00:00 +0000

We present an algorithm for the statistical learning setting with a bounded exp-concave loss in d dimensions that obtains excess risk $O(d \log(1/δ)/n)$ with probability $1 - δ$. The core technique is to boost the confidence of recent in-expectation O(d/n) excess risk bounds for empirical risk minimization (ERM), without sacrificing the rate, by leveraging a Bernstein condition which holds due to exp-concavity. We also show that a regret bound for any online learner in this setting translates to a high probability excess risk bound for the corresponding online-to-batch conversion of the online learner. Lastly, we present high probability bounds for the exp-concave model selection aggregation problem that are quantile-adaptive in a certain sense. One bound obtains a nearly optimal rate without requiring the loss to be Lipschitz continuous, and another requires Lipschitz continuity but obtains the optimal rate.

Communication-Efficient Learning of Deep Networks from Decentralized Data

Mon, 10 Apr 2017 00:00:00 +0000

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.

Minimax Density Estimation for Growing Dimension

Mon, 10 Apr 2017 00:00:00 +0000

This paper presents minimax rates for density estimation when the data dimension $d$ is allowed to grow with the number of observations $n$ rather than remaining fixed as in previous analyses. We prove a non-asymptotic lower bound which gives the worst-case rate over standard classes of smooth densities, and we show that kernel density estimators achieve this rate. We also give oracle choices for the bandwidth and derive the fastest rate $d$ can grow with $n$ to maintain estimation consistency.

Sequential Multiple Hypothesis Testing with Type I Error Control

Mon, 10 Apr 2017 00:00:00 +0000

This work studies multiple hypothesis testing in the setting when we obtain data sequentially and may choose when to stop sampling. We summarize the notion of a sequential p-value (one that can be continually updated and still maintain a type I error guarantee) and provide several examples from the literature. This tool allows us to convert step-up or step-down multiple hypothesis testing procedures in the fixed-horizon setting (which includes Benjamini-Hochberg, Holm, and Bonferroni) into sequential versions that allow the statistician to reject a hypothesis as soon as the sequential p-value reaches a threshold. We show that if the original procedure has a type I error guarantee in a certain family (including FDR and FWER), then the sequential conversion inherits an analogous guarantee. The conversion also allows for allocating samples in a data-dependent way, and we provide simulated experiments demonstrating an increased number of rejections when compared to the fixed-horizon setting.

Relativistic Monte Carlo

Mon, 10 Apr 2017 00:00:00 +0000

Hamiltonian Monte Carlo (HMC) is a popular Markov chain Monte Carlo (MCMC) algorithm that generates proposals for a Metropolis-Hastings algorithm by simulating the dynamics of a Hamiltonian system. However, HMC is sensitive to large time discretizations and performs poorly if there is a mismatch between the spatial geometry of the target distribution and the scales of the momentum distribution. In particular the mass matrix of HMC is hard to tune well. In order to alleviate these problems we propose relativistic Hamiltonian Monte Carlo, a version of HMC based on relativistic dynamics that introduces a maximum velocity on particles. We also derive stochastic gradient versions of the algorithm and show that the resulting algorithms bear interesting relationships to gradient clipping, RMSprop, Adagrad and Adam, popular optimisation methods in deep learning. Based on this, we develop relativistic stochastic gradient descent by taking the zero-temperature limit of relativistic stochastic gradient Hamiltonian Monte Carlo. In experiments we show that the relativistic algorithms perform better than classical Newtonian variants and Adam.

A Stochastic Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization

Mon, 10 Apr 2017 00:00:00 +0000

Symmetric nonnegative matrix factorization (SymNMF) plays an important role in applications of many data analytics problems such as community detection, document clustering and image segmentation. In this paper, we consider a stochastic SymNMF problem in which the observation matrix is generated in a random and sequential manner. We propose a stochastic nonconvex splitting method, which not only guarantees convergence to the set of stationary points of the problem (in the mean-square sense), but further achieves a sublinear convergence rate. Numerical results show that for clustering problems over both synthetic and real world datasets, the proposed algorithm converges quickly to the set of stationary points.

A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe

Mon, 10 Apr 2017 00:00:00 +0000

Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms. In this paper we take a unified view on both classes of methods, leading to the first explicit convergence rates of matching pursuit methods in an optimization sense, for general sets of atoms. We derive sublinear (1/t) convergence for both classes on general smooth objectives, and linear convergence on strongly convex objectives, as well as a clear correspondence of algorithm variants. Our presented algorithms and rates are affine invariant, and do not need any incoherence or sparsity assumptions.

Black-box Importance Sampling

Mon, 10 Apr 2017 00:00:00 +0000

Importance sampling is widely used in machine learning and statistics, but its power is limited by the restriction of using simple proposals for which the importance weights can be tractably calculated. We address this problem by studying black-box importance sampling methods that calculate importance weights for samples generated from any unknown proposal or black-box mechanism. Our method allows us to use better and richer proposals to solve difficult problems, and (somewhat counter-intuitively) also has the additional benefit of improving the estimation accuracy beyond typical importance sampling. Both theoretical and empirical analyses are provided.

Attributing Hacks

Mon, 10 Apr 2017 00:00:00 +0000

In this paper, we describe an algorithm for estimating the provenance of hacks on websites. That is, given properties of sites and the temporal occurrence of attacks, we are able to attribute individual attacks to joint causes and vulnerabilities, as well as estimating the evolution of these vulnerabilities over time. Specifically, we use hazard regression with a time-varying additive hazard function parameterized in a generalized linear form. The activation coefficients on each feature are continuous-time functions over time. We formulate the problem of learning these functions as a constrained variational maximum likelihood estimation problem with total variation penalty and show that the optimal solution is a 0th order spline (a piecewise constant function) with a finite number of adaptively chosen knots. This allows the inference problem to be solved efficiently and at scale by solving a finite dimensional optimization problem. Extensive experiments on real data sets show that our method significantly outperforms Cox’s proportional hazard model. We also conduct case studies and verify that the fitted functions are indeed recovering vulnerable features.

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Mon, 10 Apr 2017 00:00:00 +0000

Many natural systems, such as neurons firing in the brain or basketball teams traversing a court, give rise to time series data with complex, nonlinear dynamics. We can gain insight into these systems by decomposing the data into segments that are each explained by simpler dynamic units. Building on switching linear dynamical systems (SLDS), we develop a model class and Bayesian inference algorithms that not only discover these dynamical units but also, by learning how transition probabilities depend on observations or continuous latent states, explain their switching behavior. Our key innovation is to design these recurrent SLDS models to enable recent Pólya-gamma auxiliary variable techniques and thus make approximate Bayesian learning and inference in these models easy, fast, and scalable.

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Mon, 10 Apr 2017 00:00:00 +0000

The stochastic composition optimization proposed recently by Wang et al. [2014] minimizes the objective with the composite expectation form: $\min_x (\mathbbE_iF_i ∘\mathbbE_j G_j)(x).$ It summarizes many important applications in machine learning, statistics, and finance. In this paper, we consider the finite-sum scenario for composition optimization: $\min_x f (x) := \frac1n \sum_i = 1^n F_i \left( \frac1m \sum_j = 1^m G_j (x) \right)$. In this paper, two algorithms are proposed to solve this problem by combining the stochastic compositional gradient descent (SCGD) and the stochastic variance reduced gradient (SVRG) technique. A constant linear convergence rate is proved for strongly convex optimization, which substantially improves the sublinear rate $O(K^-0.8)$ of the best known algorithm.

Binary and Multi-Bit Coding for Stable Random Projections

Mon, 10 Apr 2017 00:00:00 +0000

The recent work [17] developed a 1-bit compressed sensing (CS) algorithm based on $α$-stable random projections. Although the work in [17] showed that the method is a strong competitor of other existing 1-bit algorithms, the procedure requires knowing $K$, the sparsity. Note that $K$ is the $l_0$ norm of the signal. Other existing 1-bit CS algorithms require the $l_2$ norm of the signal. In this paper, we develop an estimation procedure for the $l_α$ norm of the signal, where $0<α\leq2$ from binary or multi-bit measurements. We demonstrate that using a simple closed-form estimator with merely 1-bit information does not result in a significant loss of accuracy if the parameter is chosen appropriately. Theoretical tail bounds are also provided. Using 2 or more bits per measurement reduces the variance and importantly, stabilizes the estimate so that the variance is not too sensitive to chosen parameters.

Efficient Rank Aggregation via Lehmer Codes

Mon, 10 Apr 2017 00:00:00 +0000

We propose a novel rank aggregation method based on converting permutations into their corresponding Lehmer codes or other subdiagonal images. Lehmer codes, also known as inversion vectors, are vector representations of permutations in which each coordinate can take values not restricted by the values of other coordinates. This transformation allows for decoupling of the coordinates and for performing aggregation via simple scalar median or mode computations. We present simulation results illustrating the performance of this completely parallelizable approach and analytically prove that both the mode and median aggregation procedure recover the correct centroid aggregate with small sample complexity when the permutations are drawn according to the well-known Mallows models. The proposed Lehmer code approach may also be used on partial rankings, with similar performance guarantees.

Minimax Gaussian Classification & Clustering

Mon, 10 Apr 2017 00:00:00 +0000

We present minimax bounds for classification and clustering error in the setting where covariates are drawn from a mixture of two isotropic Gaussian distributions. Here, we define clustering error in a discriminative fashion, demonstrating fundamental connections between classification (supervised) and clustering (unsupervised). For both classification and clustering, our lower bounds show that without enough samples, the best any classifier or clustering rule can do is close to random guessing. For classification, as part of our upper bound analysis, we show that Fisher's linear discriminant achieves a fast minimax rate $\Theta(1/n)$ with enough samples $n$. For clustering, as part of our upper bound analysis, we show that a clustering rule constructed using principal component analysis achieves the minimax rate with enough samples. We also provide lower and upper bounds for the high-dimensional sparse setting where the dimensionality of the covariates $p$ is potentially larger than the number of samples $n$, but where the difference between the Gaussian means is sparse.

Less than a Single Pass: Stochastically Controlled Stochastic Gradient

Mon, 10 Apr 2017 00:00:00 +0000

We develop and analyze a procedure for gradient-based optimization that we refer to as stochastically controlled stochastic gradient (SCSG). As a member of the SVRG family of algorithms, SCSG makes use of gradient estimates at two scales. Unlike most existing algorithms in this family, both the computation cost and the communication cost of SCSG do not necessarily scale linearly with the sample size n; indeed, these costs are independent of n when the target accuracy is small. An experimental evaluation of SCSG on the MNIST dataset shows that it can yield accurate results on this dataset on a single commodity machine with a memory footprint of only 2.6MB and only eight disk accesses.

Hierarchically-partitioned Gaussian Process Approximation

Mon, 10 Apr 2017 00:00:00 +0000

The Gaussian process (GP) is a simple yet powerful probabilistic framework for various machine learning tasks. However, exact algorithms for learning and prediction are prohibitive to be applied to large datasets due to inherent computational complexity. To overcome this main limitation, various techniques have been proposed, and in particular, local GP algorithms that scales “truly linearly” with respect to the dataset size. In this paper, we introduce a hierarchical model based on local GP for large-scale datasets, which stacks inducing points over inducing points in layers. By using different kernels in each layer, the overall model becomes multi-scale and is able to capture both long- and short-range dependencies. We demonstrate the effectiveness of our model by speed-accuracy performance on challenging real-world datasets.

ASAGA: Asynchronous Parallel SAGA

Mon, 10 Apr 2017 00:00:00 +0000

We describe ASAGA, an asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced “perturbed iterate” framework that resolves it. We thereby prove that ASAGA can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedup as well as the hardware overhead.

Inference Compilation and Universal Probabilistic Programming

Mon, 10 Apr 2017 00:00:00 +0000

We introduce a method for using deep neural networks to amortize the cost of inference in models from the family induced by universal probabilistic programming languages, establishing a framework that combines the strengths of probabilistic programming and deep learning methods. We call what we do “compilation of inference” because our method transforms a denotational specification of an inference problem in the form of a probabilistic program written in a universal programming language into a trained neural network denoted in a neural network specification language. When at test time this neural network is fed observational data and executed, it performs approximate inference in the original model specified by the probabilistic program. Our training objective and learning procedure are designed to allow the trained neural network to be used as a proposal distribution in a sequential importance sampling inference engine. We illustrate our method on mixture models and Captcha solving and show significant speedups in the efficiency of inference.

The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

Mon, 10 Apr 2017 00:00:00 +0000

Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications. Current approaches focus on generalising existing techniques for finite-armed bandits, notably the otimism principle and Thompson sampling. Prior analysis has mostly focussed on the worst-case setting. We analyse the asymptotic regret and show matching upper and lower bounds on what is achievable. Surprisingly, our results show that no algorithm based on optimism or Thompson sampling will ever achieve the optimal rate. In fact, they can be arbitrarily far from optimal, even in very simple cases. This is a disturbing result because these techniques are standard tools that are widely used for sequential optimisation, for example, generalised linear bandits and reinforcement learning.

Learning Cost-Effective and Interpretable Treatment Regimes

Mon, 10 Apr 2017 00:00:00 +0000

Decision makers, such as doctors and judges, make crucial decisions such as recommending treatments to patients, and granting bails to defendants on a daily basis. Such decisions typically involve weighting the potential benefits of taking an action against the costs involved. In this work, we aim to automate this task of learning cost-effective, interpretable and actionable treatment regimes. We formulate this as a problem of learning a decision list – a sequence of if-then-else rules – which maps characteristics of subjects (eg., diagnostic test results of patients) to treatments. We propose a novel objective to construct a decision list which maximizes outcomes for the population, and minimizes overall costs. Since we do not observe the outcomes corresponding to counterfactual scenarios, we use techniques from causal inference literature to infer them. We model the problem of learning the decision list as a Markov Decision Process (MDP) and employ a variant of the Upper Confidence Bound for Trees (UCT) strategy which leverages customized checks for pruning the search space effectively. Experimental results on real world observational data capturing judicial bail decisions and treatment recommendations for asthma patients demonstrate the effectiveness of our approach.

Online Learning and Blackwell Approachability with Partial Monitoring: Optimal Convergence Rates

Mon, 10 Apr 2017 00:00:00 +0000

Blackwell approachability is an online learning setup generalizing the classical problem of regret minimization by allowing for instance multi-criteria optimization, global (online) optimization of a convex loss, or online linear optimization under some cumulative constraint. We consider partial monitoring where the decision maker does not necessarily observe the outcomes of his decision (unlike the traditional regret/bandit literature). Instead, he receives a random signal correlated to the decision–outcome pair, or only to the outcome. We construct, for the first time, approachability algorithms with convergence rate of order $O(T^-1/2)$ when the signal is independent of the decision and of order $O(T^-1/3)$ in the case of general signals. Those rates are optimal in the sense that they cannot be improved without further assumption on the structure of the objectives and/or the signals.

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

Mon, 10 Apr 2017 00:00:00 +0000

For optimal decision making under variable class distributions and misclassification costs a classifier needs to produce well-calibrated estimates of the posterior probability. Isotonic calibration is a powerful non-parametric method that is however prone to overfitting on smaller datasets; hence a parametric method based on the logistic curve is commonly used. While logistic calibration is designed for normally distributed per-class scores, we demonstrate experimentally that many classifiers including Naive Bayes and Adaboost suffer from a particular distortion where these score distributions are heavily skewed. In such cases logistic calibration can easily yield probability estimates that are worse than the original scores. Moreover, the logistic curve family does not include the identity function, and hence logistic calibration can easily uncalibrate a perfectly calibrated classifier. In this paper we solve all these problems with a richer class of calibration maps based on the beta distribution. We derive the method from first principles and show that fitting it is as easy as fitting a logistic curve. Extensive experiments show that beta calibration is superior to logistic calibration for Naive Bayes and Adaboost.

Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning

Mon, 10 Apr 2017 00:00:00 +0000

Density-ratio estimation (i.e. estimating $f = f_Q/f_P$ for two unknown distributions Q and P) has proved useful in many Machine Learning tasks, e.g., risk-calibration in transfer-learning, two-sample tests, and also useful in common techniques such importance sampling and bias correction. While there are many important analyses of this estimation problem, the present paper derives convergence rates in other practical settings that are less understood, namely, extensions of traditional Lipschitz smoothness conditions, and common high-dimensional settings with structured data (e.g. manifold data, sparse data). Various interesting facts, which hold in earlier settings, are shown to extend to these settings. Namely, (1) optimal rates depend only on the smoothness of the ratio f, and not on the densities $f_Q$, $f_P$, supporting the belief that plugging in estimates for $f_Q$, $f_P$ is suboptimal; (2) optimal rates depend only on the intrinsic dimension of data, i.e. this problem – unlike density estimation – escapes the curse of dimension. We further show that near-optimal rates are attainable by estimators tuned from data alone, i.e. with no prior distributional information. This last fact is of special interest in unsupervised settings such as this one, where only oracle rates seem to be known, i.e., rates which assume critical distributional information usually unavailable in practice.

A Learning Theory of Ranking Aggregation

Mon, 10 Apr 2017 00:00:00 +0000

Originally formulated in Social Choice theory, Ranking Aggregation, also referred to as Consensus Ranking, has motivated the development of numerous statistical models since the middle of the 20th century. Recently, the analysis of ranking/preference data has been the subject of a renewed interest in machine-learning, boosted by modern applications such as meta-search engines, giving rise to the design of various scalable algorithmic approaches for approximately computing ranking medians, viewed as solutions of a discrete (generally NP-hard) minimization problem. This paper develops a statistical learning theory for ranking aggregation in a general probabilistic setting (avoiding any rigid ranking model assumptions), assessing the generalization ability of empirical ranking medians. Universal rate bounds are established and the situations where convergence occurs at an exponential rate are fully characterized. Minimax lower bounds are also proved, showing that the rate bounds we obtain are optimal.

Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets

Mon, 10 Apr 2017 00:00:00 +0000

Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. Despite its success, for large datasets, training and validating a single configuration often takes hours, days, or even weeks, which limits the achievable performance. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods or the recently proposed bandit strategy Hyperband.

Scalable Greedy Feature Selection via Weak Submodularity

Mon, 10 Apr 2017 00:00:00 +0000

Greedy algorithms are widely used for problems in machine learning such as feature selection and set function optimization. Unfortunately, for large datasets, the running time of even greedy algorithms can be quite high. This is because for each greedy step we need to refit a model or calculate a function using the previously selected choices and the new candidate. Two algorithms that are faster approximations to the greedy forward selection were introduced recently [Mirzasoleiman et al., 2013, 2015]. They achieve better performance by exploiting stochastic evaluation and distributed computation respectively. Both algorithms have provable performance guarantees for submodular functions. In this paper we show that divergent from previously held opinion, submodularity is not required to obtain approximation guarantees for these two algorithms. Specifically, we show that a generalized concept of weak submodularity suffices to give multiplicative approximation guarantees. Our result extends the applicability of these algorithms to a larger class of functions. Furthermore, we show that a bounded submodularity ratio can be used to provide data dependent bounds that can sometimes be tighter also for submodular functions. We empirically validate our work by showing superior performance of fast greedy approximations versus several established baselines on artificial and real datasets.

Information Projection and Approximate Inference for Structured Sparse Variables

Mon, 10 Apr 2017 00:00:00 +0000

Approximate inference via information projection has been recently introduced as a general-purpose technique for efficient probabilistic inference given sparse variables. This manuscript goes beyond classical sparsity by proposing efficient algorithms for approximate inference via information projection that are applicable to any structure on the set of variables that admits enumeration using matroid or knapsack constraints. Further, leveraging recent advances in submodular optimization, we provide an efficient greedy algorithm with strong optimization-theoretic guarantees. The class of probabilistic models that can be expressed in this way is quite broad and, as we show, includes group sparse regression, group sparse principal components analysis and sparse collective matrix factorization, among others. Empirical results on simulated data and high dimensional neuroimaging data highlight the superior performance of the information projection approach as compared to established baselines for a range of probabilistic models.

Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models

Mon, 10 Apr 2017 00:00:00 +0000

Variational inference is computationally challenging in models that contain both conjugate and non-conjugate terms. Methods specifically designed for conjugate models, even though computationally efficient, find it difficult to deal with non-conjugate terms. On the other hand, stochastic-gradient methods can handle the non-conjugate terms but they usually ignore the conjugate structure of the model which might result in slow convergence. In this paper, we propose a new algorithm called Conjugate-computation Variational Inference (CVI) which brings the best of the two worlds together – it uses conjugate computations for the conjugate terms and employs stochastic gradients for the rest. We derive this algorithm by using a stochastic mirror-descent method in the mean-parameter space, and then expressing each gradient step as a variational inference in a conjugate model. We demonstrate our algorithm’s applicability to a large class of models and establish its convergence. Our experimental results show that our method converges much faster than the methods that ignore the conjugate structure of the model.

Stochastic Rank-1 Bandits

Mon, 10 Apr 2017 00:00:00 +0000

We propose stochastic rank-1 bandits, a class of online learning problems where at each step a learning agent chooses a pair of row and column arms, and receives the product of their values as a reward. The main challenge of the problem is that the individual values of the row and column are unobserved. We assume that these values are stochastic and drawn independently. We propose a computationally-efficient algorithm for solving our problem, which we call Rank1Elim. We derive a O((K + L) (1 / Delta) log n) upper bound on its n-step regret, where K is the number of rows, L is the number of columns, and Delta is the minimum of the row and column gaps; under the assumption that the mean row and column rewards are bounded away from zero. To the best of our knowledge, we present the first bandit algorithm that finds the maximum entry of a rank-1 matrix whose regret is linear in K + L, 1 / Delta, and log n. We also derive a nearly matching lower bound. Finally, we evaluate Rank1Elim empirically on multiple problems. We observe that it leverages the structure of our problems and can learn near-optimal solutions even if our modeling assumptions are mildly violated.

A Framework for Optimal Matching for Causal Inference

Mon, 10 Apr 2017 00:00:00 +0000

We propose a novel framework for matching estimators for causal effect from observational data that is based on minimizing the dual norm of estimation error when expressed as an operator. We show that many popular matching estimators can be expressed as optimal in this framework, including nearest-neighbor matching, coarsened exact matching, and mean-matched sampling. This reveals their motivation and aptness as structural priors formulated by embedding the effect in a particular functional space. This also gives rise to a range of new, kernel-based matching estimators that arise when one embeds the effect in a reproducing kernel Hilbert space. Depending on the case, these estimators can be found using either quadratic optimization or integer optimization. We show that estimators based on universal kernels are universally consistent without model specification. In empirical results using both synthetic and real data, the new, kernel-based estimators outperform all standard causal estimators in estimation error.

Sequential Graph Matching with Sequential Monte Carlo

Mon, 10 Apr 2017 00:00:00 +0000

We develop a novel probabilistic model for graph matchings and develop practical inference methods for supervised and unsupervised learning of the parameters of this model. The framework we develop admits joint inference on the parameters and the matching. Furthermore, our framework generalizes naturally to $K$-partite hypergraph matchings or set packing problems. The sequential formulation of the graph matching process naturally leads to sequential Monte Carlo algorithms which can be combined with various parameter inference methods. We apply our method to image matching problems, document ranking, and our own novel quadripartite matching problem arising from the field of computational forestry.

Improved Strongly Adaptive Online Learning using Coin Betting

Mon, 10 Apr 2017 00:00:00 +0000

This paper describes a new parameter-free online learning algorithm for changing environments. In comparing against algorithms with the same time complexity as ours, we obtain a strongly adaptive regret bound that is a factor of at least $\sqrt\log(T)$ better, where $T$ is the time horizon. Empirical results show that our algorithm outperforms state-of-the-art methods in learning with expert advice and metric learning scenarios.

Modal-set estimation with an application to clustering

Mon, 10 Apr 2017 00:00:00 +0000

We present a procedure that can estimate – with statistical consistency guarantees – any local-maxima of a density, under benign distributional conditions. The procedure estimates all such local maxima, or modal-sets, of any bounded shape or dimension, including usual point-modes. In practice, modal-sets can arise as dense low-dimensional structures in noisy data, and more generally serve to better model the rich variety of locally dense structures in data. The procedure is then shown to be competitive on clustering applications, and moreover is quite stable to a wide range of settings of its tuning parameter.

Robust and Efficient Computation of Eigenvectors in a Generalized Spectral Method for Constrained Clustering

Mon, 10 Apr 2017 00:00:00 +0000

FAST-GE is a generalized spectral method for constrained clustering [Cucuringu et al., AISTATS 2016]. It incorporates the must-link and cannot-link constraints into two Laplacian matrices and then minimizes a Rayleigh quotient via solving a generalized eigenproblem, and is considered to be simple and scalable. However, there are two unsolved issues. Theoretically, since both Laplacian matrices are positive semi-definite and the corresponding pencil is singular, it is not proven whether the minimum of the Rayleigh quotient exists and is equivalent to an eigenproblem. Computationally, the locally optimal block preconditioned conjugate gradient (LOBPCG) method is not designed for solving the eigenproblem of a singular pencil. In fact, to the best of our knowledge, there is no existing eigensolver that is immediately applicable. In this paper, we provide solutions to these two critical issues. We prove a generalization of Courant-Fischer variational principle for the Laplacian singular pencil. We propose a regularization for the pencil so that LOBPCG is applicable. We demonstrate the robustness and efficiency of proposed solutions for constrained image segmentation. The proposed theoretical and computational solutions can be applied to eigenproblems of positive semi-definite pencils arising in other machine learning algorithms, such as generalized linear discriminant analysis in dimension reduction and multisurface classification via eigenvectors.

Combinatorial Topic Models using Small-Variance Asymptotics

Mon, 10 Apr 2017 00:00:00 +0000

Modern topic models typically have a probabilistic formulation, and derive their inference algorithms based on Latent Dirichlet Allocation (LDA) and its variants. In contrast, we approach topic modeling via combinatorial optimization, and take a small-variance limit of LDA to derive a new objective function. We minimize this objective by using ideas from combinatorial optimization, obtaining a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are not only significantly better than traditional SVA algorithms, but also truly competitive with popular LDA-based approaches; we also discuss the (dis)similarities between our approach and its probabilistic counterparts.

Dynamic Collaborative Filtering With Compound Poisson Factorization

Mon, 10 Apr 2017 00:00:00 +0000

Model-based collaborative filtering (CF) analyzes user–item interactions to infer latent factors that represent user preferences and item characteristics in order to predict future interactions. Most CF approaches assume that these latent factors are static; however, in most CF data, user preferences and item perceptions drift over time. Here, we propose a new conjugate and numerically stable dynamic matrix factorization (DCPF) based on hierarchical Poisson factorization that models the smoothly drifting latent factors using gamma-Markov chains. We propose a conjugate gamma chain construction that is numerically stable within our compound-Poisson framework. We then derive a scalable stochastic variational inference approach to estimate the parameters of our model. We apply our model to time-stamped ratings data sets from Netflix, Yelp, and Last.fm. We empirically demonstrate that DCPF achieves a higher predictive accuracy than state-of-the-art static and dynamic factorization algorithms.

Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot

Mon, 10 Apr 2017 00:00:00 +0000

While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds – establishing global convergence and obtaining a good rate of convergence for the problem of computing squareroot of a positive semidefinite (PSD) matrix, which is a widely studied problem in numerical linear algebra with applications in machine learning and statistics among others. Given a PSD matrix M and a PSD starting point $U_0$, we show that gradient descent with appropriately chosen step-size finds an epsilon-accurate squareroot of M in $O(α\log(||M-U_0||_F^2 / ε))$ iterations, where $α= (\max{||U_0||^2, ||M||} / \min{\sigma_min^2(U_0), \sigma_min(M)} )^3/2$. Our result is the first to establish global convergence for this problem and that it is robust to errors in each iteration. A key contribution of our work is the general proof technique which we believe should further excite research in understanding deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global convergence rates for other problems in machine learning and numerical linear algebra.

Large-Scale Data-Dependent Kernel Approximation

Mon, 10 Apr 2017 00:00:00 +0000

Learning a computationally efficient kernel from data is an important machine learning problem. The majority of kernels in the literature do not leverage the geometry of the data, and those that do are computationally infeasible for contemporary datasets. Recent advances in approximation techniques have expanded the applicability of the kernel methodology to scale linearly with the data size. Data-dependent kernels, which could leverage this computational advantage, have however not yet seen the benefit. Here we derive an approximate large-scale learning procedure for data-dependent kernels that is efficient and performs well in practice. We provide a Lemma that can be used to derive the asymptotic convergence of the approximation in the limit of infinite random features, and, under certain conditions, an estimate of the convergence speed. We empirically prove that our construction represents a valid, yet efficient approximation of the data-dependent kernel. For large-scale datasets of millions of datapoints, where the proposed method is now applicable for the first time, we notice a significant performance boost over both baselines consisting of data independent kernels and of kernel approximations, at comparable computational cost.

Nonlinear ICA of Temporally Dependent Stationary Sources

Mon, 10 Apr 2017 00:00:00 +0000

We develop a nonlinear generalization of independent component analysis (ICA) or blind source separation, based on temporal dependencies (e.g. autocorrelations). We introduce a nonlinear generative model where the independent sources are assumed to be temporally dependent, non-Gaussian, and stationary, and we observe arbitrarily nonlinear mixtures of them. We develop a method for estimating the model (i.e. separating the sources) based on logistic regression in a neural network which learns to discriminate between a short temporal window of the data vs. a temporal window of temporally permuted data. We prove that the method estimates the sources for general smooth mixing nonlinearities, assuming the sources have sufficiently strong temporal dependencies, and these dependencies are in a certain way different from dependencies found in Gaussian processes. For Gaussian (and similar) sources, the method estimates the nonlinear part of the mixing. We thus provide the first rigorous and general proof of identifiability of nonlinear ICA for temporally dependent sources, together with a practical method for its estimation.

Quantifying the accuracy of approximate diffusions and Markov chains

Mon, 10 Apr 2017 00:00:00 +0000

Markov chains and diffusion processes are indispensable tools in machine learning and statistics that are used for inference, sampling, and modeling. With the growth of large-scale datasets, the computational cost associated with simulating these stochastic processes can be considerable, and many algorithms have been proposed to approximate the underlying Markov chain or diffusion. A fundamental question is how the computational savings trade off against the statistical error incurred due to approximations. This paper develops general results that address this question. We bound the Wasserstein distance between the equilibrium distributions of two diffusions as a function of their mixing rates and the deviation in their drifts. We show that this error bound is tight in simple Gaussian settings. Our general result on continuous diffusions can be discretized to provide insights into the computational–statistical trade-off of Markov chains. As an illustration, we apply our framework to derive finite-sample error bounds of approximate unadjusted Langevin dynamics. We characterize computation-constrained settings where, by using fast-to-compute approximate gradients in the Langevin dynamics, we obtain more accurate samples compared to using the exact gradients. Finally, as an additional application of our approach, we quantify the accuracy of approximate zig-zag sampling. Our theoretical analyses are supported by simulation experiments.

Greedy Direction Method of Multiplier for MAP Inference of Large Output Domain

Mon, 10 Apr 2017 00:00:00 +0000

Maximum-a-Posteriori (MAP) inference lies at the heart of Graphical Models and Structured Prediction. Despite the intractability of exact MAP inference, approximated methods based on LP relaxations have exhibited superior performance across a wide range of applications. Yet for problems involving large output domains (i.e., the state space for each variable is large), standard LP relaxations can easily give rise to a large number of variables and constraints which are beyond the limit of existing optimization algorithms. In this paper, we introduce an effective MAP inference method for problems with large output domains. The method builds upon alternating minimization of an Augmented Lagrangian that exploits the sparsity of messages through greedy optimization techniques. A key feature of our greedy approach is to introduce variables in an on-demand manner with a pre-built data structure over local factors. This results in a single-loop algorithm of sublinear cost per iteration and O(log(1/epsilon))-type iteration complexity to achieve epsilon sub-optimality. In addition, we introduce a variant of GDMM for binary MAP inference problems with a large number of factors. Empirically, the proposed algorithms demonstrate orders of magnitude speedup over state-of-the-art MAP inference techniques on MAP inference problems including Segmentation, Alignment, Protein Folding, Graph Matching, and Pairwise-Interacted Multilabel Prediction.

Gradient Boosting on Stochastic Data Streams

Mon, 10 Apr 2017 00:00:00 +0000

Boosting is a popular ensemble algorithm that generates more powerful learners by linearly combining base models from a simpler hypothesis class. In this work, we investigate the problem of adapting batch gradient boosting for minimizing convex loss functions to online setting where the loss at each iteration is i.i.d sampled from an unknown distribution. To generalize from batch to online, we first introduce the definition of online weak learning edge with which for strongly convex and smooth loss functions, we present an algorithm, Streaming Gradient Boosting (SGB) with exponential shrinkage guarantees in the number of weak learners. We further present an adaptation of SGB to optimize non-smooth loss functions, for which we derive a $O(\ln(N)/N)$ convergence rate. We also show that our analysis can extend to adversarial online learning setting under a stronger assumption that the online weak learning edge will hold in adversarial setting. We finally demonstrate experimental results showing that in practice our algorithms can achieve competitive results as classic gradient boosting while using less computation.

ConvNets with Smooth Adaptive Activation Functions for Regression

Mon, 10 Apr 2017 00:00:00 +0000

Within Neural Networks (NN), the parameters of Adaptive Activation Functions (AAF) control the shapes of activation functions. These parameters are trained along with other parameters in the NN. AAFs have improved performance of Convolutional Neural Networks (CNN) in multiple classification tasks. In this paper, we propose and apply AAFs on CNNs for regression tasks. We argue that applying AAFs in the regression (second-to-last) layer of a NN can significantly decrease the bias of the regression NN. However, using existing AAFs may lead to overfitting. To address this problem, we propose a Smooth Adaptive Activation Function (SAAF) with a piecewise polynomial form which can approximate any continuous function to arbitrary degree of error, while having a bounded Lipschitz constant for given bounded model parameters. As a result, NNs with SAAF can avoid overfitting by simply regularizing model parameters. We empirically evaluated CNNs with SAAFs and achieved state-of-the-art results on age and pose estimation datasets.

Regression Uncertainty on the Grassmannian

Mon, 10 Apr 2017 00:00:00 +0000

Trends in longitudinal or cross-sectional studies over time are often captured through regression models. In their simplest manifestation, these regression models are formulated in $R^n$. However, in the context of imaging studies, the objects of interest which are to be regressed are frequently best modeled as elements of a Riemannian manifold. Regression on such spaces can be accomplished through geodesic regression. This paper develops an approach to compute confidence intervals for geodesic regression models. The approach is general, but illustrated and specifically developed for the Grassmann manifold, which allows us, e.g., to regress shapes or linear dynamical systems. Extensions to other manifolds can be obtained in a similar manner. We demonstrate our approach for regression with 2D/3D shapes using synthetic and real data.

High-dimensional Time Series Clustering via Cross-Predictability

Mon, 10 Apr 2017 00:00:00 +0000

The key to time series clustering is how to characterize the similarity between any two time series. In this paper, we explore a new similarity metric called “cross-predictability”: the degree to which a future value in each time series is predicted by past values of the others. However, it is challenging to estimate such cross-predictability among time series in the high-dimensional regime, where the number of time series is much larger than the length of each time series. We address this challenge with a sparsity assumption: only time series in the same cluster have significant cross-predictability with each other. We demonstrate that this approach is computationally attractive, and provide a theoretical proof that the proposed algorithm will identify the correct clustering structure with high probability under certain conditions. To the best of our knowledge, this is the first practical high-dimensional time series clustering algorithm with a provable guarantee. We evaluate with experiments on both synthetic data and real-world data, and results indicate that our method can achieve more than 80% clustering accuracy on real-world data, which is 20% higher than the state-of-art baselines.

Consistent and Efficient Nonparametric Different-Feature Selection

Mon, 10 Apr 2017 00:00:00 +0000

Two-sample feature selection is a ubiquitous problem in both scientific and engineering studies. We propose a feature selection method to find features that describe a difference in two probability distributions. The proposed method is nonparametric and does not assume any specific parametric models on data distributions. We show that the proposed method is computationally efficient and does not require any extra computation for model selection. Moreover, we prove that the proposed method provides a consistent estimator of features under mild conditions. Our experimental results show that the proposed method outperforms the current method with regard to both accuracy and computation time.

Unsupervised Sequential Sensor Acquisition

Mon, 10 Apr 2017 00:00:00 +0000

In many security and healthcare systems a sequence of sensors/tests are used for detection and diagnosis. Each test outputs a prediction of the latent state, and carries with it inherent costs. Our objective is to learn strategies for selecting tests to optimize accuracy and costs. Unfortunately it is often impossible to acquire in-situ ground truth annotations and we are left with the problem of unsupervised sensor selection (USS). We pose USS as a version of stochastic partial monitoring problem with an unusual reward structure (even noisy annotations are unavailable). Unsurprisingly no learner can achieve sublinear regret without further assumptions. To this end we propose the notion of weak-dominance. This is a condition on the joint probability distribution of test outputs and latent state and says that whenever a test is accurate on an example, a later test in the sequence is likely to be accurate as well.

Comparison-Based Nearest Neighbor Search

Mon, 10 Apr 2017 00:00:00 +0000

We consider machine learning in a comparison-based setting where we are given a set of points in a metric space, but we have no access to the actual distances between the points. Instead, we can only ask an oracle whether the distance between two points i and j is smaller than the distance between the points i and k. We are concerned with data structures and algorithms to find nearest neighbors based on such comparisons. We focus on a simple yet effective algorithm that recursively splits the space by first selecting two random pivot points and then assigning all other points to the closer of the two (comparison tree). We prove that if the metric space satisfies certain expansion conditions, then with high probability the height of the comparison tree is logarithmic in the number of points, leading to efficient search performance. We also provide an upper bound for the failure probability to return the true nearest neighbor. Experiments show that the comparison tree is competitive with algorithms that have access to the actual distance values, and needs less triplet comparisons than other competitors.

Asymptotically exact inference in differentiable generative models

Mon, 10 Apr 2017 00:00:00 +0000

Many generative models can be expressed as a differentiable function of random inputs drawn from some simple probability density. This framework includes both deep generative architectures such as Variational Autoencoders and a large class of procedurally defined simulator models. We present a method for performing efficient MCMC inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where Approximate Bayesian Computation might otherwise be employed. We use the intuition that inference corresponds to integrating a density across the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to coherently move between inputs exactly consistent with observations. We validate the method by performing inference tasks in a diverse set of models.

Linear Convergence of Stochastic Frank Wolfe Variants

Mon, 10 Apr 2017 00:00:00 +0000

In this paper, we show that the Away-step Stochastic Frank-Wolfe (ASFW) and Pairwise Stochastic Frank-Wolfe (PSFW) algorithms converge linearly in expectation. We also show that if an algorithm convergences linearly in expectation then it converges linearly almost surely. In order to prove these results, we develop a novel proof technique based on concepts of empirical processes and concentration inequalities. As far as we know, this technique has not been used previously to derive the convergence rates of stochastic optimization algorithms. In large- scale numerical experiments, ASFW and PSFW perform as well as or better than their stochastic competitors in actual CPU time.

Frank-Wolfe Algorithms for Saddle Point Problems

Mon, 10 Apr 2017 00:00:00 +0000

We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained smooth convex-concave saddle point (SP) problems. Remarkably, the method only requires access to linear minimization oracles. Leveraging recent advances in FW optimization, we provide the first proof of convergence of a FW-type saddle point solver over polytopes, thereby partially answering a 30 year-old conjecture. We also survey other convergence results and highlight gaps in the theoretical underpinnings of FW-style algorithms. Motivating applications without known efficient alternatives are explored through structured prediction with combinatorial penalties as well as games over matching polytopes involving an exponential number of constraints.

Learning Graphical Games from Behavioral Data: Sufficient and Necessary Conditions

Mon, 10 Apr 2017 00:00:00 +0000

In this paper we obtain sufficient and necessary conditions on the number of samples required for exact recovery of the pure-strategy Nash equilibria (PSNE) set of a graphical game from noisy observations of joint actions. We consider sparse linear influence games — a parametric class of graphical games with linear payoffs, and represented by directed graphs of n nodes (players) and in-degree of at most k. We show that one can efficiently recover the PSNE set of a linear influence game with $O(k^2 \log n)$ samples, under very general observation models. On the other hand, we show that $Ω(k \log n)$ samples are necessary for any procedure to recover the PSNE set from observations of joint actions.

Information-theoretic limits of Bayesian network structure learning

Mon, 10 Apr 2017 00:00:00 +0000

In this paper, we study the information-theoretic limits of learning the structure of Bayesian networks (BNs), on discrete as well as continuous random variables, from a finite number of samples. We show that the minimum number of samples required by any procedure to recover the correct structure grows as $Ω(m)$ and $Ω(k \log m + (k^2)/m)$ for non-sparse and sparse BNs respectively, where m is the number of variables and k is the maximum number of parents per node. We provide a simple recipe, based on an extension of the Fano’s inequality, to obtain information-theoretic limits of structure recovery for any exponential family BN. We instantiate our result for specific conditional distributions in the exponential family to characterize the fundamental limits of learning various commonly used BNs, such as conditional probability table based networks, Gaussian BNs, noisy-OR networks, and logistic regression networks. En route to obtaining our main results, we obtain tight bounds on the number of sparse and non-sparse essential-DAGs. Finally, as a byproduct, we recover the information-theoretic limits of sparse variable selection for logistic regression.

Discovering and Exploiting Additive Structure for Bayesian Optimization

Mon, 10 Apr 2017 00:00:00 +0000

Bayesian optimization has proven invaluable for black-box optimization of expensive functions. Its main limitation is its exponential complexity with respect to the dimensionality of the search space using typical kernels. Luckily, many objective functions can be decomposed into additive subproblems, which can be optimized independently. We investigate how to automatically discover such (typically unknown) additive structure while simultaneously exploiting it through Bayesian optimization. We propose an efficient algorithm based on Metropolis-Hastings sampling and demonstrate its efficacy empirically on synthetic and real-world data sets. Throughout all our experiments we reliably discover hidden additive structure whenever it exists and exploit it to yield significantly faster convergence.

On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

Mon, 10 Apr 2017 00:00:00 +0000

We study the interpretability of conditional probability estimates for binary classification under the agnostic setting or scenario. Under the agnostic setting, conditional probability estimates do not necessarily reflect the true conditional probabilities. Instead, they have a certain calibration property: among all data points that the classifier has predicted $P(Y = 1|X) = p$, $p$ portion of them actually have label $Y = 1$. For cost-sensitive decision problems, this calibration property provides adequate support for us to use Bayes Decision Theory. In this paper, we define a novel measure for the calibration property together with its empirical counterpart, and prove an uniform convergence result between them. This new measure enables us to formally justify the calibration property of conditional probability estimations, and provides new insights on the problem of estimating and calibrating conditional probabilities.

Gray-box Inference for Structured Gaussian Process Models

Mon, 10 Apr 2017 00:00:00 +0000

We develop an automated variational inference method for Bayesian structured prediction problems with Gaussian process (GP) priors and linear-chain likelihoods. Our approach does not need to know the details of the structured likelihood model and can scale up to a large number of observations. Furthermore, we show that the required expected likelihood term and its gradients in the variational objective (ELBO) can be estimated efficiently by using expectations over very low-dimensional Gaussian distributions. Optimization of the ELBO is fully parallelizable over sequences and amenable to stochastic optimization, which we use along with control variate techniques to make our framework useful in practice. Results on a set of natural language processing tasks show that our method can be as good as (and sometimes better than, in particular with respect to expected log-likelihood) hard-coded approaches including SVM-struct and CRF, and overcomes the scalability limitations of previous inference algorithms based on sampling. Overall, this is a fundamental step to developing automated inference methods for Bayesian structured prediction.

Sparse Accelerated Exponential Weights

Mon, 10 Apr 2017 00:00:00 +0000

We consider the stochastic optimization problem where a convex function is minimized observing recursively the gradients. We introduce SAEW, a new procedure that accelerates exponential weights procedures with the slow rate $1/\sqrtT$ to procedures achieving the fast rate $1/T$. Under the strong convexity of the risk, we achieve the optimal rate of convergence for approximating sparse parameters in $R^d$. The acceleration is achieved by using successive averaging steps in an online fashion. The procedure also produces sparse estimators thanks to additional hard threshold steps.

CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC

Mon, 10 Apr 2017 00:00:00 +0000

In recent years, stochastic gradient Markov Chain Monte Carlo (SG-MCMC) methods have been raised to process large-scale dataset by iterative learning from small minibatches. However, the high variance caused by naive subsampling usually slows down the convergence to the desired posterior distribution. In this paper, we propose an effective subsampling strategy to reduce the variance based on a failed attempt to do importance sampling. In particular, before sampling, we partition the dataset with k-means clustering algorithm in a preprocessing step and use the fixed clustering throughout the entire MCMC simulation. Then during simulation, we approximate the gradient of log-posterior via summing the estimated gradient of each cluster. The resulting procedure is surprisingly simple without enhancing the complexity of the original algorithm during sampling procedure. We apply our Clustering-based Preprocessing strategy on stochastic gradient Langevin dynamics, stochastic gradient Hamilton Monte Carlo and stochastic gradient Riemann Langevin dynamics. Empirically, we provide thorough numerical results to back up the effectiveness and efficiency of our approach.

Exploration-Exploitation in MDPs with Options

Mon, 10 Apr 2017 00:00:00 +0000

While a large body of empirical results show that temporally-extended actions and options may significantly affect the learning performance of an agent, the theoretical understanding of how and when options can be beneficial in online reinforcement learning is relatively limited. In this paper, we derive an upper and lower bound on the regret of a variant of UCRL using options. While we first analyze the algorithm in the general case of semi-Markov decision processes (SMDPs), we show how these results can be translated to the specific case of MDPs with options and we illustrate simple scenarios in which the regret of learning with options can be provably much smaller than the regret suffered when learning with primitive actions.

Poisson intensity estimation with reproducing kernels

Mon, 10 Apr 2017 00:00:00 +0000

Despite the fundamental nature of the Poisson process in the theory and application of stochastic processes, and its attractive generalizations (e.g. Cox process), few tractable nonparametric modeling approaches exist, especially in high dimensional settings. In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. We model the square root of the intensity as an RKHS function. The modeling challenge is that the usual representer theorem arguments no longer apply due to the form of the inhomogeneous Poisson process likelihood. However, we prove that the representer theorem does hold in an appropriately transformed RKHS, guaranteeing that the optimization of the penalized likelihood can be cast as a tractable finite-dimensional problem. The resulting approach is simple to implement, and readily scales to high dimensions and large-scale datasets.

Removing Phase Transitions from Gibbs Measures

Mon, 10 Apr 2017 00:00:00 +0000

Gibbs measures are a fundamental class of distributions for the analysis of high dimensional data. Phase transitions, which are also known as degeneracy in the network science literature, are an emergent property of these models that well describe many physical systems. However, the reach of the Gibbs measure is now far outside the realm of physical systems, and in many of these domains multiphase behavior is a nuisance. This nuisance often makes distribution fitting impossible due to failure of the MCMC sampler, and even when an MLE fit is possible, if the solution is near a phase transition point, the plausibility of the fit can be highly questionable. We introduce a modification to the Gibbs distribution that reduces the effects of phase transitions, and with properly chosen hyper-parameters, provably removes all multiphase behavior. We show that this new distribution is just as easy to fit via MCMCMLE as the Gibbs measure, and provide examples in the Ising model from statistical physics and ERGMs from network science.

Optimal Recovery of Tensor Slices

Mon, 10 Apr 2017 00:00:00 +0000

We consider the problem of large scale matrix recovery given side information in the form of additional matrices of conforming dimension. This is a parsimonious model that captures a number of interesting problems including context and location aware recommendations, personalized ‘tag’ learning, demand learning with side information, etc. Viewing the matrix we seek to recover and the side information we have as slices of a tensor, we consider the problem of Slice Recovery, which is to recover specific slices of a tensor from noisy observations of the tensor. We provide an efficient algorithm to recover slices of structurally ’simple’ tensors given noisy observations of the tensor’s entries; our definition of simplicity subsumes low-rank tensors for a variety of definitions of tensor rank. Our algorithm is practical for large datasets and provides a significant performance improvement over state of the art incumbent approaches to tensor recovery. We establish theoretical recovery guarantees that under reasonable assumptions are minimax optimal for slice recovery. These guarantees also imply the first minimax optimal guarantees for recovering tensors of low Tucker rank and general noise. Experiments on data from a music streaming service demonstrate the performance and scalability of our algorithm.

Value-Aware Loss Function for Model-based Reinforcement Learning

Mon, 10 Apr 2017 00:00:00 +0000

We consider the problem of estimating the transition probability kernel to be used by a model-based reinforcement learning (RL) algorithm. We argue that estimating a generative model that minimizes a probabilistic loss, such as the log-loss, is an overkill because it does not take into account the underlying structure of decision problem and the RL algorithm that intends to solve it. We introduce a loss function that takes the structure of the value function into account. We provide a finite-sample upper bound for the loss function showing the dependence of the error on model approximation error, number of samples, and the complexity of the model space. We also empirically compare the method with the maximum likelihood estimator on a simple problem.

On the Troll-Trust Model for Edge Sign Prediction in Social Networks

Mon, 10 Apr 2017 00:00:00 +0000

In the problem of edge sign prediction, we are given a directed graph (representing a social network), and our task is to predict the binary labels of the edges (i.e., the positive or negative nature of the social relationships). Many successful heuristics for this problem are based on the troll-trust features, estimating at each node the fraction of outgoing and incoming positive/negative edges. We show that these heuristics can be understood, and rigorously analyzed, as approximators to the Bayes optimal classifier for a simple probabilistic model of the edge labels. We then show that the maximum likelihood estimator for this model approximately corresponds to the predictions of a Label Propagation algorithm run on a transformed version of the original social graph. Extensive experiments on a number of real-world datasets show that this algorithm is competitive against state-of-the-art classifiers in terms of both accuracy and scalability. Finally, we show that troll-trust features can also be used to derive online learning algorithms which have theoretical guarantees even when edges are adversarially labeled.

Annular Augmentation Sampling

Mon, 10 Apr 2017 00:00:00 +0000

The exponentially large sample space of general binary probabilistic models renders intractable standard operations such as exact marginalization, inference, and normalization. Typically, researchers deal with these distributions via deterministic approximations, the class of belief propagation methods being a prominent example. Comparatively, Markov Chain Monte Carlo methods have been significantly less used in this domain. In this work, we introduce an auxiliary variable MCMC scheme that samples from an annular augmented space, translating to a great circle path around the hypercube of the binary sample space. This annular augmentation sampler explores the sample space more effectively than coordinate-wise samplers and has no tunable parameters, leading to substantial performance gains in estimating quantities of interest in large binary models. We extend the method to incorporate into the sampler any existing mean-field approximation (such as from belief propagation), leading to further performance improvements. Empirically, we consider a range of large Ising models and an application to risk factors for heart disease.

Encrypted Accelerated Least Squares Regression

Mon, 10 Apr 2017 00:00:00 +0000

Information that is stored in an encrypted format is, by definition, usually not amenable to statistical analysis or machine learning methods. In this paper we present detailed analysis of coordinate and accelerated gradient descent algorithms which are capable of fitting least squares and penalised ridge regression models, using data encrypted under a fully homomorphic encryption scheme. Gradient descent is shown to dominate in terms of encrypted computational speed, and theoretical results are proven to give parameter bounds which ensure correctness of decryption. The characteristics of encrypted computation are empirically shown to favour a non-standard acceleration technique. This demonstrates the possibility of approximating conventional statistical regression methods using encrypted data without compromising privacy.

Trading off Rewards and Errors in Multi-Armed Bandits

Mon, 10 Apr 2017 00:00:00 +0000

In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward.

Scalable Learning of Non-Decomposable Objectives

Mon, 10 Apr 2017 00:00:00 +0000

Modern retrieval systems are often driven by an underlying machine learning model. The goal of such systems is to identify and possibly rank the few most relevant items for a given query or context. Thus, such systems are typically evaluated using a ranking-based performance metric such as the area under the precision-recall curve, the F score, precision at fixed recall, etc. Obviously, it is desirable to train such systems to optimize the metric of interest. In practice, due to the scalability limitations of existing approaches for optimizing such objectives, large-scale retrieval systems are instead trained to maximize classification accuracy, in the hope that performance as measured via the true objective will also be favorable. In this work we present a unified framework that, using straightforward building block bounds, allows for highly scalable optimization of a wide range of ranking-based objectives. We demonstrate the advantage of our approach on several real-life retrieval problems that are significantly larger than those considered in the literature, while achieving substantial improvement in performance over the accuracy-objective baseline.

Data Driven Resource Allocation for Distributed Learning

Mon, 10 Apr 2017 00:00:00 +0000

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be “locally simple but globally complex” (Vapnik and Bottou, 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.

Automated Inference with Adaptive Batches

Mon, 10 Apr 2017 00:00:00 +0000

Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive stepsize selection and automatic stopping. We propose alternative “big batch” SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The resulting methods have similar convergence rates to classical SGD, and do not require convexity of the objective. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Big batch methods are thus easily automated and can run with little or no oversight.

Learning from Conditional Distributions via Dual Embeddings

Mon, 10 Apr 2017 00:00:00 +0000

Many machine learning tasks, such as learning with invariance and policy evaluation in reinforcement learning, can be characterized as problems of learning from conditional distributions. In such problems, each sample x itself is associated with a conditional distribution $p(z|x)$ represented by samples $z_i_i=1^M$, and the goal is to learn a function f that links these conditional distributions to target values y. These problems become very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution. Commonly used approaches either assume that z is independent of x, or require an overwhelmingly large set of samples from each conditional distribution. To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. With such new reformulation, we only need to deal with the joint distribution p(z,x). We also design an efficient learning algorithm, Embedding-SGD, and establish theoretical sample complexity for such problems. Finally, our numerical experiments, on both synthetic and real-world datasets, show that the proposed approach can significantly improve over existing algorithms.

Distance Covariance Analysis

Mon, 10 Apr 2017 00:00:00 +0000

We propose a dimensionality reduction method to identify linear projections that capture interactions between two or more sets of variables. The method, distance covariance analysis (DCA), can detect both linear and nonlinear relationships, and can take dependent variables into account. On previous testbeds and a new testbed that systematically assesses the ability to detect both linear and nonlinear interactions, DCA performs better than or comparable to existing methods, while being one of the fastest methods. To showcase the versatility of DCA, we also applied it to three different neurophysiological datasets.

Online Optimization of Smoothed Piecewise Constant Functions

Mon, 10 Apr 2017 00:00:00 +0000

We study online optimization of smoothed piecewise constant functions over the domain [0, 1). This is motivated by the problem of adaptively picking parameters of learning algorithms as in the recently introduced framework by Gupta and Roughgarden (2016). Majority of the machine learning literature has focused on Lipschitz-continuous functions or functions with bounded gradients. This is with good reason-any learning algorithm suffers linear regret even against piecewise constant functions that are chosen adversarially, arguably the simplest of non-Lipschitz continuous functions. The smoothed setting we consider is inspired by the seminal work of Spielman and Teng (2004) and the recent work of Gupta and Roughgarden (2016)-in this setting, the sequence of functions may be chosen by an adversary, however, with some uncertainty in the location of discontinuities. We give algorithms that achieve sublinear regret in the full information and bandit settings.

Rank Aggregation and Prediction with Item Features

Mon, 10 Apr 2017 00:00:00 +0000

We study the problem of rank aggregation with features, where both pairwise comparisons and item features are available to help the rank aggregation task. Observing that traditional rank aggregation methods disregard features, while models adapted from learning-to-rank task are sensitive to feature noise, we propose a general model to learn a total ranking by balancing between comparisons and feature information jointly. As a result, our proposed model takes advantage of item features and is also robust to noise. More importantly, we study the effectiveness of item features in our model and show that given sufficiently informative features, the sample complexity of our model can be asymptotically lower than models based only on comparisons for deriving an $ε$-accurate ranking. The results theoretically justify that our model can achieve efficient learning by leveraging item feature information. In addition, we show that the proposed model can also be extended to two other related problems—online rank aggregation and rank prediction of new items. Finally, experiments show that our model is more effective and robust compared to existing methods on both synthetic and real datasets.

Near-optimal Bayesian Active Learning with Correlated and Noisy Tests

Mon, 10 Apr 2017 00:00:00 +0000

We consider the Bayesian active learning and experimental design problem, where the goal is to learn the value of some unknown target variable through a sequence of informative, noisy tests. In contrast to prior work, we focus on the challenging, yet practically relevant setting where test outcomes can be conditionally dependent given the hidden target variable. Under such assumptions, common heuristics, such as greedily performing tests that maximize the reduction in uncertainty of the target, often perform poorly. We propose ECED, a novel, efficient active learning algorithm, and prove strong theoretical guarantees that hold with correlated, noisy tests. Rather than directly optimizing the prediction error, at each step, ECED picks the test that maximizes the gain in a surrogate objective, which takes into account the dependencies between tests. Our analysis relies on an information-theoretic auxiliary function to track the progress of ECED, and utilizes adaptive submodularity to attain the approximation bound. We demonstrate strong empirical performance of ECED on two problem instances, including a Bayesian experimental design task intended to distinguish among economic theories of how people make risky decisions, and an active preference learning task via pairwise comparisons.

Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection

Mon, 10 Apr 2017 00:00:00 +0000

In the Best-k-Arm problem, we are given n stochastic bandit arms, each associated with an unknown reward distribution. We are required to identify the k arms with the largest means by taking as few samples as possible. In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best-k-Arm problem. On the lower bound side, we obtain a novel complexity term to measure the sample complexity that every Best-k-Arm instance requires. This is derived by an interesting and nontrivial reduction from the Best-1-Arm problem. We also provide an elimination-based algorithm that matches the instance-wise lower bound within doubly-logarithmic factors. The sample complexity of our algorithm strictly dominates the state-of-the-art for Best-k-Arm (module constant factors).

Clustering from Multiple Uncertain Experts

Mon, 10 Apr 2017 00:00:00 +0000

Utilizing expert input often improves clustering performance. However in a knowledge discovery problem, ground truth is unknown even to an expert. Thus, instead of one expert, we solicit the opinion from multiple experts. The key question motivating this work is: which experts should be assigned higher weights when there is disagreement on whether to put a pair of samples in the same group? To model the uncertainty in constraints from different experts, we build a probabilistic model for pairwise constraints through jointly modeling each expert’s accuracy and the mapping from features to latent cluster assignments. After learning our probabilistic discriminative clustering model and accuracies of different experts, 1) samples that were not annotated by any expert can be clustered using the discriminative clustering model; and 2) experts with higher accuracies are automatically assigned higher weights in determining the latent cluster assignments. Experimental results on UCI benchmark datasets and a real-world disease subtyping dataset demonstrate that our proposed approach outperforms competing alternatives, including semi-crowdsourced clustering, semi-supervised clustering with constraints from majority voting, and consensus clustering.

Distributed Adaptive Sampling for Kernel Matrix Approximation

Mon, 10 Apr 2017 00:00:00 +0000

Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or $k$-means clustering, do not scale to large datasets, because constructing and storing the kernel matrix $K_n$ requires at least $O(n^2)$ time and space for $n$ samples. Recent works (Alaoui 2014, Musco 2016) show that sampling points with replacement according to their ridge leverage scores (RLS) generates small dictionaries of relevant points with strong spectral approximation guarantees for $K_n$. The drawback of RLS-based methods is that computing exact RLS requires constructing and storing the whole kernel matrix. In this paper, we introduce SQUEAK, a new algorithm for kernel approximation based on RLS sampling that \emphsequentially processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension $d_eff(γ)$ of the dataset. Moreover since all the RLS estimations are efficiently performed using only the small dictionary, SQUEAK never constructs the whole matrix $K_n$, runs in linear time $\widetildeO(nd_eff(γ)^3)$ w.r.t. $n$, and requires only a single pass over the dataset. We also propose a parallel and distributed version of SQUEAK achieving similar accuracy in as little as $\widetildeO(\log(n)d_eff(γ)^3)$ time.

Robust Causal Estimation in the Large-Sample Limit without Strict Faithfulness

Mon, 10 Apr 2017 00:00:00 +0000

Causal effect estimation from observational data is an important and much studied research topic. The instrumental variable (IV) and local causal discovery (LCD) patterns are canonical examples of settings where a closed-form expression exists for the causal effect of one variable on another, given the presence of a third variable. Both rely on faithfulness to infer that the latter only influences the target effect via the cause variable. In reality, it is likely that this assumption only holds approximately and that there will be at least some form of weak interaction. This brings about the paradoxical situation that, in the large-sample limit, no predictions are made, as detecting the weak edge invalidates the setting. We introduce an alternative approach by replacing strict faithfulness with a prior that reflects the existence of many ’weak’ (irrelevant) and ’strong’ interactions. We obtain a posterior distribution over the target causal effect estimator which shows that, in many cases, we can still make good estimates. We demonstrate the approach in an application on a simple linear-Gaussian setting, using the MultiNest sampling algorithm, and compare it with established techniques to show our method is robust even when strict faithfulness is violated.

Bayesian Hybrid Matrix Factorisation for Data Integration

Mon, 10 Apr 2017 00:00:00 +0000

We introduce a novel Bayesian hybrid matrix factorisation model (HMF) for data integration, based on combining multiple matrix factorisation methods, that can be used for in- and out-of-matrix prediction of missing values. The model is very general and can be used to integrate many datasets across different entity types, including repeated experiments, similarity matrices, and very sparse datasets. We apply our method on two biological applications, and extensively compare it to state-of-the-art machine learning and matrix factorisation models. For in-matrix predictions on drug sensitivity datasets we obtain consistently better performances than existing methods. This is especially the case when we increase the sparsity of the datasets. Furthermore, we perform out-of-matrix predictions on methylation and gene expression datasets, and obtain the best results on two of the three datasets, especially when the predictivity of datasets is high.

Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification

Mon, 10 Apr 2017 00:00:00 +0000

We consider training probabilistic classifiers in the case that the number of classes is too large to perform exact normalisation over all classes. We show that the source of high variance in standard sampling approximations is due to simply not including the correct class of the datapoint into the approximation. To account for this we explicitly sum over a subset of classes and sample the remaining. We show that this simple approach is competitive with recently introduced non likelihood-based approximations.

Structured adaptive and random spinners for fast machine learning computations

Mon, 10 Apr 2017 00:00:00 +0000

We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms,new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the HD 3 HD 2 HD 1 structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications.

Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains

Mon, 10 Apr 2017 00:00:00 +0000

Submodular continuous functions are a category of (generally) non-convex/non-concave functions with a wide spectrum of applications. We characterize these functions and demonstrate that they can be maximized efficiently with approximation guarantees. Specifically, i) We introduce the weak DR property that gives a unified characterization of submodularity for all set, integer-lattice and continuous functions; ii) for maximizing monotone DR-submodular continuous functions under general down-closed convex constraints, we propose a Frank-Wolfe variant with (1-1/e) approximation guarantee, and sub-linear convergence rate; iii) for maximizing general non-monotone submodular continuous functions subject to box constraints, we propose a DoubleGreedy algorithm with 1/3 approximation guarantee. Submodular continuous functions naturally find applications in various real-world settings, including influence and revenue maximization with continuous assignments, sensor energy management, facility location, etc. Experimental results show that the proposed algorithms efficiently generate superior solutions compared to baseline algorithms.

Frequency Domain Predictive Modelling with Aggregated Data

Mon, 10 Apr 2017 00:00:00 +0000

Existing work in spatio-temporal data analysis invariably assumes data available as individual measurements with localised estimates. However, for many applications like econometrics, financial forecasting and climate science, data is often obtained as aggregates. Data aggregation presents severe mathematical challenges to learning and inference, and application of standard techniques is susceptible to ecological fallacy. In this manuscript we investigate the problem of predictive linear modelling in the scenario where data is aggregated in a non-uniform manner across targets and features. We introduce a novel formulation of the problem in the frequency domain, and develop algorithmic techniques that exploit the duality properties of Fourier analysis to bypass the inherent structural challenges of this setting. We provide theoretical guarantees for generalisation error for our estimation procedure and extend our analysis to capture approximation effects arising from aliasing. Finally, we perform empirical evaluation to demonstrate the efficacy of our algorithmic aproach in predictive modelling on synthetic data, and on three real datasets from agricultural studies, ecological surveys and climate science. approximation effects arising from aliasing. Finally, we perform empirical evaluation to demonstrate the efficacy of our algorithmic aproach in predictive modelling on synthetic data, and on three real datasets from agricultural studies, ecological surveys and climate science.

Active Positive Semidefinite Matrix Completion: Algorithms, Theory and Applications

Mon, 10 Apr 2017 00:00:00 +0000

In this paper we provide simple, computationally efficient, active algorithms for completion of symmetric positive semidefinite matrices. Our proposed algorithms are based on adaptive Nyström sampling, and are allowed to actively query any element in the matrix, and obtain a possibly noisy estimate of the queried element. We establish sample complexity guarantees on the recovery of the matrix in the max-norm and in the process establish new theoretical results, potentially of independent interest, on adaptive Nyström sampling. We demonstrate the efficacy of our algorithms on problems in multi-armed bandits and kernel dimensionality reduction.

Distribution of Gaussian Process Arc Lengths

Mon, 10 Apr 2017 00:00:00 +0000

We present the first treatment of the arc length of the GP with more than a single output dimension. GPs are commonly used for tasks such as trajectory modelling, where path length is a crucial quantity of interest. Previously, only paths in one dimension have been considered, with no theoretical consideration of higher dimensional problems. We fill the gap in the existing literature by deriving the moments of the arc length for a stationary GP with multiple output dimensions. A new method is used to derive the mean of a one-dimensional GP over a finite interval, by considering the distribution of the arc length integrand. This technique is used to derive an approximate distribution over the arc length of a vector valued GP in $\mathbbR^n$ by moment matching the distribution. Numerical simulations confirm our theoretical derivations.

Local Perturb-and-MAP for Structured Prediction

Mon, 10 Apr 2017 00:00:00 +0000

Conditional random fields (CRFs) provide a powerful tool for structured prediction, but cast significant challenges in both the learning and inference steps. Approximation techniques are widely used in both steps, which should be considered jointly to guarantee good performance (a.k.a. “inferning"). Perturb-and-MAP models provide a promising alternative to CRFs, but require global combinatorial optimization and hence they are usable only on specific models. In this work, we present a new Local Perturb-and-MAP (locPMAP) framework that replaces the global optimization with a local optimization by exploiting our observed connection between locPMAP and the pseudolikelihood of the original CRF model. We test our approach on three different vision tasks and show that our method achieves consistently improved performance over other approximate inference techniques optimized to a pseudolikelihood objective. Additionally, we demonstrate that we can integrate our method in the fully convolutional network framework to increase our model’s complexity. Finally, our observed connection between locPMAP and the pseudolikelihood leads to a novel perspective for understanding and using pseudolikelihood.

Random projection design for scalable implicit smoothing of randomly observed stochastic processes

Mon, 10 Apr 2017 00:00:00 +0000

Sampling at random timestamps, long range dependencies, and scale hamper standard meth- ods for multivariate time series analysis. In this paper we present a novel estimator for cross-covariance of randomly observed time series which unravels the dynamics of an unobserved stochastic process. We analyze the statistical properties of our estimator without needing the assumption that observation timestamps are independent from the process of interest and show that our solution is not hindered by the issues affecting standard estimators for cross-covariance. We implement and evaluate our statistically sound and scalable approach in the distributed setting using Apache Spark and demonstrate its ability to unravel causal dynamics on both simulations and high-frequency financial trading data.

Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation

Mon, 10 Apr 2017 00:00:00 +0000

We propose a flexible convex relaxation for the phase retrieval problem that operates in the natural domain of the signal. Therefore, we avoid the prohibitive computational cost associated with “lifting” and semidefinite programming (SDP) in methods such as PhaseLift and compete with recently developed non-convex techniques for phase retrieval. We relax the quadratic equations for phaseless measurements to inequality constraints each of which representing a symmetric “slab”. Through a simple convex program, our proposed estimator finds an extreme point of the intersection of these slabs that is best aligned with a given anchor vector. We characterize geometric conditions that certify success of the proposed estimator. Furthermore, using classic results in statistical learning theory, we show that for random measurements the geometric certificates hold with high probability at an optimal sample complexity. Phase transition of our estimator is evaluated through simulations. Our numerical experiments also suggest that the proposed method can solve phase retrieval problems with coded diffraction measurements as well.

Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

Mon, 10 Apr 2017 00:00:00 +0000

Mean shift is a mode-seeking clustering algorithm that has been successfully used in a wide range of applications such as image segmentation and object tracking. To further improve the clustering performance, mean shift has been extended to various directions, including generalization to handle data on Riemannian manifolds and extension to directly estimating the density gradient without density estimation. In this paper, we combine these ideas and propose a novel mode-seeking algorithm for Riemannian manifolds with direct density-gradient estimation. Although the idea of combining the two extensions is rather straightforward, directly estimating the density gradient on Riemannian manifolds is mathematically challenging. We will provide a mathematically sound algorithm and demonstrate its usefulness through experiments.

Spectral Methods for Correlated Topic Models

Mon, 10 Apr 2017 00:00:00 +0000

In this paper we propose guaranteed spectral methods for learning a broad range of topic models, which generalize the popular Latent Dirichlet Allocation (LDA). We overcome the limitation of LDA to incorporate arbitrary topic correlations, by assuming that the hidden topic proportions are drawn from a flexible class of Normalized Infinitely Divisible (NID) distributions. NID distributions are generated by normalizing a family of independent Infinitely Divisible (ID) random variables. The Dirichlet distribution is a special case obtained by normalizing a set of Gamma random variables. We prove that this flexible topic model class can be learnt via spectral methods using only moments up to the third order, with (low order) polynomial sample and computational complexity. The proof is based on a key new technique derived here that allows us to diagonalize the moments of the NID distribution through an efficient procedure that requires evaluating only univariate integrals, despite the fact that we are handling high dimensional multivariate moments. In order to assess the performance of our proposed Latent NID topic model, we use two real datasets of articles collected from New York Times and Pubmed. Our experiments yield improved perplexity on both datasets compared with the baseline.

Non-Count Symmetries in Boolean & Multi-Valued Prob. Graphical Models

Mon, 10 Apr 2017 00:00:00 +0000

Lifted inference algorithms commonly exploit symmetries in a probabilistic graphical model (PGM) for efficient inference. However, existing algorithms for Boolean-valued domains can identify only those pairs of states as symmetric, in which the number of ones and zeros match exactly (count symmetries). Moreover, algorithms for lifted inference in multi-valued domains also compute a multi-valued extension of count symmetries only. These algorithms miss many symmetries in a domain. In this paper, we present first algorithms to compute non-count symmetries in both Boolean-valued and multi-valued domains. Our methods can also find symmetries between multi-valued variables that have different domain cardinalities. The key insight in the algorithms is that they change the unit of symmetry computation from a variable to a variable-value (VV) pair. Our experiments find that exploiting these symmetries in MCMC can obtain substantial computational gains over existing algorithms.

Regret Bounds for Lifelong Learning

Mon, 10 Apr 2017 00:00:00 +0000

We consider the problem of transfer learning in an online setting. Different tasks are presented sequentially and processed by a within-task algorithm. We propose a lifelong learning strategy which refines the underlying data representation used by the within-task algorithm, thereby transferring information from one task to the next. We show that when the within-task algorithm comes with some regret bound, our strategy inherits this good property. Our bounds are in expectation for a general loss function, and uniform for a convex loss. We discuss applications to dictionary learning and finite set of predictors. In the latter case, we improve previous $O(1/\sqrt{m})$ bounds to $O(1/m)$, where $m$ is the per task sample size.

Generalized Pseudolikelihood Methods for Inverse Covariance Estimation

Mon, 10 Apr 2017 00:00:00 +0000

We introduce PseudoNet, a new pseudolikelihood-based estimator of the inverse covariance matrix, that has a number of useful statistical and computational properties. We show, through detailed experiments with synthetic and also real-world finance as well as wind power data, that PseudoNet outperforms related methods in terms of estimation error and support recovery, making it well-suited for use in a downstream application, where obtaining low estimation error can be important. We also show, under regularity conditions, that PseudoNet is consistent. Our proof assumes the existence of accurate estimates of the diagonal entries of the underlying inverse covariance matrix; we additionally provide a two-step method to obtain these estimates, even in a high-dimensional setting, going beyond the proofs for related methods. Unlike other pseudolikelihood-based methods, we also show that PseudoNet does not saturate, i.e., in high dimensions, there is no hard limit on the number of nonzero entries in the PseudoNet estimate. We present a fast algorithm as well as screening rules that make computing the PseudoNet estimate over a range of tuning parameters tractable.

An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability

Mon, 10 Apr 2017 00:00:00 +0000

One fundamental goal in any learning algorithm is to mitigate its risk for overfitting. Mathematically, this requires that the learning algorithm enjoys a small generalization risk, which is defined either in expectation or in probability. Both types of generalization are commonly used in the literature. For instance, generalization in expectation has been used to analyze algorithms, such as ridge regression and SGD, whereas generalization in probability is used in the VC theory, among others. Recently, a third notion of generalization has been studied, called uniform generalization, which requires that the generalization risk vanishes uniformly in expectation across all bounded parametric losses. It has been shown that uniform generalization is, in fact, equivalent to an information-theoretic stability constraint, and that it recovers classical results in learning theory. It is achievable under various settings, such as sample compression schemes, finite hypothesis spaces, finite domains, and differential privacy. However, the relationship between uniform generalization and concentration remained unknown. In this paper, we answer this question by proving that, while a generalization in expectation does not imply a generalization in probability, a uniform generalization in expectation does imply concentration. We establish a chain rule for the uniform generalization risk of the composition of hypotheses and use it to derive a large deviation bound. Finally, we prove that the bound is tight.

Learning Time Series Detection Models from Temporally Imprecise Labels

Mon, 10 Apr 2017 00:00:00 +0000

In this paper, we consider a new low-quality label learning problem: learning time series detection models from temporally imprecise labels. In this problem, the data consist of a set of input time series, and supervision is provided by a sequence of noisy time stamps corresponding to the occurrence of positive class events. Such temporally imprecise labels commonly occur in areas like mobile health research where human annotators are tasked with labeling the occurrence of very short duration events. We propose a general learning framework for this problem that can accommodate different base classifiers and noise models. We present results on real mobile health data showing that the proposed framework significantly outperforms a number of alternatives including assuming that the label time stamps are noise-free, transforming the problem into the multiple instance learning framework, and learning on labels that were manually re-aligned.

Thompson Sampling for Linear-Quadratic Control Problems

Mon, 10 Apr 2017 00:00:00 +0000

We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (TS) (a.k.a. posterior-sampling for reinforcement learning) in the frequentist setting, i.e., when the parameters characterizing the LQ dynamics are fixed. Despite the empirical and theoretical success in a wide range of problems from multi-armed bandit to linear bandit, we show that when studying the frequentist regret TS in control problems, we need to trade-off the frequency of sampling optimistic parameters and the frequency of switches in the control policy. This results in an overall regret of $O(T^2/3)$, which is significantly worse than the regret $O(\sqrtT)$ achieved by the optimism-in-face-of-uncertainty algorithm in LQ control problems.

Linear Thompson Sampling Revisited

Mon, 10 Apr 2017 00:00:00 +0000

We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. While we obtain a regret bound of order $O(d^3/2\sqrtT)$ as in previous results, the proof sheds new light on the functioning of the TS. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textitoptimistic parameters does control it. Thus we show that TS can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrtd$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.

Hit-and-Run for Sampling and Planning in Non-Convex Spaces

Mon, 10 Apr 2017 00:00:00 +0000

We propose the Hit-and-Run algorithm for planning and sampling problems in non- convex spaces. For sampling, we show the first analysis of the Hit-and-Run algorithm in non-convex spaces and show that it mixes fast as long as certain smoothness conditions are satisfied. In particular, our analysis reveals an intriguing connection between fast mixing and the existence of smooth measure-preserving mappings from a convex space to the non-convex space. For planning, we show advantages of Hit-and- Run compared to state-of-the-art planning methods such as Rapidly-Exploring Random Trees.