Proceedings of Machine Learning Research

Proceedings of Machine Learning Research Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics Held in Playa Blanca, Lanzarote, Canary Islands on 09-11 April 2018 Published as Volume 84 by the Proceedings of Machine Learning Research on 31 March 2018. Volume Edited by: Amos Storkey Fernando Perez-Cruz Series Editors: Neil D. Lawrence Mark Reid https://proceedings.mlr.press/v84/ Wed, 08 Feb 2023 10:44:08 +0000 Wed, 08 Feb 2023 10:44:08 +0000 Jekyll v3.9.3 Bayesian Multi-label Learning with Sparse Features and Labels, and Label Co-occurrences We present a probabilistic, fully Bayesian framework for multi-label learning. Our framework is based on the idea of learning a joint low-rank embedding of the label matrix and the label co-occurrence matrix. The proposed framework has the following appealing aspects: (1) It leverages the sparsity in the label matrix and the feature matrix, which results in very efficient inference, especially for sparse datasets, commonly encountered in multi-label learning problems, and (2) By effectively utilizing the label co-occurrence information, the model yields improved prediction accuracies, especially in the case where the amount of training data is low and/or the label matrix has a significant fraction of missing labels. Our framework enjoys full local conjugacy and admits a simple inference procedure via a scalable Gibbs sampler. We report experimental results on a number of benchmark datasets, on which it outperforms several state-of-the-art multi-label learning models. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zhao18b.html https://proceedings.mlr.press/v84/zhao18b.html Stochastic Three-Composite Convex Minimization with a Linear Operator We develop a primal-dual convex minimization framework to solve a class of stochastic convex three-composite problem with a linear operator. We consider the cases where the problem is both convex and strongly convex and analyze the convergence of the proposed algorithm in both cases. In addition, we extend the proposed framework to deal with additional constraint sets and multiple non-smooth terms. We provide numerical evidence on graph-guided sparse logistic regression, fused lasso and overlapped group lasso, to demonstrate the superiority of our approach to the state-of-the-art. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zhao18a.html https://proceedings.mlr.press/v84/zhao18a.html Learning Structural Weight Uncertainty for Sequential Decision-Making Learning probability distributions on the weights of neural networks (NNs) has recently proven beneficial in many applications. Bayesian methods, such as Stein variational gradient descent (SVGD), offer an elegant framework to reason about NN model uncertainty. However, by assuming independent Gaussian priors for the individual NN weights (as often applied), SVGD does not impose prior knowledge that there is often structural information (dependence) among weights. We propose efficient posterior learning of structural weight uncertainty, within an SVGD framework, by employing matrix variate Gaussian priors on NN parameters. We further investigate the learned structural uncertainty in sequential decision-making problems, including contextual bandits and reinforcement learning. Experiments on several synthetic and real datasets indicate the superiority of our model, compared with state-of-the-art methods. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zhang18d.html https://proceedings.mlr.press/v84/zhang18d.html A Unified Framework for Nonconvex Low-Rank plus Sparse Matrix Recovery We propose a unified framework to solve general low-rank plus sparse matrix recovery problems based on matrix factorization, which covers a broad family of objective functions satisfying the restricted strong convexity and smoothness conditions. Based on projected gradient descent and the double thresholding operator, our proposed generic algorithm is guaranteed to converge to the unknown low-rank and sparse matrices at a locally linear rate, while matching the best-known robustness guarantee (i.e., tolerance for sparsity). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We illustrate the application of our framework through two concrete examples: robust matrix sensing and robust PCA. Empirical experiments corroborate our theory. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zhang18c.html https://proceedings.mlr.press/v84/zhang18c.html Transfer Learning on fMRI Datasets We explore transferring learning between fMRI datasets. A method is introduced to improve prediction accuracy on a primary fMRI dataset by jointly learning a model using other secondary fMRI datasets. We assume the secondary datasets are directly or indirectly linked to the primary dataset through sets of partially shared subjects. This method is particularly useful when the primary dataset is small. Using six fMRI datasets linked by various subsets of shared subjects, we show that the method yields improved performance in various predictive tasks. Our tests are performed on a variety of regions of interest in the brain and across various stimuli. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zhang18b.html https://proceedings.mlr.press/v84/zhang18b.html Nonlinear Structured Signal Estimation in High Dimensions via Iterative Hard Thresholding We study the high-dimensional signal estimation problem with nonlinear measurements, where the signal of interest is either sparse or low-rank. In both settings, our estimator is formulated as the minimizer of the nonlinear least-squares loss function under a combinatorial constraint, which is obtained efficiently by the iterative hard thresholding (IHT) algorithm. Although the loss function is non-convex due to the nonlinearity of the statistical model, the IHT algorithm is shown to converge linearly to a point with optimal statistical accuracy using arbitrary initialization. Moreover, our analysis only hinges on conditions similar to those required in the linear case. Detailed numerical experiments are included to corroborate the theoretical results. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zhang18a.html https://proceedings.mlr.press/v84/zhang18a.html Finding Global Optima in Nonconvex Stochastic Semidefinite Optimization with Variance Reduction There is a recent surge of interest in nonconvex reformulations via low-rank factorization for stochastic convex semidefinite optimization problem in the purpose of efficiency and scalability. Compared with the original convex formulations, the nonconvex ones typically involve much fewer variables, allowing them to scale to scenarios with millions of variables. However, it opens a new challenge that under what conditions the nonconvex stochastic algorithms may find the global optima effectively despite their empirical success in applications. In this paper, we provide an answer that a stochastic gradient descent method with variance reduction, can be adapted to solve the nonconvex reformulation of the original convex problem, with a global linear convergence, i.e., converging to a global optimum exponentially fast, at a proper initial choice in the restricted strongly convex case. Experimental studies on both simulation and real-world applications on ordinal embedding are provided to show the effectiveness of the proposed algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/zeng18a.html https://proceedings.mlr.press/v84/zeng18a.html Graphical Models for Non-Negative Data Using Generalized Score Matching A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach becomes computationally intensive. In contrast, the score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions over $\mathbb{R}^m$. Hyvärinen (2007) extended the approach to distributions supported on the non-negative orthant $\mathbb{R}_+^m$. In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. We also generalize the regularized score matching method of Lin et al. (2016) for non-negative Gaussian graphical models, with improved theoretical guarantees. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yu18b.html https://proceedings.mlr.press/v84/yu18b.html Tensor Regression Meets Gaussian Processes Low-rank tensor regression, a new model class that learns high-order correlation from data, has recently received considerable attention. At the same time, Gaussian processes (GP) are well-studied machine learning models for structure learning. In this paper, we demonstrate interesting connections between the two, especially for multi-way data analysis. We show that low-rank tensor regression is essentially learning a multi-linear kernel in Gaussian processes, and the low-rank assumption translates to the constrained Bayesian inference problem. We prove the oracle inequality and derive the average case learning curve for the equivalent GP model. Our finding implies that low-rank tensor regression, though empirically successful, is highly dependent on the eigenvalues of covariance functions as well as variable correlations. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yu18a.html https://proceedings.mlr.press/v84/yu18a.html Gradient Diversity: a Key Ingredient for Scalable Distributed Learning It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the convergence and generalization performance of mini-batch SGD. We also establish that heuristics similar to DropConnect, Langevin dynamics, and quantization, are provably diversity-inducing mechanisms, and provide experimental evidence indicating that these mechanisms can indeed enable the use of larger batches without sacrificing accuracy and lead to faster training in distributed learning. For example, in one of our experiments, for a convolutional neural network to reach 95% training accuracy on MNIST, using the diversity-inducing mechanism can reduce the training time by 30% in the distributed setting. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yin18a.html https://proceedings.mlr.press/v84/yin18a.html HONES: A Fast and Tuning-free Homotopy Method For Online Newton Step In this article, we develop and analyze a homotopy continuation method, referred to as HONES , for solving the sequential generalized projections in Online Newton Step (Hazan et al., 2006b), as well as the generalized problem known as sequential standard quadratic programming. HONES is fast, tuning-free, error-free (up to machine error) and adaptive to the solution sparsity. This is confirmed by both careful theoretical analysis and extensive experiments on both synthetic and real data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ye18a.html https://proceedings.mlr.press/v84/ye18a.html Dimensionality Reduced $\ell^{0}$-Sparse Subspace Clustering Subspace clustering partitions the data that lie on a union of subspaces. $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC), which belongs to the subspace clustering methods with sparsity prior, guarantees the correctness of subspace clustering under less restrictive assumptions compared to its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC, Elhamifar et al., 2013) with demonstrated effectiveness in practice. In this paper, we present Dimensionality Reduced $\ell^{0}$-Sparse Subspace Clustering (DR-$\ell^{0}$-SSC). DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by linear transformation, then performs $\ell^{0}$-SSC on the dimensionality reduced data. The correctness of DR-$\ell^{0}$-SSC in terms of the subspace detection property is proved, therefore DR-$\ell^{0}$-SSC recovers the underlying subspace structure in the original data from the dimensionality reduced data. Experimental results demonstrate the effectiveness of DR-$\ell^{0}$-SSC. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yang18c.html https://proceedings.mlr.press/v84/yang18c.html A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer In this paper, we present a simple analysis of fast rates with high probability of empirical minimization for it stochastic composite optimization over a finite-dimensional bounded convex set with exponential concave loss functions and an arbitrary convex regularization. To the best of our knowledge, this result is the first of its kind. As a byproduct, we can directly obtain the fast rate with high probability for exponential concave empirical risk minimization with and without any convex regularization, which not only extends existing results of empirical risk minimization but also provides a unified framework for analyzing exponential concave empirical risk minimization with and without any convex regularization. Our proof is very simple only exploiting the covering number of a finite-dimensional bounded set and a concentration inequality of random vectors. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yang18b.html https://proceedings.mlr.press/v84/yang18b.html Optimal Cooperative Inference Cooperative transmission of data fosters rapid accumulation of knowledge by efficiently combining experiences across learners. Although well studied in human learning and increasingly in machine learning, we lack formal frameworks through which we may reason about the benefits and limitations of cooperative inference. We present such a framework. We introduce novel indices for measuring the effectiveness of probabilistic and cooperative information transmission. We relate our indices to the well-known Teaching Dimension in deterministic settings. We prove conditions under which optimal cooperative inference can be achieved, including a representation theorem that constrains the form of inductive biases for learners optimized for cooperative inference. We conclude by demonstrating how these principles may inform the design of machine learning algorithms and discuss implications for human and machine learning. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yang18a.html https://proceedings.mlr.press/v84/yang18a.html Provable Estimation of the Number of Blocks in Block Models Community detection is a fundamental unsupervised learning problem for unlabeled networks which has a broad range of applications. Many community detection algorithms assume that the number of clusters r is known apriori. In this paper, we propose an approach based on semi-definite relaxations, which does not require prior knowledge of model parameters like many existing convex relaxation methods and recovers the number of clusters and the clustering matrix exactly under a broad parameter regime, with probability tending to one. On a variety of simulated and real data experiments, we show that the proposed method often outperforms state-of-the-art techniques for estimating the number of clusters. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yan18a.html https://proceedings.mlr.press/v84/yan18a.html Post Selection Inference with Kernels Finding a set of statistically significant features from complex data (e.g., nonlinear and/or multi-dimensional output data) is important for scientific discovery and has a number of practical applications including biomarker discovery. In this paper, we propose a kernel-based post-selection inference (PSI) algorithm that can find a set of statistically significant features from non-linearly related data. Specifically, our PSI algorithm is based on independence measures, and we call it the Hilbert-Schmidt Independence Criterion (HSIC)-based PSI algorithm (hsicInf). The novelty of hsicInf is that it can handle non-linearity and/or multi-variate/multi-class outputs through kernels. Through synthetic experiments, we show that hsicInf can find a set of statistically significant features for both regression and classification problems. We applied hsicInf to real-world datasets and show that it can successfully identify important features. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/yamada18a.html https://proceedings.mlr.press/v84/yamada18a.html Achieving the time of 1-NN, but the accuracy of k-NN We propose a simple approach which, given distributed computing resources, can nearly achieve the accuracy of k-NN prediction, while matching (or improving) the faster prediction time of 1-NN. The approach consists of aggregating denoised 1-NN predictors over a small number of distributed subsamples. We show, both theoretically and experimentally, that small subsample sizes suffice to attain similar performance as k-NN, without sacrificing the computational efficiency of 1-NN. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xue18a.html https://proceedings.mlr.press/v84/xue18a.html On the Statistical Efficiency of Compositional Nonparametric Prediction In this paper, we propose a compositional nonparametric method in which a model is expressed as a labeled binary tree of $2k+1$ nodes, where each node is either a summation, a multiplication, or the application of one of the $q$ basis functions to one of the $p$ covariates. We show that in order to recover a labeled binary tree from a given dataset, the sufficient number of samples is $O(k\log(pq)+\log(k!))$, and the necessary number of samples is $Ω(k\log (pq)-\log(k!))$. We further propose a greedy algorithm for regression in order to validate our theoretical findings through synthetic experiments. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xu18f.html https://proceedings.mlr.press/v84/xu18f.html Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms We present a new framework to analyze accelerated stochastic mirror descent through the lens of continuous-time stochastic dynamic systems. It enables us to design new algorithms, and perform a unified and simple analysis of the convergence rates of these algorithms. More specifically, under this framework, we provide a Lyapunov function based analysis for the continuous-time stochastic dynamics, as well as several new discrete-time algorithms derived from the continuous-time dynamics. We show that for general convex objective functions, the derived discrete-time algorithms attain the optimal convergence rate. Empirical experiments corroborate our theory. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xu18e.html https://proceedings.mlr.press/v84/xu18e.html A fully adaptive algorithm for pure exploration in linear bandits We propose the first fully-adaptive algorithm for pure exploration in linear bandits—the task to find the arm with the largest expected reward, which depends on an unknown parameter linearly. While existing methods partially or entirely fix sequences of arm selections before observing rewards, our method adaptively changes the arm selection strategy based on past observations at each round. We show our sample complexity matches the achievable lower bound up to a constant factor in an extreme case. Furthermore, we evaluate the performance of the methods by simulations based on both synthetic setting and real-world data, in which our method shows vast improvement over existing ones. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xu18d.html https://proceedings.mlr.press/v84/xu18d.html Benefits from Superposed Hawkes Processes The superposition of temporal point processes has been studied for many years, although the usefulness of such models for practical applications has not be fully developed. We investigate superposed Hawkes process as an important class of such models, with properties studied in the framework of least squares estimation. The superposition of Hawkes processes is demonstrated to be beneficial for tightening the upper bound of excess risk under certain conditions, and we show the feasibility of the benefit in typical situations. The usefulness of superposed Hawkes processes is verified on synthetic data, and its potential to solve the cold-start problem of recommendation systems is demonstrated on real-world data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xu18c.html https://proceedings.mlr.press/v84/xu18c.html On Truly Block Eigensolvers via Riemannian Optimization We study theoretical properties of block solvers for the eigenvalue problem. Despite a recent surge of interest in such eigensolver analysis, truly block solvers have received relatively less attention, in contrast to the majority of studies concentrating on vector versions and non-truly block versions that rely on the deflation strategy. In fact, truly block solvers are more widely deployed in practice by virtue of its simplicity without compromise on accuracy. However, the corresponding theoretical analysis remains inadequate for first-order solvers, as only local and k-th gap-dependent rates of convergence have been established thus far. This paper is devoted to revealing significantly better or as-yet-unknown theoretical properties of such solvers. We present a novel convergence analysis in a unified framework for three types of first-order Riemannian solvers, i.e., deterministic, vanilla stochastic, and stochastic with variance reduction, that are to find top-k eigenvectors of a real symmetric matrix, in full generality. In particular, the issue of zero gaps between eigenvalues, to the best of our knowledge for the first time, is explicitly considered for these solvers, which brings new understandings, e.g., the dependence of convergence on gaps other than the k-th one. We thus propose the concept of generalized k-th gap. Three types of solvers are proved to converge to a globally optimal solution at a global, generalized k-th gap-dependent, and linear or sub-linear rate. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xu18b.html https://proceedings.mlr.press/v84/xu18b.html Accelerated Stochastic Power Iteration Principal component analysis (PCA) is one of the most powerful tools for analyzing matrices in machine learning. In this paper, we study methods to accelerate power iteration in the stochastic setting by adding a momentum term. While in the deterministic setting, power iteration with momentum has optimal iteration complexity, we show that naively adding momentum to a stochastic method does not always result in acceleration. We perform a novel, tight variance analysis that reveals a "breaking-point variance" beyond which this acceleration does not occur. Combining this insight with modern variance reduction techniques yields a simple version of power iteration with momentum that achieves the optimal iteration complexities in both the online and offline setting. Our methods are embarrassingly parallel and can produce wall-clock-time speedups. Our approach is very general and applies to many non-convex optimization problems that can now be accelerated using the same technique. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xu18a.html https://proceedings.mlr.press/v84/xu18a.html Towards Memory-Friendly Deterministic Incremental Gradient Method Incremental Gradient (IG) methods are classical strategies in solving finite sum minimization problems. Deterministic IG methods are particularly favorable in handling massive scale problem due to its memory-friendly data access pattern. In this paper, we propose a new deterministic variant of the IG method SVRG that blends a periodically updated full gradient with a component function gradient selected in a cyclic order. Our method uses only $O(1)$ extra gradient storage without compromising the linear convergence. Empirical results demonstrate that the proposed method is advantageous over existing incremental gradient algorithms, especially on problems that does not fit into physical memory. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/xie18a.html https://proceedings.mlr.press/v84/xie18a.html Random Warping Series: A Random Features Method for Time-Series Embedding Time series data analytics has been a problem of substantial interests for decades, and Dynamic Time Warping (DTW) has been the most widely adopted technique to measure dissimilarity between time series. A number of global-alignment kernels have since been proposed in the spirit of DTW to extend its use to kernel-based estimation method such as support vector machine. However, those kernels suffer from diagonal dominance of the Gram matrix and a quadratic complexity w.r.t. the sample size. In this work, we study a family of alignment-aware positive definite (p.d.) kernels, with its feature embedding given by a distribution of Random Warping Series (RWS). The proposed kernel does not suffer from the issue of diagonal dominance while naturally enjoys a Random Features (RF) approximation, which reduces the computational complexity of existing DTW-based techniques from quadratic to linear in terms of both the number and the length of time-series. We also study the convergence of the RF approximation for the domain of time series of unbounded length. Our extensive experiments on 16 benchmark datasets demonstrate that RWS outperforms or matches state-of-the-art classification and clustering methods in both accuracy and computational time. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wu18b.html https://proceedings.mlr.press/v84/wu18b.html Iterative Spectral Method for Alternative Clustering Given a dataset and an existing clustering as input, alternative clustering aims to find an alternative partition. One of the state-of-the-art approaches is Kernel Dimension Alternative Clustering (KDAC). We propose a novel Iterative Spectral Method (ISM) that greatly improves the scalability of KDAC. Our algorithm is intuitive, relies on easily implementable spectral decompositions, and comes with theoretical guarantees. Its computation time improves upon existing implementations of KDAC by as much as 5 orders of magnitude. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wu18a.html https://proceedings.mlr.press/v84/wu18a.html Fast and Scalable Learning of Sparse Changes in High-Dimensional Gaussian Graphical Model Structure We focus on the problem of estimating the change in the dependency structures of two $p$-dimensional Gaussian Graphical models (GGMs). Previous studies for sparse change estimation in GGMs involve expensive and difficult non-smooth optimization. We propose a novel method, DIFFEE for estimating DIFFerential networks via an Elementary Estimator under a high-dimensional situation. DIFFEE is solved through a faster and closed form solution that enables it to work in large-scale settings. We conduct a rigorous statistical analysis showing that surprisingly DIFFEE achieves the same asymptotic convergence rates as the state-of-the-art estimators that are much more difficult to compute. Our experimental results on multiple synthetic datasets and one real-world data about brain connectivity show strong performance improvements over baselines, as well as significant computational benefits. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wang18f.html https://proceedings.mlr.press/v84/wang18f.html Stochastic Zeroth-order Optimization in High Dimensions We consider the problem of optimizing a high-dimensional convex function using stochastic zeroth-order queries. Under sparsity assumptions on the gradients or function values, we present two algorithms: a successive component/feature selection algorithm and a noisy mirror descent algorithm using Lasso gradient estimates, and show that both algorithms have convergence rates that depend only logarithmically on the ambient dimension of the problem. Empirical results confirm our theoretical findings and show that the algorithms we design outperform classical zeroth-order optimization methods in the high-dimensional setting. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wang18e.html https://proceedings.mlr.press/v84/wang18e.html A Stochastic Differential Equation Framework for Guiding Online User Activities in Closed Loop Recently, there is a surge of interest in using point processes to model continuous-time user activities. This framework has resulted in novel models and improved performance in diverse applications. However, most previous works focus on the ”open loop” setting where learned models are used for predictive tasks. Typically, we are interested in the ”closed loop” setting where a policy needs to be learned to incorporate user feedbacks and guide user activities to desirable states. Although point processes have good predictive performance, it is not clear how to use them for the challenging closed loop activity guiding task. In this paper, we propose a framework to reformulate point processes into stochastic differential equations, which allows us to extend methods from stochastic optimal control to address the activity guiding problem. We also design an efficient algorithm, and show that our method guides user activities to desired states more effectively than state-of-arts. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wang18d.html https://proceedings.mlr.press/v84/wang18d.html Batched Large-scale Bayesian Optimization in High-dimensional Spaces Bayesian optimization (BO) has become an effective approach for black-box function optimization problems when function evaluations are expensive and the optimum can be achieved within a relatively small number of queries. However, many cases, such as the ones with high-dimensional inputs, may require a much larger number of observations for optimization. Despite an abundance of observations thanks to parallel experiments, current BO techniques have been limited to merely a few thousand observations. In this paper, we propose ensemble Bayesian optimization (EBO) to address three current challenges in BO simultaneously: (1) large-scale observations; (2) high dimensional input spaces; and (3) selections of batch queries that balance quality and diversity. The key idea of EBO is to operate on an ensemble of additive Gaussian process models, each of which possesses a randomized strategy to divide and conquer. We show unprecedented, previously impossible results of scaling up BO to tens of thousands of observations within minutes of computation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wang18c.html https://proceedings.mlr.press/v84/wang18c.html Regional Multi-Armed Bandits We consider a variant of the classic multi-armed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This regional bandit model naturally bridges the non-informative bandit setting where the player can only learn the chosen arm, and the global bandit model where sampling one arms reveals information of all arms. We propose an efficient algorithm, UCB-g, that solves the regional bandit problem by combining the Upper Confidence Bound (UCB) and greedy principles. Both parameter-dependent and parameter-free regret upper bounds are derived. We also establish a matching lower bound, which proves the order-optimality of UCB-g. Moreover, we propose SW-UCB-g, which is an extension of UCB-g for a non-stationary environment where the parameters slowly vary over time. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wang18b.html https://proceedings.mlr.press/v84/wang18b.html Topic Compositional Neural Language Model We propose a Topic Compositional Neural Language Model (TCNLM), a novel method designed to simultaneously capture both the global semantic meaning and the local word-ordering structure in a document. The TCNLM learns the global semantic coherence of a document via a neural topic model, and the probability of each learned latent topic is further used to build a Mixture-of-Experts (MoE) language model, where each expert (corresponding to one topic) is a recurrent neural network (RNN) that accounts for learning the local structure of a word sequence. In order to train the MoE model efficiently, a matrix factorization method is applied, by extending each weight matrix of the RNN to be an ensemble of topic-dependent weight matrices. The degree to which each member of the ensemble is used is tied to the document-dependent probability of the corresponding topics. Experimental results on several corpora show that the proposed approach outperforms both a pure RNN-based model and other topic-guided language models. Further, our model yields sensible topics, and also has the capacity to generate meaningful sentences conditioned on given topics. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/wang18a.html https://proceedings.mlr.press/v84/wang18a.html Intersection-Validation: A Method for Evaluating Structure Learning without Ground Truth To compare learning algorithms that differ by the adopted statistical paradigm, model class, or search heuristic, it is common to evaluate the performance on training data of varying size. Measuring the performance is straightforward if the data are generated from a known model, the ground truth. However, when the study concerns real-world data, the current methodology is limited to estimating predictive performance, typically by cross-validation. This work introduces a method to compare algorithms’ ability to learn the model structure, assuming no ground truth is given. The idea is to identify a partial structure on which the algorithms agree, and measure the performance in relation to that structure on subsamples of the data. The method is instantiated to structure learning in Bayesian networks, measuring the performance by the structural Hamming distance. It is tested using benchmark ground truth networks and algorithms that maximize various scoring functions. The results show that the method can produce evaluation outcomes that are close to those one would obtain if the ground truth was available. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/viinikka18a.html https://proceedings.mlr.press/v84/viinikka18a.html Growth-Optimal Portfolio Selection under CVaR Constraints Online portfolio selection research has so far focused mainly on minimizing regret defined in terms of wealth growth. Practical financial decision making, however, is deeply concerned with both wealth and risk. We consider online learning of portfolios of stocks whose prices are governed by arbitrary (unknown) stationary and ergodic processes, where the goal is to maximize wealth while keeping the conditional value at risk (CVaR) below a desired threshold. We characterize the asymptomatically optimal risk-adjusted performance and present an investment strategy whose portfolios are guaranteed to achieve the asymptotic optimal solution while fulfilling the desired risk constraint. We also numerically demonstrate and validate the viability of our method on standard datasets. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/uziel18a.html https://proceedings.mlr.press/v84/uziel18a.html Variational inference for the multi-armed contextual bandit In many biomedical, science, and engineering problems, one must sequentially decide which action to take next so as to maximize rewards. One general class of algorithms for optimizing interactions with the world, while simultaneously learning how the world operates, is the multi-armed bandit setting and, in particular, the contextual bandit case. In this setting, for each executed action, one observes rewards that are dependent on a given ’context’, available at each interaction with the world. The Thompson sampling algorithm has recently been shown to enjoy provable optimality properties for this set of problems, and to perform well in real-world settings. It facilitates generative and interpretable modeling of the problem at hand. Nevertheless, the design and complexity of the model limit its application, since one must both sample from the distributions modeled and calculate their expected rewards. We here show how these limitations can be overcome using variational inference to approximate complex models, applying to the reinforcement learning case advances developed for the inference case in the machine learning community over the past two decades. We consider contextual multi-armed bandit applications where the true reward distribution is unknown and complex, which we approximate with a mixture model whose parameters are inferred via variational inference. We show how the proposed variational Thompson sampling approach is accurate in approximating the true distribution, and attains reduced regrets even with complex reward distributions. The proposed algorithm is valuable for practical scenarios where restrictive modeling assumptions are undesirable. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/urteaga18a.html https://proceedings.mlr.press/v84/urteaga18a.html Multi-objective Contextual Bandit Problem with Similarity Information In this paper we propose the multi-objective contextual bandit problem with similarity information. This problem extends the classical contextual bandit problem with similarity information by introducing multiple and possibly conflicting objectives. Since the best arm in each objective can be different given the context, learning the best arm based on a single objective can jeopardize the rewards obtained from the other objectives. To handle this issue, we define a new performance metric, called the contextual Pareto regret, to evaluate the performance of the learner. Essentially, the contextual Pareto regret is the sum of the distances of the arms chosen by the learner to the context dependent Pareto front. For this problem, we develop a new online learning algorithm called Pareto Contextual Zooming (PCZ), which exploits the idea of contextual zooming to learn the arms that are close to the Pareto front for each observed context by adaptively partitioning the joint context-arm set according to the observed rewards and locations of the context-arm pairs selected in the past. Then, we prove that PCZ achieves $\tilde O (T^{(1+d_p)/(2+d_p)})$ Pareto regret where $d_p$ is the Pareto zooming dimension that depends on the size of the set of near-optimal context-arm pairs. Moreover, we show that this regret bound is nearly optimal by providing an almost matching $Ω(T^{(1+d_p)/(2+d_p)})$ lower bound. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/turgay18a.html https://proceedings.mlr.press/v84/turgay18a.html VAE with a VampPrior Many different methods to train deep generative models have been introduced in the past. In this paper, we propose to extend the variational auto-encoder (VAE) framework with a new type of prior which we call "Variational Mixture of Posteriors" prior, or VampPrior for short. The VampPrior consists of a mixture distribution (e.g., a mixture of Gaussians) with components given by variational posteriors conditioned on learnable pseudo-inputs. We further extend this prior to a two layer hierarchical model and show that this architecture with a coupled prior and posterior, learns significantly better models. The model also avoids the usual local optima issues related to useless latent dimensions that plague VAEs. We provide empirical studies on six datasets, namely, static and binary MNIST, OMNIGLOT, Caltech 101 Silhouettes, Frey Faces and Histopathology patches, and show that applying the hierarchical VampPrior delivers state-of-the-art results on all datasets in the unsupervised permutation invariant setting and the best results or comparable to SOTA methods for the approach with convolutional networks. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/tomczak18a.html https://proceedings.mlr.press/v84/tomczak18a.html Nested CRP with Hawkes-Gaussian Processes There has been growing interest in learning social structure underlying interaction data, especially when such data consist of both temporal and textual information. In this paper, we propose a novel nonparametric Bayesian model that incorporates senders and receivers of messages into a hierarchical structure that governs the content and reciprocity of communications. We bring the nested Chinese restaurant process from nonparametric Bayesian statistics to Hawkes process models of point pattern data. By modeling senders and receivers in such a hierarchical framework, we are better able to make inferences about the authorship and audience of communications, as well as individual behavior such as favorite collaborators and top-pick words. Empirical results with our nonparametric Bayesian point process model show that our formulation has improved predictions about event times and clusters. In addition, the latent structure revealed by our model provides a useful qualitative understanding of the data, facilitating interesting exploratory analyses. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/tan18a.html https://proceedings.mlr.press/v84/tan18a.html Independently Interpretable Lasso: A New Regularizer for Sparse Regression with Uncorrelated Variables Sparse regularization such as l1 regularization is a quite powerful and widely used strategy for high dimensional learning problems. The effectiveness of sparse regularization has been supported practically and theoretically by several studies. However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features. Ordinary l1 regularization can select variables correlated with each other, which results in deterioration of not only its generalization error but also interpretability. In this paper, we pro- pose a new regularization method, “Independently Interpretable Lasso” (IILasso). Our proposed regularizer suppresses selecting correlated variables, and thus each active variable independently affects the objective variable in the model. Hence, we can interpret regression coefficients intuitively and also improve the performance by avoiding overfitting. We analyze theoretical property of IILasso and show that the proposed method is much advantageous for its sign recovery and achieves almost minimax optimal convergence rate. Synthetic and real data analyses also indicate the effectiveness of IILasso. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/takada18a.html https://proceedings.mlr.press/v84/takada18a.html Fast generalization error bound of deep learning from a kernel perspective We develop a new theoretical framework to analyze the generalization error of deep learning, and derive a new fast learning rate for two representative algorithms: empirical risk minimization and Bayesian deep learning. The series of theoretical analyses of deep learning has revealed its high expressive power and universal approximation capability. Our point of view is to deal with the ordinary finite dimensional deep neural network as a finite approximation of the infinite dimensional one. Our formulation of the infinite dimensional model naturally defines a reproducing kernel Hilbert space corresponding to each layer. The approximation error is evaluated by the degree of freedom of the reproducing kernel Hilbert space in each layer. We derive the generalization error bound of both of empirical risk minimization and Bayesian deep learning and it is shown that there appears bias-variance trade-off in terms of the number of parameters of the finite dimensional approximation. We show that the optimal width of the internal layers can be determined through the degree of freedom and derive the optimal convergence rate that is faster than $O(1/\sqrt{n})$ rate which has been shown in the existing studies. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/suzuki18a.html https://proceedings.mlr.press/v84/suzuki18a.html Efficient and principled score estimation with Nyström kernel exponential families We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite dimensional. The model is learned by fitting the derivative of the log density, the score, thus avoiding the need to compute a normalization constant. We improved the computational efficiency of an earlier solution with a low-rank, Nyström-like solution. The new solution retains the consistency and convergence rates of the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. We evaluate the method in experiments on density estimation and in the construction of an adaptive Hamiltonian Monte Carlo sampler. Compared to an existing score learning approach using a denoising autoencoder, our estimator is empirically more data-efficient when estimating the score, runs faster, and has fewer parameters (which can be tuned in a principled and interpretable way), in addition to providing statistical guarantees. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sutherland18a.html https://proceedings.mlr.press/v84/sutherland18a.html Random Subspace with Trees for Feature Selection Under Memory Constraints Dealing with datasets of very high dimension is a major challenge in machine learning. In this paper, we consider the problem of feature selection in applications where the memory is not large enough to contain all features. In this setting, we propose a novel tree-based feature selection approach that builds a sequence of randomized trees on small subsamples of variables mixing both variables already identified as relevant by previous models and variables randomly selected among the other variables. As our main contribution, we provide an in-depth theoretical analysis of this method in infinite sample setting. In particular, we study its soundness with respect to common definitions of feature relevance and its convergence speed under various variable dependance scenarios. We also provide some preliminary empirical results highlighting the potential of the approach. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sutera18a.html https://proceedings.mlr.press/v84/sutera18a.html Learning Hidden Quantum Markov Models Hidden Quantum Markov Models (HQMMs) can be thought of as quantum probabilistic graphical models that can model sequential data. We extend previous work on HQMMs with three contributions: (1) we show how classical hidden Markov models (HMMs) can be simulated on a quantum circuit, (2) we reformulate HQMMs by relaxing the constraints for modeling HMMs on quantum circuits, and (3) we present a learning algorithm to estimate the parameters of an HQMM from data. While our algorithm requires further optimization to handle larger datasets, we are able to evaluate our algorithm using several synthetic datasets generated by valid HQMMs. We show that our algorithm learns HQMMs with the same number of hidden states and predictive accuracy as the HQMMs that generated the data, while HMMs learned with the Baum-Welch algorithm require more states to match the predictive accuracy. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/srinivasan18a.html https://proceedings.mlr.press/v84/srinivasan18a.html Towards Provable Learning of Polynomial Neural Networks Using Low-Rank Matrix Estimation We study the problem of (provably) learning the weights of a two-layer neural network with quadratic activations. In particular, we focus on the under-parametrized regime where the number of neurons in the hidden layer is (much) smaller than the dimension of the input. Our approach uses a lifting trick, which enables us to borrow algorithmic ideas from low-rank matrix estimation. In this context, we propose two novel, non-convex training algorithms which do not need any extra tuning parameters other than the number of hidden neurons. We support our algorithms with rigorous theoretical analysis, and show that the proposed algorithms enjoy linear convergence, fast running time per iteration, and near-optimal sample complexity. Finally, we complement our theoretical results with several numerical experiments. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/soltani18a.html https://proceedings.mlr.press/v84/soltani18a.html A Provable Algorithm for Learning Interpretable Scoring Systems Score learning aims at taking advantage of supervised learning to produce interpretable models which facilitate decision making. Scoring systems are simple classification models that let users quickly perform stratification. Ideally, a scoring system is based on simple arithmetic operations, is sparse, and can be easily explained by human experts. In this contribution, we introduce an original methodology to simultaneously learn interpretable binning mapped to a class variable, and the weights associated with these bins contributing to the score. We develop and show the theoretical guarantees for the proposed method. We demonstrate by numerical experiments on benchmark data sets that our approach is competitive compared to the state-of-the-art methods. We illustrate by a real medical problem of type 2 diabetes remission prediction that a scoring system learned automatically purely from data is comparable to one manually constructed by clinicians. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sokolovska18a.html https://proceedings.mlr.press/v84/sokolovska18a.html Differentially Private Regression with Gaussian Processes A major challenge for machine learning is increasing the availability of data while respecting the privacy of individuals. Here we combine the provable privacy guarantees of the differential privacy framework with the flexibility of Gaussian processes (GPs). We propose a method using GPs to provide differentially private (DP) regression. We then improve this method by crafting the DP noise covariance structure to efficiently protect the training data, while minimising the scale of the added noise. We find that this cloaking method achieves the greatest accuracy, while still providing privacy guarantees, and offers practical DP for regression over multi-dimensional inputs. Together these methods provide a starter toolkit for combining differential privacy and GPs. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/smith18a.html https://proceedings.mlr.press/v84/smith18a.html Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training Generative Adversarial Networks (GANs) have become a widely popular framework for generative modelling of high-dimensional datasets. However their training is well-known to be difficult. This work presents a rigorous statistical analysis of GANs providing straight-forward explanations for common training pathologies such as vanishing gradients. Furthermore, it proposes a new training objective, Kernel GANs and demonstrates its practical effectiveness on large-scale real-world data sets. A key element in the analysis is the distinction between training with respect to the (unknown) data distribution, and its empirical counterpart. To overcome issues in GAN training, we pursue the idea of smoothing the Jensen-Shannon Divergence (JSD) by incorporating noise in the input distributions of the discriminator. As we show, this effectively leads to an empirical version of the JSD in which the true and the generator densities are replaced by kernel density estimates, which leads to Kernel GANs. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sinn18a.html https://proceedings.mlr.press/v84/sinn18a.html Minimax Reconstruction Risk of Convolutional Sparse Dictionary Learning Sparse dictionary learning (SDL) has become a popular method for learning parsimonious representations of data, a fundamental problem in machine learning and signal processing. While most work on SDL assumes a training dataset of independent and identically distributed (IID) samples, a variant known as convolutional sparse dictionary learning (CSDL) relaxes this assumption to allow dependent, non-stationary sequential data sources. Recent work has explored statistical properties of IID SDL; however, the statistical properties of CSDL remain largely unstudied. This paper identifies minimax rates of CSDL in terms of reconstruction risk, providing both lower and upper bounds in a variety of settings. Our results make minimal assumptions, allowing arbitrary dictionaries and showing that CSDL is robust to dependent noise. We compare our results to similar results for IID SDL and verify our theory with synthetic experiments. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/singh18a.html https://proceedings.mlr.press/v84/singh18a.html Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). In contrast to the closely related factorized normalized maximum likelihood criterion, qNML satisfies the property of score equivalence. It is also decomposable and completely free of adjustable hyperparameters. For practical computations, we identify a remarkably accurate approximation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated and real data demonstrate that the new criterion leads to parsimonious models with good predictive accuracy. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/silander18a.html https://proceedings.mlr.press/v84/silander18a.html Matrix-normal models for fMRI analysis Multivariate analysis of fMRI data has bene- fited substantially from advances in machine learning. Most recently, a range of prob- abilistic latent variable models applied to fMRI data have been successful in a variety of tasks, including identifying similarity pat- terns in neural data, combining multi-subject datasets, and mapping between brain and be- havior. Although these methods share some underpinnings, they have been developed as distinct methods, with distinct algorithms and software tools. We show how the matrix- variate normal (MN) formalism can unify some of these methods into a single frame- work. In doing so, we gain the ability to reuse noise modeling assumptions, algorithms, and code across models. Our primary theoretical contribution shows how some of these meth- ods can be written as instantiations of the same model, allowing us to generalize them to flexibly modeling structured noise covari- ances. Our formalism permits novel model variants and improved estimation strategies for SRM and RSA using substantially fewer parameters. We empirically demonstrate ad- vantages of our two new methods: for MN-RSA, we show up to 10x improvement in run- time, up to 6x improvement in RMSE, and more conservative behavior under the null. For MN-SRM, our method grants a modest improvement to out-of-sample reconstruction while relaxing the orthonormality constraint of SRM. We also provide a software prototyp- ing tool for MN models that can flexibly reuse noise covariance assumptions and algorithms across models. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/shvartsman18a.html https://proceedings.mlr.press/v84/shvartsman18a.html Online Ensemble Multi-kernel Learning Adaptive to Non-stationary and Adversarial Environments Kernel-based methods exhibit well-documented performance in various nonlinear learning tasks. Most of them rely on a preselected kernel, whose prudent choice presumes task-specific prior information. To cope with this limitation, multi-kernel learning has gained popularity thanks to its flexibility in choosing kernels from a prescribed kernel dictionary. Leveraging the random feature approximation and its recent orthogonality-promoting variant, the present contribution develops an online multi-kernel learning scheme to infer the intended nonlinear function ‘on the fly.’ To further boost performance in non-stationary environments, an adaptive multi-kernel learning scheme is developed with affordable computation and memory complexity. Performance is analyzed in terms of both static and dynamic regret. To our best knowledge, AdaRaker is the first algorithm that can optimally track nonlinear functions in non-stationary settings with strong theoretical guarantees. Numerical tests on real datasets are carried out to showcase the effectiveness of the proposed algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/shen18a.html https://proceedings.mlr.press/v84/shen18a.html Sum-Product-Quotient Networks We present a novel tractable generative model that extends Sum-Product Networks (SPNs) and significantly boosts their power. We call it Sum-Product-Quotient Networks (SPQNs), whose core concept is to incorporate conditional distributions into the model by direct computation using quotient nodes, e.g. $P(A|B) = \frac{P(A,B)}{P(B)}$. We provide sufficient conditions for the tractability of SPQNs that generalize and relax the decomposable and complete tractability conditions of SPNs. These relaxed conditions give rise to an exponential boost to the expressive efficiency of our model, i.e. we prove that there are distributions which SPQNs can compute efficiently but require SPNs to be of exponential size. Thus, we narrow the gap in expressivity between tractable graphical models and other Neural Network-based generative models. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sharir18a.html https://proceedings.mlr.press/v84/sharir18a.html Guaranteed Sufficient Decrease for Stochastic Variance Reduced Gradient Optimization In this paper, we propose a novel sufficient decrease technique for stochastic variance reduced gradient descent methods such as SVRG and SAGA. In order to make sufficient decrease for stochastic optimization, we design a new sufficient decrease criterion, which yields sufficient decrease versions of stochastic variance reduction algorithms such as SVRG-SD and SAGA-SD as a byproduct. We introduce a coefficient to scale current iterate and to satisfy the sufficient decrease property, which takes the decisions to shrink, expand or even move in the opposite direction, and then give two specific update rules of the coefficient for Lasso and ridge regression. Moreover, we analyze the convergence properties of our algorithms for strongly convex problems, which show that our algorithms attain linear convergence rates. We also provide the convergence guarantees of our algorithms for non-strongly convex problems. Our experimental results further verify that our algorithms achieve significantly better performance than their counterparts. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/shang18a.html https://proceedings.mlr.press/v84/shang18a.html Reducing Crowdsourcing to Graphon Estimation, Statistically Inferring the correct answers to binary tasks based on multiple noisy answers in an unsupervised manner has emerged as the canonical question for micro-task crowdsourcing or more generally aggregating opinions. In graphon estimation, one is interested in estimating edge intensities or probabilities between nodes using a single snapshot of a graph realization. In the recent literature, there has been exciting development within both of these topics. In the context of crowdsourcing, the key intellectual challenge is to understand whether a given task can be more accurately denoised by aggregating answers collected from other different tasks. In the context of graphon estimation, precise information limits and estimation algorithms remain of interest. In this paper, we utilize a statistical reduction from crowdsourcing to graphon estimation to advance the state-of-art for both of these challenges. We use concepts from graphon estimation to design an algorithm that achieves better performance than the majority voting scheme for a setup that goes beyond the rank one models considered in the literature. We use known lower bounds for crowdsourcing to derive lower bounds for graphon estimation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/shah18a.html https://proceedings.mlr.press/v84/shah18a.html Contextual Bandits with Stochastic Experts We consider the problem of contextual bandits with stochastic experts, which is a variation of the traditional stochastic contextual bandit with experts problem. In our problem setting, we assume access to a class of stochastic experts, where each expert is a conditional distribution over the arms given a context. We propose upper-confidence bound (UCB) algorithms for this problem, which employ two different importance sampling based estimators for the mean reward for each expert. Both these estimators leverage information leakage among the experts, thus using samples collected under all the experts to estimate the mean reward of any given expert. This leads to instance dependent regret bounds of $\mathcal{O}\left(λ(\pmb{μ})\mathcal{M}\log T/∆\right)$, where $λ(\pmb{μ})$ is a term that depends on the mean rewards of the experts, $∆$ is the smallest gap between the mean reward of the optimal expert and the rest, and $\mathcal{M}$ quantifies the information leakage among the experts. We show that under some assumptions $λ(\pmb{μ})$ is typically $\mathcal{O}(\log N)$. We implement our algorithm with stochastic experts generated from cost-sensitive classification oracles and show superior empirical performance on real-world datasets, when compared to other state of the art contextual bandit algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sen18a.html https://proceedings.mlr.press/v84/sen18a.html Human Interaction with Recommendation Systems Many recommendation algorithms rely on user data to generate recommendations. However, these recommendations also affect the data obtained from future users. This work aims to understand the effects of this dynamic interaction. We propose a simple model where users with heterogeneous preferences arrive over time. Based on this model, we prove that naive estimators, i.e. those which ignore this feedback loop, are not consistent. We show that consistent estimators are efficient in the presence of myopic agents. Our results are validated using extensive simulations. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/schmit18a.html https://proceedings.mlr.press/v84/schmit18a.html Combinatorial Semi-Bandits with Knapsacks We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited “resources" consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, support it with several motivating examples, and design an algorithm for it. Our regret bounds are comparable with those for BwK and combinatorial semi-bandits. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sankararaman18a.html https://proceedings.mlr.press/v84/sankararaman18a.html Solving lp-norm regularization with tensor kernels In this paper, we discuss how a suitable family of tensor kernels can be used to efficiently solve nonparametric extensions of lp regularized learning methods. Our main contribution is proposing a fast dual algorithm, and showing that it allows to solve the problem efficiently. Our results contrast recent findings suggesting kernel methods cannot be extended beyond Hilbert setting. Numerical experiments confirm the effectiveness of the method. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/salzo18a.html https://proceedings.mlr.press/v84/salzo18a.html Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models The natural gradient method has been used effectively in conjugate Gaussian process models, but the non-conjugate case has been largely unexplored. We examine how natural gradients can be used in non-conjugate stochastic settings, together with hyperparameter learning. We conclude that the natural gradient can significantly improve performance in terms of wall-clock time. For ill-conditioned posteriors the benefit of the natural gradient method is especially pronounced, and we demonstrate a practical setting where ordinary gradients are unusable. We show how natural gradients can be computed efficiently and automatically in any parameterization, using automatic differentiation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/salimbeni18a.html https://proceedings.mlr.press/v84/salimbeni18a.html Efficient Bandit Combinatorial Optimization Algorithm with Zero-suppressed Binary Decision Diagrams We consider bandit combinatorial optimization (BCO) problems. A BCO instance generally has a huge set of all feasible solutions, which we call the action set. To avoid dealing with such huge action sets directly, we propose an algorithm that takes advantage of zero-suppressed binary decision diagrams, which encode action sets as compact graphs. The proposed algorithm achieves either $O(T^{2/3})$ regret with high probability or $O(\sqrt{T})$ expected regret at any $T$-th round. Typically, our algorithm works efficiently for BCO problems defined on networks. Experiments show that our algorithm is applicable to various large BCO instances including adaptive routing problems on real-world networks. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/sakaue18a.html https://proceedings.mlr.press/v84/sakaue18a.html Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models Professional-grade software applications are powerful but complicated – expert users can achieve impressive results, but novices often struggle to complete even basic tasks. Photo editing is a prime example: after loading a photo, the user is confronted with an array of cryptic sliders like "clarity", "temp", and "highlights". An automatically generated suggestion could help, but there is no single "correct" edit for a given image – different experts may make very different aesthetic decisions when faced with the same image, and a single expert may make different choices depending on the intended use of the image (or on a whim). We therefore want a system that can propose multiple diverse, high-quality edits while also learning from and adapting to a user’s aesthetic preferences. In this work, we develop a statistical model that meets these objectives. Our model builds on recent advances in neural network generative modeling and scalable inference, and uses hierarchical structure to learn editing patterns across many diverse users. Empirically, we find that our model outperforms other approaches on this challenging multimodal prediction task. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/saeedi18a.html https://proceedings.mlr.press/v84/saeedi18a.html Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. The method is appropriate for jointly modeling hundreds of time series with widely varying, non-stationary dynamics. Given a collection of $N$ time series, the Bayesian model first partitions them into independent clusters using a Chinese restaurant process prior. Within a cluster, all time series are modeled jointly using a novel “temporally-reweighted” extension of the Chinese restaurant process mixture. Markov chain Monte Carlo techniques are used to obtain samples from the posterior distribution, which are then used to form predictive inferences. We apply the technique to challenging forecasting and imputation tasks using seasonal flu data from the US Center for Disease Control and Prevention, demonstrating superior forecasting accuracy and competitive imputation accuracy as compared to multiple widely used baselines. We further show that the model discovers interpretable clusters in datasets with hundreds of time series, using macroeconomic data from the Gapminder Foundation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/saad18a.html https://proceedings.mlr.press/v84/saad18a.html Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information Conditional independence testing is a fundamental problem underlying causal discovery and a particularly challenging task in the presence of nonlinear dependencies. Here a fully non-parametric test for continuous data based on conditional mutual information combined with a local permutation scheme is presented. Numerical experiments covering sample sizes from $50$ to $2,000$ and dimensions up to $10$ demonstrate that the test reliably generates the null distribution. For smooth nonlinear dependencies, the test has higher power than kernel-based tests in lower dimensions and similar or slightly lower power in higher dimensions. For highly non-smooth densities the data-adaptive nearest neighbor approach is particularly well-suited while kernel methods yield much lower power. The experiments also show that kernel methods utilizing an analytical approximation of the null distribution are not well-calibrated for sample sizes below $1,000$. Combining the local permutation scheme with these kernel tests leads to better calibration but lower power. For smaller sample sizes and lower dimensions, the proposed test is faster than random fourier feature-based kernel tests if (embarrassingly) parallelized, but the runtime increases more sharply with sample size and dimensionality. Thus, more theoretical research to analytically approximate the null distribution and speed up the estimation is desirable. As illustrated on real data here, the test is ideally suited in combination with causal discovery algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/runge18a.html https://proceedings.mlr.press/v84/runge18a.html Direct Learning to Rank And Rerank Learning-to-rank techniques have proven to be extremely useful for prioritization problems, where we rank items in order of their estimated probabilities, and dedicate our limited resources to the top-ranked items. This work exposes a serious problem with the state of learning-to-rank algorithms, which is that they are based on convex proxies that lead to poor approximations. We then discuss the possibility of "exact" reranking algorithms based on mathematical programming. We prove that a relaxed version of the "exact" problem has the same optimal solution, and provide an empirical analysis. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/rudin18a.html https://proceedings.mlr.press/v84/rudin18a.html An Analysis of Categorical Distributional Reinforcement Learning Distributional approaches to value-based reinforcement learning model the entire distribution of returns, rather than just their expected values, and have recently been shown to yield state-of-the-art empirical performance. This was demonstrated by the recently proposed C51 algorithm, based on categorical distributional reinforcement learning (CDRL) [Bellemare et al., 2017]. However, the theoretical properties of CDRL algorithms are not yet well understood. In this paper, we introduce a framework to analyse CDRL algorithms, establish the importance of the projected distributional Bellman operator in distributional RL, draw fundamental connections between CDRL and the Cramer distance, and give a proof of convergence for sample-based categorical distributional reinforcement learning algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/rowland18a.html https://proceedings.mlr.press/v84/rowland18a.html Discriminative Learning of Prediction Intervals In this work we consider the task of constructing prediction intervals in an inductive batch setting. We present a discriminative learning framework which optimizes the expected error rate under a budget constraint on the interval sizes. Most current methods for constructing prediction intervals offer guarantees for a single new test point. Applying these methods to multiple test points can result in a high computational overhead and degraded statistical guarantees. By focusing on expected errors, our method allows for variability in the per-example conditional error rates. As we demonstrate both analytically and empirically, this flexibility can increase the overall accuracy, or alternatively, reduce the average interval size. While the problem we consider is of a regressive flavor, the loss we use is combinatorial. This allows us to provide PAC-style, finite-sample guarantees. Computationally, we show that our original objective is NP-hard, and suggest a tractable convex surrogate. We conclude with a series of experimental evaluations. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/rosenfeld18b.html https://proceedings.mlr.press/v84/rosenfeld18b.html Semi-Supervised Learning with Competitive Infection Models The goal in semi-supervised learning is to effectively combine labeled and unlabeled data. One way to do this is by encouraging smoothness across edges in a graph whose nodes correspond to input examples. In many graph-based methods, labels can be thought of as propagating over the graph, where the underlying propagation mechanism is based on random walks or on averaging dynamics. While theoretically elegant, these dynamics suffer from several drawbacks which can hurt predictive performance. Our goal in this work is to explore alternative mechanisms for propagating labels. In particular, we propose a method based on dynamic infection processes, where unlabeled nodes can be "infected" with the label of their already infected neighbors. Our algorithm is efficient and scalable, and an analysis of the underlying optimization objective reveals a surprising relation to other Laplacian approaches. We conclude with a thorough set of experiments across multiple benchmarks and various learning settings. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/rosenfeld18a.html https://proceedings.mlr.press/v84/rosenfeld18a.html High-Dimensional Bayesian Optimization via Additive Models with Overlapping Groups Bayesian optimization (BO) is a popular technique for sequential black-box function optimization, with applications including parameter tuning, robotics, environmental monitoring, and more. One of the most important challenges in BO is the development of algorithms that scale to high dimensions, which remains a key open problem despite recent progress. In this paper, we consider the approach of Kandasamy et al. (2015), in which the high-dimensional function decomposes as a sum of lower-dimensional functions on subsets of the underlying variables. In particular, we significantly generalize this approach by lifting the assumption that the subsets are disjoint, and consider additive models with arbitrary overlap among the subsets. By representing the dependencies via a graph, we deduce an efficient message passing algorithm for optimizing the acquisition function. In addition, we provide an algorithm for learning the graph from samples based on Gibbs sampling. We empirically demonstrate the effectiveness of our methods on both synthetic and real-world data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/rolland18a.html https://proceedings.mlr.press/v84/rolland18a.html A Generic Approach for Escaping Saddle points A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian-based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/reddi18a.html https://proceedings.mlr.press/v84/reddi18a.html On how complexity affects the stability of a predictor Given a finite random sample from a Markov chain environment, we select a predictor that minimizes a criterion function and refer to it as being calibrated to its environment. If its prediction error is not bounded by its criterion value, we say that the criterion fails. We define the predictor’s complexity to be the amount of uncertainty in detecting that the criterion fails given that it fails. We define a predictor’s stability to be the discrepancy between the average number of prediction errors that it makes on two random samples. We show that complexity is inversely proportional to the level of adaptivity of the calibrated predictor to its random environment. The calibrated predictor becomes less stable as its complexity increases or as its level of adaptivity decreases. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ratsaby18a.html https://proceedings.mlr.press/v84/ratsaby18a.html Iterative Supervised Principal Components In high-dimensional prediction problems, where the number of features may greatly exceed the number of training instances, fully Bayesian approach with a sparsifying prior is known to produce good results but is computationally challenging. To alleviate this computational burden, we propose to use a preprocessing step where we first apply a dimension reduction to the original data to reduce the number of features to something that is computationally conveniently handled by Bayesian methods. To do this, we propose a new dimension reduction technique, called iterative supervised principal components (ISPCs), which combines variable screening and dimension reduction and can be considered as an extension to the existing technique of supervised principal components (SPCs). Our empirical evaluations confirm that, although not foolproof, the proposed approach provides very good results on several microarray benchmark datasets with very affordable computation time, and it can also be very useful for visualizing high-dimensional data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/piironen18a.html https://proceedings.mlr.press/v84/piironen18a.html Fast Threshold Tests for Detecting Discrimination Threshold tests have recently been proposed as a useful method for detecting bias in lending, hiring, and policing decisions. For example, in the case of credit extensions, these tests aim to estimate the bar for granting loans to white and minority applicants, with a higher inferred threshold for minorities indicative of discrimination. This technique, however, requires fitting a complex Bayesian latent variable model for which inference is often computationally challenging. Here we develop a method for fitting threshold tests that is two orders of magnitude faster than the existing approach, reducing computation from hours to minutes. To achieve these performance gains, we introduce and analyze a flexible family of probability distributions on the interval [0, 1] – which we call discriminant distributions – that is computationally efficient to work with. We demonstrate our technique by analyzing 2.7 million police stops of pedestrians in New York City. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/pierson18a.html https://proceedings.mlr.press/v84/pierson18a.html Structured Factored Inference for Probabilistic Programming Probabilistic reasoning on complex real-world models is computationally challenging. Inference algorithms have been developed that work well on specific models or on parts of general models, but they require significant hand-engineering to apply to full-scale problems. Probabilistic programming (PP) enables the expression of rich probabilistic models, but inference remains a bottleneck in many applications. Factored inference is one of the main approaches to inference in graphical models, but has trouble scaling up to some hard problems expressible as probabilistic programs. We present structured factored inference (SFI), a framework that enables factored inference algorithms to scale to significantly more complex programs. Using models encoded in a PP language, SFI provides a sound means to decompose a model into submodels, apply an algorithm to each submodel, and combine results to answer a query. Our results show that SFI successfully reasons on models where standard factored inference algorithms fail due to computational complexity. SFI is nearly as accurate as exact inference and is as fast as approximate inference methods. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/pfeffer18a.html https://proceedings.mlr.press/v84/pfeffer18a.html Actor-Critic Fictitious Play in Simultaneous Move Multistage Games Fictitious play is a game theoretic iterative procedure meant to learn an equilibrium in normal form games. However, this algorithm requires that each player has full knowledge of other players’ strategies. Using an architecture inspired by actor-critic algorithms, we build a stochastic approximation of the fictitious play process. This procedure is on-line, decentralized (an agent has no information of others’ strategies and rewards) and applies to multistage games (a generalization of normal form games). In addition, we prove convergence of our method towards a Nash equilibrium in both the cases of zero-sum two-player multistage games and cooperative multistage games. We also provide empirical evidence of the soundness of our approach on the game of Alesia with and without function approximation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/perolat18a.html https://proceedings.mlr.press/v84/perolat18a.html The emergence of spectral universality in deep networks Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network’s input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initialization. To this end, we leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network’s Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/pennington18a.html https://proceedings.mlr.press/v84/pennington18a.html On Statistical Optimality of Variational Bayes The article addresses a long-standing open problem on the justification of using variational Bayes methods for parameter estimation. We provide general conditions for obtaining optimal risk bounds for point estimates acquired from mean-field variational Bayesian inference. The conditions pertain to the existence of certain test functions for the distance metric on the parameter space and minimal assumptions on the prior. A general recipe for verification of the conditions is outlined which is broadly applicable to existing Bayesian models with or without latent variables. As illustrations, specific applications to Latent Dirichlet Allocation and Gaussian mixture models are discussed. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/pati18a.html https://proceedings.mlr.press/v84/pati18a.html Catalyst for Gradient-based Nonconvex Optimization We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them without assuming any knowledge about the convexity of the objective. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/paquette18a.html https://proceedings.mlr.press/v84/paquette18a.html Optimal Submodular Extensions for Marginal Estimation Submodular extensions of an energy function can be used to efficiently compute approximate marginals via variational inference. The accuracy of the marginals depends crucially on the quality of the submodular extension. To identify the best possible extension, we show an equivalence between the submodular extensions of the energy and the objective functions of linear programming (LP) relaxations for the corresponding MAP estimation problem. This allows us to (i) establish the optimality of the submodular extension for Potts model used in the literature; (ii) identify the optimal submodular extension for the more general class of metric labeling; and (iii) efficiently compute the marginals for the widely used dense CRF model using a recently proposed Gaussian filtering method. Using both synthetic and real data, we show that our approach provides comparable upper bounds on the log-partition function to those obtained using tree-reweighted message passing (TRW) in cases where the latter is computationally feasible. Importantly, unlike TRW, our approach provides the first practical algorithm to compute an upper bound on the dense CRF model. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/pansari18a.html https://proceedings.mlr.press/v84/pansari18a.html Probability–Revealing Samples In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution D via independent samples from D. We introduce a model in which every sample comes with the information about the probability of selecting it. In this setting, we give algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size. The sample complexity of all of our algorithms is optimal up to a constant factor for sufficiently large support size. The running times of our algorithms are near-linear in the number of samples collected. Additionally, our algorithms are robust to small multiplicative errors in probability estimates. The complexity of our model lies strictly between the complexity of the model where only independent samples are provided and the complexity of the model where additionally arbitrary probability queries are allowed. Our model finds applications where once a given element is sampled, it is easier to estimate its probability. We describe two scenarios in which all occurrences of each element are easy to explore once at least one copy of the element is detected. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/onak18a.html https://proceedings.mlr.press/v84/onak18a.html Spectral Algorithms for Computing Fair Support Vector Machines Classifiers and rating scores are prone to implicitly codifying biases, which may be present in the training data, against protected classes (i.e., age, gender, or race). So it is important to understand how to design classifiers and scores that prevent discrimination in predictions. This paper develops computationally tractable algorithms for designing accurate but fair support vector machines (SVM’s). Our approach imposes a constraint on the covariance matrices conditioned on each protected class, which leads to a nonconvex quadratic constraint in the SVM formulation. We develop iterative algorithms to compute fair linear and kernel SVM’s, which solve a sequence of relaxations constructed using a spectral decomposition of the nonconvex constraint. Its effectiveness in achieving high prediction accuracy while ensuring fairness is shown through numerical experiments on several data sets. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/olfat18a.html https://proceedings.mlr.press/v84/olfat18a.html Scalable Hash-Based Estimation of Divergence Measures We propose a scalable divergence estimation method based on hashing. Consider two continuous random variables $X$ and $Y$ whose densities have bounded support. We consider a particular locality sensitive random hashing, and consider the ratio of samples in each hash bin having non-zero numbers of Y samples. We prove that the weighted average of these ratios over all of the hash bins converges to f-divergences between the two samples sets. We derive the MSE rates for two families of smooth functions; the Hölder smoothness class and differentiable functions. In particular, it is proved that if the density functions have bounded derivatives up to the order $d$, where $d$ is the dimension of samples, the optimal parametric MSE rate of $O(1/N)$ can be achieved. The computational complexity is shown to be $O(N)$, which is optimal. To the best of our knowledge, this is the first empirical divergence estimator that has optimal computational complexity and can achieve the optimal parametric MSE estimation rate of $O(1/N)$. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/noshad18a.html https://proceedings.mlr.press/v84/noshad18a.html Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models We propose a new technique that boosts the convergence of training generative adversarial networks. Generally, the rate of training deep models reduces severely after multiple iterations. A key reason for this phenomenon is that a deep network is expressed using a highly non-convex finite-dimensional model, and thus the parameter gets stuck in a local optimum. Because of this, methods often suffer not only from degeneration of the convergence speed but also from limitations in the representational power of the trained network. To overcome this issue, we propose an additional layer called the gradient layer to seek a descent direction in an infinite-dimensional space. Because the layer is constructed in the infinite-dimensional space, we are not restricted by the specific model structure of finite-dimensional models. As a result, we can get out of the local optima in finite-dimensional models and move towards the global optimal function more directly. In this paper, this phenomenon is explained from the functional gradient method perspective of the gradient layer. Interestingly, the optimization procedure using the gradient layer naturally constructs the deep structure of the network. Moreover, we demonstrate that this procedure can be regarded as a discretization method of the gradient flow that naturally reduces the objective function. Finally, the method is tested using several numerical experiments, which show its fast convergence. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/nitanda18a.html https://proceedings.mlr.press/v84/nitanda18a.html Why Adaptively Collected Data Have Negative Bias and How to Correct for It From scientific experiments to online A/B testing, the previously observed data often affects how future experiments are performed, which in turn affects which data will be collected. Such adaptivity introduces complex correlations between the data and the collection procedure. In this paper, we prove that when the data collection procedure satisfies natural conditions, then sample means of the data have systematic negative biases. As an example, consider an adaptive clinical trial where additional data points are more likely to be tested for treatments that show initial promise. Our surprising result implies that the average observed treatment effects would underestimate the true effects of each treatment. We quantitatively analyze the magnitude and behavior of this negative bias in a variety of settings. We also propose a novel debiasing algorithm based on selective inference techniques. In experiments, our method can effectively reduce bias and estimation error. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/nie18a.html https://proceedings.mlr.press/v84/nie18a.html Learning with Complex Loss Functions and Constraints We develop a general approach for solving constrained classification problems, where the loss and constraints are defined in terms of a general function of the confusion matrix. We are able to handle complex, non-linear loss functions such as the F-measure, G-mean or H-mean, and constraints ranging from budget limits, to constraints for fairness, to bounds on complex evaluation metrics. Our approach builds on the framework of Narasimhan et al. (2015) for unconstrained classification with complex losses, and reduces the constrained learning problem to a sequence of cost-sensitive learning tasks. We provide algorithms for two broad families of problems, involving convex and fractional-convex losses, subject to convex constraints. Our algorithms are statistically consistent, generalize an existing approach for fair classification, and readily apply to multiclass problems. Experiments on a variety of tasks demonstrate the efficacy of our methods. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/narasimhan18a.html https://proceedings.mlr.press/v84/narasimhan18a.html Learning Priors for Invariance Informative priors are often difficult, if not impossible, to elicit for modern large-scale Bayesian models. Yet, often, some prior knowledge is known, and this information is incorporated via engineering tricks or methods less principled than a Bayesian prior. However, employing these tricks is difficult to reconcile with principled probabilistic inference. For instance, in the case of data set augmentation, the posterior is conditioned on artificial data and not on what is actually observed. In this paper, we address the problem of how to specify an informative prior when the problem of interest is known to exhibit invariance properties. The proposed method is akin to posterior variational inference: we choose a parametric family and optimize to find the member of the family that makes the model robust to a given transformation. We demonstrate the method’s utility for dropout and rotation transformations, showing that the use of these priors results in performance competitive to that of non-Bayesian methods. Furthermore, our approach does not depend on the data being labeled and thus can be used in semi-supervised settings. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/nalisnick18a.html https://proceedings.mlr.press/v84/nalisnick18a.html Variational Sequential Monte Carlo Many recent advances in large scale probabilistic inference rely on variational methods. The success of variational approaches depends on (i) formulating a flexible parametric family of distributions, and (ii) optimizing the parameters to find the member of this family that most closely approximates the exact posterior. In this paper we present a new approximating family of distributions, the variational sequential Monte Carlo (VSMC) family, and show how to optimize it in variational inference. VSMC melds variational inference (VI) and sequential Monte Carlo (SMC), providing practitioners with flexible, accurate, and powerful Bayesian inference. The VSMC family is a variational family that can approximate the posterior arbitrarily well, while still allowing for efficient optimization of its parameters. We demonstrate its utility on state space models, stochastic volatility models for financial data, and deep Markov models of brain neural circuits. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/naesseth18a.html https://proceedings.mlr.press/v84/naesseth18a.html Generalized Binary Search For Split-Neighborly Problems In sequential hypothesis testing, Generalized Binary Search (GBS) greedily chooses the test with the highest information gain at each step. It is known that GBS obtains the gold standard query cost of O(log n) for problems satisfying the k-neighborly condition, which requires any two tests to be connected by a sequence of tests where neighboring tests disagree on at most k hypotheses. In this paper, we introduce a weaker condition, split-neighborly, which requires that for the set of hypotheses two neighbors disagree on, any subset is splittable by some test. For four problems that are not k-neighborly for any constant k, we prove that they are split-neighborly, which allows us to obtain the optimal O(log n) worst-case query cost. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mussmann18a.html https://proceedings.mlr.press/v84/mussmann18a.html Delayed Sampling and Automatic Rao-Blackwellization of Probabilistic Programs We introduce a dynamic mechanism for the solution of analytically-tractable substructure in probabilistic programs, using conjugate priors and affine transformations to reduce variance in Monte Carlo estimators. For inference with Sequential Monte Carlo, this automatically yields improvements such as locally-optimal proposals and Rao–Blackwellization. The mechanism maintains a directed graph alongside the running program that evolves dynamically as operations are triggered upon it. Nodes of the graph represent random variables, edges the analytically-tractable relationships between them. Random variables remain in the graph for as long as possible, to be sampled only when they are used by the program in a way that cannot be resolved analytically. In the meantime, they are conditioned on as many observations as possible. We demonstrate the mechanism with a few pedagogical examples, as well as a linear-nonlinear state-space model with simulated data, and an epidemiological model with real data of a dengue outbreak in Micronesia. In all cases one or more variables are automatically marginalized out to significantly reduce variance in estimates of the marginal likelihood, in the final case facilitating a random-weight or pseudo-marginal-type importance sampler for parameter estimation. We have implemented the approach in Anglican and a new probabilistic programming language called Birch. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/murray18a.html https://proceedings.mlr.press/v84/murray18a.html Combinatorial Preconditioners for Proximal Algorithms on Graphs We present a novel preconditioning technique for proximal optimization methods that relies on graph algorithms to construct effective preconditioners. Such combinatorial preconditioners arise from partitioning the graph into forests. We prove that certain decompositions lead to a theoretically optimal condition number. We also show how ideal decompositions can be realized using matroid partitioning and propose efficient greedy variants thereof for large-scale problems. Coupled with specialized solvers for the resulting scaled proximal subproblems, the preconditioned algorithm achieves competitive performance in machine learning and vision applications. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mollenhoff18a.html https://proceedings.mlr.press/v84/mollenhoff18a.html Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap In this paper, we study the problem of constrained and stochastic continuous submodular maximization. Even though the objective function is not concave (nor convex) and is defined in terms of an expectation, we develop a variant of the conditional gradient method, called Stochastic Continuous Greedy, which achieves a tight approximation guarantee. More precisely, for a monotone and continuous DR-submodular function and subject to a general convex body constraint, we prove that Stochastic Continuous Greedy achieves a $[(1-1/e)\text{OPT} -\eps]$ guarantee (in expectation) with $\mathcal{O}{(1/\eps^3)}$ stochastic gradient computations. This guarantee matches the known hardness results and closes the gap between deterministic and stochastic continuous submodular maximization. By using stochastic continuous optimization as an interface, we also provide the first $(1-1/e)$ tight approximation guarantee for maximizing a monotone but stochastic submodular set function subject to a general matroid constraint. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mokhtari18a.html https://proceedings.mlr.press/v84/mokhtari18a.html Competing with Automata-based Expert Sequences We consider a general framework of online learning with expert advice where regret is defined with respect to sequences of experts accepted by a weighted automaton. Our framework covers several problems previously studied, including competing against k-shifting experts. We give a series of algorithms for this problem, including an automata-based algorithm extending weighted-majority and more efficient algorithms based on the notion of failure transitions. We further present efficient algorithms based on an approximation of the competitor automaton, in particular n-gram models obtained by minimizing the $∞$-Rényi divergence, and present an extensive study of the approximation properties of such models. Finally, we also extend our algorithms and results to the framework of sleeping experts. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mohri18a.html https://proceedings.mlr.press/v84/mohri18a.html Learning to Round for Discrete Labeling Problems Discrete labeling problems are often solved by formulating them as an integer program, and relaxing the integrality constraint to a continuous domain. While the continuous relaxation is closely related to the original integer program, its optimal solution is often fractional. Thus, the success of a relaxation depends crucially on the availability of an accurate rounding procedure. The problem of identifying an accurate rounding procedure has mainly been tackled in the theoretical computer science community through mathematical analysis of the worst-case. However, this approach is both onerous and ignores the distribution of the data encountered in practice. We present a novel interpretation of rounding procedures as sampling from a latent variable model, which opens the door to the use of powerful machine learning formulations in their design. Inspired by the recent success of deep latent variable models we parameterize rounding procedures as a neural network, which lends itself to efficient optimization via back propagation. By minimizing the expected value of the objective of the discrete labeling problem over training samples, we learn a rounding procedure that is more suited to the task at hand. Using both synthetic and real world data sets, we demonstrate that our approach can outperform the state-of-the-art hand-designed rounding procedures. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mohapatra18a.html https://proceedings.mlr.press/v84/mohapatra18a.html Submodularity on Hypergraphs: From Sets to Sequences In a nutshell, submodular functions encode an intuitive notion of diminishing returns. As a result, submodularity appears in many important machine learning tasks such as feature selection and data summarization. Although there has been a large volume of work devoted to the study of submodular functions in recent years, the vast majority of this work has been focused on algorithms that output sets, not sequences. However, in many settings, the order in which we output items can be just as important as the items themselves. To extend the notion of submodularity to sequences, we use a directed graph on the items where the edges encode the additional value of selecting items in a particular order. Existing theory is limited to the case where this underlying graph is a directed acyclic graph. In this paper, we introduce two new algorithms that provably give constant factor approximations for general graphs and hypergraphs having bounded in or out degrees. Furthermore, we show the utility of our new algorithms for real-world applications in movie recommendation, online link prediction, and the design of course sequences for MOOCs. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mitrovic18a.html https://proceedings.mlr.press/v84/mitrovic18a.html The Power Mean Laplacian for Multilayer Graph Clustering Multilayer graphs encode different kind of interactions between the same set of entities. When one wants to cluster such a multilayer graph, the natural question arises how one should merge the information from different layers. We introduce in this paper a one-parameter family of matrix power means for merging the Laplacians from different layers and analyze it in expectation in the stochastic block model. We show that this family allows to recover ground truth clusters under different settings and verify this in real world data. While the matrix power mean is computationally expensive to compute we introduce a scalable numerical scheme that allows to efficiently compute the eigenvectors of the matrix power mean of large sparse graphs. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/mercado18a.html https://proceedings.mlr.press/v84/mercado18a.html Generalized Concomitant Multi-Task Lasso for Sparse Multimodal Regression In high dimension, it is customary to consider Lasso-type estimators to enforce sparsity. For standard Lasso theory to hold, the regularization parameter should be proportional to the noise level, which is often unknown in practice. A remedy is to consider estimators such as the Concomitant Lasso, which jointly optimize over the regression coefficients and the noise level. However, when data from different sources are pooled to increase sample size, noise levels differ and new dedicated estimators are needed. We provide new statistical and computational solutions to perform heteroscedastic regression, with an emphasis on functional brain imaging with magneto- and electroencephalography (M/EEG). When instantiated to de-correlated noise, our framework leads to an efficient algorithm whose computational cost is not higher than for the Lasso, but addresses more complex noise structures. Experiments demonstrate improved prediction and support identification with correct estimation of noise levels. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/massias18a.html https://proceedings.mlr.press/v84/massias18a.html Practical Bayesian optimization in the presence of outliers Inference in the presence of outliers is an important field of research as outliers are ubiquitous and may arise across a variety of problems and domains. Bayesian optimization is method that heavily relies on probabilistic inference. This allows outstanding sample efficiency because the probabilistic machinery provides a memory of the whole optimization process. However, that virtue becomes a disadvantage when the memory is populated with outliers, inducing bias in the estimation. In this paper, we present an empirical evaluation of Bayesian optimization methods in the presence of outliers. The empirical evidence shows that Bayesian optimization with robust regression often produces suboptimal results. We then propose a new algorithm which combines robust regression (a Gaussian process with Student-t likelihood) with outlier diagnostics to classify data points as outliers or inliers. By using an scheduler for the classification of outliers, our method is more efficient and has better convergence over the standard robust regression. Furthermore, we show that even in controlled situations with no expected outliers, our method is able to produce better results. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/martinez-cantin18a.html https://proceedings.mlr.press/v84/martinez-cantin18a.html A Fast Algorithm for Separated Sparsity via Perturbed Lagrangians Sparsity-based methods are widely used in machine learning, statistics, and signal processing. There is now a rich class of structured sparsity approaches that expand the modeling power of the sparsity paradigm and incorporate constraints such as group sparsity, graph sparsity, or hierarchical sparsity. While these sparsity models offer improved sample complexity and better interpretability, the improvements come at a computational cost: it is often challenging to optimize over the (non-convex) constraint sets that capture various sparsity structures. In this paper, we make progress in this direction in the context of separated sparsity – a fundamental sparsity notion that captures exclusion constraints in linearly ordered data such as time series. While prior algorithms for computing a projection onto this constraint set required quadratic time, we provide a perturbed Lagrangian relaxation approach that computes provably exact projection in only nearly-linear time. Although the sparsity constraint is nonconvex, our perturbed Lagrangian approach is still guaranteed to find a globally optimal solution. In experiments, our new algorithms offer a 10x speed-up already on moderately-size inputs. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/madry18a.html https://proceedings.mlr.press/v84/madry18a.html Teacher Improves Learning by Selecting a Training Subset We call a learner super-teachable if a teacher can trim down an iid training set while making the learner learn even better. We provide sharp super-teaching guarantees on two learners: the maximum likelihood estimator for the mean of a Gaussian, and the large margin classifier in 1D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ma18a.html https://proceedings.mlr.press/v84/ma18a.html Boosting Variational Inference: an Optimization Perspective Variational inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, boosting variational inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. However, as is the case with many other variational inference algorithms, its theoretical properties have not been studied. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic Frank-Wolfe algorithm. Our analyses yields novel theoretical insights regarding the sufficient conditions for convergence, explicit rates, and algorithmic simplifications. Since a lot of focus in previous works for variational inference has been on tractability, our work is especially important as a much needed attempt to bridge the gap between probabilistic models and their corresponding theoretical properties. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/locatello18a.html https://proceedings.mlr.press/v84/locatello18a.html Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications In this paper, we design and analyze a new zeroth-order online algorithm, namely, the zeroth-order online alternating direction method of multipliers (ZOO-ADMM), which enjoys dual advantages of being gradient-free operation and employing the ADMM to accommodate complex structured regularizers. Compared to the first-order gradient-based online algorithm, we show that ZOO-ADMM requires $\sqrt{m}$ times more iterations, leading to a convergence rate of $O(\sqrt{m}/\sqrt{T})$, where $m$ is the number of optimization variables, and $T$ is the number of iterations. To accelerate ZOO-ADMM, we propose two minibatch strategies: gradient sample averaging and observation averaging, resulting in an improved convergence rate of $O(\sqrt{1+q^{-1}m}/\sqrt{T})$, where $q$ is the minibatch size. In addition to convergence analysis, we also demonstrate ZOO-ADMM to applications in signal processing, statistics, and machine learning. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/liu18a.html https://proceedings.mlr.press/v84/liu18a.html Reparameterizing the Birkhoff Polytope for Variational Permutation Inference Many matching, tracking, sorting, and ranking problems require probabilistic reasoning about possible permutations, a set that grows factorially with dimension. Combinatorial optimization algorithms may enable efficient point estimation, but fully Bayesian inference poses a severe challenge in this high-dimensional, discrete space. To surmount this challenge, we start by relaxing the discrete set of permutation matrices to its convex hull the Birkhoff polytope, the set of doubly-stochastic matrices. We then introduce two novel transformations: an invertible and differentiable stick-breaking procedure that maps unconstrained space to the Birkhoff polytope, and a map that rounds points toward the vertices of the polytope. Both transformations include a temperature parameter that, in the limit, concentrates the densities on permutation matrices. We exploit these transformations and reparameterization gradients to introduce variational inference over permutation matrices, and we demonstrate its utility in a series of experiments. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/linderman18a.html https://proceedings.mlr.press/v84/linderman18a.html Labeled Graph Clustering via Projected Gradient Descent Advances in recovering low-rank matrices from noisy observations have led to tractable algorithms for clustering from general pairwise labels with provable performance guarantees. Based on convex relaxation, it has been shown that the ground truth clusters can be recovered with high probability under a generalized stochastic block model by solving a semidefinite program. Although tractable, the algorithm is typically too slow for sufficiently large problems in practice. Inspired by recent advances in non-convex approaches to low-rank recovery problems, we propose an algorithm based on projected gradient descent that enjoys similar provable guarantees as the convex counterpart, but can be orders of magnitude faster. Our theoretical results are further supported by encouraging empirical results. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/lim18b.html https://proceedings.mlr.press/v84/lim18b.html Multi-scale Nystrom Method Kernel methods are powerful tools for modeling nonlinear data. However, the amount of computation and memory required for kernel methods becomes the bottleneck when dealing with large-scale problems. In this paper, we propose Nested Nystrom Method (NNM) which achieves a delicate balance between the approximation accuracy and computational efficiency by exploiting the multilayer structure and multiple compressions. Even when the size of the kernel matrix is very large, NNM consistently decomposes very small matrices to update the eigen-decomposition of the kernel matrix. We theoretically show that NNM implicitly updates the principal subspace through the multiple layers, and also prove that its corresponding errors of rank-k PSD matrix approximation and kernel PCA (KPCA) are decreased by using additional sublayers before the final layer. Finally, we empirically demonstrate the decreasing property of errors of NNM with the additional sublayers through the experiments on the constructed kernel matrices of real data sets, and show that NNM effectively controls the efficiency both for rank-k PSD matrix approximation and KPCA. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/lim18a.html https://proceedings.mlr.press/v84/lim18a.html Stochastic Multi-armed Bandits in Constant Space We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all $K$ arms. We give an algorithm using $O(1)$ words of space with regret $\sum_{i=1}^{K}\frac{1}{\Delta_i}\log \frac{\Delta_i}{∆}\log T$ where $\Delta_i$ is the gap between the best arm and arm $i$ and $∆$ is the gap between the best and the second-best arms. If the rewards are bounded away from $0$ and $1$, this is within an $O(\log (1/∆))$ factor of the optimum regret possible without space constraints. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/liau18a.html https://proceedings.mlr.press/v84/liau18a.html Nonlinear Weighted Finite Automata Weighted finite automata (WFA) can expressively model functions defined over strings but are inherently linear models.Given the recent successes of non-linear models in machine learning, it is natural to wonder whether extending WFA to the non-linearsetting would be beneficial.In this paper, we propose a novel model of neural network based nonlinear WFA model (NL-WFA) along with a learning algorithm. Our learning algorithm is inspired by the spectral learning algorithm for WFA and relies on a non-linear decomposition of the so-called Hankel matrix, by means of an auto-encoder network. The expressive power of NL-WFA and the proposed learning algorithm are assessed on both synthetic and real world data, showing that NL-WFA can infer complex grammatical structures from data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/li18a.html https://proceedings.mlr.press/v84/li18a.html Multiphase MCMC Sampling for Parameter Inference in Nonlinear Ordinary Differential Equations Traditionally, ODE parameter inference relies on solving the system of ODEs and assessing fit of the estimated signal with the observations. However, nonlinear ODEs often do not permit closed form solutions. Using numerical methods to solve the equations results in prohibitive computational costs, particularly when one adopts a Bayesian approach in sampling parameters from a posterior distribution. With the introduction of gradient matching, we can abandon the need to numerically solve the system of equations. Inherent in these efficient procedures is an introduction of bias to the learning problem as we no longer sample based on the exact likelihood function. This paper presents a multiphase MCMC approach that attempts to close the gap between efficiency and accuracy. By sampling using a surrogate likelihood, we accelerate convergence to the stationary distribution before sampling using the exact likelihood. We demonstrate that this method combines the efficiency of gradient matching and the accuracy of the exact likelihood scheme. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/lazarus18a.html https://proceedings.mlr.press/v84/lazarus18a.html Bayesian Approaches to Distribution Regression Distribution regression has recently attracted much interest as a generic solution to the problem of supervised learning where labels are available at the group level, rather than at the individual level. Current approaches, however, do not propagate the uncertainty in observations due to sampling variability in the groups. This effectively assumes that small and large groups are estimated equally well, and should have equal weight in the final regression. We account for this uncertainty with a Bayesian distribution regression formalism, improving the robustness and performance of the model when group sizes vary. We frame our models in a neural network style, allowing for simple MAP inference using backpropagation to learn the parameters, as well as MCMC-based inference which can fully propagate uncertainty. We demonstrate our approach on illustrative toy datasets, as well as on a challenging problem of predicting age from images. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/law18a.html https://proceedings.mlr.press/v84/law18a.html A Nonconvex Proximal Splitting Algorithm under Moreau-Yosida Regularization We tackle highly nonconvex, nonsmooth composite optimization problems whose objectives comprise a Moreau-Yosida regularized term. Classical nonconvex proximal splitting algorithms, such as nonconvex ADMM, suffer from lack of convergence for such a problem class. To overcome this difficulty, in this work we consider a lifted variant of the Moreau-Yosida regularized model and propose a novel multiblock primal-dual algorithm that intrinsically stabilizes the dual block. We provide a complete convergence analysis of our algorithm and identify respective optimality qualifications under which stationarity of the original model is retrieved at convergence. Numerically, we demonstrate the relevance of Moreau-Yosida regularized models and the efficiency of our algorithm on robust regression as well as joint feature selection and semi-supervised learning. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/laude18a.html https://proceedings.mlr.press/v84/laude18a.html Optimality of Approximate Inference Algorithms on Stable Instances Approximate algorithms for structured prediction problems—such as LP relaxations and the popular α-expansion algorithm (Boykov et al. 2001)—typically far exceed their theoretical performance guarantees on real-world instances. These algorithms often find solutions that are very close to optimal. The goal of this paper is to partially explain the performance of α-expansion and an LP relaxation algorithm on MAP inference in Ferromagnetic Potts models (FPMs). Our main results give stability conditions under which these two algorithms provably recover the optimal MAP solution. These theoretical results complement numerous empirical observations of good performance. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/lang18a.html https://proceedings.mlr.press/v84/lang18a.html Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? Temporal difference learning algorithms such as TD(0) and GTD in reinforcement learning (RL) and the stochastic gradient descent (SGD) for linear prediction are linear stochastic approximation (LSA) algorithms. These algorithms make only $O(d)$ ($d$ is parameter dimension) computations per iteration. In the design of LSA algorithms, step-size choice is critical, and is often tuned in an ad-hoc manner. In this paper, we study a constant step-size averaged linear stochastic approximation (CALSA) algorithm, and for a given class of problems, we ask whether properties of $i)$ a universal constant step-size and $ii)$ a uniform fast rate of $\frac{C}{t}$ for the mean square-error hold for all instance of the class, where the constant $C>0$ does not depend on the problem instance. We show that the answer to these question, in general, is no. However, we show the TD(0) and CAGTD algorithms with a problem independent universal constant step-size and iterate averaging, achieve an asymptotic fast rate of $O(\frac{1}{t})$. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/lakshminarayanan18a.html https://proceedings.mlr.press/v84/lakshminarayanan18a.html Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach We consider the problem of minimizing a convex function over the intersection of finitely many simple sets which are easy to project onto. This is an important problem arising in various domains such as machine learning. The main difficulty lies in finding the projection of a point in the intersection of many sets. Existing approaches yield an infeasible point with an iteration-complexity of $O(1/ε^2)$ for nonsmooth problems with no guarantees on the in-feasibility. By reformulating the problem through exact penalty functions, we derive first-order algorithms which not only guarantees that the distance to the intersection is small but also improve the complexity to $O(1/ε)$ and $O(1/\sqrt{ε})$ for smooth functions. For composite and smooth problems, this is achieved through a saddle-point reformulation where the proximal operators required by the primal-dual algorithms can be computed in closed form. We illustrate the benefits of our approach on a graph transduction problem and on graph matching. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kundu18a.html https://proceedings.mlr.press/v84/kundu18a.html On the challenges of learning with inference networks on sparse, high-dimensional data We study parameter estimation in Nonlinear Factor Analysis (NFA) where the generative model is parameterized by a deep neural network. Recent work has focused on learning such models using inference (or recognition) networks; we identify a crucial problem when modeling large, sparse, high-dimensional datasets – underfitting. We study the extent of underfitting, highlighting that its severity increases with the sparsity of the data. We propose methods to tackle it via iterative optimization inspired by stochastic variational inference (Hoffman et al., 2013) and improvements in the data representation used for inference. The proposed techniques drastically improve the ability of these powerful models to fit sparse data, achieving state-of-the-art results on a benchmark text-count dataset and excellent results on the task of top-N recommendation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/krishnan18a.html https://proceedings.mlr.press/v84/krishnan18a.html Robust Active Label Correction Active label correction addresses the problem of learning from input data for which noisy labels are available (e.g., from imprecise measurements or crowd-sourcing) and each true label can be obtained at a significant cost (e.g., through additional measurements or human experts). To minimize these costs, we are interested in identifying training patterns for which knowing the true labels maximally improves the learning performance. We approximate the true label noise by a model that learns the aspects of the noise that are class-conditional (i.e., independent of the input given the observed label). To select labels for correction, we adopt the active learning strategy of maximizing the expected model change. We consider the change in regularized empirical risk functionals that use different pointwise loss functions for patterns with noisy and true labels, respectively. Different loss functions for the noisy data lead to different active label correction algorithms. If loss functions consider the label noise rates, these rates are estimated during learning, where importance weighting compensates for the sampling bias. We show empirically that viewing the true label as a latent variable and computing the maximum likelihood estimate of the model parameters performs well across all considered problems. A maximum a posteriori estimate of the model parameters was beneficial in most test cases. An image classification experiment using convolutional neural networks demonstrates that the class-conditional noise model, which can be learned efficiently, can guide re-labeling in real-world applications. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kremer18a.html https://proceedings.mlr.press/v84/kremer18a.html Outlier Detection and Robust Estimation in Nonparametric Regression This paper studies outlier detection and robust estimation for nonparametric regression problems. We propose to include a subject-specific mean shift parameter for each data point such that a nonzero parameter will identify its corresponding data point as an outlier. We adopt a regularization approach by imposing a roughness penalty on the regression function and a shrinkage penalty on the mean shift parameter. An efficient algorithm has been proposed to solve the double penalized regression problem. We discuss a data-driven simultaneous choice of two regularization parameters based on a combination of generalized cross validation and modified Bayesian information criterion. We show that the proposed method can consistently detect the outliers. In addition, we obtain minimax-optimal convergence rates for both the regression function and the mean shift parameter under regularity conditions. The estimation procedure is shown to enjoy the oracle property in the sense that the convergence rates agree with the minimax-optimal rates when the outliers (or regression function) are known in advance. Numerical results demonstrate that the proposed method has desired performance in identifying outliers under different scenarios. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kong18a.html https://proceedings.mlr.press/v84/kong18a.html Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation Across a variety of scientific disciplines, sparse inverse covariance estimation is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets (often on the order of terabytes), and assume Gaussian samples. To address these deficiencies, we introduce HP-CONCORD, a highly scalable optimization method for estimating a sparse inverse covariance matrix based on a regularized pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal gradient method uses a novel communication-avoiding linear algebra algorithm and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving parallel scalability on problems with up to ≈819 billion parameters (1.28 million dimensions); even on a single node, HP-CONCORD demonstrates scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to estimate the underlying dependency structure of the brain from fMRI data, and use the result to identify functional regions automatically. The results show good agreement with a clustering from the neuroscience literature. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/koanantakool18a.html https://proceedings.mlr.press/v84/koanantakool18a.html Nonparametric Sharpe Ratio Function Estimation in Heteroscedastic Regression Models via Convex Optimization We consider maximum likelihood estimation (MLE) of heteroscedastic regression models based on a new “parametrization” of the likelihood in terms of the Sharpe ratio function, or the ratio of the mean and volatility functions. While with a standard parametrization the MLE problem is not convex and hence hard to solve globally, our parametrization leads to a functional that is jointly convex in the Sharpe ratio and inverse volatility functions. The major difficulty with the resulting infinite-dimensional convex program is the shape constraint on the inverse volatility function. We propose to solve the problem by solving a sequence of finite-dimensional convex programs with increasing dimensions, which can be done globally and efficiently. We demonstrate that, when the goal is to estimate the Sharpe ratio function directly, the finite-sample performance of the proposed estimation method is superior to existing methods that estimate the mean and variance functions separately. When applied to a financial dataset, our method captures a well-known covariate-dependent effect on the Shape ratio. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kim18b.html https://proceedings.mlr.press/v84/kim18b.html Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes Automating statistical modelling is a challenging problem in artificial intelligence. The Automatic Statistician employs a kernel search algorithm using Gaussian Processes (GP) to provide interpretable statistical models for regression problems. However this does not scale due to its O(N^3) running time for the model selection. We propose Scalable Kernel Composition (SKC), a scalable kernel search algorithm that extends the Automatic Statistician to bigger data sets. In doing so, we derive a cheap upper bound on the GP marginal likelihood that is used in SKC with the variational lower bound to sandwich the marginal likelihood. We show that the upper bound is significantly tighter than the lower bound and useful for model selection. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kim18a.html https://proceedings.mlr.press/v84/kim18a.html IHT dies hard: Provable accelerated Iterative Hard Thresholding We study – both in theory and practice– the use of momentum motions in classic iterative hard thresholding (IHT) methods. By simply modifying plain IHT, we investigate its convergence behavior on convex optimization criteria with non-convex constraints, under standard assumptions. In diverse scenaria, we observe that acceleration in IHT leads to significant improvements, compared to state of the art projected gradient descent and Frank-Wolfe variants. As a byproduct of our inspection, we study the impact of selecting the momentum parameter: similar to convex settings, two modes of behavior are observed –“rippling” and linear– depending on the level of momentum. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/khanna18a.html https://proceedings.mlr.press/v84/khanna18a.html Comparison Based Learning from Weak Oracles There is increasing interest in learning algorithms that involve interaction between hu- man and machine. Comparison-based queries are among the most natural ways to get feed- back from humans. A challenge in designing comparison-based interactive learning algorithms is coping with noisy answers. The most common fix is to submit a query several times, but this is not applicable in many situations due to its prohibitive cost and due to the unrealistic assumption of independent noise in different repetitions of the same query. In this paper, we introduce a new weak oracle model, where a non-malicious user responds to a pairwise comparison query only when she is quite sure about the answer. This model is able to mimic the behavior of a human in noise-prone regions. We also consider the ap- plication of this weak oracle model to the problem of content search (a variant of the nearest neighbor search problem) through comparisons. More specifically, we aim at devising efficient algorithms to locate a target object in a database equipped with a dissimilarity metric via invocation of the weak comparison oracle. We propose two algorithms termed Worcs-I and Worcs-II (Weak-Oracle Comparison- based Search), which provably locate the tar- get object in a number of comparisons close to the entropy of the target distribution. While Worcs-I provides better theoretical guarantees, Worcs-II is applicable to more technically challenging scenarios where the algorithm has limited access to the ranking dis- similarity between objects. A series of experiments validate the performance of our proposed algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kazemi18a.html https://proceedings.mlr.press/v84/kazemi18a.html Nonparametric Preference Completion We consider the task of collaborative preference completion: given a pool of items, a pool of users and a partially observed item-user rating matrix, the goal is to recover the personalized ranking of each user over all of the items. Our approach is nonparametric: we assume that each item i and each user u have unobserved features x_i and y_u, and that the associated rating is given by $g_u(f(x_i,y_u))$ where f is Lipschitz and g_u is a monotonic transformation that depends on the user. We propose a k-nearest neighbors-like algorithm and prove that it is consistent. To the best of our knowledge, this is the first consistency result for the collaborative preference completion problem in a nonparametric setting. Finally, we demonstrate the performance of our algorithm with experiments on the Netflix and Movielens datasets. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/katz-samuels18a.html https://proceedings.mlr.press/v84/katz-samuels18a.html Adaptive Sampling for Coarse Ranking We consider the problem of active coarse ranking, where the goal is to sort items according to their means into clusters of pre-specified sizes, by adaptively sampling from their reward distributions. This setting is useful in many social science applications involving human raters and the approximate rank of every item is desired. Approximate or coarse ranking can significantly reduce the number of ratings required in comparison to the number needed to find an exact ranking. We propose a computationally efficient PAC algorithm LUCBRank for coarse ranking, and derive an upper bound on its sample complexity. We also derive a nearly matching distribution-dependent lower bound. Experiments on synthetic as well as real-world data show that LUCBRank performs better than state-of-the-art baseline methods, even when these methods have the advantage of knowing the underlying parametric model. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/katariya18a.html https://proceedings.mlr.press/v84/katariya18a.html Riemannian stochastic quasi-Newton algorithm with variance reduction and its convergence analysis Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite number of loss functions. The present paper proposes a Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR). The key challenges of averaging, adding, and subtracting multiple gradients are addressed with notions of retraction and vector transport. We present convergence analyses of R-SQN-VR on both non-convex and retraction-convex functions under retraction and vector transport operators. The proposed algorithm is evaluated on the Karcher mean computation on the symmetric positive-definite manifold and the low-rank matrix completion on the Grassmann manifold. In all cases, the proposed algorithm outperforms the state-of-the-art Riemannian batch and stochastic gradient algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kasai18a.html https://proceedings.mlr.press/v84/kasai18a.html Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems First-order optimization methods comprise two important primitives: i) the computation of gradient information and ii) the computation of the update that leads to the next iterate. In practice there is often a wide mismatch between the time required for the two steps, leading to underutilization of resources. In this work, we propose a new framework, Approx Composite Minimization (ACM) that uses approximate update steps to ensure balance between the two operations. The accuracy is adaptively chosen in an online fashion to take advantage of changing conditions. Our unified analysis for approximate composite minimization generalizes and extends previous work to new settings. Numerical experiments on Lasso regression and SVMs demonstrate the effectiveness of the novel scheme. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/karimireddy18a.html https://proceedings.mlr.press/v84/karimireddy18a.html Factor Analysis on a Graph Graph is a common way to represent relationships among a set of objects in a variety of application areas of machine learning. We consider the case that the input data is not only a graph but also numerical features in which one of the given features corresponds to a node in the graph. Then, the primary importance is often in understanding interactions on the graph nodes which effect on covariance structure of the numerical features. We propose a Gaussian based analysis which is a combination of graph constrained covariance matrix estimation and factor analysis (FA). We show that this approach, called graph FA, has desirable interpretability. In particular, we prove the connection between graph FA and a graph node clustering based on a perspective of kernel method. This connection indicates that graph FA is effective not only on the conventional noise-reduction explanation of the observation by FA but also on identifying important subgraphs. The experiments on synthetic and real-world datasets demonstrate the effectiveness of the approach. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/karasuyama18a.html https://proceedings.mlr.press/v84/karasuyama18a.html Parallelised Bayesian Optimisation via Thompson Sampling We design and analyse variations of the classical Thompson sampling (TS) procedure for Bayesian optimisation (BO) in settings where function evaluations are expensive but can be performed in parallel. Our theoretical analysis shows that a direct application of the sequential Thompson sampling algorithm in either synchronous or asynchronous parallel settings yields a surprisingly powerful result: making $n$ evaluations distributed among $M$ workers is essentially equivalent to performing $n$ evaluations in sequence. Further, by modelling the time taken to complete a function evaluation, we show that, under a time constraint, asynchronous parallel TS achieves asymptotically lower regret than both the synchronous and sequential versions. These results are complemented by an experimental analysis, showing that asynchronous TS outperforms a suite of existing parallel BO algorithms in simulations and in an application involving tuning hyper-parameters of a convolutional neural network. In addition to these, the proposed procedure is conceptually much simpler than existing work for parallel BO. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kandasamy18a.html https://proceedings.mlr.press/v84/kandasamy18a.html Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control Trial-and-error based reinforcement learning (RL) has seen rapid advancements in recent times, especially with the advent of deep neural networks. However, the majority of autonomous RL algorithms require a large number of interactions with the environment. A large number of interactions may be impractical in many real-world applications, such as robotics, and many practical systems have to obey limitations in the form of state space or control constraints. To reduce the number of system interactions while simultaneously handling constraints, we propose a model-based RL framework based on probabilistic Model Predictive Control (MPC). In particular, we propose to learn a probabilistic transition model using Gaussian Processes (GPs) to incorporate model uncertainty into long-term predictions, thereby, reducing the impact of model errors. We then use MPC to find a control sequence that minimises the expected long-term cost. We provide theoretical guarantees for first-order optimality in the GP-based transition models with deterministic approximate inference for long-term planning. We demonstrate that our approach does not only achieve state-of-the-art data efficiency, but also is a principled way for RL in constrained environments. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kamthe18a.html https://proceedings.mlr.press/v84/kamthe18a.html Policy Evaluation and Optimization with Continuous Treatments We study the problem of policy evaluation and learning from batched contextual bandit data when treatments are continuous, going beyond previous work on discrete treatments. Previous work for discrete treatment/action spaces focuses on inverse probability weighting (IPW) and doubly robust (DR) methods that use a rejection sampling approach for evaluation and the equivalent weighted classification problem for learning. In the continuous setting, this reduction fails as we would almost surely reject all observations. To tackle the case of continuous treatments, we extend the IPW and DR approaches to the continuous setting using a kernel function that leverages treatment proximity to attenuate discrete rejection. Our policy estimator is consistent and we characterize the optimal bandwidth. The resulting continuous policy optimizer (CPE) approach using our estimator achieves convergent regret and approaches the best-in-class policy for learnable policy classes. We demonstrate that the estimator performs well and, in particular, outperforms a discretization-based benchmark. We further study the performance of our policy optimizer in a case study on personalized dosing based on a dataset of Warfarin patients, their covariates, and final therapeutic doses. Our learned policy outperforms benchmarks and nears the oracle-best linear policy. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kallus18a.html https://proceedings.mlr.press/v84/kallus18a.html Nonparametric Bayesian sparse graph linear dynamical systems A nonparametric Bayesian sparse graph linear dynamical system (SGLDS) is proposed to model sequentially observed multivariate data. SGLDS uses the Bernoulli-Poisson link together with a gamma process to generate an infinite dimensional sparse random graph to model state transitions. Depending on the sparsity pattern of the corresponding row and column of the graph affinity matrix, a latent state of SGLDS can be categorized as either a non-dynamic state or a dynamic one. A normal-gamma construction is used to shrink the energy captured by the non-dynamic states, while the dynamic states can be further categorized into live, absorbing, or noise-injection states, which capture different types of dynamical components of the underlying time series. The state-of-the-art performance of SGLDS is demonstrated with experiments on both synthetic and real data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/kalantari18a.html https://proceedings.mlr.press/v84/kalantari18a.html Online Boosting Algorithms for Multi-label Ranking We consider the multi-label ranking approach to multi-label learning. Boosting is a natural method for multi-label ranking as it aggregates weak predictions through majority votes, which can be directly used as scores to produce a ranking of the labels. We design online boosting algorithms with provable loss bounds for multi-label ranking. We show that our first algorithm is optimal in terms of the number of learners required to attain a desired accuracy, but it requires knowledge of the edge of the weak learners. We also design an adaptive algorithm that does not require this knowledge and is hence more practical. Experimental results on real data sets demonstrate that our algorithms are at least as good as existing batch boosting algorithms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/jung18a.html https://proceedings.mlr.press/v84/jung18a.html Efficient Bayesian Methods for Counting Processes in Partially Observable Environments When sensors that count events are unreliable, the data sets that result cannot be trusted. We address this common problem by developing practical Bayesian estimators for a partially observable Poisson process (POPP). Unlike Bayesian estimation for a fully observable Poisson process (FOPP) this is non-trivial, since there is no conjugate density for a POPP and the posterior has a number of elements that grow exponentially in the number of observed intervals. We present two tractable approximations, which we combine in a switching filter. This switching filter enables efficient and accurate estimation of the posterior. We perform a detailed empirical analysis, using both simulated and real-world data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/jovan18a.html https://proceedings.mlr.press/v84/jovan18a.html Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy Complex simulator-based models usually have intractable likelihood functions, rendering the likelihood-based inference methods inapplicable. Approximate Bayesian Computation (ABC) emerges as an alternative framework of likelihood-free inference methods. It identifies a quasi-posterior distribution by finding values of parameter that simulate the synthetic data resembling the observed data. A major ingredient of ABC is the discrepancy measure between the observed and the simulated data, which conventionally involves a fundamental difficulty of constructing effective summary statistics. To bypass this difficulty, we adopt a Kullback-Leibler divergence estimator to assess the data discrepancy. Our method enjoys the asymptotic consistency and linearithmic time complexity as the data size increases. In experiments on five benchmark models, this method achieves a comparable or higher quasi-posterior quality, compared to the existing methods using other discrepancy measures. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/jiang18a.html https://proceedings.mlr.press/v84/jiang18a.html Scalable Generalized Dynamic Topic Models Dynamic topic models (DTMs) model the evolution of prevalent themes in literature, online media, and other forms of text over time. DTMs assume that word co-occurrence statistics change continuously and therefore impose continuous stochastic process priors on their model parameters. These dynamical priors make inference much harder than in regular topic models, and also limit scalability. In this paper, we present several new results around DTMs. First, we extend the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs). This allows us to explore topics that develop smoothly over time, that have a long-term memory or are temporally concentrated (for event detection). Second, we show how to perform scalable approximate inference in these models based on ideas around stochastic variational inference and sparse Gaussian processes. This way we can train a rich family of DTMs to massive data. Our experiments on several large-scale datasets show that our generalized model allows us to find interesting patterns that were not accessible by previous approaches. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/jahnichen18a.html https://proceedings.mlr.press/v84/jahnichen18a.html Scalable Gaussian Processes with Billions of Inducing Inputs via Tensor Train Decomposition We propose a method (TT-GP) for approximate inference in Gaussian Process (GP) models. We build on previous scalable GP research including stochastic variational inference based on inducing inputs, kernel interpolation, and structure exploiting algebra. The key idea of our method is to use Tensor Train decomposition for variational parameters, which allows us to train GPs with billions of inducing inputs and achieve state-of-the-art results on several benchmarks. Further, our approach allows for training kernels based on deep neural networks without any modifications to the underlying GP model. A neural network learns a multidimensional embedding for the data, which is used by the GP to make the final prediction. We train GP and neural network parameters end-to-end without pretraining, through maximization of GP marginal likelihood. We show the efficiency of the proposed approach on several regression and classification benchmark datasets including MNIST, CIFAR-10, and Airline. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/izmailov18a.html https://proceedings.mlr.press/v84/izmailov18a.html Online Regression with Partial Information: Generalization and Linear Projection We investigate an online regression problem in which the learner makes predictions sequentially while only the limited information on features is observable. In this paper, we propose a general setting for the limitation of the available information, where the observed information is determined by a function chosen from a given set of observation functions. Our problem setting is a generalization of the online sparse linear regression problem, which has been actively studied. For our general problem, we present an algorithm by combining multi-armed bandit algorithms and online learning methods. This algorithm admits a sublinear regret bound when the number of observation functions is constant. We also show that the dependency on the number of observation functions is inevitable unless additional assumptions are adopted. To mitigate this inefficiency, we focus on a special case of practical importance, in which the observed information is expressed through linear combinations of the original features. We propose efficient algorithms for this special case. Finally, we also demonstrate the efficiency of the proposed algorithms by simulation studies using both artificial and real data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ito18a.html https://proceedings.mlr.press/v84/ito18a.html Statistically Efficient Estimation for Non-Smooth Probability Densities We investigate statistical efficiency of estimators for non-smooth density functions. The density estimation problem appears in various situations, and it is intensively used in statistics and machine learning. The statistical efficiencies of estimators, i.e., their convergence rates, play a central role in advanced statistical analysis. Although estimators and their convergence rates for smooth density functions are well investigated in the literature, those for non-smooth density functions remain elusive despite their importance in application fields. In this paper, we propose new estimators for non-smooth density functions by employing the notion of Szemeredi partitions from graph theory. We derive convergence rates of the proposed estimators. One of them has the optimal convergence rate in minimax sense, and the other has slightly worse convergence rate but runs in polynomial time. Experimental results support the theoretical performance of our estimators. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/imaizumi18a.html https://proceedings.mlr.press/v84/imaizumi18a.html Multi-view Metric Learning in Vector-valued Kernel Spaces We consider the problem of metric learning for multi-view data and present a novel method for learning within-view as well as between-view metrics in vector-valued kernel spaces, as a way to capture multi-modal structure of the data. We formulate two convex optimization problems to jointly learn the metric and the classifier or regressor in kernel feature spaces. An iterative three-step multi-view metric learning algorithm is derived from the optimization problems. In order to scale the computation to large training sets, a block-wise Nyström approximation of the multi-view kernel matrix is introduced. We justify our approach theoretically and experimentally, and show its performance on real-world datasets against relevant state-of-the-art methods. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/huusari18a.html https://proceedings.mlr.press/v84/huusari18a.html Semi-Supervised Prediction-Constrained Topic Models Supervisory signals can help topic models discover low-dimensional data representations which are useful for a specific prediction task. We propose a framework for training supervised latent Dirichlet allocation that balances two goals: faithful generative explanations of high-dimensional data and accurate prediction of associated class labels. Existing approaches fail to balance these goals by not properly handling a fundamental asymmetry: the intended application is always predicting labels from data, not data from labels. Our new prediction-constrained objective for training generative models coherently integrates supervisory signals even when only a small fraction of training examples are labeled. We demonstrate improved prediction quality compared to previous supervised topic models, achieving results competitive with high-dimensional logistic regression on text analysis and electronic health records tasks while simultaneously learning interpretable topics. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/hughes18a.html https://proceedings.mlr.press/v84/hughes18a.html A Unified Dynamic Approach to Sparse Model Selection Sparse model selection is ubiquitous from linear regression to graphical models where regularization paths, as a family of estimators upon the regularization parameter varying, are computed when the regularization parameter is unknown or decided data-adaptively. Traditional computational methods rely on solving a set of optimization problems where the regularization parameters are fixed on a grid that might be inefficient. In this paper, we introduce a simple iterative regularization path, which follows the dynamics of a sparse Mirror Descent algorithm or a generalization of Linearized Bregman Iterations with nonlinear loss. Its performance is competitive to glmnet with a further bias reduction. A path consistency theory is presented that under the Restricted Strong Convexity and the Irrepresentable Condition, the path will first evolve in a subspace with no false positives and reach an estimator that is sign-consistent or of minimax optimal $\ell_2$ error rate. Early stopping regularization is required to prevent overfitting. Application examples are given in sparse logistic regression and Ising models for NIPS coauthorship. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/huang18a.html https://proceedings.mlr.press/v84/huang18a.html SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning We propose an efficient dual augmented Lagrangian formulation to learn conditional random fields (CRF). Our algorithm, which can be interpreted as an inexact gradient descent algorithm on the multipliers, does not require to perform global inference iteratively and requires only a fixed number of stochastic clique-wise updates at each epoch to obtain a sufficiently good estimate of the gradient w.r.t. the Lagrange multipliers. We prove that the proposed algorithm enjoys global linear convergence for both the primal and the dual objectives. Our experiments show that the proposed algorithm outperforms state-of-the-art baselines in terms of the speed of convergence. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/hu18a.html https://proceedings.mlr.press/v84/hu18a.html Cheap Checking for Cloud Computing: Statistical Analysis via Annotated Data Streams As the popularity of outsourced computation increases, questions of accuracy and trust between the client and the cloud computing services become ever more relevant. Our work aims to provide fast and practical methods to verify analysis of large data sets, where the client’s computation and memory costs are kept to a minimum. Our verification protocols are based on defining ’proofs’ which are easy to create and check. These add only a small overhead to reporting the result of the computation itself. We build up a series of protocols for elementary statistical methods, to create more complex protocols for Ordinary Least Squares, Principal Component Analysis and Linear Discriminant Analysis, and show them to be very efficient in practice. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/hickey18a.html https://proceedings.mlr.press/v84/hickey18a.html Gaussian Process Subset Scanning for Anomalous Pattern Detection in Non-iid Data Identifying anomalous patterns in real-world data is essential for understanding where, when, and how systems deviate from their expected dynamics. Yet methods that separately consider the anomalousness of each individual data point have low detection power for subtle, emerging irregularities. Additionally, recent detection techniques based on subset scanning make strong independence assumptions and suffer degraded performance in correlated data. We introduce methods for identifying anomalous patterns in non-iid data by combining Gaussian processes with novel log-likelihood ratio statistic and subset scanning techniques. Our approaches are powerful, interpretable, and can integrate information across multiple streams. We illustrate their performance on numeric simulations and three open source spatiotemporal datasets of opioid overdose deaths, 311 calls, and storm reports. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/herlands18a.html https://proceedings.mlr.press/v84/herlands18a.html Approximate Ranking from Pairwise Comparisons A common problem in machine learning is to rank a set of n items based on pairwise comparison. Here, ranking refers to partitioning the items into sets of pre-specified sizes according to theirs scores, which includes identification of the top-k items as the most prominent special case. The score of a given item is defined as the probability that it beats a randomly chosen other item. In practice, in particular when n is large, finding an exact ranking typically requires a prohibitively large number of comparisons. What comes to our rescue here is that in practice, one is usually content with finding an approximate ranking. In this paper we consider the problem of finding approximate rankings from pairwise comparisons. We analyze an active ranking algorithm that counts the number of comparisons won, and decides whether to stop or which pair of items to compare next, based on confidence intervals computed from the data collected in previous steps. We show that this algorithm succeeds in recovering approximate rankings using a number of comparisons that is close to optimal up to logarithmic factors. We also present numerical results, showing that in practice, approximation can drastically reduce the number of comparisons required to estimate a ranking. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/heckel18a.html https://proceedings.mlr.press/v84/heckel18a.html Derivative Free Optimization Via Repeated Classification We develop a procedure for minimizing a function using $n$ batched function value measurements at each of $T$ rounds by using classifiers to identify a function’s sublevel set. We show that sufficiently accurate classifiers can achieve linear convergence rates, and show that the convergence rate is tied to the difficulty of active learning sublevel sets. Further, we show that the bootstrap is a computationally efficient approximation to the necessary classification scheme. The end result is a computationally efficient method requiring no tuning that consistently outperforms other methods on simulations, standard benchmarks, real-world DNA binding optimization, and airfoil design problems where batched function queries are natural. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/hashimoto18a.html https://proceedings.mlr.press/v84/hashimoto18a.html Making Tree Ensembles Interpretable: A Bayesian Model Selection Approach Tree ensembles, such as random forests, are renowned for their high prediction performance. However, their interpretability is critically limited due to the enormous complexity. In this study, we propose a method to make a complex tree ensemble interpretable by simplifying the model. Specifically, we formalize the simplification of tree ensembles as a model selection problem. Given a complex tree ensemble, we aim at obtaining the simplest representation that is essentially equivalent to the original one. To this end, we derive a Bayesian model selection algorithm that optimizes the simplified model while maintaining the prediction performance. Our numerical experiments on several datasets showed that complicated tree ensembles were approximated interpretably. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/hara18a.html https://proceedings.mlr.press/v84/hara18a.html Exploiting Strategy-Space Diversity for Batch Bayesian Optimization This paper proposes a novel approach to batch Bayesian optimisation using a multi-objective optimisation framework with exploitation and exploration forming two objectives. The key advantage of this approach is that it uses a suite of strategies to balance exploration and exploitation and thus can efficiently handle the optimisation of a variety of functions with small to large number of local extrema. Another advantage is that it automatically determines the batch size within a specified budget avoiding unnecessary function evaluations. Theoretical analysis shows that the regret not only reduces sub-linearly but also by an additional reduction factor determined by the batch size. We demonstrate the efficiency of our algorithm by optimising a variety of benchmark functions, performing hyperparameter tuning of support vector regression and classification, and finally heat treatment process of an Al-Sc alloy. Comparisons with recent baseline algorithms confirm the usefulness of our algorithm. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gupta18a.html https://proceedings.mlr.press/v84/gupta18a.html Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond For Markov chain Monte Carlo methods, one of the greatest discrepancies between theory and system is the scan order — while most theoretical development on the mixing time analysis deals with random updates, real-world systems are implemented with systematic scans. We bridge this gap for models that exhibit a bipartite structure, including, most notably, the Restricted/Deep Boltzmann Machine. The de facto implementation for these models scans variables in a layer-wise fashion. We show that the Gibbs sampler with a layerwise alternating scan order has its relaxation time (in terms of epochs) no larger than that of a random-update Gibbs sampler (in terms of variable updates). We also construct examples to show that this bound is asymptotically tight. Through standard inequalities, our result also implies a comparison on the mixing times. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/guo18a.html https://proceedings.mlr.press/v84/guo18a.html Asynchronous Doubly Stochastic Group Regularized Learning Group regularized learning problems (such as group Lasso) are important in machine learning. The asynchronous parallel stochastic optimization algorithms have received huge attentions recently as handling large scale problems. However, existing asynchronous stochastic algorithms for solving the group regularized learning problems are not scalable enough simultaneously in sample size and feature dimensionality. To address this challenging problem, in this paper, we propose a novel asynchronous doubly stochastic proximal gradient algorithm with variance reduction (AsyDSPG+). To the best of our knowledge, AsyDSPG+ is the first asynchronous doubly stochastic proximal gradient algorithm, which can scale well with the large sample size and high feature dimensionality simultaneously. More importantly, we provide a comprehensive convergence guarantee to AsyDSPG+. The experimental results on various large-scale real-world datasets not only confirm the fast convergence of our new method, but also show that AsyDSPG+ scales better than the existing algorithms with the sample size and dimension simultaneously. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gu18a.html https://proceedings.mlr.press/v84/gu18a.html Plug-in Estimators for Conditional Expectations and Probabilities We study plug-in estimators of conditional expectations and probabilities, and we provide a systematic analysis of their rates of convergence. The plug-in approach is particularly useful in this setting since it introduces a natural link to VC- and empirical process theory. We make use of this link to derive rates of convergence that hold uniformly over large classes of functions and sets, and under various conditions. For instance, we demonstrate that elementary conditional probabilities are estimated by these plug-in estimators with a rate of $n^{α-1/2}$ if one conditions with a VC-class of sets and where $α∈[0,1/2)$ controls a lower bound on the size of sets we can estimate given n samples. We gain similar results for Kolmogorov’s conditional expectation and probability which generalize the elementary forms of conditioning. Due to their simplicity, plug-in estimators can be evaluated in linear time and there is no up-front cost for inference. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/grunewalder18a.html https://proceedings.mlr.press/v84/grunewalder18a.html Factorial HMMs with Collapsed Gibbs Sampling for Optimizing Long-term HIV Therapy Combined antiretroviral therapies can successfully suppress HIV in the serum and bring its viral load below detection rate. However, drug resistance remains a major challenge. As resistance patterns vary between patients, therapy personalization is required. Automatic systems for therapy personalization exist and were shown to better predict therapy outcome than HIV experts in some settings. However, these systems focus only on selecting the therapy most likely to suppress the virus for several weeks, a choice that may be suboptimal over the longer term due to evolution of drug resistance. We present a novel generative model for HIV drug resistance evolution. This model is based on factorial HMMs, applying a novel collapsed Gibbs Sampling algorithm for approximate learning. Using the suggested model, we obtain better therapy outcome predictions than existing methods and recommend therapies that may be more effective in the long term. We demonstrate our results using simulated data and using real data from the EuResist dataset. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gruber18a.html https://proceedings.mlr.press/v84/gruber18a.html Best arm identification in multi-armed bandits with delayed feedback In this paper, we propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedbacks. The delay in feedbacks increases the effective sample complexity of the algorithm, but can be offset by partial feedbacks received before a pull is completed. We propose a a general modeling framework to structure in the partial feedbacks, and as a special case we introduce efficient algorithms for best arm identification in settings where the partial feedbacks are biased or unbiased estimators of the final outcome of the pull. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Experiments on simulated as well as real world datasets of policy search for charging chemical batteries and hyperparameter optimization for mixed integer programming demonstrate that exploiting the structure of partial and delayed feedbacks can lead to significant improvements over baselines on both sequential and parallel MAB. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/grover18b.html https://proceedings.mlr.press/v84/grover18b.html Variational Rejection Sampling Learning latent variable models with stochastic variational inference is challenging when the approximate posterior is far from the true posterior, due to high variance in the gradient estimates. We propose a novel rejection sampling step that discards samples from the variational posterior which are assigned low likelihoods by the model. Our approach provides an arbitrarily accurate approximation of the true posterior at the expense of extra computation. Using a new gradient estimator for the resulting unnormalized proposal distribution, we achieve average improvements of 3.71 nats and 0.31 nats over state-of-the-art single-sample and multi-sample alternatives respectively for estimating marginal log-likelihoods using sigmoid belief networks on the MNIST dataset. We show both theoretically and empirically how explicitly rejecting samples, while seemingly challenging to analyze due to the implicit nature of the resulting unnormalized proposal distribution, can have benefits over the competing state-of-the-art alternatives based on importance weighting. We demonstrate the effectiveness of the proposed approach via experiments on synthetic data and a benchmark density estimation task with sigmoid belief networks over the MNIST dataset. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/grover18a.html https://proceedings.mlr.press/v84/grover18a.html Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Our goal is to improve variance reducing stochastic methods through better control variates. We first propose a modification of SVRG which uses the Hessian to track gradients over time, rather than to recondition, increasing the correlation of the control variates and leading to faster theoretical convergence close to the optimum. We then propose accurate and computationally efficient approximations to the Hessian, both using a diagonal and a low-rank matrix. Finally, we demonstrate the effectiveness of our method on a wide range of problems. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gower18a.html https://proceedings.mlr.press/v84/gower18a.html Weighted Tensor Decomposition for Learning Latent Variables with Partial Data Tensor decomposition methods are popular tools for learning latent variables given only lowerorder moments of the data. However, the standard assumption is that we have sufficient data to estimate these moments to high accuracy. In this work, we consider the case in which certain dimensions of the data are not always observed–common in applied settings, where not all measurements may be taken for all observations–resulting in moment estimates of varying quality. We derive a weighted tensor decomposition approach that is computationally as efficient as the non-weighted approach, and demonstrate that it outperforms methods that do not appropriately leverage these less-observed dimensions. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gottesman18a.html https://proceedings.mlr.press/v84/gottesman18a.html Frank-Wolfe Splitting via Augmented Lagrangian Method Minimizing a function over an intersection of convex sets is an important task in optimization that is often much more challenging than minimizing it over each individual constraint set. While traditional methods such as Frank-Wolfe (FW) or proximal gradient descent assume access to a linear or quadratic oracle on the intersection, splitting techniques take advantage of the structure of each sets, and only require access to the oracle on the individual constraints. In this work, we develop and analyze the Frank-Wolfe Augmented Lagrangian (FW-AL) algorithm, a method for minimizing a smooth function over convex compact sets related by a “linear consistency” constraint that only requires access to a linear minimization oracle over the individual constraints. It is based on the Augmented Lagrangian Method (ALM), also known as Method of Multipliers, but unlike most existing splitting methods, it only requires access to linear (instead of quadratic) minimization oracles. We use recent advances in the analysis of Frank-Wolfe and the alternating direction method of multipliers algorithms to prove a sublinear convergence rate for FW-AL over general convex compact sets and a linear convergence rate for polytopes. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gidel18a.html https://proceedings.mlr.press/v84/gidel18a.html Learning Sparse Polymatrix Games in Polynomial Time and Sample Complexity We consider the problem of learning sparse polymatrix games from observations of strategic interactions. We show that a polynomial time method based on $\ell_{1,2}$-group regularized logistic regression recovers a game, whose Nash equilibria are the $ε$-Nash equilibria of the game from which the data was generated (true game), in $O(m^4 d^4 \log (pd))$ samples of strategy profiles — where $m$ is the maximum number of pure strategies of a player, $p$ is the number of players, and $d$ is the maximum degree of the game graph. Under slightly more stringent separability conditions on the payoff matrices of the true game, we show that our method learns a game with the exact same Nash equilibria as the true game. We also show that $Ω(d \log (pm))$ samples are necessary for any method to consistently recover a game, with the same Nash-equilibria as the true game, from observations of strategic interactions. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ghoshal18b.html https://proceedings.mlr.press/v84/ghoshal18b.html Learning linear structural equation models in polynomial time and sample complexity The problem of learning structural equation models (SEMs) from data is a fundamental problem in causal inference. We develop a new algorithm — which is computationally and statistically efficient and works in the high-dimensional regime — for learning linear SEMs from purely observational data with arbitrary noise distribution. We consider three aspects of the problem: identifiability, computational efficiency, and statistical efficiency. We show that when data is generated from a linear SEM over p nodes and maximum Markov blanket size d, our algorithm recovers the directed acyclic graph (DAG) structure of the SEM under an identifiability condition that is more general than those considered in the literature, and without faithfulness assumptions. In the population setting, our algorithm recovers the DAG structure in $O(p(d + \log p))$ operations. In the finite sample setting, if the estimated precision matrix is sparse, our algorithm has a smoothed complexity of $\tilde{O}(p^3 + pd^{4})$, while if the estimated precision matrix is dense, our algorithm has a smoothed complexity of $\tilde{O}(p^5)$. For sub-Gaussian and bounded ($4m$-th, $m$ being positive integer) moment noise, our algorithm has a sample complexity of $\mathcal{O}(\frac{d^4}{\varepsilon^2} \log (\frac{p}{\sqrt{δ}}))$ and $\mathcal{O}(\frac{d^4}{\varepsilon^2} (\frac{p^2}{δ})^{\nicefrac{1}{m}})$ resp., to achieve $\varepsilon$ element-wise additive error with respect to the true autoregression matrix with probability at least $1 - δ$. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ghoshal18a.html https://proceedings.mlr.press/v84/ghoshal18a.html Learning Generative Models with Sinkhorn Divergences The ability to compare two degenerate probability distributions, that is two distributions supported on low-dimensional manifolds in much higher-dimensional spaces, is a crucial factor in the estimation of generative mod- els.It is therefore no surprise that optimal transport (OT) metrics and their ability to handle measures with non-overlapping sup- ports have emerged as a promising tool. Yet, training generative machines using OT raises formidable computational and statistical challenges, because of (i) the computational bur- den of evaluating OT losses, (ii) their instability and lack of smoothness, (iii) the difficulty to estimate them, as well as their gradients, in high dimension. This paper presents the first tractable method to train large scale generative models using an OT-based loss called Sinkhorn loss which tackles these three issues by relying on two key ideas: (a) entropic smoothing, which turns the original OT loss into a differentiable and more robust quantity that can be computed using Sinkhorn fixed point iterations; (b) algorithmic (automatic) differentiation of these iterations with seam- less GPU execution. Additionally, Entropic smoothing generates a family of losses interpolating between Wasserstein (OT) and Energy distance/Maximum Mean Discrepancy (MMD) losses, thus allowing to find a sweet spot leveraging the geometry of OT on the one hand, and the favorable high-dimensional sample complexity of MMD, which comes with un- biased gradient estimates. The resulting computational architecture complements nicely standard deep network generative models by a stack of extra layers implementing the loss function. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/genevay18a.html https://proceedings.mlr.press/v84/genevay18a.html Turing: A Language for Flexible Probabilistic Inference Probabilistic programming promises to simplify and democratize probabilistic machine learning, but successful probabilistic programming systems require flexible, generic and efficient inference engines. In this work, we present a system called Turing for building MCMC algorithms for probabilistic programming inference. Turing has a very simple syntax and makes full use of the numerical capabilities in the Julia programming language, including all implemented probability distributions, and automatic differentiation. Turing supports a wide range of popular Monte Carlo algorithms, including Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling, sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers. Most importantly, Turing inference is composable: it combines MCMC operations on subsets of variables, for example using a combination of an HMC engine and a particle Gibbs (PG) engine. We explore several combinations of inference methods with the aim of finding approaches that are both efficient and universal, i.e. applicable to arbitrary probabilistic models. NUTS—a popular variant of HMC that adapts Hamiltonian simulation path length automatically, although quite powerful for exploring differentiable target distributions, is however not universal. We identify some failure modes for the NUTS engine, and demonstrate that composition of PG (for discrete variables) and NUTS (for continuous variables) can be useful when the NUTS engine is either not applicable, or simply does not work well. Our aim is to present Turing and its composable inference engines to the world and encourage other researchers to build on this system to help advance the field of probabilistic machine learning. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ge18b.html https://proceedings.mlr.press/v84/ge18b.html Minimax-Optimal Privacy-Preserving Sparse PCA in Distributed Systems This paper proposes a distributed privacy-preserving sparse PCA (DPS-PCA) algorithm that generates a minimax-optimal sparse PCA estimator under differential privacy constraints. In a distributed optimization framework, data providers can use this algorithm to collaboratively analyze the union of their data sets while limiting the disclosure of their private information. DPS-PCA can recover the leading eigenspace of the population covariance at a geometric convergence rate, and simultaneously achieves the optimal minimax statistical error for high-dimensional data. Our algorithm provides fine-tuned control over the tradeoff between estimation accuracy and privacy preservation. Numerical simulations demonstrate that DPS-PCA significantly outperforms other privacy-preserving PCA methods in terms of estimation accuracy and computational efficiency. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ge18a.html https://proceedings.mlr.press/v84/ge18a.html Product Kernel Interpolation for Scalable Gaussian Processes Recent work shows that inference for Gaussian processes can be performed efficiently using iterative methods that rely only on matrix-vector multiplications (MVMs). Structured Kernel Interpolation (SKI) exploits these techniques by deriving approximate kernels with very fast MVMs. Unfortunately, such strategies suffer badly from the curse of dimensionality. We develop a new technique for MVM based learning that exploits product kernel structure. We demonstrate that this technique is broadly applicable, resulting in linear rather than exponential runtime with dimension for SKI, as well as state-of-the-art asymptotic complexity for multi-task GPs Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gardner18a.html https://proceedings.mlr.press/v84/gardner18a.html Online Learning with Non-Convex Losses and Non-Stationary Regret In this paper, we consider online learning with non-convex loss functions. Similar to Besbes et al. [2015] we apply non-stationary regret as the performance metric. In particular, we study the regret bounds under different assumptions on the information available regarding the loss functions. When the gradient of the loss function at the decision point is available, we propose an online normalized gradient descent algorithm (ONGD) to solve the online learning problem. In another situation, when only the value of the loss function is available, we propose a bandit online normalized gradient descent algorithm (BONGD). Under a condition to be called weak pseudo-convexity (WPC), we show that both algorithms achieve a cumulative regret bound of O($\sqrt{T+V_T T}$), where $V_T$ is the total temporal variations of the loss functions, thus establishing a sublinear regret bound for online learning with non-convex loss functions and non-stationary regret measure. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/gao18a.html https://proceedings.mlr.press/v84/gao18a.html Variational Inference based on Robust Divergences Robustness to outliers is a central issue in real-world machine learning applications. While replacing a model to a heavy-tailed one (e.g., from Gaussian to Student-t) is a standard approach for robustification, it can only be applied to simple models. In this paper, based on Zellner’s optimization and variational formulation of Bayesian inference, we propose an outlier-robust pseudo-Bayesian variational method by replacing the Kullback-Leibler divergence used for data fitting to a robust divergence such as the beta- and gamma-divergences. An advantage of our approach is that superior but complex models such as deep networks can also be handled. We theoretically prove that, for deep networks with ReLU activation functions, the influence function in our proposed method is bounded, while it is unbounded in the ordinary variational inference. This implies that our proposed method is robust to both of input and output outliers, while the ordinary variational method is not. We experimentally demonstrate that our robust variational method outperforms ordinary variational inference in regression and classification with deep networks. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/futami18a.html https://proceedings.mlr.press/v84/futami18a.html Robustness of classifiers to uniform $\ell_p$ and Gaussian noise We study the robustness of classifiers to various kinds of random noise models. In particular, we consider noise drawn uniformly from the $\ell_p$ ball for $p ∈[1, ∞]$ and Gaussian noise with an arbitrary covariance matrix. We characterize this robustness to random noise in terms of the distance to the decision boundary of the classifier. This analysis applies to linear classifiers as well as classifiers with locally approximately flat decision boundaries, a condition which is satisfied by state-of-the-art deep neural networks. The predicted robustness is verified experimentally. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/franceschi18a.html https://proceedings.mlr.press/v84/franceschi18a.html Mixed Membership Word Embeddings for Computational Social Science Word embeddings improve the performance of NLP systems by revealing the hidden structural relationships between words. Despite their success in many applications, word embeddings have seen very little use in computational social science NLP tasks, presumably due to their reliance on big data, and to a lack of interpretability. I propose a probabilistic model-based word embedding method which can recover interpretable embeddings, without big data. The key insight is to leverage mixed membership modeling, in which global representations are shared, but individual entities (i.e. dictionary words) are free to use these representations to uniquely differing degrees. I show how to train the model using a combination of state-of-the-art training techniques for word embeddings and topic models. The experimental results show an improvement in predictive language modeling of up to 63% in MRR over the skip-gram, and demonstrate that the representations are beneficial for supervised learning. I illustrate the interpretability of the models with computational social science case studies on State of the Union addresses and NIPS articles. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/foulds18a.html https://proceedings.mlr.press/v84/foulds18a.html Inference in Sparse Graphs with Pairwise Measurements and Side Information We consider the statistical problem of recovering a hidden "ground truth" binary labeling for the vertices of a graph up to low Hamming error from noisy edge and vertex measurements. We present new algorithms and a sharp finite-sample analysis for this problem on trees and sparse graphs with poor expansion properties such as hypergrids and ring lattices. Our method generalizes and improves over that of Globerson et al. (2015), who introduced the problem for two-dimensional grid lattices. For trees we provide a simple, efficient, algorithm that infers the ground truth with optimal Hamming error has optimal sample complexity and implies recovery results for all connected graphs. Here, the presence of side information is critical to obtain a non-trivial recovery rate. We then show how to adapt this algorithm to tree decompositions of edge-subgraphs of certain graph families such as lattices, resulting in optimal recovery error rates that can be obtained efficiently The thrust of our analysis is to 1) use the tree decomposition along with edge measurements to produce a small class of viable vertex labelings and 2) apply an analysis influenced by statistical learning theory to show that we can infer the ground truth from this class using vertex measurements. We show the power of our method in several examples including hypergrids, ring lattices, and the Newman-Watts model for small world graphs. For two-dimensional grids, our results improve over Globerson et al. (2015) by obtaining optimal recovery in the constant-height regime. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/foster18a.html https://proceedings.mlr.press/v84/foster18a.html Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and k-means One iteration of standard k-means (i.e., Lloyd’s algorithm) or standard EM for Gaussian mixture models (GMMs) scales linearly with the number of clusters C, data points N, and data dimensionality D. In this study, we explore whether one iteration of k-means or EM for GMMs can scale sublinearly with C at run-time, while improving the clustering objective remains effective. The tool we apply for complexity reduction is variational EM, which is typically used to make training of generative models with exponentially many hidden states tractable. Here, we apply novel theoretical results on truncated variational EM to make tractable clustering algorithms more efficient. The basic idea is to use a partial variational E-step which reduces the linear complexity of O(NCD) required for a full E-step to a sublinear complexity. Our main observation is that the linear dependency on C can be reduced to a dependency on a much smaller parameter G which relates to cluster neighborhood relations. We focus on two versions of partial variational EM for clustering: variational GMM, scaling with O(NG²D), and variational k-means, scaling with O(NGD) per iteration. Empirical results show that these algorithms still require comparable numbers of iterations to improve the clustering objective to same values as k-means. For data with many clusters, we consequently observe reductions of net computational demands between two and three orders of magnitude. More generally, our results provide substantial empirical evidence in favor of clustering to scale sublinearly with C. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/forster18a.html https://proceedings.mlr.press/v84/forster18a.html Efficient Weight Learning in High-Dimensional Untied MLNs Existing techniques for improving scalability of weight learning in Markov Logic Networks (MLNs) are typically effective when the parameters of the MLN are tied, i.e., several ground formulas in the MLN share the same weight. However, to improve accuracy in real-world problems, we typically need to learn separate weights for different groundings of the MLN. In this paper, we present an approach to perform efficient weight learning in MLNs containing high-dimensional, untied formulas. The fundamental idea in our approach is to help the learning algorithm navigate the parameter search-space more efficiently by a) tying together groundings of untied formulas that are likely to have similar weights, and b) setting good initial values for the parameters. To do this, we follow a hierarchical approach, where we first learn the parameters that are to be tied using a non-relational learner. We then use a relational learner to learn the tied-parameter MLN with initial values derived from parameters learned by the non-relational learner. We illustrate the promise of our approach on three different real-world problems and show that our approach yields much more scalable and accurate results compared to existing state-of-the-art relational learning systems. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/farabi18a.html https://proceedings.mlr.press/v84/farabi18a.html The Binary Space Partitioning-Tree Process The Mondrian process represents an elegant and powerful approach for space partition modelling. However, as it restricts the partitions to be axis-aligned, its modelling flexibility is limited. In this work, we propose a self-consistent Binary Space Partitioning (BSP)-Tree process to generalize the Mondrian process. The BSP-Tree process is an almost surely right continuous Markov jump process that allows uniformly distributed oblique cuts in a two-dimensional convex polygon. The BSP-Tree process can also be extended using a non-uniform probability measure to generate direction differentiated cuts. The process is also self-consistent, maintaining distributional invariance under a restricted subdomain. We use Conditional-Sequential Monte Carlo for inference using the tree structure as the high-dimensional variable. The BSP-Tree process’s performance on synthetic data partitioning and relational modelling demonstrates clear inferential improvements over the standard Mondrian process and other related methods. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/fan18b.html https://proceedings.mlr.press/v84/fan18b.html Statistical Sparse Online Regression: A Diffusion Approximation Perspective In this paper, we propose to adopt the diffusion approximation techniques to study online regression. The diffusion approximation techniques allow us to characterize the exact dynamics of the online regression process. As a consequence, we obtain the optimal statistical rate of convergence up to a logarithmic factor of the streaming sample size. Using the idea of trajectory averaging, we further improve the rate of convergence by eliminating the logarithmic factor. Lastly, we propose a two-step algorithm for sparse online regression: a burn-in step using offline learning and a refinement step using a variant of truncated stochastic gradient descent. Under appropriate assumptions, we show the proposed algorithm produces near optimal sparse estimators. Numerical experiments lend further support to our obtained theory. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/fan18a.html https://proceedings.mlr.press/v84/fan18a.html Combinatorial Penalties: Which structures are preserved by convex relaxations? We consider the homogeneous and the non-homogeneous convex relaxations for combinatorial penalty functions defined on support sets. Our study identifies key differences in the tightness of the resulting relaxations through the notion of the lower combinatorial envelope of a set-function along with new necessary conditions for support identification. We then propose a general adaptive estimator for convex monotone regularizers, and derive new sufficient conditions for support recovery in the asymptotic setting. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/el-halabi18a.html https://proceedings.mlr.press/v84/el-halabi18a.html Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method Most second order methods are inapplicable to large scale empirical risk minimization (ERM) problems because both, the number of samples N and number of parameters p are large. Large N makes it costly to evaluate Hessians and large p makes it costly to invert Hessians. This paper propose a novel adaptive sample size second-order method, which reduces the cost of computing the Hessian by solving a sequence of ERM problems corresponding to a subset of samples and lowers the cost of computing the Hessian inverse using a truncated eigenvalue decomposition. Although the sample size is grown at a geometric rate, it is shown that it is sufficient to run a single iteration in each growth stage to track the optimal classifier to within its statistical accuracy. This results in convergence to the optimal classifier associated with the whole set in a number of iterations that scales with $\log(N)$. The use of a truncated eigenvalue decomposition result in the cost of each iteration being of order $p^2$. Theoretical performance gains manifest in practical implementations. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/eisen18a.html https://proceedings.mlr.press/v84/eisen18a.html Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions, and a novel learning rate schedule to compensate for gradient staleness. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/dutta18a.html https://proceedings.mlr.press/v84/dutta18a.html Learning Determinantal Point Processes in Sublinear Time We propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous DPPs and DPPs defined on exponentially many items. We apply this new class to modelling text documents as sampling a DPP of sentences, and propose a conditional maximum likelihood formulation to model topic proportions, which is made possible with no approximation for our class of DPPs. We present an application to document summarization with a DPP on 2 to the power 500 items. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/dupuy18a.html https://proceedings.mlr.press/v84/dupuy18a.html Bayesian Nonparametric Poisson-Process Allocation for Time-Sequence Modeling Analyzing the underlying structure of multiple time-sequences provides insights into the understanding of social networks and human activities. In this work, we present the Bayesian nonparametric Poisson process allocation (BaNPPA), a latent-function model for time-sequences, which automatically infers the number of latent functions. We model the intensity of each sequence as an infinite mixture of latent functions, each of which is obtained using a function drawn from a Gaussian process. We show that a technical challenge for the inference of such mixture models is the unidentifiability of the weights of the latent functions. We propose to cope with the issue by regulating the volume of each latent function within a variational inference algorithm. Our algorithm is computationally efficient and scales well to large data sets. We demonstrate the usefulness of our proposed model through experiments on both synthetic and real-world data sets. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ding18a.html https://proceedings.mlr.press/v84/ding18a.html Sketching for Kronecker Product Regression and P-splines TensorSketch is an oblivious linear sketch introduced in (Pagh, 2013) and later used in (Pham and Pagh, 2013) in the context of SVMs for polynomial kernels. It was shown in (Avron et al., 2014) that TensorSketch provides a subspace embedding, and therefore can be used for canonical correlation analysis, low rank approximation, and principal component regression for the polynomial kernel. We take TensorSketch outside of the context of polynomials kernels, and show its utility in applications in which the underlying design matrix is a Kronecker product of smaller matrices. This allows us to solve Kronecker product regression and non-negative Kronecker product regression, as well as regularized spline regression. Our main technical result is then in extending TensorSketch to other norms. That is, TensorSketch only provides input sparsity time for Kronecker product regression with respect to the 2-norm. We show how to solve Kronecker product regression with respect to the 1-norm in time sublinear in the time required for computing the Kronecker product, as well as for more general p-norms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/diao18a.html https://proceedings.mlr.press/v84/diao18a.html Batch-Expansion Training: An Efficient Optimization Framework We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset. As opposed to stochastic approaches, batches do not need to be resampled i.i.d. at every iteration, thus making BET more resource efficient in a distributed setting, and when disk-access is constrained. Moreover, BET can be easily paired with most batch optimizers, does not require any parameter-tuning, and compares favorably to existing stochastic and batch methods. We show that when the batch size grows exponentially with the number of outer iterations, BET achieves optimal O (1/epsilon) data-access convergence rate for strongly convex objectives. Experiments in parallel and distributed settings show that BET performs better than standard batch and stochastic approaches. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/derezinski18b.html https://proceedings.mlr.press/v84/derezinski18b.html Subsampling for Ridge Regression via Regularized Volume Sampling Given n vectors $x_i ∈R^d$, we want to fit a linear regression model for noisy labels $y_i ∈\mathbb{R}$. The ridge estimator is a classical solution to this problem. However, when labels are expensive, we are forced to select only a small subset of vectors $x_i$ for which we obtain the labels $y_i$. We propose a new procedure for selecting the subset of vectors, such that the ridge estimator obtained from that subset offers strong statistical guarantees in terms of the mean squared prediction error over the entire dataset of n labeled vectors. The number of labels needed is proportional to the statistical dimension of the problem which is often much smaller than d. Our method is an extension of a joint subsampling procedure called volume sampling. A second major contribution is that we speed up volume sampling so that it is essentially as efficient as leverage scores, which is the main i.i.d. subsampling procedure for this task. Finally, we show theoretically and experimentally that volume sampling has a clear advantage over any i.i.d. sampling when labels are expensive. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/derezinski18a.html https://proceedings.mlr.press/v84/derezinski18a.html Bootstrapping EM via Power EM and Convergence in the Naive Bayes Model We study the convergence properties of the Expectation-Maximization algorithm in the Naive Bayes model. We show that EM can get stuck in regions of slow convergence, even when the features are binary and i.i.d. conditioning on the class label, and even under random (i.e. non worst-case) initialization. In turn, we show that EM can be bootstrapped in a pre-training step that computes a good initialization. From this initialization we show theoretically and experimentally that EM converges exponentially fast to the true model parameters. Our bootstrapping method amounts to running the EM algorithm on appropriately centered iterates of small magnitude, which as we show corresponds to effectively performing power iteration on the covariance matrix of the mixture model, although power iteration is performed under the hood by EM itself. As such, we call our bootstrapping approach “power EM.” Specifically for the case of two binary features, we show global exponentially fast convergence of EM, even without bootstrapping. Finally, as the Naive Bayes model is quite expressive, we show as corollaries of our convergence results that the EM algorithm globally converges to the true model parameters for mixtures of two Gaussians, recovering recent results of Xu et al.’2016 and Daskalakis et al. 2017. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/daskalakis18a.html https://proceedings.mlr.press/v84/daskalakis18a.html On denoising modulo 1 samples of a function Consider an unknown smooth function $f: [0,1] →\mathbb{R}$, and say we are given $n$ noisy $\mod 1$ samples of $f$, i.e., $y_i = (f(x_i) + \eta_i)\mod 1$ for $x_i ∈[0,1]$, where $\eta_i$ denotes noise. Given the samples $(x_i,y_i)_{i=1}^{n}$ our goal is to recover smooth, robust estimates of the clean samples $f(x_i) \bmod 1$. We formulate a natural approach for solving this problem which works with representations of mod 1 values over the unit circle. This amounts to solving a quadratically constrained quadratic program (QCQP) with non-convex constraints involving points lying on the unit circle. Our proposed approach is based on solving its relaxation which is a trust region subproblem, and hence solvable efficiently. We demonstrate its robustness to noise via extensive simulations on several synthetic examples, and provide a detailed theoretical analysis. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/cucuringu18a.html https://proceedings.mlr.press/v84/cucuringu18a.html Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel Smoothing Methods Evaluating integrals is an ubiquitous issue and Monte Carlo methods, exploiting advances in random number generation over the last decades, offer a popular and powerful alternative to integration deterministic techniques, unsuited in particular when the domain of integration is complex. This paper is devoted to the study of a kernel smoothing based competitor built from a sequence of $n\geq 1$ i.i.d random vectors with arbitrary continuous probability distribution $f(x)dx$, originally proposed in Delyon et al. (2016), from a nonasymptotic perspective. We establish a probability bound showing that the method under study, though biased, produces an estimate approximating the target integral $\int_{x\in\mathbb{R}^d}\varphi(x)dx$ with an error bound of order $o(1/\sqrt{n})$ uniformly over a class $\Phi$ of functions $\varphi$, under weak complexity/smoothness assumptions related to the class $\Phi$, outperforming Monte-Carlo procedures. This striking result is shown to derive from an appropriate decomposition of the maximal deviation between the target integrals and their estimates, highlighting the remarkable benefit to averaging strongly dependent terms regarding statistical accuracy in this situation. The theoretical analysis then rests on sharp probability inequalities for degenerate $U$-statistics. It is illustrated by numerical results in the context of covariate shift regression, providing empirical evidence of the relevance of the approach. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/clemencon18a.html https://proceedings.mlr.press/v84/clemencon18a.html Parallel and Distributed MCMC via Shepherding Distributions In this paper, we present a general algorithmic framework for developing easily parallelizable/distributable Markov Chain Monte Carlo (MCMC) algorithms. Our framework relies on the introduction of an auxiliary distribution called a ’shepherding distribution’ (SD) that is used to control several MCMC chains that run in parallel. The SD is an introduced prior on one or more key parameters (or hyperparameters) of the target distribution. The shepherded chains then collectively explore the space of samples, communicating via the shepherding distribution, to reach high likelihood regions faster. The method of SDs is simple, and it is often easy to develop a shepherded sampler for a particular problem. Other advantages include wide applicability- the method can easily be used to draw samples from discrete distributions, or distributions on the simplex. Further, the method is asymptotically correct, since the method of SDs trivially maintains detailed balance. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chowdhury18a.html https://proceedings.mlr.press/v84/chowdhury18a.html The Geometry of Random Features We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal (in terms of mean squared error) kernel estimators. We provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/choromanski18a.html https://proceedings.mlr.press/v84/choromanski18a.html Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms In this paper, community detection in hypergraphs is explored. Under a generative hypergraph model called "d-wise hypergraph stochastic block model" (d-hSBM) which naturally extends the Stochastic Block Model (SBM) from graphs to d-uniform hypergraphs, the fundamental limit on the asymptotic minimax misclassified ratio is characterized. For proving the achievability, we propose a two-step polynomial time algorithm that provably achieves the fundamental limit in the sparse hypergraph regime. For proving the optimality, the lower bound of the minimax risk is set by finding a smaller parameter space which contains the most dominant error events, inspired by the analysis in the achievability part. It turns out that the minimax risk decays exponentially fast to zero as the number of nodes tends to infinity, and the rate function is a weighted combination of several divergence terms, each of which is the Renyi divergence of order 1/2 between two Bernoulli distributions. The Bernoulli distributions involved in the characterization of the rate function are those governing the random instantiation of hyperedges in d-hSBM. Experimental results on both synthetic and real-world data validate our theoretical finding. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chien18a.html https://proceedings.mlr.press/v84/chien18a.html Convergence of Value Aggregation for Imitation Learning Value aggregation is a general framework for solving imitation learning problems. Based on the idea of data aggregation, it generates a policy sequence by iteratively interleaving policy optimization and evaluation in an online learning setting. While the existence of a good policy in the policy sequence can be guaranteed non-asymptotically, little is known about the convergence of the sequence or the performance of the last policy. In this paper, we debunk the common belief that value aggregation always produces a convergent policy sequence with improving performance. Moreover, we identify a critical stability condition for convergence and provide a tight non-asymptotic bound on the performance of the last policy. These new theoretical insights let us stabilize problems with regularization, which removes the inconvenient process of identifying the best policy in the policy sequence in stochastic problems. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/cheng18c.html https://proceedings.mlr.press/v84/cheng18c.html FLAG n’ FLARE: Fast Linearly-Coupled Adaptive Gradient Methods We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/cheng18b.html https://proceedings.mlr.press/v84/cheng18b.html Matrix completability analysis via graph k-connectivity The problem of low-rank matrix completion is continually attracting attention for its applicability to many real-world problems. Still, the large size, extreme sparsity, and non-uniformity of these matrices pose a challenge. In this paper, we make the observation that even when the observed matrix is too sparse for accurate completion, there may be portions of the data where completion is still possible. We propose the completeID algorithm, which exploits the non-uniformity of the observation, to analyze the completability of the input instead of blindly applying completion. Balancing statistical accuracy with computational efficiency, we relate completability to edge-connectivity of the graph associated with the input partially-observed matrix. We develop the MaxKCD algorithm for finding maximally k-edge-connected components efficiently. Experiments across datasets from a variety of applications demonstrate not only the success of completeID but also the importance of completability analysis. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/cheng18a.html https://proceedings.mlr.press/v84/cheng18a.html Near-Optimal Machine Teaching via Explanatory Teaching Sets Modern applications of machine teaching for humans often involve domain-specific, non- trivial target hypothesis classes. To facilitate understanding of the target hypothesis, it is crucial for the teaching algorithm to use examples which are interpretable to the human learner. In this paper, we propose NOTES, a principled framework for constructing interpretable teaching sets, utilizing explanations to accelerate the teaching process. Our algorithm is built upon a natural stochastic model of learners and a novel submodular surrogate objective function which greedily selects interpretable teaching examples. We prove that NOTES is competitive with the optimal explanation-based teaching strategy. We further instantiate NOTES with a specific hypothesis class, which can be viewed as an interpretable approximation of any hypothesis class, allowing us to handle complex hypothesis in practice. We demonstrate the effectiveness of NOTES on several image classification tasks, for both simulated and real human learners. Our experimental results suggest that by leveraging explanations, one can significantly speed up teaching. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18g.html https://proceedings.mlr.press/v84/chen18g.html Online Continuous Submodular Maximization In this paper, we consider an online optimization process, where the objective functions are not convex (nor concave) but instead belong to a broad class of continuous submodular functions. We first propose a variant of the Frank-Wolfe algorithm that has access to the full gradient of the objective functions. We show that it achieves a regret bound of $O(\sqrt{T})$ (where $T$ is the horizon of the online optimization problem) against a $(1-1/e)$-approximation to the best feasible solution in hindsight. However, in many scenarios, only an unbiased estimate of the gradients are available. For such settings, we then propose an online stochastic gradient ascent algorithm that also achieves a regret bound of $O(\sqrt{T})$ regret, albeit against a weaker $1/2$-approximation to the best feasible solution in hindsight. We also generalize our results to $γ$-weakly submodular functions and prove the same sublinear regret bounds. Finally, we demonstrate the efficiency of our algorithms on a few problem instances, including non-convex/non-concave quadratic programs, multilinear extensions of submodular set functions, and D-optimal design. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18f.html https://proceedings.mlr.press/v84/chen18f.html Metrics for Deep Generative Models Neural samplers such as variational autoencoders (VAEs) or generative adversarial networks (GANs) approximate distributions by transforming samples from a simple random source—the latent space—to samples from a more complex distribution represented by a dataset. While the manifold hypothesis implies that a dataset contains large regions of low density, the training criterions of VAEs and GANs will make the latent space densely covered. Consequently points that are separated by low-density regions in observation space will be pushed together in latent space, making stationary distances poor proxies for similarity. We transfer ideas from Riemannian geometry to this setting, letting the distance between two points be the shortest path on a Riemannian manifold induced by the transformation. The method yields a principled distance measure, provides a tool for visual inspection of deep generative models, and an alternative to linear interpolation in latent space. In addition, it can be applied for robot movement generalization using previously learned skills. The method is evaluated on a synthetic dataset with known ground truth; on a simulated robot arm dataset; on human motion capture data; and on a generative model of handwritten digits. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18e.html https://proceedings.mlr.press/v84/chen18e.html Sparse Linear Isotonic Models In machine learning and data mining, linear models have been widely used to model the response as parametric linear functions of the predictors. To relax such stringent assumptions made by parametric linear models, additive models consider the response to be a summation of unknown transformations applied on the predictors; in particular, additive isotonic models (AIMs) assume the unknown transformations to be monotone. In this paper, we introduce sparse linear isotonic models (SLIMs) for high-dimensional problems by hybridizing ideas in parametric sparse linear models and AIMs, which enjoy a few appealing advantages over both. In the high-dimensional setting, a two-step algorithm is proposed for estimating the sparse parameters as well as the monotone functions over predictors. Under mild statistical assumptions, we show that the algorithm can accurately estimate the parameters. Promising preliminary experiments are presented to support the theoretical results. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18d.html https://proceedings.mlr.press/v84/chen18d.html Crowdclustering with Partition Labels Crowdclustering is a practical way to incorporate domain knowledge into clustering, by combining opinions from multiple domain experts. Existing crowdclustering methods analyze binary pairwise similarity labels. However, in some applications, experts might provide partition labels. If we convert partition labels into pairwise similarity, then it would be difficult to understand the relationships between clustering solutions from different experts. In this paper, we propose a crowdclustering model that directly analyzes partition labels. The proposed model adopts a novel approach based on a modified multinomial logistic regression model, which simultaneously learns the number of clusters and determines hyper-planes that partition samples into clusters. The proposed model also learns a mapping between the latent clusters and expert labels, revealing the agreements and disagreements between experts. Experiments on benchmark data demonstrate that the proposed model simultaneously learns the number of clusters and discovers the clustering structure. An experiment on disease subtyping problem illustrates that the proposed model helps us understand the agreement and disagreement between experts. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18c.html https://proceedings.mlr.press/v84/chen18c.html Symmetric Variational Autoencoder and Connections to Adversarial Learning A new form of the variational autoencoder (VAE) is proposed, based on the symmetric Kullback- Leibler divergence. It is demonstrated that learn- ing of the resulting symmetric VAE (sVAE) has close connections to previously developed adversarial-learning methods. This relationship helps unify the previously distinct techniques of VAE and adversarially learning, and provides insights that allow us to ameliorate shortcomings with some previously developed adversarial methods. In addition to an analysis that motivates and explains the sVAE, an extensive set of experiments validate the utility of the approach. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18b.html https://proceedings.mlr.press/v84/chen18b.html An Optimization Approach to Learning Falling Rule Lists A falling rule list is a probabilistic decision list for binary classification, consisting of a series of if-then rules with antecedents in the if clauses and probabilities of the desired outcome ("1") in the then clauses. Just as in a regular decision list, the order of rules in a falling rule list is important – each example is classified by the first rule whose antecedent it satisfies. Unlike a regular decision list, a falling rule list requires the probabilities of the desired outcome ("1") to be monotonically decreasing down the list. We propose an optimization approach to learning falling rule lists and "softly" falling rule lists, along with Monte-Carlo search algorithms that use bounds on the optimal solution to prune the search space. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chen18a.html https://proceedings.mlr.press/v84/chen18a.html Convergence diagnostics for stochastic gradient descent with constant learning rate Many iterative procedures in stochastic optimization exhibit a transient phase followed by a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in that region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant learning rate. We present theory and experiments suggesting that the region where the proposed diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate each time stationarity is detected. This leads to a new variant of stochastic gradient descent, which in many settings is comparable to state-of-art. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/chee18a.html https://proceedings.mlr.press/v84/chee18a.html Dropout as a Low-Rank Regularizer for Matrix Factorization Regularization for matrix factorization (MF) and approximation problems has been carried out in many different ways. Due to its popularity in deep learning, dropout has been applied also for this class of problems. Despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive for this class of problems. In this paper, we present a theoretical analysis of dropout for MF, where Bernoulli random variables are used to drop columns of the factors. We demonstrate the equivalence between dropout and a fully deterministic model for MF in which the factors are regularized by the sum of the product of squared Euclidean norms of the columns. Additionally, we inspect the case of a variable sized factorization and we prove that dropout achieves the global minimum of a convex approximation problem with (squared) nuclear norm regularization. As a result, we conclude that dropout can be used as a low-rank regularizer with data dependent singular-value thresholding. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/cavazza18a.html https://proceedings.mlr.press/v84/cavazza18a.html Nearly second-order optimality of online joint detection and estimation via one-sample update schemes Sequential hypothesis test and change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. We show that for such problems, detection procedures based on sequential likelihood ratios with simple one-sample update estimates such as online mirror descent are nearly second-order optimal. This means that the upper bound for the algorithm performance meets the lower bound asymptotically up to a log-log factor in the false-alarm rate when it tends to zero. This is a blessing, since although the generalized likelihood ratio (GLR) statistics are optimal theoretically, but they cannot be computed recursively, and their exact computation usually requires infinite memory of historical data. We prove the nearly second-order optimality by making a connection between sequential change-point detection and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical examples validate our theory. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/cao18a.html https://proceedings.mlr.press/v84/cao18a.html Robust Maximization of Non-Submodular Objectives We study the problem of maximizing a monotone set function subject to a cardinality constraint $k$ in the setting where some number of elements $τ$ is deleted from the returned set. The focus of this work is on the worst-case adversarial setting. While there exist constant-factor guarantees when the function is submodular, there are no guarantees for non-submodular objectives. In this work, we present a new algorithm OBLIVIOUS-GREEDY and prove the first constant-factor approximation guarantees for a wider class of non-submodular objectives. The obtained theoretical bounds are the first constant-factor bounds that also hold in the linear regime, i.e. when the number of deletions $τ$ is linear in $k$. Our bounds depend on established parameters such as the submodularity ratio and some novel ones such as the inverse curvature. We bound these parameters for two important objectives including support selection and variance reduction. Finally, we numerically demonstrate the robust performance of OBLIVIOUS-GREEDY for these two objectives on various datasets. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bogunovic18a.html https://proceedings.mlr.press/v84/bogunovic18a.html Smooth and Sparse Optimal Transport Entropic regularization is quickly emerging as a new standard in optimal transport (OT). It enables to cast the OT computation as a differentiable and unconstrained convex optimization problem, which can be efficiently solved using the Sinkhorn algorithm. However, entropy keeps the transportation plan strictly positive and therefore completely dense, unlike unregularized OT. This lack of sparsity can be problematic in applications where the transportation plan itself is of interest. In this paper, we explore regularizing the primal and dual OT formulations with a strongly convex term, which corresponds to relaxing the dual and primal constraints with smooth approximations. We show how to incorporate squared $2$-norm and group lasso regularizations within that framework, leading to sparse and group-sparse transportation plans. On the theoretical side, we bound the approximation error introduced by regularizing the primal and dual formulations. Our results suggest that, for the regularized primal, the approximation error can often be smaller with squared $2$-norm than with entropic regularization. We showcase our proposed framework on the task of color transfer. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/blondel18a.html https://proceedings.mlr.press/v84/blondel18a.html Cause-Effect Inference by Comparing Regression Errors We address the problem of inferring the causal relation between two variables by comparing the least-squares errors of the predictions in both possible causal directions. Under the assumption of an independence between the function relating cause and effect, the conditional noise distribution, and the distribution of the cause, we show that the errors are smaller in causal direction if both variables are equally scaled and the causal relation is close to deterministic. Based on this, we provide an easily applicable method that only requires a regression in both possible causal directions. The performance of this method is compared with different related causal inference methods in various artificial and real-world data sets. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bloebaum18a.html https://proceedings.mlr.press/v84/bloebaum18a.html Group Invariance Principles for Causal Generative Models The postulate of independence of cause and mechanism (ICM) has recently led to several new causal discovery algorithms. The interpretation of independence and the way it is utilized, however, varies across these methods. Our aim in this paper is to propose a group theoretic framework for ICM to unify and generalize these approaches. In our setting, the cause-mechanism relationship is assessed by perturbing it with random group transformations. We show that the group theoretic view encompasses previous ICM approaches and provides a very general tool to study the structure of data generating mechanisms with direct applications to machine learning. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/besserve18a.html https://proceedings.mlr.press/v84/besserve18a.html Tree-based Bayesian Mixture Model for Competing Risks Many chronic diseases possess a shared biology. Therapies designed for patients at risk of multiple diseases need to account for the shared impact they may have on related diseases to ensure maximum overall well-being. Learning from data in this setting differs from classical survival analysis methods since the incidence of an event of interest may be obscured by other related competing events. We develop a semi-parametric Bayesian regression model for survival analysis with competing risks, which can be used for jointly assessing a patient’s risk of multiple (competing) adverse outcomes. We construct a Hierarchical Bayesian Mixture (HBM) model to describe survival paths in which a patient’s covariates influence both the estimation of the type of adverse event and the subsequent survival trajectory through Multivariate Random Forests. In addition variable importance measures, which are essential for clinical interpretability are induced naturally by our model. We aim with this setting to provide accurate individual estimates but also interpretable conclusions for use as a clinical decision support tool. We compare our method with various state-of-the-art benchmarks on both synthetic and clinical data. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bellot18a.html https://proceedings.mlr.press/v84/bellot18a.html Factorized Recurrent Neural Architectures for Longer Range Dependence The ability to capture Long Range Dependence (LRD) in a stochastic process is of prime importance in the context of predictive models. A sequential model with a longer-term memory is better able contextualize recent observations. In this article, we apply the theory of LRD stochastic processes to modern recurrent architectures, such as LSTMs and GRUs, and prove they do not provide LRD under assumptions sufficient for gradients to vanish. Motivated by an information-theoretic analysis, we provide a modified recurrent neural architecture that mitigates the issue of faulty memory through redundancy while keeping the compute time constant. Experimental results on a synthetic copy task, the Youtube-8m video classification task and a recommender system show that we enable better memorization and longer-term memory. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/belletti18a.html https://proceedings.mlr.press/v84/belletti18a.html Personalized and Private Peer-to-Peer Machine Learning The rise of connected personal devices together with privacy concerns call for machine learning algorithms capable of leveraging the data of a large number of agents to learn personalized models under strong privacy requirements. In this paper, we introduce an efficient algorithm to address the above problem in a fully decentralized (peer-to-peer) and asynchronous fashion, with provable convergence rate. We show how to make the algorithm differentially private to protect against the disclosure of information about the personal datasets, and formally analyze the trade-off between utility and privacy. Our experiments show that our approach dramatically outperforms previous work in the non-private case, and that under privacy constraints, we can significantly improve over models learned in isolation. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bellet18a.html https://proceedings.mlr.press/v84/bellet18a.html Few-shot Generative Modelling with Generative Matching Networks Despite recent advances, the remaining bottlenecks in deep generative models are necessity of extensive training and difficulties with generalization from small number of training examples. We develop a new generative model called Generative Matching Network which is inspired by the recently proposed matching networks for one-shot learning in discriminative tasks. By conditioning on the additional input dataset, our model can instantly learn new concepts that were not available in the training data but conform to a similar generative process. The proposed framework does not explicitly restrict diversity of the conditioning data and also does not require an extensive inference procedure for training or adaptation. Our experiments on the Omniglot dataset demonstrate that Generative Matching Networks significantly improve predictive performance on the fly as more additional data is available and outperform existing state of the art conditional generative models. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bartunov18a.html https://proceedings.mlr.press/v84/bartunov18a.html Robust Locally-Linear Controllable Embedding Embed-to-control (E2C) is a model for solving high-dimensional optimal control problems by combining variational auto-encoders with locally-optimal controllers. However, the E2C model suffers from two major drawbacks: 1) its objective function does not correspond to the likelihood of the data sequence and 2) the variational encoder used for embedding typically has large variational approximation error, especially when there is noise in the system dynamics. In this paper, we present a new model for learning robust locally-linear controllable embedding (RCE). Our model directly estimates the predictive conditional density of the future observation given the current one, while introducing the bottleneck between the current and future observations. Although the bottleneck provides a natural embedding candidate for control, our RCE model introduces additional specific structures in the generative graphical model so that the model dynamics can be robustly linearized. We also propose a principled variational approximation of the embedding posterior that takes the future observation into account, and thus, makes the variational approximation more robust against the noise. Experimental results show that RCE outperforms the E2C model, and does so significantly when the underlying dynamics is noisy. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/banijamali18a.html https://proceedings.mlr.press/v84/banijamali18a.html Medoids in Almost-Linear Time via Multi-Armed Bandits Computing the medoid of a large number of points in high-dimensional space is an increasingly common operation in many data science problems. We present an algorithm Med-dit to compute the medoid with high probability, which uses $O(n\log n)$ distance evaluations. Med-dit is based on a connection with the Multi-Armed Bandit problem. We evaluate the performance of Med-dit empirically on the Netflix-prize and single-cell RNA-seq datasets, containing hundreds of thousands of points living in tens of thousands of dimensions, and observe a $5$-$10$x improvement in performance over the current state of the art. We have released the code of Med-dit and our empirical results at https://github.com/bagavi/Meddit. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bagaria18a.html https://proceedings.mlr.press/v84/bagaria18a.html One-shot Coresets: The Case of k-Clustering Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent - the practitioner has to commit to a fixed clustering objective before even exploring the data. However, can one construct small data summaries for a wide range of clustering problems simultaneously? In this work, we affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for k-clustering problems while retaining strong theoretical guarantees. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/bachem18a.html https://proceedings.mlr.press/v84/bachem18a.html Robust Vertex Enumeration for Convex Hulls in High Dimensions We design a fast and robust algorithm named {All Vertex Traingle Algorithm (AVTA)} for detecting the vertices of the convex hull of a set of points in high dimensions. Our proposed algorithm is very general and works for arbitrary convex hulls. In addition to being a fundamental problem in computational geometry and linear programming, vertex enumeration in high dimensions has numerous applications in machine learning. In particular, we apply AVTA to design new practical algorithms for topic models and non-negative matrix factorization. For topic models, our new algorithm leads to significantly better reconstruction of the topic-word matrix than state of the art approaches. Additionally, we provide a robust analysis of AVTA and empirically demonstrate that it can handle larger amounts of noise than existing methods. For non-negative matrix we show that AVTA is competitive with existing methods that are specialized for this task. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/awasthi18a.html https://proceedings.mlr.press/v84/awasthi18a.html Kernel Conditional Exponential Family A nonparametric family of conditional distributions is introduced, which generalizes conditional exponential families using functional parameters in a suitable RKHS. An algorithm is provided for learning the generalized natural parameter, and consistency of the estimator is established in the well specified case. In experiments, the new method generally outperforms a competing approach with consistency guarantees, and is competitive with a deep conditional density model on datasets that exhibit abrupt transitions and heteroscedasticity. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/arbel18a.html https://proceedings.mlr.press/v84/arbel18a.html Bayesian Structure Learning for Dynamic Brain Connectivity Human brain activity as measured by fMRI exhibits strong correlations between brain regions which are believed to vary over time. Importantly, dynamic connectivity has been linked to individual differences in physiology, psychology and behavior, and has shown promise as a biomarker for disease. The state of the art in computational neuroimaging is to estimate the brain networks as relatively short sliding window covariance matrices, which leads to high variance estimates, thereby resulting in high overall error. This manuscript proposes a novel Bayesian model for dynamic brain connectivity. Motivated by the underlying neuroscience, the model estimates covariances which vary smoothly over time, with an instantaneous decomposition into a collection of spatially sparse components – resulting in parsimonious and highly interpretable estimates of dynamic brain connectivity. Simulated results are presented to illustrate the performance of the model even when it is mis-specified. For real brain imaging data with unknown ground truth, in addition to qualitative evaluation, we devise a simple classification task which suggests that the estimated brain networks better capture the underlying structure. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/andersen18a.html https://proceedings.mlr.press/v84/andersen18a.html Integral Transforms from Finite Data: An Application of Gaussian Process Regression to Fourier Analysis Computing accurate estimates of the Fourier transform of analog signals from discrete data points is important in many fields of science and engineering. The conventional approach of performing the discrete Fourier transform of the data implicitly assumes periodicity and bandlimitedness of the signal. In this paper, we use Gaussian process regression to estimate the Fourier transform (or any other integral transform) without making these assumptions. This is possible because the posterior expectation of Gaussian process regression maps a finite set of samples to a function defined on the whole real line, expressed as a linear combination of covariance functions. We estimate the covariance function from the data using an appropriately designed gradient ascent method that constrains the solution to a linear combination of tractable kernel functions. This procedure results in a posterior expectation of the analog signal whose Fourier transform can be obtained analytically by exploiting linearity. Our simulations show that the new method leads to sharper and more precise estimation of the spectral density both in noise-free and noise-corrupted signals. We further validate the method in two real-world applications: the analysis of the yearly fluctuation in atmospheric CO2 level and the analysis of the spectral content of brain signals. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ambrogioni18a.html https://proceedings.mlr.press/v84/ambrogioni18a.html Structured Optimal Transport Optimal Transport has recently gained interest in machine learning for applications ranging from domain adaptation to sentence similarities or deep learning. Yet, its ability to capture frequently occurring structure beyond the "ground metric" is limited. In this work, we develop a nonlinear generalization of (discrete) optimal transport that is able to reflect much additional structure. We demonstrate how to leverage the geometry of this new model for fast algorithms, and explore connections and properties. Illustrative experiments highlight the benefit of the induced structured couplings for tasks in domain adaptation and natural language processing. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/alvarez-melis18a.html https://proceedings.mlr.press/v84/alvarez-melis18a.html Proximity Variational Inference Variational inference is a powerful approach for approximate posterior inference. However, it is sensitive to initialization and can be subject to poor local optima. In this paper, we develop proximity variational inference (PVI). PVI is a new method for optimizing the variational objective that constrains subsequent iterates of the variational parameters to robustify the optimization path. Consequently, PVI is less sensitive to initial- ization and optimization quirks and finds better local optima. We demonstrate our method on four proximity statistics. We study PVI on a Bernoulli factor model and sigmoid belief network fit to real and synthetic data and compare to deterministic annealing (Katahira et al., 2008). We highlight the flexibility of PVI by designing a proximity statistic for Bayesian deep learning models such as the variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014) and show that it gives better performance by reducing overpruning. PVI also yields improved predictions in a deep generative model of text. Empirically, we show that PVI consistently finds better local optima and gives better predictive performance. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/altosaar18a.html https://proceedings.mlr.press/v84/altosaar18a.html Gauged Mini-Bucket Elimination for Approximate Inference Computing the partition function Z of a discrete graphical model is a fundamental inference challenge. Since this is computationally intractable, variational approximations are often used in practice. Recently, so-called gauge transformations were used to improve variational lower bounds on Z. In this paper, we propose a new gauge-variational approach, termed WMBE-G, which combines gauge transformations with the weighted mini-bucket elimination (WMBE) method. WMBE-G can provide both upper and lower bounds on Z, and is easier to optimize than the prior gauge-variational algorithm. We show that WMBE-G strictly improves the earlier WMBE approximation for symmetric models including Ising models with no magnetic field. Our experimental results demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric models. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/ahn18a.html https://proceedings.mlr.press/v84/ahn18a.html Stochastic algorithms for entropy-regularized optimal transport problems Optimal transport (OT) distances are finding evermore applications in machine learning and computer vision, but their wide spread use in larger-scale problems is impeded by their high computational cost. In this work we develop a family of fast and practical stochastic algorithms for solving the optimal transport problem with an entropic penalization. This work extends the recently developed Greenkhorn algorithm, in the sense that, the Greenkhorn algorithm is a limiting case of this family. We also provide a simple and general convergence theorem for all algorithms in the class, with rates that match the best known rates of Greenkorn and the Sinkhorn algorithm, and conclude with numerical experiments that show under what regime of penalization the new stochastic methods are faster than the aforementioned methods. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/abid18a.html https://proceedings.mlr.press/v84/abid18a.html AdaGeo: Adaptive Geometric Learning for Optimization and Sampling Gradient-based optimization and Markov Chain Monte Carlo sampling can be found at the heart of several machine learning methods. In high-dimensional settings, well-known issues such as slow-mixing, non-convexity and correlations can hinder the algorithms’ efficiency. In order to overcome these difficulties, we propose AdaGeo, a preconditioning framework for adaptively learning the geometry of the parameter space during optimization or sampling. In particular, we use the Gaussian process latent variable model (GP-LVM) to represent a lower-dimensional embedding of the parameters, identifying the underlying Riemannian manifold on which the optimization or sampling is taking place. Samples or optimization steps are consequently proposed based on the geometry of the manifold. We apply our framework to stochastic gradient descent, stochastic gradient Langevin dynamics, and stochastic gradient Riemannian Langevin dynamics, and show performance improvements for both optimization and sampling. Sat, 31 Mar 2018 00:00:00 +0000 https://proceedings.mlr.press/v84/abbati18a.html https://proceedings.mlr.press/v84/abbati18a.html