Proceedings of Machine Learning Research

The Power of Batching in Multiple Hypothesis Testing

Wed, 03 Jun 2020 00:00:00 +0000

One important partition of algorithms for controlling the false discovery rate (FDR) in multiple testing is into offline and online algorithms. The first generally achieve significantly higher power of discovery, while the latter allow making decisions sequentially as well as adaptively formulating hypotheses based on past observations. Using existing methodology, it is unclear how one could trade off the benefits of these two broad families of algorithms, all the while preserving their formal FDR guarantees. To this end, we introduce Batch-BH and Batch-St-BH, algorithms for controlling the FDR when a possibly infinite sequence of batches of hypotheses is tested by repeated application of one of the most widely used offline algorithms, the Benjamini-Hochberg (BH) method or Storey’s improvement of the BH method. We show that our algorithms interpolate between existing online and offline methodology, thus trading off the best of both worlds.

An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

Wed, 03 Jun 2020 00:00:00 +0000

We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves $\mathcal{O}(\sqrt{kn}+\sqrt{D\log(k)})$ regret guarantee, where $k$ is the number of arms, $n$ is the number of rounds, and $D$ is the total delay. The result matches the lower bound within constants and requires no prior knowledge of $n$ or $D$. Additionally, we propose a refined tuning of the algorithm, which achieves $\mathcal{O}(\sqrt{kn}+\min_{S}(|S|+\sqrt{D_{\bar S}\log(k)}))$ regret guarantee, where $S$ is a set of rounds excluded from delay counting, $\bar S = [n]\setminus S$ are the counted rounds, and $D_{\bar S}$ is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019).

Adaptive Discretization for Evaluation of Probabilistic Cost Functions

Wed, 03 Jun 2020 00:00:00 +0000

In many real-world planning applications, e.g. dynamic design of experiments, autonomous driving and robot manipulation, it is necessary to evaluate candidate movement paths with respect to a safety cost function. Here, the continuous candidate paths need to be discretized first and, subsequently, evaluated onthe discretization points. The resulting quality of planned paths, thus, highly depends on the definition of the safety cost functions, and the resolution of the discretization. In this paper, we propose an approach for evaluating continuous candidate paths by employing an adaptive discretization scheme, with a probabilistic cost function learned from observations. The obtained path is then guaranteed to be epsilon-safe, i.e. the remaining risk of still finding an unsafe point on the trajectory is smaller than epsilon. The proposed approach is investigated theoretically, as well as empirically validated on several robotic path planning scenarios.

Communication-Efficient Asynchronous Stochastic Frank-Wolfe over Nuclear-norm Balls

Wed, 03 Jun 2020 00:00:00 +0000

Large-scale machine learning training suffers from two prior challenges, specifically for nuclear-norm constrained problems with distributed systems: the synchronization slowdown due to the straggling workers, and high communication costs. In this work, we propose an asynchronous Stochastic Frank Wolfe (SFW-asyn) method, which, for the first time, solves the two problems simultaneously, while successfully maintaining the same convergence rate as the vanilla SFW. We implement our algorithm in python (with MPI) to run on Amazon EC2, and demonstrate that SFW-asyn yields speed-ups almost linear to the number of machines compared to the vanilla SFW.

Federated Heavy Hitters Discovery with Differential Privacy

Wed, 03 Jun 2020 00:00:00 +0000

The discovery of heavy hitters (most frequent items) in user-generated data streams drives improvements in the app and web ecosystems, but can incur substantial privacy risks if not done with care. To address these risks, we propose a distributed and privacy-preserving algorithm for discovering the heavy hitters in a population of user-generated data streams. We leverage the sampling and thresholding properties of our distributed algorithm to prove that it is inherently differentially private, without requiring additional noise. We also examine the trade-off between privacy and utility, and show that our algorithm provides excellent utility while also achieving strong privacy guarantees. A significant advantage of this approach is that it eliminates the need to centralize raw data while also avoiding the significant loss in utility incurred by local differential privacy. We validate our findings both theoretically, using worst-case analyses, and practically, using a Twitter dataset with 1.6M tweets and over 650k users. Finally, we carefully compare our approach to Apple’s local differential privacy method for discovering heavy hitters.

Accelerated Factored Gradient Descent for Low-Rank Matrix Factorization

Wed, 03 Jun 2020 00:00:00 +0000

We study the low-rank matrix estimation problem, where the objective function $\mathcal{L}(\Mb)$ is defined over the space of positive semidefinite matrices with rank less than or equal to $r$. A fast approach to solve this problem is matrix factorization, which reparameterizes $\mathbf{M}$ as the product of two smaller matrix such that $\mathbf{M} =\mathbf{U}\mathbf{U}^\top$ and then performs gradient descent on $\mathbf{U}$ directly, a.k.a., factored gradient descent. Since the resulting problem is nonconvex, whether Nesterov’s acceleration scheme can be adapted to it remains a long-standing question. In this paper, we answer this question affirmatively by proposing a novel and practical accelerated factored gradient descent method motivated by Nesterov’s accelerated gradient descent. The proposed method enjoys better iteration complexity and computational complexity than the state-of-the-art algorithms in a wide regime. The key idea of our algorithm is to restrict all its iterates onto a special convex set, which enables the acceleration. Experimental results demonstrate the faster convergence of our algorithm and corroborate our theory.

Stochastic Recursive Variance-Reduced Cubic Regularization Methods

Wed, 03 Jun 2020 00:00:00 +0000

Stochastic Variance-Reduced Cubic regularization (SVRC) algorithms have received increasing attention due to its improved gradient/Hessian complexities (i.e., number of queries to stochastic gradient/Hessian oracles) to find local minima for nonconvex finite-sum optimization. However, it is unclear whether existing SVRC algorithms can be further improved. Moreover, the semi-stochastic Hessian estimator adopted in existing SVRC algorithms prevents the use of Hessian-vector product-based fast cubic subproblem solvers, which makes SVRC algorithms computationally intractable for high-dimensional problems. In this paper, we first present a Stochastic Recursive Variance-Reduced Cubic regularization method (SRVRC) using a recursively updated semi-stochastic gradient and Hessian estimators. It enjoys improved gradient and Hessian complexities to find an $(\epsilon, \sqrt{\epsilon})$-approximate local minimum, and outperforms the state-of-the-art SVRC algorithms. Built upon SRVRC, we further propose a Hessian-free SRVRC algorithm, namely SRVRC$_{\text{free}}$, which only needs $\tilde O(n\epsilon^{-2} \land \epsilon^{-3})$ stochastic gradient and Hessian-vector product computations, where $n$ is the number of component functions in the finite-sum objective and $\epsilon$ is the optimization precision. This outperforms the best-known result $\tilde O(\epsilon^{-3.5})$ achieved by stochastic cubic regularization algorithm proposed in \cite{tripuraneni2018stochastic}.

Learning Sparse Nonparametric DAGs

Wed, 03 Jun 2020 00:00:00 +0000

We develop a framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data. Our approach is based on a recent algebraic characterization of DAGs that led to the first fully continuous optimization for score-based learning of DAG models parametrized by a linear structural equation model (SEM). We extend this algebraic characterization to nonparametric SEM by leveraging nonparametric sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases. Unlike existing approaches that require specific modeling choices, loss functions, or algorithms, we present a completely general framework that can be applied to general nonlinear models (e.g. without additive noise), general differentiable loss functions, and generic black-box optimization routines.

A Framework for Sample Efficient Interval Estimation with Control Variates

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of estimating confidence intervals for the mean of a random variable, where the goal is to produce the smallest possible interval for a given number of samples. While minimax optimal algorithms are known for this problem in the general case, improved performance is possible under additional assumptions. In particular, we design an estimation algorithm to take advantage of side information in the form of a control variate, leveraging order statistics. Under certain conditions on the quality of the control variates, we show improved asymptotic efficiency compared to existing estimation algorithms. Empirically, we demonstrate superior performance on several real world surveying and estimation tasks where we use regression models as control variates.

Persistence Enhanced Graph Neural Network

Wed, 03 Jun 2020 00:00:00 +0000

Local structural information can increase the adaptability of graph convolutional networks to large graphs with heterogeneous topology. Existing methods only use relatively simplistic topological information, such as node degrees.We present a novel approach leveraging advanced topological information, i.e., persistent homology, which measures the information flow efficiency at different parts of the graph. To fully exploit such structural information in real world graphs, we propose a new network architecture which learns to use persistent homology information to reweight messages passed between graph nodes during convolution. For node classification tasks, our network outperforms existing ones on a broad spectrum of graph benchmarks.

Variational Autoencoders for Sparse and Overdispersed Discrete Data

Wed, 03 Jun 2020 00:00:00 +0000

Many applications, such as text modelling, high-throughput sequencing, and recommender systems, require analysing sparse, high-dimensional, and overdispersed discrete (count or binary) data. Recent deep probabilistic models based on variational autoencoders (VAE) have shown promising results on discrete data but may have inferior modelling performance due to the insufficient capability in modelling overdispersion and model misspecification. To address these issues, we develop a VAE-based framework using the negative binomial distribution as the data distribution. We also provide an analysis of its properties vis-à-vis other models. We conduct extensive experiments on three problems from discrete data analysis: text analysis/topic modelling, collaborative filtering, and multi-label learning. Our models outperform state-of-the-art approaches on these problems, while also capturing the phenomenon of overdispersion more effectively.

Bandit Convex Optimization in Non-stationary Environments

Wed, 03 Jun 2020 00:00:00 +0000

Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length of the comparator sequence that reflects the non-stationarity of environments. We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the $\Omega(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper. Notably, our algorithm is more adaptive to non-stationary environments since it does not require prior knowledge of the path-length $P_T$ ahead of time, which is generally unknown.

A Simple Approach for Non-stationary Linear Bandits

Wed, 03 Jun 2020 00:00:00 +0000

This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let $T$ be the time horizon, $d$ be the dimension, and $P_T$ be the path-length that measures the fluctuation of the evolving unknown parameter, our approach enjoys an $\tilde{O}(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ dynamic regret, which is nearly optimal, matching the $\Omega(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach.

One Sample Stochastic Frank-Wolfe

Wed, 03 Jun 2020 00:00:00 +0000

One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of $\mathcal{O}(1/\epsilon^2)$ for reaching an $\epsilon$-suboptimal solution in the stochastic convex setting, and a $(1-1/e)-\epsilon$ approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an $\epsilon$-first-order stationary point after at most $\mathcal{O}(1/\epsilon^3)$ iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration.

Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models

Wed, 03 Jun 2020 00:00:00 +0000

Starting with Gilmer et al. (2018), several works have demonstrated the inevitability of adversarial examples based on different assumptions about the underlying input probability space. It remains unclear, however, whether these results apply to natural image distributions. In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). Building upon the state-of-the-art conditional generative models, we study the intrinsic robustness of two common image benchmarks under L2 perturbations, and show the existence of a large gap between the robustness limits implied by our theory and the adversarial robustness achieved by current state-of-the-art robust models.

Quantized Frank-Wolfe: Faster Optimization, Lower Communication, and Projection Free

Wed, 03 Jun 2020 00:00:00 +0000

How can we efficiently mitigate the overhead of gradient communications in distributed optimization? This problem is at the heart of training scalable machine learning models and has been mainly studied in the unconstrained setting. In this paper, we propose Quantised Frank-Wolfe (QFW), the first projection free and communication-efficient algorithm for solving constrained optimization problems at scale. We consider both convex and non-convex objective functions, expressed as a finite-sum or more generally a stochastic optimization problem, and provide strong theoretical guarantees on the convergence rate of QFW. This is accomplished by proposing novel quantization schemes that efficiently compress gradients while controlling the noise variance intduced during this process. Finally, we empirically validate the efficiency of QFW in terms of communication and the quality of returned solution against natural baselines.

Stepwise Model Selection for Sequence Prediction via Deep Kernel Learning

Wed, 03 Jun 2020 00:00:00 +0000

An essential problem in automated machine learning (AutoML) is that of model selection. A unique challenge in the sequential setting is the fact that the optimal model itself may vary over time, depending on the distribution of features and labels available up to each point in time. In this paper, we propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting. This is accomplished by treating the performance at each time step as its own black-box function. In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions using deep kernel learning (DKL). To the best of our knowledge, we are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose. Using multiple real-world datasets, we verify that our proposed method outperforms both standard BO and multi-objective BO algorithms on a variety of sequence prediction tasks.

AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic Gradient MCMC

Wed, 03 Jun 2020 00:00:00 +0000

Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is an efficient method for sampling from continuous distributions. It is a faster alternative to HMC: instead of using the whole dataset at each iteration, SGHMC uses only a subsample. This improves performance, but introduces bias that can cause SGHMC to converge to the wrong distribution. One can prevent this using a step size that decays to zero, but such a step size schedule can drastically slow down convergence. To address this tension, we propose a novel second-order SG-MCMC algorithm—AMAGOLD—that infrequently uses Metropolis-Hastings (M-H) corrections to remove bias. The infrequency of corrections amortizes their cost. We prove AMAGOLD converges to the target distribution with a fixed, rather than a diminishing, step size, and that its convergence rate is at most a constant factor slower than a full-batch baseline. We empirically demonstrate AMAGOLD’s effectiveness on synthetic distributions, Bayesian logistic regression, and Bayesian neural networks.

Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory

Wed, 03 Jun 2020 00:00:00 +0000

Particle-optimization-based sampling (POS) is a recently developed effective sampling technique that interactively updates a set of particles. A representative algorithm is the Stein variational gradient descent (SVGD). We prove, under certain conditions, SVGD experiences a theoretical pitfall, {\it i.e.}, particles tend to collapse. As a remedy, we generalize POS to a stochastic setting by injecting random noise into particle updates, thus termed stochastic particle-optimization sampling (SPOS). Notably, for the first time, we develop non-asymptotic convergence theory for the SPOS framework (related to SVGD), characterizing algorithm convergence in terms of the 1-Wasserstein distance w.r.t. the numbers of particles and iterations. Somewhat surprisingly, with the same number of updates (not too large) for each particle, our theory suggests adopting more particles does not necessarily lead to a better approximation of a target distribution, due to limited computational budget and numerical errors. This phenomenon is also observed in SVGD and verified via a synthetic experiment. Extensive experimental results verify our theory and demonstrate the effectiveness of our proposed framework.

Learning Overlapping Representations for the Estimation of Individualized Treatment Effects

Wed, 03 Jun 2020 00:00:00 +0000

The choice of making an intervention depends on its potential benefit or harm in comparison to alternatives. Estimating the likely outcome of alternatives from observational data is a challenging problem as all outcomes are never observed, and selection bias precludes the direct comparison of differently intervened groups. Despite their empirical success, we show that algorithms that learn domain-invariant representations of inputs (on which to make predictions) are often inappropriate, and develop generalization bounds that demonstrate the dependence on domain overlap and highlight the need for invertible latent maps. Based on these results, we develop a deep kernel regression algorithm and posterior regularization framework that substantially outperforms the state-of-the-art on a variety of benchmarks data sets.

Nested-Wasserstein Self-Imitation Learning for Sequence Generation

Wed, 03 Jun 2020 00:00:00 +0000

Reinforcement learning (RL) has been widely studied for improving sequence-generation models. However, the conventional rewards used for RL training typically cannot capture sufficient semantic information and therefore render model bias. Further, the sparse and delayed rewards make RL exploration inefficient. To alleviate these issues, we propose the concept of nested-Wasserstein distance for distributional semantic matching. To further exploit it, a novel nested-Wasserstein self-imitation learning framework is developed, encouraging the model to exploit historical high-rewarded sequences for enhanced exploration and better semantic matching. Our solution can be understood as approximately executing proximal policy optimization with Wasserstein trust-regions. Experiments on a variety of unconditional and conditional sequence-generation tasks demonstrate the proposed approach consistently leads to improved performance.

Minimizing Dynamic Regret and Adaptive Regret Simultaneously

Wed, 03 Jun 2020 00:00:00 +0000

Regret minimization is treated as the golden rule in the traditional study of online learning. However, regret minimization algorithms tend to converge to the static optimum, thus being suboptimal for changing environments. To address this limitation, new performance measures, including dynamic regret and adaptive regret have been proposed to guide the design of online algorithms. The former one aims to minimize the global regret with respect to a sequence of changing comparators, and the latter one attempts to minimize every local regret with respect to a fixed comparator. Existing algorithms for dynamic regret and adaptive regret are developed independently, and only target one performance measure. In this paper, we bridge this gap by proposing novel online algorithms that are able to minimize the dynamic regret and adaptive regret simultaneously. In fact, our theoretical guarantee is even stronger in the sense that one algorithm is able to minimize the dynamic regret over any interval.

AsyncQVI: Asynchronous-Parallel Q-Value Iteration for Discounted Markov Decision Processes with Near-Optimal Sample Complexity

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we propose AsyncQVI, an asynchronous-parallel Q-value iteration for discounted Markov decision processes whose transition and reward can only be sampled through a generative model. AsyncQVI is also the first asynchronous-parallel algorithm for discounted Markov decision processes that has a sample complexity, which nearly matches the theoretical lower bound. The relatively low memory footprint and parallel ability make AsyncQVI suitable for large-scale applications. In numerical tests, we compare AsyncQVI with four sample-based value iteration methods. The results show that our algorithm is highly efficient and achieves linear parallel speedup.

Fully Decentralized Joint Learning of Personalized Models and Collaboration Graphs

Wed, 03 Jun 2020 00:00:00 +0000

We consider the fully decentralized machine learning scenario where many users with personal datasets collaborate to learn models through local peer-to-peer exchanges, without a central coordinator. We propose to train personalized models that leverage a collaboration graph describing the relationships between user personal tasks, which we learn jointly with the models. Our fully decentralized optimization procedure alternates between training nonlinear models given the graph in a greedy boosting manner, and updating the collaboration graph (with controlled sparsity) given the models. Throughout the process, users exchange messages only with a small number of peers (their direct neighbors when updating the models, and a few random users when updating the graph), ensuring that the procedure naturally scales with the number of users. Overall, our approach is communication-efficient and avoids exchanging personal data. We provide an extensive analysis of the convergence rate, memory and communication complexity of our approach, and demonstrate its benefits compared to competing techniques on synthetic and real datasets.

Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Wed, 03 Jun 2020 00:00:00 +0000

We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where exploration is induced by perturbing the least-squares approximation of the action-value function. Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of steps. To the best of our knowledge, this is the first frequentist regret analysis for randomized exploration with function approximation.

Scaling up Kernel Ridge Regression via Locality Sensitive Hashing

Wed, 03 Jun 2020 00:00:00 +0000

Random binning features, introduced in the seminal paper of Rahimi and Recht ’07, are an efficient method for approximating a kernel matrix using locality sensitive hashing. Random binning features provide a very simple and efficient way to approximate the Laplace kernel but unfortunately do not apply to many important classes of kernels, notably ones that generate smooth Gaussian processes, such as the Gaussian kernel and Matern kernel. In this paper we introduce a simple weighted version of random binning features, and show that the corresponding kernel function generates Gaussian processes of any desired smoothness. We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression. Experiments on large scale regression datasets show that our method outperforms the accuracy of random Fourier features method.

Why Non-myopic Bayesian Optimization is Promising and How Far Should We Look-ahead? A Study via Rollout

Wed, 03 Jun 2020 00:00:00 +0000

Lookahead, also known as non-myopic, Bayesian optimization (BO) aims to find optimal sampling policies through solving a dynamic programming (DP) formulation that maximizes a long-term reward over a rolling horizon. Though promising, lookahead BO faces the risk of error propagation through its increased dependence on a possibly mis-specified model. In this work we focus on the rollout approximation for solving the intractable DP. We first prove the improving nature of rollout in tackling lookahead BO and provide a sufficient condition for the used heuristic to be rollout improving. We then provide both a theoretical and practical guideline to decide on the rolling horizon stagewise. This guideline is built on quantifying the negative effect of a mis-specified model. To illustrate our idea, we provide case studies on both single and multi-information source BO. Empirical results show the advantageous properties of our method over several myopic and non-myopic BO algorithms.

Discrete Action On-Policy Learning with Action-Value Critic

Wed, 03 Jun 2020 00:00:00 +0000

Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension, making it challenging to apply existing on-policy gradient based deep RL algorithms efficiently. To effectively operate in multidimensional discrete action spaces, we construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. We follow rigorous statistical analysis to design how to generate and combine these correlated actions, and how to sparsify the gradients by shutting down the contributions from certain dimensions. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques. We demonstrate these properties on OpenAI Gym benchmark tasks, and illustrate how discretizing the action space could benefit the exploration phase and hence facilitate convergence to a better local optimal solution thanks to the flexibility of discrete policy.

Learning Entangled Single-Sample Distributions via Iterative Trimming

Wed, 03 Jun 2020 00:00:00 +0000

In the setting of entangled single-sample distributions, the goal is to estimate some common parameter shared by a family of distributions, given one \emph{single} sample from each distribution. We study mean estimation and linear regression under general conditions, and analyze a simple and computationally efficient method based on iteratively trimming samples and re-estimating the parameter on the trimmed sample set. We show that the method in logarithmic iterations outputs an estimation whose error only depends on the noise level of the $\lceil \alpha n \rceil$-th noisiest data point where $\alpha$ is a constant and $n$ is the sample size. This means it can tolerate a constant fraction of high-noise points. These are the first such results under our general conditions with computationally efficient estimators. It also justifies the wide application and empirical success of iterative trimming in practice. Our theoretical results are complemented by experiments on synthetic data.

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

A Theoretical Case Study of Structured Variational Inference for Community Detection

Wed, 03 Jun 2020 00:00:00 +0000

Mean-field variational inference (MFVI) has been widely applied in large scale Bayesian inference. However, MFVI assumes independent distribution on the latent variables, which often leads to objective functions with many local optima, making optimization algorithms sensitive to initialization. In this paper, we study the advantage of structured variational inference in the context of the two-class Stochastic Blockmodel. To facilitate theoretical analysis, the variational distribution is constructed to have a simple pairwise dependency structure on the nodes of the network. We prove that, in a broad density regime and for general random initializations, unlike MFVI, the estimated class labels by structured VI converge to the ground truth with high probability, when the model parameters are known, estimated within a reasonable range or jointly optimized with the variational parameters. In addition, empirically we demonstrate structured VI is more robust compared with MFVI when the graph is sparse and the signal to noise ratio is low. The paper takes a first step towards understanding the importance of dependency structure in variational inference for community detection.

Fast and Accurate Ranking Regression

Wed, 03 Jun 2020 00:00:00 +0000

We consider a ranking regression problem in which we use a dataset of ranked choices to learn Plackett-Luce scores as functions of sample features. We solve the maximum likelihood estimation problem by using the Alternating Directions Method of Multipliers (ADMM), effectively separating the learning of scores and model parameters. This separation allows us to express scores as the stationary distribution of a continuous-time Markov Chain. Using this equivalence, we propose two spectral algorithms for ranking regression that learn model parameters up to 579 times faster than the Newton’s method.

Optimization of Graph Total Variation via Active-Set-based Combinatorial Reconditioning

Wed, 03 Jun 2020 00:00:00 +0000

Structured convex optimization on weighted graphs finds numerous applications in machine learning and computer vision. In this work, we propose a novel adaptive preconditioning strategy for proximal algorithms on this problem class. Our preconditioner is driven by a sharp analysis of the local linear convergence rate depending on the "active set" at the current iterate. We show that nested-forest decomposition of the inactive edges yields a guaranteed local linear convergence rate. Further, we propose a practical greedy heuristic which realizes such nested decompositions and show in several numerical experiments that our reconditioning strategy, when applied to proximal gradient or primal-dual hybrid gradient algorithm, achieves competitive performances. Our results suggest that local convergence analysis can serve as a guideline for selecting variable metrics in proximal algorithms.

“Bring Your Own Greedy”+Max: Near-Optimal 1/2-Approximations for Submodular Knapsack

Wed, 03 Jun 2020 00:00:00 +0000

The problem of selecting a small-size representative summary of a large dataset is a cornerstone of machine learning, optimization and data science. Motivated by applications to recommendation systems and other scenarios with query-limited access to vast amounts of data, we propose a new rigorous algorithmic framework for a standard formulation of this problem as a submodular maximization subject to a linear (knapsack) constraint. Our framework is based on augmenting all partial Greedy solutions with the best additional item. It can be instantiated with negligible overhead in any model of computation, which allows the classic greedy algorithm and its variants to be implemented. We give such instantiations in the offline Gready+Max, multi-pass streaming Sieve+Max and distributed Distributed Sieve+Max settings. Our algorithms give ($1/2-\eps$)-approximation with most other key parameters of interest being near-optimal. Our analysis is based on a new set of first-order linear differential inequalities and their robust approximate versions. Experiments on typical datasets (movie recommendations, influence maximization) confirm scalability and high quality of solutions obtained via our framework. Instance-specific approximations are typically in the 0.6-0.7 range and frequently beat even the $(1-1/e) \approx 0.63$ worst-case barrier for polynomial-time algorithms.

Laplacian-Regularized Graph Bandits: Algorithms and Theoretical Analysis

Wed, 03 Jun 2020 00:00:00 +0000

We consider a stochastic linear bandit problem with multiple users, where the relationship between users is captured by an underlying graph and user preferences are represented as smooth signals on the graph. We introduce a novel bandit algorithm where the smoothness prior is imposed via the random-walk graph Laplacian, which leads to a single-user cumulative regret scaling as $\Tilde{\mathcal{O}}(\Psi d \sqrt{T})$ with time horizon $T$, feature dimensionality $d$, and the scalar parameter $\Psi \in (0,1)$ that depends on the graph connectivity. This is an improvement over $\Tilde{\mathcal{O}}(d \sqrt{T})$ in \algo{LinUCB} \Ccite{li2010contextual}, where user relationship is not taken into account.In terms of network regret (sum of cumulative regret over $n$ users), the proposed algorithm leads to a scaling as $\Tilde{\mathcal{O}}(\Psi d\sqrt{nT})$, which is a significant improvement over $\Tilde{\mathcal{O}}(nd\sqrt{T})$ in the state-of-the-art algorithm \algo{Gob.Lin} \Ccite{cesa2013gang}. To improve scalability, we further propose a simplified algorithm with a linear computational complexity with respect to the number of users, while maintaining the same regret. Finally, we present a finite-time analysis on the proposed algorithms, and demonstrate their advantage in comparison with state-of-the-art graph-based bandit algorithms on both synthetic and real-world data.

Robustness for Non-Parametric Classification: A Generic Attack and Defense

Wed, 03 Jun 2020 00:00:00 +0000

Adversarially robust machine learning has received much recent attention. However, prior attacks and defenses for non-parametric classifiers have been developed in an ad-hoc or classifier-specific basis. In this work, we take a holistic look at adversarial examples for non-parametric classifiers, including nearest neighbors, decision trees, and random forests. We provide a general defense method, adversarial pruning, that works by preprocessing the dataset to become well-separated. To test our defense, we provide a novel attack that applies to a wide range of non-parametric classifiers. Theoretically, we derive an optimally robust classifier, which is analogous to the Bayes Optimal. We show that adversarial pruning can be viewed as a finite sample approximation to this optimal classifier. We empirically show that our defense and attack are either better than or competitive with prior work on non-parametric classifiers. Overall, our results provide a strong and broadly-applicable baseline for future work on robust non-parametrics.

Adaptive Online Kernel Sampling for Vertex Classification

Wed, 03 Jun 2020 00:00:00 +0000

This paper studies online kernel learning (OKL) for graph classification problem, since the large approximation space provided by reproducing kernel Hilbert spaces often contains an accurate function. Nonetheless, optimizing over this space is computationally expensive. To address this issue, approximate OKL is introduced to reduce the complexity either by limiting the support vector (SV) used by the predictor, or by avoiding the kernelization process altogether using embedding. Nonetheless, as long as the size of the approximation space or the number of SV does not grow over time, an adversarial environment can always exploit the approximation process. In this paper, we introduce an online kernel sampling (OKS) technique, a new second-order OKL method that slightly improve the bound from $O(d \log(T))$ down to $O(r \log(T))$ where $r$ is the rank of the learned data and is usually much smaller than d. To reduce the computational complexity of second-order methods, we introduce a randomized sampling algorithm for sketching kernel matrix $K_t$ and show that our method is effective to reduce the time and space complexity significantly while maintaining comparable performance. Empirical experimental results demonstrate that the proposed model is highly effective on real-world graph datasets.

Amortized Inference of Variational Bounds for Learning Noisy-OR

Wed, 03 Jun 2020 00:00:00 +0000

Classical approaches for approximate inference depend on cleverly designed variational distributions and bounds. Modern approaches employ amortized variational inference, which uses a neural network to approximate any posterior without leveraging the structures of the generative models. In this paper, we propose Amortized Conjugate Posterior (ACP), a hybrid approach taking advantages of both types of approaches. Specifically, we use the classical methods to derive specific forms of posterior distributions and then learn the variational parameters using amortized inference. We study the effectiveness of the proposed approach on the Noisy-OR model and compare to both the classical and the modern approaches for approximate inference and parameter learning. Our results show that the proposed method outperforms or are at par with other approaches.

A Linear-time Independence Criterion Based on a Finite Basis Approximation

Wed, 03 Jun 2020 00:00:00 +0000

Detection of statistical dependence between random variables is an essential component in many machine learning algorithms. We propose a novel independence criterion for two random variables with linear-time complexity. We establish that our independence criterion is an upper bound of the Hirschfeld-Gebelein-Rényi maximum correlation coefficient between tested variables. A finite set of basis functions is employed to approximate the mapping functions that can achieve the maximal correlation. Using classic benchmark experiments based on independent component analysis, we demonstrate that our independence criterion performs comparably with the state-of-the-art quadratic-time kernel dependence measures like the Hilbert-Schmidt Independence Criterion, while being more efficient in computation. The experimental results also show that our independence criterion outperforms another contemporary linear-time kernel dependence measure, the Finite Set Independence Criterion. The potential application of our criterion in deep neural networks is validated experimentally.

Auditing ML Models for Individual Bias and Unfairness

Wed, 03 Jun 2020 00:00:00 +0000

We consider the task of auditing ML models for individual bias/unfairness. We formalize the task in an optimization problem and develop a suite of inferential tools for the optimal value. Our tools permit us to obtain asymptotic confidence intervals and hypothesis tests that cover the target/control the Type I error rate exactly. To demonstrate the utility of our tools, we use them to reveal the gender and racial biases in Northpointe’s COMPAS recidivism prediction instrument.

Thresholding Bandit Problem with Both Duels and Pulls

Wed, 03 Jun 2020 00:00:00 +0000

The Thresholding Bandit Problem (TBP) aims to find the set of arms with mean rewards greater than a given threshold. We consider a new setting of TBP, where in addition to pulling arms, one can also duel two arms and get the arm with a greater mean. In our motivating application from crowdsourcing, dueling two arms can be more cost-effective and time-efficient than direct pulls. We refer to this problem as TBP with Dueling Choices (TBP-DC). This paper provides an algorithm called Rank-Search (RS) for solving TBP-DC by alternating between ranking and binary search. We prove theoretical guarantees for RS, and also give lower bounds to show the optimality of it. Experiments show that RS outperforms previous baseline algorithms that only use pulls or duels.

Accelerated Primal-Dual Algorithms for Distributed Smooth Convex Optimization over Networks

Wed, 03 Jun 2020 00:00:00 +0000

This paper proposes a novel family of primal-dual-based distributed algorithms for smooth, convex, multi-agent optimization over networks that uses only gradient information and gossip communications. The algorithms can also employ acceleration on the computation and communications. We provide a unified analysis of their convergence rate, measured in terms of the Bregman distance associated to the saddle point reformation of the distributed optimization problem. When acceleration is employed, the rate is shown to be optimal, in the sense that it matches (under the proposed metric) existing complexity lower bounds of distributed algorithms applicable to such a class of problem and using only gradient information and gossip communications. Preliminary numerical results on distributed least-square regression problems show that the proposed algorithm compares favorably on existing distributed schemes.

A Stein Goodness-of-fit Test for Directional Distributions

Wed, 03 Jun 2020 00:00:00 +0000

In many fields, data appears in the form of direction (unit vector) and usual statistical procedures are not applicable to such directional data. In this study, we propose nonparametric goodness-of-fit testing procedures for general directional distributions based on kernel Stein discrepancy. Our method is based on Stein’s operator on spheres, which is derived by using Stokes’ theorem. Notably, the proposed method is applicable to distributions with an intractable normalization constant, which commonly appear in directional statistics. Experimental results demonstrate that the proposed methods control type-I error well and have larger power than existing tests, including the test based on the maximum mean discrepancy.

Linear Convergence of Adaptive Stochastic Gradient Descent

Wed, 03 Jun 2020 00:00:00 +0000

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality. The paper introduces the notion of Restricted Uniform Inequality of Gradients (RUIG)—which is a measure of the balanced-ness of the stochastic gradient norms—to depict the landscape of a function. RUIG plays a key role in proving the robustness of AdaGrad-Norm to its hyper-parameter tuning in the stochastic setting. On top of RUIG, we develop a two-stage framework to prove the linear convergence of AdaGrad-Norm without knowing the parameters of the objective functions. This framework can likely be extended to other adaptive stepsize algorithms. The numerical experiments validate the theory and suggest future directions for improvement.

On Minimax Optimality of GANs for Robust Mean Estimation

Wed, 03 Jun 2020 00:00:00 +0000

Generative adversarial networks (GANs) have become one of the most popular generative modeling techniques in machine learning. In this work, we study the statistical and robust properties of GANs for Gaussian mean estimation under Huber’s contamination model, where an epsilon proportion of training data may be arbitrarily corrupted. We prove that f-GAN, when equipped with appropriate discriminators, achieve optimal minimax rate, hence extending the recent result of Gao et al. (2019a). In contrast, we show that other GAN variants such as MMD-GAN (with Gaussian kernel) and W-GAN may fail to achieve minimax optimality. We further adapt f-GAN to the sparse and the unknown covariance settings. We perform numerical simulations to confirm our theoretical findings and reveal new insights on the importance of discriminators.

Stochastic Linear Contextual Bandits with Diverse Contexts

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we investigate the impact of context diversity on stochastic linear contextual bandits. As opposed to the previous view that contexts lead to more difficult bandit learning, we show that when the contexts are sufficiently diverse, the learner is able to utilize the information obtained during exploitation to shorten the exploration process, thus achieving reduced regret. We design the LinUCB-d algorithm, and propose a novel approach to analyze its regret performance. The main theoretical result is that under the diverse context assumption, the cumulative expected regret of LinUCB-d is bounded by a constant. As a by-product, our results improve the previous understanding of LinUCB and strengthen its performance guarantee.

Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method

Wed, 03 Jun 2020 00:00:00 +0000

We address the problem of distinguishing cause from effect in bivariate setting. Based on recent developments in nonlinear independent component analysis (ICA), we train general nonlinear causal models that are implemented by neural networks and allow non-additive noise. Further, we build an ensemble framework, namely Causal Mosaic, which models a causal pair by a mixture of nonlinear models. We compare this method with other recent methods on artificial and real world benchmark datasets, and our method shows state-of-the-art performance.

Graph DNA: Deep Neighborhood Aware Graph Encoding for Collaborative Filtering

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we consider recommender systems with side information in the form of graphs. Existing collaborative filtering algorithms mainly utilize only immediate neighborhood information and do not efficiently take advantage of deeper neighborhoods beyond 1-2 hops. The main issue with exploiting deeper graph information is the rapidly growing time and space complexity when incorporating information from these neighborhoods. In this paper, we propose using Graph DNA, a novel Deep Neighborhood Aware graph encoding algorithm, for exploiting multi-hop neighborhood information. DNA encoding computes approximate deep neighborhood information in linear time using Bloom filters, and results in a per-node encoding whose dimension is logarithmic in the number of nodes in the graph. It can be used in conjunction with both feature-based and graph-regularization-based collaborative filtering algorithms. Graph DNA has the advantages of being memory and time efficient and providing additional regularization when compared to directly using higher order graph information. We provide theoretical performance bounds for graph DNA encoding, and experimentally show that graph DNA can be used with 4 popular collaborative filtering algorithms to consistently boost their performances with little computational and memory overhead.

Minimax Testing of Identity to a Reference Ergodic Markov Chain

Wed, 03 Jun 2020 00:00:00 +0000

We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or e-far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complexity depends solely on the properties of the known reference chain and does not involve the unknown chain at all, which is not even assumed to be ergodic.

Approximate Cross-validation: Guarantees for Model Assessment and Selection

Wed, 03 Jun 2020 00:00:00 +0000

Cross-validation (CV) is a popular approach for assessing and selecting predictive models. However, when the number of folds is large, CV suffers from a need to repeatedly refit a learning procedure on a large number of training datasets. Recent work in empirical risk minimization (ERM) approximates the expensive refitting with a single Newton step warm-started from the full training set optimizer. While this can greatly reduce runtime, several open questions remain including whether these approximations lead to faithful model selection and whether they are suitable for non-smooth objectives. We address these questions with three main contributions: (i) we provide uniform non-asymptotic, deterministic model assessment guarantees for approximate CV; (ii) we show that (roughly) the same conditions also guarantee model selection performance comparable to CV; (iii) we provide a proximal Newton extension of the approximate CV framework for non-smooth prediction problems and develop improved assessment guarantees for problems such as L1-regularized ERM.

Non-Parametric Calibration for Classification

Wed, 03 Jun 2020 00:00:00 +0000

Many applications of classification methods not only require high accuracy but also reliable estimation of predictive uncertainty. However, while many current classification frameworks, in particular deep neural networks, achieve high accuracy, they tend to incorrectly estimate uncertainty. In this paper, we propose a method that adjusts the confidence estimates of a general classifier such that they approach the probability of classifying correctly. In contrast to existing approaches, our calibration method employs a non-parametric representation using a latent Gaussian process, and is specifically designed for multi-class classification. It can be applied to any classifier that outputs confidence estimates and is not limited to neural networks. We also provide a theoretical analysis regarding the over- and underconfidence of a classifier and its relationship to calibration, as well as an empirical outlook for calibrated active learning. In experiments we show the universally strong performance of our method across different classifiers and benchmark data sets, in particular for state-of-the art neural network architectures.

An Empirical Study of Stochastic Gradient Descent with Structured Covariance Noise

Wed, 03 Jun 2020 00:00:00 +0000

The choice of batch-size in a stochastic optimization algorithm plays a substantial role for both optimization and generalization. Increasing the batch-size used typically improves optimization but degrades generalization. To address the problem of improving generalization while maintaining optimal convergence in large-batch training, we propose to add covariance noise to the gradients. We demonstrate that the learning performance of our method is more accurately captured by the structure of the covariance matrix of the noise rather than by the variance of gradients. Moreover, over the convex-quadratic, we prove in theory that it can be characterized by the Frobenius norm of the noise matrix. Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated.

Structured Conditional Continuous Normalizing Flows for Efficient Amortized Inference in Graphical Models

Wed, 03 Jun 2020 00:00:00 +0000

We exploit minimally faithful inversion of graphical model structures to specify sparse continuous normalizing flows (CNFs) for amortized inference. We find that the sparsity of this factorization can be exploited to reduce the numbers of parameters in the neural network, adaptive integration steps of the flow, and consequently FLOPs at both training and inference time without decreasing performance in comparison to unconstrained flows. By expressing the structure inversion as a compilation pass in a probabilistic programming language, we are able to apply it in a novel way to models as complex as convolutional neural networks. Furthermore, we extend the training objective for CNFs in the context of inference amortization to the symmetric Kullback-Leibler divergence, and demonstrate its theoretical and practical advantages.

Optimized Score Transformation for Fair Classification

Wed, 03 Jun 2020 00:00:00 +0000

This paper considers fair probabilistic classification where the outputs of primary interest are predicted probabilities, commonly referred to as scores. We formulate the problem of transforming scores to satisfy fairness constraints while minimizing the loss in utility. The formulation can be applied either to post-process classifier outputs or to pre-process training data, thus allowing maximum freedom in selecting a classification algorithm. We derive a closed-form expression for the optimal transformed scores and a convex optimization problem for the transformation parameters. In the population limit, the transformed score function is the fairness-constrained minimizer of cross-entropy with respect to the optimal unconstrained scores. In the finite sample setting, we propose to approach this solution using a combination of standard probabilistic classifiers and ADMM. Comprehensive experiments comparing to 10 existing methods show that the proposed FairScoreTransformer has advantages for score-based metrics such as Brier score and AUC while remaining competitive for binary label-based metrics such as accuracy.

Neighborhood Growth Determines Geometric Priors for Relational Representation Learning

Wed, 03 Jun 2020 00:00:00 +0000

The problem of identifying geometric structure in heterogeneous, high-dimensional data is a cornerstone of representation learning. While there exists a large body of literature on the embeddability of canonical graphs, such as lattices or trees, the heterogeneity of the relational data typically encountered in practice limits the applicability of these classical methods. In this paper, we propose a combinatorial approach to evaluating embeddability, i.e., to decide whether a data set is best represented in Euclidean, Hyperbolic or Spherical space. Our method analyzes nearest-neighbor structures and local neighborhood growth rates to identify the geometric priors of suitable embedding spaces. For canonical graphs, the algorithm’s prediction provably matches classical results. As for large, heterogeneous graphs, we introduce an efficiently computable statistic that approximates the algorithm’s decision rule. We validate our method over a range of benchmark data sets and compare with recently published optimization-based embeddability methods.

Coping With Simulators That Don’t Always Return

Wed, 03 Jun 2020 00:00:00 +0000

Deterministic models are approximations of reality that are easy to interpret and often easier to build than stochastic alternatives. Unfortunately, as nature is capricious, observational data can never be fully explained by deterministic models in practice. Observation and process noise need to be added to adapt deterministic models to behave stochastically, such that they are capable of explaining and extrapolating from noisy data. We investigate and address computational inefficiencies that arise from adding process noise to deterministic simulators that fail to return for certain inputs; a property we describe as ’brittle’. We show how to train a conditional normalizing flow to propose perturbations such that the simulator succeeds with high probability, increasing computational efficiency.

Black-Box Inference for Non-Linear Latent Force Models

Wed, 03 Jun 2020 00:00:00 +0000

Latent force models are systems whereby there is a mechanistic model describing the dynamics of the system state, with some unknown forcing term that is approximated with a Gaussian process. If such dynamics are non-linear, it can be difficult to estimate the posterior state and forcing term jointly, particularly when there are system parameters that also need estimating. This paper uses black-box variational inference to jointly estimate the posterior, designing a multivariate extension to local inverse autoregressive flows as a flexible approximator of the system. We compare estimates on systems where the posterior is known, demonstrating the effectiveness of the approximation, and apply to problems with non-linear dynamics, multi-output systems and models with non-Gaussian likelihoods.

Optimal Algorithms for Multiplayer Multi-Armed Bandits

Wed, 03 Jun 2020 00:00:00 +0000

The paper addresses various Multiplayer Multi-Armed Bandit (MMAB) problems, where M decision-makers, or players, collaborate to maximize their cumulative reward. We first investigate the MMAB problem where players selecting the same arms experience a collision (and are aware of it) and do not collect any reward. For this problem, we present DPE1 (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same asymptotic regret as that obtained by an optimal centralized algorithm. DPE1 is simpler than the state-of-the-art algorithm SIC-MMAB Boursier and Perchet (2019), and yet offers better performance guarantees. We then study the MMAB problem without collision, where players may select the same arm. Players sit on vertices of a graph, and in each round, they are able to send a message to their neighbours in the graph. We present DPE2, a simple and asymptotically optimal algorithm that outperforms the state-of-the-art algorithm DD- UCB Martinez-Rubio et al. (2019). Besides, under DPE2, the expected number of bits transmitted by the players in the graph is finite.

Learning Dynamic Hierarchical Topic Graph with Graph Convolutional Network for Document Classification

Wed, 03 Jun 2020 00:00:00 +0000

Constructing a graph with graph convolutional network (GCN) to explore the relational structure of the data has attracted lots of interests in various tasks. However, for document classification, existing graph based methods often focus on the straightforward word-word and word-document relations, ignoring the hierarchical semantics. Besides, the graph construction is often independent from the task-specific GCN learning. To address these constrains, we integrate a probabilistic deep topic model into graph construction, and propose a novel trainable hierarchical topic graph (HTG), including word-level, hierarchical topic-level and document-level nodes, exhibiting semantic variation from fine-grained to coarse. Regarding the document classification as a document-node label generation task, HTG can be dynamically evolved with GCN by performing variational inference, which leads to an end-to-end document classification method, named dynamic HTG (DHTG). Besides achieving state-of-the-art classification results, our model learns an interpretable document graph with meaningful node embeddings and semantic edges.

Online Batch Decision-Making with High-Dimensional Covariates

Wed, 03 Jun 2020 00:00:00 +0000

We propose and investigate a class of new algorithms for sequential decision making that interacts with a batch of users simultaneously instead of a user at each decision epoch. This type of batch models is motivated by interactive marketing and clinical trial, where a group of people are treated simultaneously and the outcomes of the whole group are collected before the next stage of decision. In such a scenario, our goal is to allocate a batch of treatments to maximize treatment efficacy based on observed high-dimensional user covariates. We deliver a solution, named Teamwork LASSO Bandit algorithm, that resolves a batch version of explore-exploit dilemma via switching between teamwork stage and selfish stage during the whole decision process. This is made possible based on statistical properties of LASSO estimate of treatment efficacy that adapts to a sequence of batch observations. In general, a rate of optimal allocation condition is proposed to delineate the exploration and exploitation trade-off on the data collection scheme, which is sufficient for LASSO to identify the optimal treatment for observed user covariates. An upper bound on expected cumulative regret of the proposed algorithm is provided.

A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models

Wed, 03 Jun 2020 00:00:00 +0000

Score matching provides an effective approach to learning flexible unnormalized models, but its scalability is limited by the need to evaluate a second-order derivative. In this paper, we present a scalable approximation to a general family of learning objectives including score matching, by observing a new connection between these objectives and Wasserstein gradient flows. We present applications with promise in learning neural density estimators on manifolds, and training implicit variational and Wasserstein auto-encoders with a manifold-valued prior.

Causal inference in degenerate systems: An impossibility result

Wed, 03 Jun 2020 00:00:00 +0000

Causal relationships among variables are commonly represented via directed acyclic graphs. There are many methods in the literature to quantify the strength of arrows in a causal acyclic graph. These methods, however, have undesirable properties when the causal system represented by a directed acyclic graph is degenerate. In this paper, we characterize a degenerate causal system using multiplicity of Markov boundaries. We show that in this case, it is impossible to find an identifiable quantitative measure of causal effects that satisfy a set of natural criteria. To supplement the impossibility result, we also develop algorithms to identify degenerate causal systems from observed data. Performance of our algorithms is investigated through synthetic data analysis.

Finite-Time Error Bounds for Biased Stochastic Approximation with Applications to Q-Learning

Wed, 03 Jun 2020 00:00:00 +0000

Inspired by the widespread use of Q-learning algorithms in reinforcement learning (RL), this present paper studies a class of biased stochastic approximation (SA) procedures under an ‘ergodic-like’ assumption on the underlying stochastic noise sequence. Leveraging a \emph{multistep Lyapunov function} that looks ahead to several future updates to accommodate the gradient bias, we prove a general result on the convergence of the iterates, and use it to derive finite-time bounds on the mean-square error in the case of constant stepsizes. This novel viewpoint renders the finite-time analysis of \emph{biased SA} algorithms under a broad family of stochastic perturbations possible. For direct comparison with past works, we also demonstrate these bounds by applying them to Q-learning with linear function approximation, under the realistic Markov chain observation model. The resultant finite-time error bound for Q-learning is \emph{the first of its kind}, in the sense that it holds: i) for the unmodified version (i.e., without making any modifications to the updates), and ii), for Markov chains starting from any initial distribution, at least one of which has to be violated for existing results to be applicable.

Learning High-dimensional Gaussian Graphical Models under Total Positivity without Adjustment of Tuning Parameters

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of estimating an undirected Gaussian graphical model when the underlying distribution is multivariate totally positive of order 2 (MTP2), a strong form of positive dependence. Such distributions are relevant for example for portfolio selection, since assets are usually positively dependent. A large body of methods have been proposed for learning undirected graphical models without the MTP2 constraint. A major limitation of these methods is that their structure recovery guarantees in the high-dimensional setting usually require a particular choice of a tuning parameter, which is unknown a priori in real world applications. We here propose a new method to estimate the underlying undirected graphical model under MTP2 and show that it is provably consistent in structure recovery without adjusting the tuning parameters. This is achieved by a constraint-based estimator that infers the structure of the underlying graphical model by testing the signs of the empirical partial correlation coefficients. We evaluate the performance of our estimator in simulations and on financial data.

Assessing Local Generalization Capability in Deep Models

Wed, 03 Jun 2020 00:00:00 +0000

While it has not yet been proven, empirical evidence suggests that model generalization is related to local properties of the optima, which can be described via the Hessian. We connect model generalization with the local property of a solution under the PAC-Bayes paradigm. In particular, we prove that model generalization ability is related to the Hessian, the higher-order “smoothness" terms characterized by the Lipschitz constant of the Hessian, and the scales of the parameters. Guided by the proof, we propose a metric to score the generalization capability of a model, as well as an algorithm that optimizes the perturbed model accordingly.

Deontological Ethics By Monotonicity Shape Constraints

Wed, 03 Jun 2020 00:00:00 +0000

We demonstrate how easy it is for modern machine-learned systems to violate common deontological ethical principles and social norms such as “favor the less fortunate,” and “do not penalize good attributes.” We propose that in some cases such ethical principles can be incorporated into a machine-learned model by adding shape constraints that constrain the model to respond only positively to relevant inputs. We analyze the relationship between these deontological constraints that act on individuals and the consequentialist group-based fairness goals of one-sided statistical parity and equal opportunity. This strategy works with sensitive attributes that are Boolean or real-valued such as income and age, and can help produce more responsible and trustworthy AI.

The Sylvester Graphical Lasso (SyGlasso)

Wed, 03 Jun 2020 00:00:00 +0000

This paper introduces the Sylvester graphical lasso (SyGlasso) that captures multiway dependencies present in tensor-valued data. The model is based on the Sylvester equation that defines a generative model. The proposed model complements the tensor graphical lasso (Greenewald et al., 2019) that imposes a Kronecker sum model for the inverse covariance matrix, by providing an alternative Kronecker sum model that is generative and interpretable. A nodewise regression approach is adopted for estimating the conditional independence relationships among variables. The statistical convergence of the method is established, and empirical studies are provided to demonstrate the recovery of meaningful conditional dependency graphs. We apply the SyGlasso to an electroencephalography (EEG) study to compare the brain connectivity of alcoholic and nonalcoholic subjects. We demonstrate that our model can simultaneously estimate both the brain connectivity and its temporal dependencies.

Neural Topic Model with Attention for Supervised Learning

Wed, 03 Jun 2020 00:00:00 +0000

Topic modeling utilizing neural variational inference has shown promising results recently. Unlike traditional Bayesian topic models, neural topic models use deep neural network to approximate the intractable marginal distribution and thus gain strong generalisation ability. However, neural topic models are unsupervised model. Directly using the document-specific topic proportions in downstream prediction tasks could lead to sub-optimal performance. This paper presents Topic Attention Model (TAM), a supervised neural topic model that integrates an attention recurrent neural network (RNN) model. We design a novel way to utilize document-specific topic proportions and global topic vectors learned from neural topic model in the attention mechanism. We also develop backpropagation inference method that allows for joint model optimisation. Experimental results on three public datasets show that TAM not only significantly improves supervised learning tasks, including classification and regression, but also achieves lower perplexity for the document modeling.

Uncertainty Quantification for Sparse Deep Learning

Wed, 03 Jun 2020 00:00:00 +0000

Deep learning methods continue to have a decided impact on machine learning, both in theory and in practice. Statistical theoretical developments have been mostly concerned with approximability or rates of estimation when recovering infinite dimensional objects (curves or densities). Despite the impressive array of available theoretical results, the literature has been largely silent about uncertainty quantification for deep learning. This paper takes a step forward in this important direction by taking a Bayesian point of view. We study Gaussian approximability of certain aspects of posterior distributions of sparse deep ReLU architectures in non-parametric regression. Building on tools from Bayesian non-parametrics, we provide semi-parametric Bernstein-von Mises theorems for linear and quadratic functionals, which guarantee that implied Bayesian credible regions have valid frequentist coverage. Our results provide new theoretical justifications for (Bayesian) deep learning with ReLU activation functions, highlighting their inferential potential.

Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons

Wed, 03 Jun 2020 00:00:00 +0000

A number of applications (e.g., AI bot tournaments, sports, peer grading, crowdsourcing) use pairwise comparison data and the Bradley-Terry-Luce (BTL) model to evaluate a given collection of items (e.g., bots, teams, students, search results). Past work has shown that under the BTL model, the widely-used maximum-likelihood estimator (MLE) is minimax-optimal in estimating the item parameters, in terms of the mean squared error. However, another important desideratum for designing estimators is fairness. In this work, we consider one specific type of fairness, which is the notion of bias in statistics. We show that the MLE incurs a suboptimal rate in terms of bias. We then propose a simple modification to the MLE, which "stretches" the bounding box of the maximum-likelihood optimizer by a small constant factor from the underlying ground truth domain. We show that this simple modification leads to an improved rate in bias, while maintaining minimax-optimality in the mean squared error. In this manner, our proposed class of estimators provably improves fairness in the sense of bias without loss in accuracy.

Convergence Rates of Gradient Descent and MM Algorithms for Bradley-Terry Models

Wed, 03 Jun 2020 00:00:00 +0000

We present tight convergence rate bounds for gradient descent and MM algorithms for maximum likelihood (ML) estimation and maximum a posteriori probability (MAP) estimation of a popular Bayesian inference method, for Bradley-Terry models of ranking data. Our results show that MM algorithms have the same convergence rate, up to a constant factor, as gradient descent algorithms with optimal constant step size. For the ML estimation objective, the convergence is linear with the rate crucially determined by the algebraic connectivity of the matrix of item pair co-occurrences in observed comparison data. For the MAP estimation objective, we show that the convergence rate is also linear, with the rate determined by a parameter of the prior distribution in a way that can make convergence arbitrarily slow for small values of this parameter. The limit of small values of this parameter corresponds to a flat, non-informative prior distribution.

A Multiclass Classification Approach to Label Ranking

Wed, 03 Jun 2020 00:00:00 +0000

In multiclass classification, the goal is to learn how to predict a random label $Y$, valued in $\mathcal{Y}=\{1,; \ldots,;{K} \}$ with $K\geq 3$, based upon observing a r.v. $X$, taking its values in $\mathbb{R}^q$ with $q\geq 1$ say, by means of a classification rule $g:\mathbb{R}^q\to \mathcal{Y}$ with minimum probability of error $\mathbb{P}\{Yeq g(X) \}$. However, in a wide variety of situations, the task targeted may be more ambitious, consisting in sorting all the possible label values $y$ that may be assigned to $X$ by decreasing order of the posterior probability $\eta_y(X)=\mathbb{P}\{Y=y \mid X \}$. This article is devoted to the analysis of this statistical learning problem, halfway between multiclass classification and posterior probability estimation (regression) and referred to as \textit{label ranking} here. We highlight the fact that it can be viewed as a specific variant of \textit{ranking median regression} (RMR), where, rather than observing a random permutation $\Sigma$ assigned to the input vector $X$ and drawn from a Bradley-Terry-Luce-Plackett model with conditional preference vector $(\eta_1(X),; \ldots,; \eta_K(X))$, the sole information available for training a label ranking rule is the label $Y$ ranked on top, namely $\Sigma^{-1}(1)$. Inspired by recent results in RMR, we prove that under appropriate noise conditions, the One-Versus-One (OVO) approach to multiclassification yields, as a by-product, an optimal ranking of the labels with overwhelming probability. Beyond theoretical guarantees, the relevance of the approach to label ranking promoted in this article is supported by experimental results.

Dynamic content based ranking

Wed, 03 Jun 2020 00:00:00 +0000

We introduce a novel state space model for a set of sequentially time-stamped partial rankings of items and textual descriptions for the items. Based on the data, the model infers text-based themes that are predictive of the rankings enabling forecasting tasks and performing trend analysis. We propose a scaled Gamma process based prior for capturing the underlying dynamics. Based on two challenging and contemporary real data collections, we show the model infers meaningful and useful textual themes as well as performs better than existing related dynamic models.

Momentum in Reinforcement Learning

Wed, 03 Jun 2020 00:00:00 +0000

We adapt the optimization’s concept of momentum to reinforcement learning. Seeing the state-action value functions as an anlog to the gradients in optimization, we interpret momentum as an average of consecutive $q$-functions. We derive Momentum Value Iteration (MoVI), a variation of Value iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations. We show that the proposed approach can be readily extended to deep learning. Specifically,we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games.

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

Wed, 03 Jun 2020 00:00:00 +0000

We propose RandUCB, a bandit strategy that uses theoretically derived confidence intervals similar to upper confidence bound (UCB) algorithms, but akin to Thompson sampling (TS), uses randomization to trade off exploration and exploitation. In the $K$-armed bandit setting, we show that there are infinitely many variants of RandUCB, all of which achieve the minimax-optimal $\widetilde{O}(\sqrt{K T})$ regret after $T$ rounds. Moreover, in a specific multi-armed bandit setting, we show that both UCB and TS can be recovered as special cases of RandUCB. For structured bandits, where each arm is associated with a $d$-dimensional feature vector and rewards are distributed according to a linear or generalized linear model, we prove that RandUCB achieves the minimax-optimal $\widetilde{O}(d \sqrt{T})$ regret even in the case of infinite arms. We demonstrate the practical effectiveness of RandUCB with experiments in both multi-armed and structured bandit settings. We show that RandUCB matches the empirical performance of TS while matching the theoretically optimal bounds of UCB algorithms, thus achieving the best of both worlds.

On the optimality of kernels for high-dimensional clustering

Wed, 03 Jun 2020 00:00:00 +0000

This paper studies the optimality of kernel methods in high dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of $\sqrt{2}$. It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in the higher-order analysis of kernel methods.

Online Convex Optimization with Perturbed Constraints: Optimal Rates against Stronger Benchmarks

Wed, 03 Jun 2020 00:00:00 +0000

This paper studies Online Convex Optimization (OCO) problems where the constraints have additive perturbations that (i) vary over time and (ii) are not known at the time to make a decision. Perturbations may not be i.i.d. generated and can be used, for example, to model a time-varying budget or time-varying requests in resource allocation problems. Our goal is to design a policy that obtains sublinear regret and satisfies the constraints in the long-term. To this end, we present an online primal-dual proximal gradient algorithm that has $O(T^\epsilon \vee T^{1-\epsilon})$ regret and $O(T^\epsilon)$ constraint violation, where $\epsilon \in [0,1)$ is a parameter in the learning rate. The proposed algorithm obtains optimal rates when $\epsilon = 1/2$, and can compare against a stronger comparator (the set of fixed decisions in hindsight) than previous work.

Revisiting the Landscape of Matrix Factorization

Wed, 03 Jun 2020 00:00:00 +0000

Prior work has shown that low-rank matrix factorization has infinitely many critical points, each of which is either a global minimum or a (strict) saddle point. We revisit this problem and provide simple, intuitive proofs of a set of extended results for low-rank and general-rank problems. We couple our investigation with a known invariant manifold M0 of gradient flow. This restriction admits a uniform negative upper bound on the least eigenvalue of the Hessian map at all strict saddles in M0. The bound depends on the size of the nonzero singular values and the separation between distinct singular values of the matrix to be factorized.

Nonparametric Sequential Prediction While Deep Learning the Kernel

Wed, 03 Jun 2020 00:00:00 +0000

The research on online learning under stationary and ergodic processes has been mainly focused on achieving asymptotic guarantees. Although all the methods pursue the same asymptotic goal, their performance varies when handling finite sample datasets and depends heavily on which predefined density estimation method is chosen. In this paper, therefore, we propose a novel algorithm that simultaneously satisfies a short-term goal, to perform as good as the best choice in hindsight of a data-adaptive kernel, learned using a deep neural network, and a long-term goal, to achieve the same theoretical asymptotic guarantee. We present theoretical proofs for our algorithms and demonstrate the validity of our method on the online portfolio selection problem.

Long-and Short-Term Forecasting for Portfolio Selection with Transaction Costs

Wed, 03 Jun 2020 00:00:00 +0000

In this paper we focus on the problem of online portfolio selection with transaction costs. We tackle this problem using a novel approach for combining the predictions of long-term experts with those of short-term experts so as to effectively reduce transaction costs. We prove that the new strategy maintains bounded regret relative to the performance of the best possible combination (switching times) of the long-and short-term experts. We empirically validate our approach on several standard benchmark datasets. These studies indicate that the proposed approach achieves state-of-the-art performance.

Monotonic Gaussian Process Flows

Wed, 03 Jun 2020 00:00:00 +0000

We propose a new framework for imposing monotonicity constraints in a Bayesian non-parametric setting based on numerical solutions of stochastic differential equations. We derive a nonparametric model of monotonic functions that allows for interpretable priors and principled quantification of hierarchical uncertainty. We demonstrate the efficacy of the proposed model by providing competitive results to other probabilistic monotonic models on a number of benchmark functions. In addition, we consider the utility of a monotonic random process as a part of a hierarchical probabilistic model; we examine the task of temporal alignment of time-series data where it is beneficial to use a monotonic random process in order to preserve the uncertainty in the temporal warpings.

Imputation estimators for unnormalized models with missing data

Wed, 03 Jun 2020 00:00:00 +0000

Several statistical models are given in the form of unnormalized densities and calculation of the normalization constant is intractable. We propose estimation methods for such unnormalized models with missing data. The key concept is to combine imputation techniques with estimators for unnormalized models including noise contrastive estimation and score matching. Further, we derive asymptotic distributions of the proposed estimators and construct confidence intervals. Simulation results with truncated Gaussian graphical models and the application to real data of wind direction demonstrate that the proposed methods enable statistical inference from missing data properly.

A Unified Statistically Efficient Estimation Framework for Unnormalized Models

Wed, 03 Jun 2020 00:00:00 +0000

The parameter estimation of unnormalized models is a challenging problem. The maximum likelihood estimation (MLE) is computationally infeasible for these models since normalizing constants are not explicitly calculated. Although some consistent estimators have been proposed earlier, the problem of statistical efficiency remains. In this study, we propose a unified, statistically efficient estimation framework for unnormalized models and several efficient estimators, whose asymptotic variance is the same as the MLE. The computational cost of these estimators is also reasonable and they can be employed whether the sample space is discrete or continuous. The loss functions of the proposed estimators are derived by combining the following two methods: (1) density-ratio matching using Bregman divergence, and (2) plugging-in nonparametric estimators. We also analyze the properties of the proposed estimators when the unnormalized models are misspecified. The experimental results demonstrate the advantages of our method over existing approaches.

Deep Structured Mixtures of Gaussian Processes

Wed, 03 Jun 2020 00:00:00 +0000

Gaussian Processes (GPs) are powerful non-parametric Bayesian regression models that allow exact posterior inference, but exhibit high computational and memory costs. In order to improve scalability of GPs, approximate posterior inference is frequently employed, where a prominent class of approximation techniques is based on local GP experts. However, local-expert techniques proposed so far are either not well-principled, come with limited approximation guarantees, or lead to intractable models. In this paper, we introduce deep structured mixtures of GP experts, a stochastic process model which i) allows exact posterior inference, ii) has attractive computational and memory costs, and iii) when used as GP approximation, captures predictive uncertainties consistently better than previous expert-based approximations. In a variety of experiments, we show that deep structured mixtures have a low approximation error and often perform competitive or outperform prior work.

Taxonomy of Dual Block-Coordinate Ascent Methods for Discrete Energy Minimization

Wed, 03 Jun 2020 00:00:00 +0000

We consider the maximum-a-posteriori inference problem in discrete graphical models and study solvers based on the dual block-coordinate ascent rule. We map all existing solvers in a single framework, allowing for a better understanding of their design principles. We theoretically show that some block-optimizing updates are sub-optimal and how to strictly improve them. On a wide range of problem instances of varying graph connectivity, we study the performance of existingsolvers as well as new variants that can be obtained within the framework. As a result of this exploration we build a new state-of-the art solver, performing uniformly better on the whole range of test instances.

Diameter-based Interactive Structure Discovery

Wed, 03 Jun 2020 00:00:00 +0000

We introduce interactive structure discovery, a generic framework that encompasses many interactive learning settings, including active learning, top-k item identification, interactive drug discovery, and others. We adapt a recently developed active learning algorithm of Tosh and Dasgupta for interactive structure discovery, and show that the new algorithm can be made noise-tolerant and enjoys favorable query complexity bounds.

A Nonparametric Off-Policy Policy Gradient

Wed, 03 Jun 2020 00:00:00 +0000

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The priceof such inefficiency becomes evident in real world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.

A Novel Confidence-Based Algorithm for Structured Bandits

Wed, 03 Jun 2020 00:00:00 +0000

We study finite-armed stochastic bandits where the rewards of each arm might be correlated to those of other arms. We introduce a novel phased algorithm that exploits the given structure to build confidence sets over the parameters of the true bandit problem and rapidly discard all sub-optimal arms. In particular, unlike standard bandit algorithms with no structure, we show that the number of times a suboptimal arm is selected may actually be reduced thanks to the information collected by pulling other arms. Furthermore, we show that, in some structures, the regret of an anytime extension of our algorithm is uniformly bounded over time. For these constant-regret structures, we also derive a matching lower bound. Finally, we demonstrate numerically that our approach better exploits certain structures than existing methods.

On the interplay between noise and curvature and its effect on optimization and generalization

Wed, 03 Jun 2020 00:00:00 +0000

The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the variance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and noise are relevant to properly estimate the generalization gap. Realizing that the limitations of some existing works stems from a confusion between these matrices, we also clarify the distinction between the Fisher matrix, the Hessian, and the covariance matrix of the gradients.

Asynchronous Gibbs Sampling

Wed, 03 Jun 2020 00:00:00 +0000

Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method often used in Bayesian learning. MCMC methods can be difficult to deploy on parallel and distributed systems due to their inherently sequential nature. We study asynchronous Gibbs sampling, which achieves parallelism by simply ignoring sequential requirements. This method has been shown to produce good empirical results for some hierarchical models, and is popular in the topic modeling community, but was also shown to diverge for other targets. We introduce a theoretical framework for analyzing asynchronous Gibbs sampling and other extensions of MCMC that do not possess the Markov property. We prove that asynchronous Gibbs can be modified so that it converges under appropriate regularity conditions - we call this the exact asynchronous Gibbs algorithm. We study asynchronous Gibbs on a set of examples by comparing the exact and approximate algorithms, including two where it works well, and one where it fails dramatically. We conclude with a set of heuristics to describe settings where the algorithm can be effectively used.

Variational Optimization on Lie Groups, with Examples of Leading (Generalized) Eigenvalue Problems

Wed, 03 Jun 2020 00:00:00 +0000

The article considers smooth optimization of functions on Lie groups. By generalizing NAG variational principle in vector space (Wibisono et al., 2016) to general Lie groups, continuous Lie-NAG dynamics which are guaranteed to converge to local optimum are obtained. They correspond to momentum versions of gradient flow on Lie groups. A particular case of $\SO(n)$ is then studied in details, with objective functions corresponding to leading Generalized EigenValue problems: the Lie-NAG dynamics are first made explicit in coordinates, and then discretized in structure preserving fashions, resulting in optimization algorithms with faithful energy behavior (due to conformal symplecticity) and exactly remaining on the Lie group. Stochastic gradient versions are also investigated. Numerical experiments on both synthetic data and practical problem (LDA for MNIST) demonstrate the effectiveness of the proposed methods as optimization algorithms (\emph{not} as a classification method).

Variance Reduction for Evolution Strategies via Structured Control Variates

Wed, 03 Jun 2020 00:00:00 +0000

Evolution Strategies (ES) are a powerful class of blackbox optimization techniques that recently became a competitive alternative to state-of-the-art policy gradient (PG) algorithms for reinforcement learning (RL). We propose a new method for improving accuracy of the ES algorithms, that as opposed to recent approaches utilizing only Monte Carlo structure of the gradient estimator, takes advantage of the underlying MDP structure to reduce the variance. We observe that the gradient estimator of the ES objective can be alternatively computed using reparametrization and PG estimators, which leads to new control variate techniques for gradient estimation in ES optimization. We provide theoretical insights and show through extensive experiments that this RL-specific variance reduction approach outperforms general purpose variance reduction methods.

Learning Fair Representations for Kernel Models

Wed, 03 Jun 2020 00:00:00 +0000

Fair representations are a powerful tool for establishing criteria like statistical parity, proxy non-discrimination, and equality of opportunity in learned models. Existing techniques for learning these representations are typically model-agnostic, as they preprocess the original data such that the output satisfies some fairness criterion, and can be used with arbitrary learning methods. In contrast, we demonstrate the promise of learning a model-aware fair representation, focusing on kernel-based models. We leverage the classical Sufficient Dimension Reduction (SDR) framework to construct representations as subspaces of the reproducing kernel Hilbert space (RKHS), whose member functions are guaranteed to satisfy fairness. Our method supports several fairness criteria, continuous and discrete data, and multiple protected attributes. We further show how to calibrate the accuracy tradeoff by characterizing it in terms of the principal angles between subspaces of the RKHS. Finally, we apply our approach to obtain the first Fair Gaussian Process (FGP) prior for fair Bayesian learning, and show that it is competitive with, and in some cases outperforms, state-of-the-art methods on real data.

Sharp Asymptotics and Optimal Performance for Inference in Binary Models

Wed, 03 Jun 2020 00:00:00 +0000

We study convex empirical risk minimization for high-dimensional inference in binary models. Our first result sharply predicts the statistical performance of such estimators in the linear asymptotic regime under isotropic Gaussian features. Importantly, the predictions hold for a wide class of convex loss functions, which we exploit in order to prove a bound on the best achievable performance among them. Notably, we show that the proposed bound is tight for popular binary models (such as Signed, Logistic or Probit), by constructing appropriate loss functions that achieve it. More interestingly, for binary linear classification under the Logistic and Probit models, we prove that the performance of least-squares is no worse than 0.997 and 0.98 times the optimal one. Numerical simulations corroborate our theoretical findings and suggest they are accurate even for relatively small problem dimensions.

Finite-Time Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Wed, 03 Jun 2020 00:00:00 +0000

Motivated by the emerging use of multi-agent reinforcement learning (MARL) in engineering applications such as networked robotics, swarming drones, and sensor networks, we investigate the policy evaluation problem in a fully decentralized setting, using temporal-difference (TD) learning with linear function approximation to handle large state spaces in practice. The goal of the group of agents is to collaboratively learn the value function of a given policy from locally private rewards observed in a shared environment, through exchanging local estimates with neighbors. Despite their simplicity and widespread use, our theoretical understanding of such decentralized TD learning algorithms remains limited. Existing results were obtained based on i.i.d. data samples, or by imposing an ‘additional’ projection step to control the ‘gradient’ bias incurred by the Markovian observations. In this paper, we provide a finite-time analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a small neighborhood of the optimum. The resultant error bounds are the first of its type—in the sense that they hold under the most practical assumptions—which is made possible by means of a novel multi-step Lyapunov approach.

Multiplicative Gaussian Particle Filter

Wed, 03 Jun 2020 00:00:00 +0000

We propose a new sampling-based approach for approximate inference in filtering problems. Instead of approximating conditional distributions with a finite set of states, as done in particle filters, our approach approximates the distribution with a weighted sum of functions from a set of continuous functions. Central to the approach is the use of sampling to approximate multiplications in the Bayes filter. We provide theoretical analysis, giving conditions for sampling to give good approximation. We next specialize to the case of weighted sums of Gaussians, and show how properties of Gaussians enable closed-form transition and efficient multiplication. Lastly, we conduct preliminary experiments on a robot localization problem and compare performance with the particle filter, to demonstrate the potential of the proposed method.

Independent Subspace Analysis for Unsupervised Learning of Disentangled Representations

Wed, 03 Jun 2020 00:00:00 +0000

Recently there has been an increased interest in unsupervised learning of disentangled representations using the Variational Autoencoder (VAE) framework. Most of the existing work has focused largely on modifying the variational cost function to achieve this goal. We first show that these modifications, e.g. beta-VAE, simplify the tendency of variational inference to underfit, causing pathological over-pruning and over-orthogonalization of learned components. Second, we propose a complementary approach: to modify the probabilistic model with a structured latent prior. This prior discovers latent variable representations that are structured into a hierarchy of independent vector spaces. The proposed prior has three major advantages: First, in contrast to the standard VAE normal prior, the proposed prior is not rotationally invariant. This feature of our approach resolves the problem of unidentifiability of the standard VAE normal prior. Second, we demonstrate that the proposed prior encourages a disentangled latent representation which facilitates learning of disentangled representations. Third, extensive quantitative experiments demonstrate that the prior significantly mitigates the trade-off between reconstruction loss and disentanglement over the state of the art.

Gain with no Pain: Efficiency of Kernel-PCA by Nyström Sampling

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we analyze a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the sample size. Our analysis shows that Nyström sampling greatly improves computational efficiency without incurring any loss of statistical accuracy. While similar effects have been observed in supervised learning, this is the first such result for PCA. Our theoretical findings are based on a combination of analytic and concentration of measure techniques. Our study is more broadly motivated by the question of understanding the interplay between statistical and computational requirements for learning.

Approximate Cross-Validation in High Dimensions with Guarantees

Wed, 03 Jun 2020 00:00:00 +0000

Leave-one-out cross-validation (LOOCV) can be particularly accurate among cross-validation (CV) variants for machine learning assessment tasks – e.g., assessing methods’ error or variability. But it is expensive to re-fit a model $N$ times for a dataset of size $N$. Previous work has shown that approximations to LOOCV can be both fast and accurate – when the unknown parameter is of small, fixed dimension. But these approximations incur a running time roughly cubic in dimension – and we show that, besides computational issues, their accuracy dramatically deteriorates in high dimensions. Authors have suggested many potential and seemingly intuitive solutions, but these methods have not yet been systematically evaluated or compared. We find that all but one perform so poorly as to be unusable for approximating LOOCV. Crucially, though, we are able to show, both empirically and theoretically, that one approximation can perform well in high dimensions – in cases where the high-dimensional parameter exhibits sparsity. Under interpretable assumptions, our theory demonstrates that the problem can be reduced to working within an empirically recovered (small) support. This procedure is straightforward to implement, and we prove that its running time and error depend on the (small) support size even when the full parameter dimension is large.

The Area of the Convex Hull of Sampled Curves: a Robust Functional Statistical Depth measure

Wed, 03 Jun 2020 00:00:00 +0000

With the ubiquity of sensors in the IoT era, statistical observations are becoming increasingly available in the form of massive (multivariate) time-series. Formulated as unsupervised anomaly detection tasks, an abundance of applications like aviation safety management, the health monitoring of complex infrastructures or fraud detection can now rely on such functional data, acquired and stored with an ever finer granularity. The concept of \textit{statistical depth}, which reflects centrality of an arbitrary observation w.r.t. a statistical population may play a crucial role in this regard, anomalies corresponding to observations with ’small’ depth. Supported by sound theoretical and computational developments in the recent decades, it has proven to be extremely useful, in particular in functional spaces. However, most approaches documented in the literature consist in evaluating independently the centrality of each point forming the time series and consequently exhibit a certain insensitivity to possible shape changes.In this paper, we propose a novel notion of functional depth based on the area of the convex hull of sampled curves, capturing gradual departures from centrality, even beyond the envelope of the data, in a natural fashion.We discuss practical relevance of commonly imposed axioms on functional depths and investigate which of them are satisfied by the notion of depth we promote here. Estimation and computational issues are also adressed and various numerical experiments provide empirical evidence of the relevance of the approach proposed.

Sample complexity bounds for localized sketching

Wed, 03 Jun 2020 00:00:00 +0000

We consider sketched approximate matrix multiplication and ridge regression in the novel setting of localized sketching, where at any given point, only part of the data matrix is available. This corresponds to a block diagonal structure on the sketching matrix. We show that, under mild conditions, block diagonal sketching matrices require only $O(\sr / \epsilon^2)$ and $O(\sd_{\lambda}/\epsilon)$ total sample complexity for matrix multiplication and ridge regression, respectively. This matches the state-of-the-art bounds that are obtained using global sketching matrices. The localized nature of sketching considered allows for different parts of the data matrix to be sketched independently and hence is more amenable to computation in distributed and streaming settings and results in a smaller memory and computational footprint.

DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communication-efficient in the sense that at every iteration the master node and workers communicate vectors of size $O(p)$, where $p$ is the dimension of the decision variable. The proposed method is based on a distributed asynchronous averaging scheme of decision vectors and gradients in a way to effectively capture the local Hessian information of the objective function. Our convergence theory supports asynchronous computations subject to both bounded delays and unbounded delays with a bounded time-average. Unlike in the majority of asynchronous optimization literature, we do not require choosing smaller stepsize when delays are huge. We provide numerical experiments that match our theoretical results and showcase significant improvement comparing to state-of-the-art distributed algorithms.

Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation

Wed, 03 Jun 2020 00:00:00 +0000

Autoregressive neural sequence generative models trained by Maximum Likelihood Estimation suffer the exposure bias problem in practical finite sample scenarios. The crux is that the number of training samples for Maximum Likelihood Estimation is usually limited and the input data distributions are different at training and inference stages. Many methods have been proposed to solve the above problem, which relies on sampling from the non-stationary model distribution and suffers from high variance or biased estimations. In this paper, we propose $\psi$-MLE, a new training scheme for autoregressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. We derive our algorithm from a new perspective of self-augmentation and introduce bias correction with density ratio estimation. Extensive experimental results on synthetic data and real-world text generation tasks demonstrate that our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.

Balanced Off-Policy Evaluation in General Action Spaces

Wed, 03 Jun 2020 00:00:00 +0000

Estimation of importance sampling weights for off-policy evaluation of contextual bandits often results in imbalance—a mismatch between the desired and the actual distribution of state-action pairs after weighting. In this work we present balanced off-policy evaluation (B-OPE), a generic method for estimating weights which minimize this imbalance. Estimation of these weights reduces to a binary classification problem regardless of action type. We show that minimizing the risk of the classifier implies minimization of imbalance to the desired counterfactual distribution. In turn, this is tied to the error of the off-policy estimate, allowing for easy tuning of hyperparameters. We provide experimental evidence that B-OPE improves weighting-based approaches for offline policy evaluation in both discrete and continuous action spaces.

Rep the Set: Neural Networks for Learning Set Representations

Wed, 03 Jun 2020 00:00:00 +0000

In several domains, data objects can be decomposed into sets of simpler objects. It is then natural to represent each object as the set of its components or parts. Many conventional machine learning algorithms are unable to process this kind of representations, since sets may vary in cardinality and elements lack a meaningful ordering. In this paper, we present a new neural network architecture, called RepSet, that can handle examples that are represented as sets of vectors. The proposed model computes the correspondences between an input set and some hidden sets by solving a series of network flow problems. This representation is then fed to a standard neural network architecture to produce the output. The architecture allows end-to-end gradient-based learning. We demonstrate RepSet on classification tasks, including text categorization, and graph classification, and we show that the proposed neural network achieves performance better or comparable to state-of-the-art algorithms.

Context Mover’s Distance & Barycenters: Optimal Transport of Contexts for Building Representations

Wed, 03 Jun 2020 00:00:00 +0000

We present a framework for building unsupervised representations of entities and their compositions, where each entity is viewed as a probability distribution rather than a vector embedding. In particular, this distribution is supported over the contexts which co-occur with the entity and are embedded in a suitable low-dimensional space. This enables us to consider representation learning from the perspective of Optimal Transport and take advantage of its tools such as Wasserstein distance and barycenters. We elaborate how the method can be applied for obtaining unsupervised representations of text and illustrate the performance (quantitatively as well as qualitatively) on tasks such as measuring sentence similarity, word entailment and similarity, where we empirically observe significant gains (e.g., 4.1% relative improvement over Sent2vec, GenSen).The key benefits of the proposed approach include: (a) capturing uncertainty and polysemy via modeling the entities as distributions, (b) utilizing the underlying geometry of the particular task (with the ground cost), (c) simultaneously providing interpretability with the notion of optimal transport between contexts and (d) easy applicability on top of existing point embedding methods. The code, as well as pre-built histograms, are available under https://github.com/context-mover/.

Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning

Wed, 03 Jun 2020 00:00:00 +0000

Decision trees are ubiquitous in machine learning for their ease of use and interpretability. Yet, these models are not typically employed in reinforcement learning as they cannot be updated online via stochastic gradient descent. We overcome this limitation by allowing for a gradient update over the entire tree that improves sample complexity affords interpretable policy extraction. First, we include theoretical motivation on the need for policy-gradient learning by examining the properties of gradient descent over differentiable decision trees. Second, we demonstrate that our approach equals or outperforms a neural network on all domains and can learn discrete decision trees online with average rewards up to 7x higher than a batch-trained decision tree. Third, we conduct a user study to quantify the interpretability of a decision tree, rule list, and a neural network with statistically significant results (p < 0.001).

Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

Wed, 03 Jun 2020 00:00:00 +0000

In this paper we settle the sampling complexity of solving discounted two-player turn-based zero-sum stochastic games up to polylogarithmic factors. Given a stochastic game with discount factor $\gamma\in(0,1)$ we provide an algorithm that computes an $\epsilon$-optimal strategy with high-probability given $\tilde{O}((1 - \gamma)^{-3} \epsilon^{-2})$ samples from the transition function for each state-action-pair. Our algorithm runs in time nearly linear in the number of samples and uses space nearly linear in the number of state-action pairs. As stochastic games generalize Markov decision processes (MDPs) our runtime and sample complexities are optimal due to \cite{azar2013minimax}. We achieve our results by showing how to generalize a near-optimal Q-learning based algorithms for MDP, in particular \cite{sidford2018near}, to two-player strategy computation algorithms. This overcomes limitations of standard Q-learning and strategy iteration or alternating minimization based approaches and we hope will pave the way for future reinforcement learning results by facilitating the extension of MDP results to multi-agent settings with little loss.

Deep Active Learning: Unified and Principled Method for Query and Training

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we are proposing a unified and principled method for both the querying and training processes in deep batch active learning. We are providing theoretical insights from the intuition of modeling the interactive procedure in active learning as distribution matching, by adopting the Wasserstein distance. As a consequence, we derived a new training loss from the theoretical analysis, which is decomposed into optimizing deep neural network parameters and batch query selection through alternative optimization. In addition, the loss for training a deep neural network is naturally formulated as a min-max optimization problem through leveraging the unlabeled data information. Moreover, the proposed principles also indicate an explicit uncertainty-diversity trade-off in the query batch selection. Finally, we evaluate our proposed method on different benchmarks, consistently showing better empirical performances and a better time-efficient query strategy compared to the baselines.

Accelerated Bayesian Optimisation through Weight-Prior Tuning

Wed, 03 Jun 2020 00:00:00 +0000

Bayesian optimization (BO) is a widely-used method for optimizing expensive (to evaluate) problems. At the core of most BO methods is the modeling of the objective function using a Gaussian Process (GP) whose covariance is selected from a set of standard covariance functions. From a weight-space view, this models the objective as a linear function in a feature space implied by the given covariance $K$, with an arbitrary Gaussian weight prior ${\bf w} \sim ormdist ({\bf 0},{\bf I})$. In many practical applications there is data available that has a similar (covariance) structure to the objective, but which, having different form, cannot be used directly in standard transfer learning. In this paper we show how such auxiliary data may be used to construct a GP covariance corresponding to a more appropriate weight prior for the objective function. Building on this, we show that we may accelerate BO by modeling the objective function using this (learned) weight prior, which we demonstrate on both test functions and a practical application to short-polymer fibre manufacture.

Sparse Orthogonal Variational Inference for Gaussian Processes

Wed, 03 Jun 2020 00:00:00 +0000

We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.

Decentralized Multi-player Multi-armed Bandits with No Collision Information

Wed, 03 Jun 2020 00:00:00 +0000

The decentralized stochastic multi-player multi-armed bandit (MP-MAB) problem, where the collision information is not available to the players, is studied in this paper. Building on the seminal work of Boursier and Perchet (2019), we propose error correction synchronization involving communication (EC-SIC), whose regret is shown to approach that of the centralized stochastic MP-MAB with collision information. By recognizing that the communication phase without collision information corresponds to the Z-channel model in information theory, the proposed EC-SIC algorithm applies optimal error correction coding for the communication of reward statistics. A fixed message length, as opposed to the logarithmically growing one in Boursier and Perchet (2019), also plays a crucial role in controlling the communication loss. Experiments with practical Z-channel codes, such as repetition code, flip code and modified Hamming code, demonstrate the superiority of EC-SIC in both synthetic and real-world datasets.

Differentiable Feature Selection by Discrete Relaxation

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we introduce Differentiable Feature Selection, a gradient-based search algorithm for feature selection. Our approach extends a recent result on the estimation of learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e. in mini-batches) and in linear time and space with respect to both the number of features D and the sample size N. This, along with a discrete-to-continuous relaxation of the search domain, allows for an efficient, gradient-based search algorithm among feature subsets for very large datasets. Our algorithm utilizes higher-order correlations between features and targets for both the N>D and N

General Identification of Dynamic Treatment Regimes Under Interference

Wed, 03 Jun 2020 00:00:00 +0000

In many applied fields, researchers are ofteninterested in tailoring treatments to unit-levelcharacteristics in order to optimize an outcomeof interest. Methods for identifying andestimating treatment policies are the subjectof the dynamic treatment regime literature. Separately, in many settings the assumptionthat data are independent and identically distributeddoes not hold due to inter-subjectdependence. The phenomenon where a subject’s outcome is dependent on his neighbor’s exposure is known as interference. These areasintersect in myriad real-world settings. Inthis paper we consider the problem of identifyingoptimal treatment policies in the presenceof interference. Using a general representationof interference, via Lauritzen-Wermuth-Freydenburg chain graphs (Lauritzen andRichardson, 2002), we formalize a variety ofpolicy interventions under interference andextend existing identification theory (Tian,2008; Sherman and Shpitser, 2018). Finally, we illustrate the efficacy of policy maximization under interference in a simulation study.

Learning spectrograms with convolutional spectral kernels

Wed, 03 Jun 2020 00:00:00 +0000

We introduce the convolutional spectral kernel (CSK), a novel family of non-stationary, nonparametric covariance kernels for Gaussian process (GP) models, derived from the convolution between two imaginary radial basis functions. We present a principled framework to interpret CSK, as well as other deep probabilistic models, using approximated Fourier transform, yielding a concise representation of input-frequency spectrogram. Observing through the lens of the spectrogram, we provide insight on the interpretability of deep models. We then infer the functional hyperparameters using scalable variational and MCMC methods. On small- and medium-sized spatiotemporal datasets, we demonstrate improved generalization of GP models when equipped with CSK, and their capability to extract non-stationary periodic patterns.

Private k-Means Clustering with Stability Assumptions

Wed, 03 Jun 2020 00:00:00 +0000

We study the problem of differentially private clustering under input-stability assumptions. Despite the ever-growing volume of works on differential privacy in general and differentially private clustering in particular, only three works (Nissim et al., 2007; Wang et al., 2015; Huang and Liu, 2018) looked at the problem of privately clustering "nice" k-means instances, all three relying on the sample-and-aggregate framework and all three measuring utility in terms of Wasserstein distance between the true cluster centers and the centers returned by the private algorithm. In this work we improve upon this line of works on multiple axes. We present a simpler algorithm for clustering stable inputs (not relying on the sample-and-aggregate framework), and analyze its utility in both the Wasserstein distance and the k-means cost. Moreover, our algorithm has straight-forward analogues for "nice" k-median instances and for the local-model of differential privacy.

A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option

Wed, 03 Jun 2020 00:00:00 +0000

We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. Upon completion, the chosen action yields a stochastic reward. The agent seeks to maximize its cumulative reward over a finite time budget, with the option of "giving up" on a current action — hence forfeiting any reward – in order to choose another action. We cast this problem as a variant of the stochastic multi-armed bandits problem with stochastic consumption of resource. For this problem, we first establish that the optimal arm is the one that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. Using a novel upper confidence bound on this ratio, we then introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound which has an improved dependence on problem parameters compared to previous works. Simulations on various problem configurations comparing WAIT-UCB against the state-of-the-art algorithms are also presented.

Learning piecewise Lipschitz functions in changing environments

Wed, 03 Jun 2020 00:00:00 +0000

Optimization in the presence of sharp (non-Lipschitz), unpredictable (w.r.t. time and amount) changes is a challenging and largely unexplored problem of great significance. We consider the class of piecewise Lipschitz functions, which is the most general online setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of ‘static’ regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to $s$ environment {\it shifts}. In this work we provide an $O(\sqrt{sdT\log T}+sT^{1-\beta})$ regret bound for $\beta$-dispersed functions, where $\beta$ roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically $\beta\ge1/2$ in problems of practical interest \cite{2019arXiv190409014B,balcan2018dispersion}). We also present a lower bound tight up to sub-logarithmic factors. We further obtain improved bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems on popular benchmarks.

Fixed-confidence guarantees for Bayesian best-arm identification

Wed, 03 Jun 2020 00:00:00 +0000

We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T3C when coupled with a very natural Bayesian stopping rule, for bandits with Gaussian rewards, solving one of the open questions raised by Russo (2016). We also provide new posterior convergence results for TTTS under two models that are commonly used in practice: bandits with Gaussian and Bernoulli rewards and conjugate priors.

Choosing the Sample with Lowest Loss makes SGD Robust

Wed, 03 Jun 2020 00:00:00 +0000

The presence of outliers can potentially significantly skew the parameters of machine learning models trained via stochastic gradient descent (SGD). In this paper we propose a simple variant of the simple SGD method: in each step, first choose a set of k samples, then from these choose the one with the smallest current loss, and do an SGD-like update with this chosen sample. Vanilla SGD corresponds to $k=1$, i.e. no choice; $k>=2$ represents a new algorithm that is however effectively minimizing a non-convex surrogate loss. Our main contribution is a theoretical analysis of the robustness properties of this idea for ML problems which are sums of convex losses; these are backed up with synthetic and small-scale neural network experiments.

A single algorithm for both restless and rested rotting bandits

Wed, 03 Jun 2020 00:00:00 +0000

In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the available actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.

Mixed Strategies for Robust Optimization of Unknown Objectives

Wed, 03 Jun 2020 00:00:00 +0000

We consider robust optimization problems, where the goal is to optimize an unknown objective function against the worst-case realization of an uncertain parameter. For this setting, we design a novel sample-efficient algorithm GP-MRO, which sequentially learns about the unknown objective from noisy point evaluations. GP-MRO seeks to discover a robust and randomized mixed strategy, that maximizes the worst-case expected objective value. To achieve this, it combines techniques from online learning with nonparametric confidence bounds from Gaussian processes. Our theoretical results characterize the number of samples required by GP-MRO to discover a robust near-optimal mixed strategy for different GP kernels of interest. We experimentally demonstrate the performance of our algorithm on synthetic datasets and on human-assisted trajectory planning tasks for autonomous vehicles. In our simulations, we show that robust deterministic strategies can be overly conservative, while the mixed strategies found by GP-MRO significantly improve the overall performance.

Kernel Conditional Density Operators

Wed, 03 Jun 2020 00:00:00 +0000

We introduce a novel conditional density estimationmodel termed the conditional densityoperator (CDO). It naturally captures multivariate,multimodal output densities andshows performance that is competitive withrecent neural conditional density models andGaussian processes. The proposed model isbased on a novel approach to the reconstructionof probability densities from their kernelmean embeddings by drawing connections toestimation of Radon-Nikodym derivatives inthe reproducing kernel Hilbert space (RKHS).We prove finite sample bounds for the estimationerror in a standard density reconstructionscenario, independent of problem dimensionality.Interestingly, when a kernel is used thatis also a probability density, the CDO allowsus to both evaluate and sample the outputdensity efficiently. We demonstrate the versatilityand performance of the proposed modelon both synthetic and real-world data.

Thompson Sampling for Linearly Constrained Bandits

Wed, 03 Jun 2020 00:00:00 +0000

We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O( sqrt( T ) ) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O( log ( T ) ). We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

Unconditional Coresets for Regularized Loss Minimization

Wed, 03 Jun 2020 00:00:00 +0000

We design and mathematically analyze sampling-based algorithms for regularized loss minimization problems that are implementable in popular computational models for large data, in which the access to the data is restricted in some way. Our main result is that if the regularizer’s effect does not become negligible as the norm of the hypothesis scales, and as the data scales, then a uniform sample of modest size is with high probability a coreset. In the case that the loss function is either logistic regression or soft-margin support vector machines, and the regularizer is one of the common recommended choices, this result implies that a uniform sample of size $O(d \sqrt{n})$ is with high probability a coreset of $n$ points in $\Re^d$. We contrast this upper bound with two lower bounds. The first lower bound shows that our analysis of uniform sampling is tight; that is, a smaller uniform sample will likely not be a core set. The second lower bound shows that in some sense uniform sampling is close to optimal, as significantly smaller core sets do not generally exist.

Minimax Rank-$1$ Matrix Factorization

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of recovering a rank-one matrix when a perturbed subset of its entries is revealed. We propose a method based on least squares in the log-space and show its performance matches the lower bounds that we derive for this problem in the small-perturbation regime, which are related to the spectral gap of a graph representing the revealed entries. Unfortunately, we show that for larger disturbances, potentially exponentially growing errors are unavoidable for any consistent recovery method. We then propose a second algorithm relying on encoding the matrix factorization in the stationary distribution of a certain Markov chain. We show that, under the stronger assumption of known upper and lower bounds on the entries of the true matrix, this second method does not have exponential error growth for large disturbances. Both algorithms can be implemented in nearly linear time.

On Maximization of Weakly Modular Functions: Guarantees of Multi-stage Algorithms, Tractability, and Hardness

Wed, 03 Jun 2020 00:00:00 +0000

Maximization of {\it non-submodular} functions appears in various scenarios, and many previous works studied it based on some measures that quantify the closeness to being submodular. On the other hand, some practical non-submodular functions are actually close to being {\it modular}, which has been utilized in few studies. In this paper, we study cardinality-constrained maximization of {\it weakly modular} functions, whose closeness to being modular is measured by {\it submodularity} and {\it supermodularity ratios}, and reveal what we can and cannot do by using the weak modularity. We first show that guarantees of multi-stage algorithms can be proved with the weak modularity, which generalize and improve some existing results, and experiments confirm their effectiveness. We then show that weakly modular maximization is {\it fixed-parameter tractable} under certain conditions; as a byproduct, we provide a new time–accuracy trade-off for $\ell_0$-constrained minimization. We finally prove that, even if objective functions are weakly modular, no polynomial-time algorithms can improve the existing approximation guarantee achieved by the greedy algorithm in general.

Guarantees of Stochastic Greedy Algorithms for Non-monotone Submodular Maximization with Cardinality Constraint

Wed, 03 Jun 2020 00:00:00 +0000

Submodular maximization with a cardinality constraint can model various problems, and those problems are often very large in practice. For the case where objective functions are monotone, many fast approximation algorithms have been developed. The stochastic greedy algorithm (SG) is one such algorithm, which is widely used thanks to its simplicity, efficiency, and high empirical performance. However, its approximation guarantee has been proved only for monotone objective functions. When it comes to non-monotone objective functions, existing approximation algorithms are inefficient relative to the fast algorithms developed for the case of monotone objectives. In this paper, we prove that SG (with slight modification) can achieve almost $1/4$-approximation guarantees in expectation in linear time even if objective functions are non-monotone. Our result provides a constant-factor approximation algorithm with the fewest oracle queries for non-monotone submodular maximization with a cardinality constraint. Experiments validate the performance of (modified) SG.

Variational Integrator Networks for Physically Structured Embeddings

Wed, 03 Jun 2020 00:00:00 +0000

Learning workable representations of dynamical systems is becoming an increasingly important problem in a number of application areas. By leveraging recent work connecting deep neural networks to systems of differential equations, we propose \emph{variational integrator networks}, a class of neural network architectures designed to preserve the geometric structure of physical systems. This class of network architectures facilitates accurate long-term prediction, interpretability, and data-efficient learning, while still remaining highly flexible and capable of modeling complex behavior. We demonstrate that they can accurately learn dynamical systems from both noisy observations in phase space and from image pixels within which the unknown dynamics are embedded.

Online Continuous DR-Submodular Maximization with Long-Term Budget Constraints

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we study a class of online optimization problems with long-term budget constraints where the objective functions are not necessarily concave (nor convex), but they instead satisfy the Diminishing Returns (DR) property. In this online setting, a sequence of monotone DR-submodular objective functions and linear budget functions arrive over time and assuming a limited total budget, the goal is to take actions at each time, before observing the utility and budget function arriving at that round, to achieve sub-linear regret bound while the total budget violation is sub-linear as well. Prior work has shown that achieving sub-linear regret and total budget violation simultaneously is impossible if the utility and budget functions are chosen adversarially. Therefore, we modify the notion of regret by comparing the agent against the best fixed decision in hindsight which satisfies the budget constraint proportionally over any window of length $W$. We propose the Online Saddle Point Hybrid Gradient (OSPHG) algorithm to solve this class of online problems. For $W=T$, we recover the aforementioned impossibility result. However, if $W$ is sub-linear in $T$, we show that it is possible to obtain sub-linear bounds for both the regret and the total budget violation.

The Fast Loaded Dice Roller: A Near-Optimal Exact Sampler for Discrete Probability Distributions

Wed, 03 Jun 2020 00:00:00 +0000

This paper introduces a new algorithm for the fundamental problem of generating a random integer from a discrete probability distribution using a source of independent and unbiased random coin flips. We prove that this algorithm, which we call the Fast Loaded Dice Roller (FLDR), is highly efficient in both space and time: (i) the size of the sampler is guaranteed to be linear in the number of bits needed to encode the input distribution; and (ii) the expected number of bits of entropy it consumes per sample is at most 6 bits more than the information-theoretically optimal rate. We present fast implementations of the linear-time preprocessing and near-optimal sampling algorithms using unsigned integer arithmetic. Empirical evaluations on a broad set of probability distributions establish that FLDR is 2x-10x faster in both preprocessing and sampling than multiple baseline algorithms, including the widely-used alias and interval samplers. It also uses up to 10000x less space than the information-theoretically optimal sampler, at the expense of less than 1.5x runtime overhead.

An Asymptotic Rate for the LASSO Loss

Wed, 03 Jun 2020 00:00:00 +0000

The LASSO is a well-studied method for use in high-dimensional linear regression where one wishes to recover a sparse vector b from noisy observations y measured through a n-by-p matrix X with the model y = Xb + w where w is a vector of independent, mean-zero noise. We study the linear asymptotic regime where the under sampling ratio, n/p, approaches a constant greater than 0 in the limit.Using a carefully constructed approximate message passing (AMP) algorithm that converges to the LASSO estimator and recent finite sample theoretical performance guarantees for AMP, we provide large deviations bounds between various measures of LASSO loss and their concentrating values predicted by the AMP state evolution that shows exponentially fast convergence (in n) when the measurement matrix X is i.i.d. Gaussian. This work refines previous asymptotic analysis of LASSO loss in [Bayati and Montanari, 2012].

Conditional Importance Sampling for Off-Policy Learning

Wed, 03 Jun 2020 00:00:00 +0000

The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.

Adaptive Trade-Offs in Off-Policy Learning

Wed, 03 Jun 2020 00:00:00 +0000

A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. This leads to new perspectives on existing methods, and also naturally yields novel algorithms for off-policy evaluation and control. We develop one such algorithm, C-trace, demonstrating that it is able to more efficiently make these trade-offs than existing methods in use, and that it can be scaled to yield state-of-the-art performance in large-scale environments.

Optimal Approximation of Doubly Stochastic Matrices

Wed, 03 Jun 2020 00:00:00 +0000

We consider the least-squares approximation of a matrix C in the set of doubly stochastic matrices with the same sparsity pattern as C. Our approach is based on applying the well-known Alternating Direction Method of Multipliers (ADMM) to a reformulation of the original problem. Our resulting algorithm requires an initial Cholesky factorization of a positive definite matrix that has the same sparsity pattern as C + I followed by simple iterations whose complexity is linear in the number of nonzeros in C, thus ensuring excellent scalability and speed. We demonstrate the advantages of our approach in a series of experiments on problems with up to 82 million nonzeros; these include normalizing large scale matrices arising from the 3D structure of the human genome, clustering applications, and the SuiteSparse matrix library. Overall, our experiments illustrate the outstanding scalability of our algorithm; matrices with millions of nonzeros can be approximated in a few seconds on modest desktop computing hardware.

Post-Estimation Smoothing: A Simple Baseline for Learning with Side Information

Wed, 03 Jun 2020 00:00:00 +0000

Observational data are often accompanied by natural structural indices, such as time stamps or geographic locations, which are meaningful to prediction tasks but are often discarded. We leverage semantically meaningful indexing data while ensuring robustness to potentially uninformative or misleading indices. We propose a post-estimation smoothing operator as a fast and effective method for incorporating structural index data into prediction. Because the smoothing step is separate from the original predictor, it applies to a broad class of machine learning tasks, with no need to retrain models. Our theoretical analysis details simple conditions under which post-estimation smoothing will improve accuracy over that of the original predictor. Our experiments on large scale spatial and temporal datasets highlight the speed and accuracy of post-estimation smoothing in practice. Together, these results illuminate a novel way to consider and incorporate the natural structure of index variables in machine learning.

Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis

Wed, 03 Jun 2020 00:00:00 +0000

We design a general framework for answering adaptive statistical queries that focuses on providing explicit confidence intervals along with point estimates. Prior work in this area has either focused on providing tight confidence intervals for specific analyses, or providing general worst-case bounds for point estimates. Unfortunately, as we observe, these worst-case bounds are loose in many settings — often not even beating simple baselines like sample splitting. Our main contribution is to design a framework for providing valid, instance-specific confidence intervals for point estimates that can be generated by heuristics. When paired with good heuristics, this method gives guarantees that are orders of magnitude better than the best worst-case bounds. We provide a Python library implementing our method.

Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness

Wed, 03 Jun 2020 00:00:00 +0000

The exploding and vanishing gradient problem has been the major conceptual principle behind most architecture and training improvements in recurrent neural networks (RNNs) during the last decade. In this paper, we argue that this principle, while powerful, might need some refinement to explain recent developments. We refine the concept of exploding gradients by reformulating the problem in terms of the cost function smoothness, which gives insight into higher-order derivatives and the existence of regions with many close local minima. We also clarify the distinction between vanishing gradients and the need for the RNN to learn attractors to fully use its expressive power. Through the lens of these refinements, we shed new light on recent developments in the RNN field, namely stable RNN and unitary (or orthogonal) RNNs.

Prediction Focused Topic Models via Feature Selection

Wed, 03 Jun 2020 00:00:00 +0000

Supervised topic models are often sought to balance prediction quality and interpretability. However, when models are (inevitably) misspecified, standard approaches rarely deliver on both. We introduce a novel approach, the prediction-focused topic model, that uses the supervisory signal to retain only vocabulary terms that improve, or at least do not hinder, prediction performance. By removing terms with irrelevant signal, the topic model is able to learn task-relevant, coherent topics. We demonstrate on several data sets that compared to existing approaches, prediction-focused topic models learn much more coherent topics while maintaining competitive predictions.

FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization

Wed, 03 Jun 2020 00:00:00 +0000

Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systems-oriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized message-passing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.

Truly Batch Model-Free Inverse Reinforcement Learning about Multiple Intentions

Wed, 03 Jun 2020 00:00:00 +0000

We consider Inverse Reinforcement Learning (IRL) about multiple intentions, \ie the problem of estimating the unknown reward functions optimized by a group of experts that demonstrate optimal behaviors. Most of the existing algorithms either require access to a model of the environment or need to repeatedly compute the optimal policies for the hypothesized rewards. However, these requirements are rarely met in real-world applications, in which interacting with the environment can be expensive or even dangerous. In this paper, we address the IRL about multiple intentions in a fully model-free and batch setting. We first cast the single IRL problem as a constrained likelihood maximization and then we use this formulation to cluster agents based on the likelihood of the assignment. In this way, we can efficiently solve, without interactions with the environment, both the IRL and the clustering problem. Finally, we evaluate the proposed methodology on simulated domains and on a real-world social-network application.

Tensorized Random Projections

Wed, 03 Jun 2020 00:00:00 +0000

We introduce a novel random projection technique for efficiently reducing the dimension of very high-dimensional tensors. Building upon classical results on Gaussian random projections and Johnson-Lindenstrauss transforms (JLT), we propose two tensorized random projection maps relying on the tensor train (TT) and CP decomposition format, respectively. The two maps offer very low memory requirements and can be applied efficiently when the inputs are low rank tensors given in the CP or TT format.Our theoretical analysis shows that the dense Gaussian matrix in JLT can be replaced by a low-rank tensor implicitly represented in compressed form with random factors, while still approximately preserving the Euclidean distance of the projected inputs. In addition, our results reveal that the TT format is substantially superior to CP in terms of the size of the random projection needed to achieve the same distortion ratio. Experiments on synthetic data validate our theoretical analysis and demonstrate the superiority of the TT decomposition.

Importance Sampling via Local Sensitivity

Wed, 03 Jun 2020 00:00:00 +0000

Given a loss function $F:\mathcal{X} \rightarrow \R^+$ that can be written as the sum of losses over a large set of inputs $a_1,\ldots, a_n$, it is often desirable to approximate $F$ by subsampling the input points. Strong theoretical guarantees require taking into account the importance of each point, measured by how much its individual loss contributes to $F(x)$. Maximizing this importance over all $x \in \mathcal{X}$ yields the \emph{sensitivity score} of $a_i$. Sampling with probabilities proportional to these scores gives strong guarantees, allowing one to approximately minimize of $F$ using just the subsampled points.Unfortunately, sensitivity sampling is difficult to apply since (1) it is unclear how to efficiently compute the sensitivity scores and (2) the sample size required is often impractically large. To overcome both obstacles we introduce \emph{local sensitivity}, which measures data point importance in a ball around some center $x_0$. We show that the local sensitivity can be efficiently estimated using the \emph{leverage scores} of a quadratic approximation to $F$ and that the sample size required to approximate $F$ around $x_0$ can be bounded. We propose employing local sensitivity sampling in an iterative optimization method and analyze its convergence when $F$ is smooth and convex.

How fine can fine-tuning be? Learning efficient language models

Wed, 03 Jun 2020 00:00:00 +0000

State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces \emph{small} differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it suffices to fine-tune only the most critical layers. Further, we find that there are surprisingly many \emph{good} solutions in the set of sparsified versions of the pre-trained model. As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero, saving both task-specific parameter storage and computational cost.

Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions

Wed, 03 Jun 2020 00:00:00 +0000

We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size $n$ and number of features $p$ are large, and $n/p$ can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as $n,p \rightarrow \infty$, even when the dimension $p$ of the feature vectors is comparable with or greater than the sample size $n$. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.

Learning Dynamic and Personalized Comorbidity Networks from Event Data using Deep Diffusion Processes

Wed, 03 Jun 2020 00:00:00 +0000

Comorbid diseases co-occur and progress via complex temporal patterns that vary among individuals. In electronic medical records, we only observe onsets of diseases, but not their triggering comorbidities — i.e., the mechanisms underlying temporal relations between diseases need to be inferred. Learning such temporal patterns from event data is crucial for understanding disease pathology and predicting prognoses. To this end, we develop deep diffusion processes (DDP) to model ’dynamic comorbidity networks’, i.e., the temporal relationships between comorbid disease onsets expressed through a dynamic graph. A DDP comprises events modelled as a multi-dimensional point process, with an intensity function parameterized by the edges of a dynamic weighted graph. The graph structure is modulated by a neural network that maps patient history to edge weights, enabling rich temporal representations for disease trajectories. The DDP parameters decouple into clinically meaningful components, which enables serving the dual purpose of accurate risk prediction and intelligible representation of disease pathology. We illustrate these features in experiments using cancer registry data.

A Robust Univariate Mean Estimator is All You Need

Wed, 03 Jun 2020 00:00:00 +0000

We study the problem of designing estimators when the data has heavy-tails and is corrupted by outliers. In such an adversarial setup, we aim to design statistically optimal estimators for flexible non-parametric distribution classes such as distributions with bounded-2k moments and symmetric distributions. Our primary workhorse is a conceptually simple reduction from multivariate estimation to univariate estimation. Using this reduction, we design estimators which are optimal in both heavy-tailed and contaminated settings. Our estimators achieve an optimal dimension independent bias in the contaminated setting, while also simultaneously achieving high-probability error guarantees with optimal sample complexity. These results provide some of the first such estimators for a broad range of problems including Mean Estimation, Sparse Mean Estimation, Covariance Estimation, Sparse Covariance Estimation and Sparse PCA.

GAIT: A Geometric Approach to Information Theory

Wed, 03 Jun 2020 00:00:00 +0000

We advocate the use of a notion of entropy that reflects the relative abundances of the symbols in an alphabet, as well as the similarities between them. This concept was originally introduced in theoretical ecology to study the diversity of ecosystems. Based on this notion of entropy, we introduce geometry-aware counterparts for several concepts and theorems in information theory. Notably, our proposed divergence exhibits performance on par with state-of-the-art methods based on the Wasserstein distance, but enjoys a closed-form expression that can be computed efficiently. We demonstrate the versatility of our method via experiments on a broad range of domains: training generative models, computing image barycenters, approximating empirical measures and counting modes.

Adversarial Robustness of Flow-Based Generative Models

Wed, 03 Jun 2020 00:00:00 +0000

Flow-based generative models leverage invertible generator functions to fit a distribution to the training data using maximum likelihood. Despite their use in several application domains, robustness of these models to adversarial attacks has hardly been explored. In this paper, we study adversarial robustness of flow-based generative models both theoretically (for some simple models) and empirically (for more complex ones). First, we consider a linear flow-based generative model and compute optimal sample-specific and universal adversarial perturbations that maximally decrease the likelihood scores. Using this result, we study the robustness of the well-known adversarial training procedure, where we characterize the fundamental trade-off between model robustness and accuracy. Next, we empirically study the robustness of two prominent deep, non-linear, flow-based generative models, namely GLOW and RealNVP. We design two types of adversarial attacks; one that minimizes the likelihood scores of in-distribution samples, while the other that maximizes the likelihood scores of out-of-distribution ones. We find that GLOW and RealNVP are extremely sensitive to both types of attacks. Finally, using a hybrid adversarial training procedure, we significantly boost the robustness of these generative models.

A principled approach for generating adversarial images under non-smooth dissimilarity metrics

Wed, 03 Jun 2020 00:00:00 +0000

Deep neural networks perform well on real world data but are prone to adversarial perturbations: small changes in the input easily lead to misclassification. In this work, we propose an attack methodology not only for cases where the perturbations are measured by Lp norms, but in fact any adversarial dissimilarity metric with a closed proximal form. This includes, but is not limited to, L1, L2, and L-infinity perturbations; the L0 counting "norm" (i.e. true sparseness); and the total variation seminorm, which is a (Lp) convolutional dissimilarity measuring local pixel changes. Our approach is a natural extension of a recent adversarial attack method, and eliminates the differentiability requirement of the metric. We demonstrate our algorithm, ProxLogBarrier, on the MNIST, CIFAR10, and ImageNet-1k datasets. We consider undefended and defended models, and show that our algorithm easily transfers to various datasets. We observe that ProxLogBarrier outperforms a host of modern adversarial attacks specialized for the L0 case. Moreover, by altering images in the total variation seminorm, we shed light on a new class of perturbations that exploit neighboring pixel information.

Deterministic Decoding for Discrete Data in Variational Autoencoders

Wed, 03 Jun 2020 00:00:00 +0000

Variational autoencoders are prominent generative models for modeling discrete data. However, with flexible decoders, they tend to ignore the latent codes. In this paper, we study a VAE model with a deterministic decoder (DD-VAE) for sequential data that selects the highest-scoring tokens instead of sampling. Deterministic decoding solely relies on latent codes as the only way to produce diverse objects, which improves the structure of the learned manifold. To implement DD-VAE, we propose a new class of bounded support proposal distributions and derive Kullback-Leibler divergence for Gaussian and uniform priors. We also study a continuous relaxation of deterministic decoding objective function and analyze the relation of reconstruction accuracy and relaxation parameters. We demonstrate the performance of DD-VAE on multiple datasets, including molecular generation and optimization problems.

Sparse Hilbert-Schmidt Independence Criterion Regression

Wed, 03 Jun 2020 00:00:00 +0000

Feature selection is a fundamental problem for machine learning and statistics, and it has been widely studied over the past decades. However, the majority of feature selection algorithms are based on linear models, and the nonlinear feature selection problem has not been well studied compared to linear models, in particular for the high-dimensional case. In this paper, we propose the sparse Hilbert–Schmidt Independence Criterion (SpHSIC) regression, which is a versatile nonlinear feature selection algorithm based on the HSIC and is a continuous optimization variant of the well-known minimum redundancy maximum relevance (mRMR) feature selection algorithm. More specifically, the SpHSIC consists of two parts: the convex HSIC loss function on the one hand and the regularization term on the other hand, where we consider the Lasso, Bridge, MCP, and SCAD penalties. We prove that the sparsity based HSIC regression estimator satisfies the oracle property; that is, the sparsity-based estimator recovers the true underlying sparse model and is asymptotically normally distributed. On the basis of synthetic and real-world experiments, we illustrate this theoretical property and highlight the fact that the proposed algorithm performs well in the high-dimensional setting.

A Deep Generative Model for Fragment-Based Molecule Generation

Wed, 03 Jun 2020 00:00:00 +0000

Molecule generation is a challenging open problem in cheminformatics. Currently, deep generative approaches addressing the challenge belong to two broad categories, differing in how molecules are represented. One approach encodes molecular graphs as strings of text, and learns their corresponding character-based language model. Another, more expressive, approach operates directly on the molecular graph. In this work, we address two limitations of the former: generation of invalid and duplicate molecules. To improve validity rates, we develop a language model for small molecular substructures called fragments, loosely inspired by the well-known paradigm of Fragment-Based Drug Design. In other words, we generate molecules fragment by fragment, instead of atom by atom. To improve uniqueness rates, we present a frequency-based masking strategy that helps generate molecules with infrequent fragments. We show experimentally that our model largely outperforms other language model-based competitors, reaching state-of-the-art performances typical of graph-based approaches. Moreover, generated molecules display molecular properties similar to those in the training sample, even in absence of explicit task-specific supervision.

Hamiltonian Monte Carlo Swindles

Wed, 03 Jun 2020 00:00:00 +0000

Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) algorithm for estimating expectations with respect to continuous un-normalized probability distributions. MCMC estimators typically have higher variance than classical Monte Carlo with i.i.d. samples due to autocorrelations; most MCMC research tries to reduce these autocorrelations. In this work, we explore a complementary approach to variance reduction based on two classical Monte Carlo ’swindles’: first, running an auxiliary coupled chain targeting a tractable approximation to the target distribution, and using the auxiliary samples as control variates; and second, generating anti-correlated ("antithetic") samples by running two chains with flipped randomness. Both ideas have been explored previously in the context of Gibbs samplers and random-walk Metropolis algorithms, but we argue that they are ripe for adaptation to HMC in light of recent coupling results from the HMC theory literature. For many posterior distributions, we find that these swindles generate effective sample sizes orders of magnitude larger than plain HMC, as well as being more efficient than analogous swindles for Metropolis-adjusted Langevin algorithm and random-walk Metropolis.

Statistical Estimation of the Poincaré constant and Application to Sampling Multimodal Distributions

Wed, 03 Jun 2020 00:00:00 +0000

Poincaré inequalities are ubiquitous in probability and analysis and have various applications in statistics (concentration of measure, rate of convergence of Markov chains). The Poincaré constant, for which the inequality is tight, is related to the typical convergence rate of diffusions to their equilibrium measure. In this paper, we show both theoretically and experimentally that, given sufficiently many samples of a measure, we can estimate its Poincaré constant. As a by-product of the estimation of the Poincaré constant, we derive an algorithm that captures a low dimensional representation of the data by finding directions which are difficult to sample. These directions are of crucial importance for sampling or in fields like molecular dynamics, where they are called reaction coordinates. Their knowledge can leverage, with a simple conditioning step, computational bottlenecks by using importance sampling techniques.

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

Wed, 03 Jun 2020 00:00:00 +0000

We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estimator is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP and SVRPG in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.

A PTAS for the Bayesian Thresholding Bandit Problem

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we study the Bayesian thresholding bandit problem (BTBP), where the goal is to adaptively make a budget of $Q$ queries to $n$ stochastic arms and determine the label for each arm (whether its mean reward is closer to $0$ or $1$). We present a polynomial-time approximation scheme for the BTBP with runtime $O(f(\epsilon) + Q)$ that achieves expected labeling accuracy at least $(\opt(Q) - \epsilon)$, where $f(\cdot)$ is a function that only depends on $\epsilon$ and $\opt(Q)$ is the optimal expected accuracy achieved by any algorithm. For any fixed $\epsilon > 0$, our algorithm runs in time linear with $Q$. The main algorithmic ideas we use include discretization employed in the PTASs for many dynamic programming algorithms (such as Knapsack), as well as many problem specific techniques such as proving an upper bound on the number of query numbers for any arm made by an almost optimal policy, and establishing the smoothness property of the $\opt(\cdot)$ curve, etc.

Stable behaviour of infinitely wide deep neural networks

Wed, 03 Jun 2020 00:00:00 +0000

We consider fully connected feed-forward deep neural networks (NNs) where weights and biases are independent and identically distributed as symmetric centered stable distributions. Then, we show that the infinite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions. The limiting process is referred to as the stable process, and it generalizes the class of Gaussian processes recently obtained as infinite wide limits of NNs (Matthews at al., 2018b). Parameters of the stable process can be computed via an explicit recursion over the layers of the network. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite wide limits.

Infinitely deep neural networks as diffusion processes

Wed, 03 Jun 2020 00:00:00 +0000

When the parameters are independently and identically distributed (initialized) neural networks exhibit undesirable properties that emerge as the number of layers increases, e.g. a vanishing dependency on the input and a concentration on restrictive families of functions including constant functions. We consider parameter distributions that shrink as the number of layers increases in order to recover well-behaved stochastic processes in the limit of infinite depth. This leads to set forth a link between infinitely deep residual networks and solutions to stochastic differential equations, i.e. diffusion processes. We show that these limiting processes do not suffer from the aforementioned issues and investigate their properties.

Linearly Convergent Frank-Wolfe with Backtracking Line-Search

Wed, 03 Jun 2020 00:00:00 +0000

Structured constraints in Machine Learning have recently brought the Frank-Wolfe (FW) family of algorithms back in the spotlight. While the classical FW algorithm has poor local convergence properties, the Away-steps and Pairwise FW variants have emerged as improved variants with faster convergence. However, these improved variants suffer from two practical limitations: they require at each iteration to solve a 1-dimensional minimization problem to set the step-size and also require the Frank-Wolfe linear subproblems to be solved exactly. In this paper we propose variants of Away-steps and Pairwise FW that lift both restrictions simultaneously. The proposed methods set the step-size based on a sufficient decrease condition, and do not require prior knowledge of the objective. Furthermore, they inherit all the favorable convergence properties of the exact line-search version, including linear convergence for strongly convex functions over polytopes. Benchmarks on different machine learning problems illustrate large performance gains of the proposed variants.

Uncertainty in Neural Networks: Approximately Bayesian Ensembling

Wed, 03 Jun 2020 00:00:00 +0000

Understanding the uncertainty of a neural network’s (NN) predictions is essential for many purposes. The Bayesian framework provides a principled approach to this, however applying it to NNs is challenging due to large numbers of parameters and data. Ensembling NNs provides an easily implementable, scalable method for uncertainty quantification, however, it has been criticised for not being Bayesian. This work proposes one modification to the usual process that we argue does result in approximate Bayesian inference; regularising parameters about values drawn from a distribution which can be set equal to the prior. A theoretical analysis of the procedure in a simplified setting suggests the recovered posterior is centred correctly but tends to have an underestimated marginal variance, and overestimated correlation. However, two conditions can lead to exact recovery. We argue that these conditions are partially present in NNs. Empirical evaluations demonstrate it has an advantage over standard ensembling, and is competitive with variational methods.

Regularity as Regularization: Smooth and Strongly Convex Brenier Potentials in Optimal Transport

Wed, 03 Jun 2020 00:00:00 +0000

Estimating Wasserstein distances between two high-dimensional densities suffers from the curse of dimensionality: one needs an exponential (wrt dimension) number of samples to ensure that the distance between two empirical measures is comparable to the distance between the original densities. Therefore, optimal transport (OT) can only be used in machine learning if it is substantially regularized. On the other hand, one of the greatest achievements of the OT literature in recent years lies in regularity theory: Caffarelli showed that the OT map between two well behaved measures is Lipschitz, or equivalently when considering 2-Wasserstein distances, that Brenier convex potentials (whose gradient yields an optimal map) are smooth. We propose in this work to draw inspiration from this theory and use regularity as a regularization tool. We give algorithms operating on two discrete measures that can recover nearly optimal transport maps with small distortion, or equivalently, nearly optimal Brenier potentials that are strongly convex and smooth. The problem boils down to solving alternatively a convex QCQP and a discrete OT problem, granting access to the values and gradients of the Brenier potential not only on sampled points, but also out of sample at the cost of solving a simpler QCQP for each evaluation. We propose algorithms to estimate and evaluate transport maps with desired regularity properties, benchmark their statistical performance, apply them to domain adaptation and visualize their action on a color transfer task.

Calibrated Prediction with Covariate Shift via Unsupervised Domain Adaptation

Wed, 03 Jun 2020 00:00:00 +0000

Reliable uncertainty estimates are an important tool for helping autonomous agents or human decision makers understand and lever-age predictive models. However, existing approaches to estimating uncertainty largely ignore the possibility of covariate shift—i.e.,where the real-world data distribution may differ from the training distribution. As a consequence, existing algorithms can overestimate certainty, possibly yielding a false sense of confidence in the predictive model. We pro-pose an algorithm for calibrating predictions that accounts for the possibility of covariate shift, given labeled examples from the train-ing distribution and unlabeled examples from the real-world distribution. Our algorithm uses importance weighting to correct for the shift from the training to the real-world distribution. However, importance weighting relies on the training and real-world distributions to be sufficiently close. Building on ideas from domain adaptation, we additionally learn a feature map that tries to equalize these two distributions. In an empirical evaluation, we show that our proposed approach outperforms existing approaches to calibrated prediction when there is covariate shift.

Unsupervised Neural Universal Denoiser for Finite-Input General-Output Noisy Channel

Wed, 03 Jun 2020 00:00:00 +0000

We devise a novel neural network-based universal denoiser for the finite-input, general-output (FIGO) channel. Based on the assumption of known noisy channel densities, which is realistic in many practical scenarios, we train the network such that it can denoise as well as the best sliding window denoiser for any given underlying clean source data. Our algorithm, dubbed as Generalized CUDE (Gen-CUDE), enjoys several desirable properties; it can be trained in an unsupervised manner (solely based on the noisy observation data), has much smaller computational complexity compared to the previously developed universal denoiser for the same setting, and has much tighter upper bound on the denoising performance, which is obtained by a theoretical analysis. In our experiments, we show such tighter upper bound is also realized in practice by showing that Gen-CUDE achieves much better denoising results compared to other strong baselines for both synthetic and real underlying clean sequences.

Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration

Wed, 03 Jun 2020 00:00:00 +0000

In many Reinforcement Learning (RL) applications, the goal is to find an optimal deterministic policy. However, most RL algorithms require the policy to be stochastic in order to avoid instabilities and perform a sufficient amount of exploration. Adjusting the level of stochasticity during the learning process is non-trivial, as it is difficult to assess whether the costs of random exploration will be repaid in the long run, and to contain the risk of instability.We study this problem in the context of policy gradients (PG) with Gaussian policies. Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. Furthermore, we provide a way to optimize this objective that guarantees stable improvement of the original performance measure. We evaluate the proposed methods on simulated continuous control tasks.

Scalable Nonparametric Factorization for High-Order Interaction Events

Wed, 03 Jun 2020 00:00:00 +0000

Interaction events among multiple entities are ubiquitous in real-world applications. Although these interactions can be naturally represented by tensors and analyzed by tensor decomposition, most existing approaches are limited to multilinear decomposition forms, and cannot estimate complex, nonlinear relationships in data. More importantly, the existing approaches severely underexploit the time stamps information. They either drop/discretize the time stamps or set a local window to ignore the long-term dependency between the events. To address these issues, we propose a Bayesian nonparametric factorization model for high-order interaction events, which can flexibly estimate/embed the static, nonlinear relationships and capture various long-term and short-term excitations effects, encoding these effects and their decaying patterns into the latent factors. Specifically, we use the latent factors to construct a set of mutually excited Hawkes processes, where we place a Gaussian process prior over the background rates to estimate the static, nonlinear relationships of the entities and propose novel triggering kernels to embed the excitation strengths and their time decaying rates among the interactions. For scalable inference, we derive a fully-decomposed model evidence lower bound to dispose of the huge covariance matrix and expensive log summation terms. Then we develop an efficient stochastic optimization algorithm. We show the advantage of our approach in four real-world applications.

Interpretable Companions for Black-Box Models

Wed, 03 Jun 2020 00:00:00 +0000

We present an interpretable companion model for any pre-trained black-box classifiers. The idea is that for any input, a user can decide to either receive a prediction from the black-box model, with high accuracy but no explanations, or employ a \emph{companion rule} to obtain an interpretable prediction with slightly lower accuracy. The companion model is trained from data and the predictions of the black-box model, with the objective combining area under the transparency–accuracy curve and model complexity. Our model provides flexible choices for practitioners who face the dilemma of choosing between always using interpretable models and always using black-box models for a predictive task, so users can, for any given input, take a step back to resort to an interpretable prediction if they find the predictive performance satisfying, or stick to the black-box model if the rules are unsatisfying. To show the value of companion models, we design a human evaluation on more than a hundred people to investigate the tolerable accuracy loss to gain interpretability for humans.

DYNOTEARS: Structure Learning from Time-Series Data

Wed, 03 Jun 2020 00:00:00 +0000

We revisit the structure learning problem for dynamic Bayesian networks and propose a method that simultaneously estimates contemporaneous (intra-slice) and time-lagged (inter-slice) relationships between variables in a time-series. Our approach is score-based, and revolves around minimizing a penalized loss subject to an acyclicity constraint. To solve this problem, we leverage a recent algebraic result characterizing the acyclicity constraint as a smooth equality constraint. The resulting algorithm, which we call DYNOTEARS, outperforms other methods on simulated data, especially in high-dimensions as the number of variables increases. We also apply this algorithm on real datasets from two different domains, finance and molecular biology, and analyze the resulting output. Compared to state-of-the-art methods for learning dynamic Bayesian networks, our method is both scalable and accurate on real data. The simple formulation and competitive performance of our method make it suitable for a variety of problems where one seeks to learn connections between variables across time.

Characterization of Overlap in Observational Studies

Wed, 03 Jun 2020 00:00:00 +0000

Overlap between treatment groups is required for non-parametric estimation of causal effects. If a subgroup of subjects always receives the same intervention, we cannot estimate the effect of intervention changes on that subgroup without further assumptions. When overlap does not hold globally, characterizing local regions of overlap can inform the relevance of causal conclusions for new subjects, and can help guide additional data collection. To have impact, these descriptions must be interpretable for downstream users who are not machine learning experts, such as policy makers. We formalize overlap estimation as a problem of finding minimum volume sets subject to coverage constraints and reduce this problem to binary classification with Boolean rule classifiers. We then generalize this method to estimate overlap in off-policy policy evaluation. In several real-world applications, we demonstrate that these rules have comparable accuracy to black-box estimators and provide intuitive and informative explanations that can inform policy making.

ASAP: Architecture Search, Anneal and Prune

Wed, 03 Jun 2020 00:00:00 +0000

Automatic methods for Neural ArchitectureSearch (NAS) have been shown to produce state-of-the-art network models, yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a discrete search space, thousands of days of GPU were required for convergence. A recent approach is based on constructing a differentiable search space that enables gradient-based optimization, thus reducing the search time to a few days. While successful, such methods still include some incontinuous steps, e.g., the pruning of many weak connections at once. In this paper, we propose a differentiable search space that allows the annealing of architecture weights, while gradually pruning inferior operations, thus the search converges to a single output network in a continuous manner. Experiments on several vision datasets demonstrate the effectiveness of our method with respect to the search cost, accuracy and the memory footprint of the achieved model.

Permutation Invariant Graph Generation via Score-Based Generative Modeling

Wed, 03 Jun 2020 00:00:00 +0000

Learning generative models for graph-structured data is challenging because graphs are discrete, combinatorial, and the underlying data distribution is invariant to the ordering of nodes. However, most of the existing generative models for graphs are not invariant to the chosen ordering, which might lead to an undesirable bias in the learned distribution. To address this difficulty, we propose a permutation invariant approach to modeling graphs, using the recent framework of score-based generative modeling. In particular, we design a permutation equivariant, multi-channel graph neural network to model the gradient of the data distribution at the input graph (a.k.a., the score function). This permutation equivariant model of gradients implicitly defines a permutation invariant distribution for graphs. We train this graph neural network with score matching and sample from it with annealed Langevin dynamics. In our experiments, we first demonstrate the capacity of this new architecture in learning discrete graph algorithms. For graph generation, we find that our learning approach achieves better or comparable results to existing models on benchmark datasets.

Functional Gradient Boosting for Learning Residual-like Networks with Statistical Guarantees

Wed, 03 Jun 2020 00:00:00 +0000

Recently, several studies have proposed progressive or sequential layer-wise training methods based on the boosting theory for deep neural networks. However, most studies lack the global convergence guarantees or require weak learning conditions that can be verified a posteriori after running methods. Moreover, generalization bounds usually have a worse dependence on network depth. In this paper, to resolve these problems, we propose a new functional gradient boosting for learning deep residual-like networks in a layer-wise fashion with its statistical guarantees on multi-class classification tasks. In the proposed method, each residual block is recognized as a functional gradient (i.e., weak learner), and the functional gradient step is performed by stacking it on the network, resulting in a strong optimization ability. In the theoretical analysis, we show the global convergence of the method under a standard margin assumption on a data distribution instead of a weak learning condition, and we eliminate a worse dependence on the network depth in a generalization bound via a fine-grained convergence analysis. %, unlike existing studies. Moreover, we show that the existence of a learnable function with a large margin on a training dataset significantly improves a generalization bound. Finally, we experimentally demonstrate that our proposed method is certainly useful for learning deep residual networks.

Contextual Combinatorial Volatile Multi-armed Bandit with Adaptive Discretization

Wed, 03 Jun 2020 00:00:00 +0000

We consider contextual combinatorial volatile multi-armed bandit (CCV-MAB), in which at each round, the learner observes a set of available base arms and their contexts, and then, selects a super arm that contains $K$ base arms in order to maximize its cumulative reward. Under the semi-bandit feedback setting and assuming that the contexts lie in a space ${\cal X}$ endowed with the Euclidean norm and that the expected base arm outcomes (expected rewards) are Lipschitz continuous in the contexts (expected base arm outcomes), we propose an algorithm called Adaptive Contextual Combinatorial Upper Confidence Bound (ACC-UCB). This algorithm, which adaptively discretizes ${\cal X}$ to form estimates of base arm outcomes and uses an $\alpha$-approximation oracle as a subroutine to select a super arm in each round, achieves $\tilde{O} ( T^{(\bar{D}+1)/(\bar{D}+2) + \epsilon} )$ regret for any $\epsilon>0$, where $\bar{D}$ represents the approximate optimality dimension related to ${\cal X}$. This dimension captures both the benignness of the base arm arrivals and the structure of the expected reward. In addition, we provide a recipe for obtaining more optimistic regret bounds by taking into account the volatility of the base arms and show that ACC-UCB achieves significant performance gains compared to the state-of-the-art for worker selection in mobile crowdsourcing.

Distributionally Robust Bayesian Quadrature Optimization

Wed, 03 Jun 2020 00:00:00 +0000

Bayesian quadrature optimization (BQO) maximizes the expectation of an expensive black-box integrand taken over a known probability distribution. In this work, we study BQO under distributional uncertainty in which the underlying probability distribution is unknown except for a limited set of its i.i.d samples. A standard BQO approach maximizes the Monte Carlo estimate of the true expected objective given the fixed sample set. Though Monte Carlo estimate is unbiased, it has high variance given a small set of samples; thus can result in a spurious objective function. We adopt the distributionally robust optimization perspective to this problem by maximizing the expected objective under the most adversarial distribution. In particular, we propose a novel posterior sampling based algorithm, namely distributionally robust BQO (DRBQO) for this purpose. We demonstrate the empirical effectiveness of our proposed framework in synthetic and real-world problems, and characterize its theoretical convergence via Bayesian regret.

Decentralized gradient methods: does topology matter?

Wed, 03 Jun 2020 00:00:00 +0000

Consensus-based distributed optimization methods have recently been advocated as alternatives to parameter server and ring all-reduce paradigms for large scale training of machine learning models. In this case, each worker maintains a local estimate of the optimal parameter vector and iteratively updates it by averaging the estimates obtained from its neighbors, and applying a correction on the basis of its local dataset. While theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge, previous experiments have shown the opposite conclusion. This paper sheds lights on this apparent contradiction and show how sparse topologies can lead to faster convergence even in the absence of communication delays.

Robust Stackelberg buyers in repeated auctions

Wed, 03 Jun 2020 00:00:00 +0000

We consider the practical and classical setting where the seller is using an exploration stage to learn the value distributions of the bidders before running a revenue-maximizing auction in a exploitation phase. In this two-stage process, we exhibit practical, simple and robust strategies with large utility uplifts for the bidders. We quantify precisely the seller revenue against non-discounted buyers, complementing recent studies that had focused on impatient/heavily discounted buyers. We also prove the robustness of these shading strategies to sample approximation error of the seller, to bidder’s approximation error of the competition and to possible change of the mechanisms.

Convergence Analysis of Block Coordinate Algorithms with Determinantal Sampling

Wed, 03 Jun 2020 00:00:00 +0000

We analyze the convergence rate of the randomized Newton-like method introduced by Qu et. al. (2016) for smooth and convex objectives, which uses random coordinate blocks of a Hessian-over-approximation matrix M instead of the true Hessian. The convergence analysis of the algorithm is challenging because of its complex dependence on the structure of M. However, we show that when the coordinate blocks are sampled with probability proportional to their determinant, the convergence rate depends solely on the eigenvalue distribution of matrix M, and has an analytically tractable form. To do so, we derive a fundamental new expectation formula for determinantal point processes. We show that determinantal sampling allows us to reason about the optimal subset size of blocks in terms of the spectrum of M. Additionally, we provide a numerical evaluation of our analysis, demonstrating cases where determinantal sampling is superior or on par with uniform sampling.

Wasserstein Style Transfer

Wed, 03 Jun 2020 00:00:00 +0000

We propose Gaussian optimal transport for image style transfer in an Encoder/Decoder framework. Optimal transport for Gaussian measures has closed forms Monge mappings from source to target distributions. Moreover, interpolating between a content and a style image can be seen as geodesics in the Wasserstein Geometry. Using this insight, we show how to mix different target styles , using Wasserstein barycenter of Gaussian measures. Since Gaussians are closed under Wasserstein barycenter, this allows us a simple style transfer and style mixing and interpolation. Moreover we show how mixing different styles can be achieved using other geodesic metrics between gaussians such as the Fisher Rao metric, while the transport of the content to the new interpolate style is still performed with Gaussian OT maps. Our simple methodology allows to generate new stylized content interpolating between many artistic styles. The metric used in the interpolation results in different stylizations. A demo is available on https: //wasserstein-transfer.github.io.

Linear predictor on linearly-generated data with missing values: non consistency and solutions

Wed, 03 Jun 2020 00:00:00 +0000

We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully observed data and we show that, in the presence of missing values, the optimal predictor is not linear in general. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and the various missing value indicators. Due to its intrinsic complexity, we study a simple approximation and prove generalization bounds with finite samples, highlighting regimes for which each method performs best. We then show that multilayer perceptrons with ReLU activation functions can be consistent, and can explore good trade-offs between the true model and approximations. Our study highlights the interesting family of models that are beneficial to fit with missing values depending on the amount of data available.

The Quantile Snapshot Scan: Comparing Quantiles of Spatial Data from Two Snapshots in Time

Wed, 03 Jun 2020 00:00:00 +0000

We introduce the Quantile Snapshot Scan (Qsnap), a spatial scan algorithm which identifies spatial regions that differ the most between two snapshots in time. Qsnap is designed for spatial data with a numeric response and a vector of associated covariates for each spatial data point. Qsnap focuses on differences involving a specific quantile of the data distribution. A naive implementation of Qsnap is too computationally expensive for large datasets but our novel incremental update provides an order of magnitude speedup. We demonstrate Qsnap’s effectiveness over an extensive set of experiments on simulated data. In addition, we apply Qsnap to two real-world problems: discovering bird migration paths and identifying regions with dramatic changes in drought conditions.

Marginal Densities, Factor Graph Duality, and High-Temperature Series Expansions

Wed, 03 Jun 2020 00:00:00 +0000

We prove that the marginal densities of a global probability mass function in aprimal normal factor graph and the corresponding marginal densities in the dual normal factor graph are related via local mappings. The mapping depends on the Fourier transform of the local factors of the models. Details of the mapping, including its fixed points, are derived for the Ising model, and then extended to the Potts model. By employing the mapping, we can transform simultaneously all the estimated marginal densities from one domain to the other, which is advantageous if estimating the marginals can be carried out more efficiently in the dual domain.An example of particular significance is the ferromagnetic Ising model in a positive external field, for which there is a rapidly mixing Markov chain (called the subgraphs-world process) to generate configurations in the dual normal factor graph of the model. Our numerical experiments illustrate that the proposed procedure can provide more accurate estimates of marginal densities in various settings.

A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach

Wed, 03 Jun 2020 00:00:00 +0000

In this paper we consider solving saddle point problems using two variants of Gradient Descent-Ascent algorithms, Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods. We show that both of these algorithms admit a unified analysis as approximations of the classical proximal point method for solving saddle point problems. This viewpoint enables us to develop a new framework for analyzing EG and OGDA for bilinear and strongly convex-strongly concave settings. Moreover, we use the proximal point approximation interpretation to generalize the results for OGDA for a wide range of parameters.

Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

Wed, 03 Jun 2020 00:00:00 +0000

Reinforcement learning (RL) methods have been shown to be capable of learning intelligent behavior in rich domains. However, this has largely been done in simulated domains without adequate focus on the process of building the simulator. In this paper, we consider a setting where we have access to an ensemble of pre-trained and possibly inaccurate simulators (models). We approximate the real environment using a state-dependent linear combination of the ensemble, where the coefficients are determined by the given state features and some unknown parameters. Our proposed algorithm provably learns a near-optimal policy with a sample complexity polynomial in the number of unknown parameters, and incurs no dependence on the size of the state (or action) space. As an extension, we also consider the more challenging problem of model selection, where the state features are unknown and can be chosen from a large candidate set. We provide exponential lower bounds that illustrate the fundamental hardness of this problem, and develop a provably efficient algorithm under additional natural assumptions.

LIBRE: Learning Interpretable Boolean Rule Ensembles

Wed, 03 Jun 2020 00:00:00 +0000

We present a novel method—LIBRE—learn an interpretable classifier, which materializes as a set of Boolean rules. LIBRE uses an ensemble of bottom-up, weak learners operating on a random subset of features, which allows for the learning of rules that generalize well on unseen data even in imbalanced settings. Weak learners are combined with a simple union so that the final ensemble is also interpretable. Experimental results indicate that LIBRE efficiently strikes the right balance between prediction accuracy, which is competitive with black-box methods, and interpretability, which is often superior to alternative methods from the literature.

Revisiting Stochastic Extragradient

Wed, 03 Jun 2020 00:00:00 +0000

We fix a fundamental issue in the stochastic extragradient method by providing a new sampling strategy that is motivated by approximating implicit updates. Since the existing stochastic extragradient algorithm, called Mirror-Prox, of (Juditsky, 2011) diverges on a simple bilinear problem when the domain is not bounded, we prove guarantees for solving variational inequality that go beyond existing settings. Furthermore, we illustrate numerically that the proposed variant converges faster than many other methods on several convex-concave saddle-point problems. We also discuss how extragradient can be applied to training Generative Adversarial Networks (GANs) and how it compares to other methods. Our experiments on GANs demonstrate that the introduced approach may make the training faster in terms of data passes, while its higher iteration complexity makes the advantage smaller.

A Characterization of Mean Squared Error for Estimator with Bagging

Wed, 03 Jun 2020 00:00:00 +0000

Bagging can significantly improve the generalization performance of unstable machine learning algorithms such as trees or neural networks. Though bagging is now widely used in practice and many empirical studies have explored its behavior, we still know little about the theoretical properties of bagged predictions. In this paper, we theoretically investigate how the bagging method can reduce the Mean Squared Error (MSE) when applied on a statistical estimator. First, we prove that for any estimator, increasing the number of bagged estimators $N$ in the average can only reduce the MSE. This intuitive result, observed empirically and discussed in the literature, has not yet been rigorously proved. Second, we focus on the standard estimator of variance called unbiased sample variance and we develop an exact analytical expression of the MSE for this estimator with bagging. This allows us to rigorously discuss the number of iterations $N$ and the batch size $m$ of the bagging method. From this expression, we state that only if the kurtosis of the distribution is greater than $\frac{3}{2}$, the MSE of the variance estimator can be reduced with bagging. This result is important because it demonstrates that for distribution with low kurtosis, bagging can only deteriorate the performance of a statistical prediction. Finally, we propose a novel general-purpose algorithm to estimate with high precision the variance of a sample.

Screening Data Points in Empirical Risk Minimization via Ellipsoidal Regions and Safe Loss Functions

Wed, 03 Jun 2020 00:00:00 +0000

We design simple screening tests to automatically discard data samples in empirical risk minimization withoutlosing optimization guarantees. We derive loss functions that produce dual objectives with a sparse solution. We also show how to regularize convex losses to ensure such a dual sparsity-inducing property, andpropose a general method to design screening tests for classification or regression based on ellipsoidal approximations of the optimal set. In addition to producing computational gains, our approach also allows us to compress a dataset into a subset of representative points.

Quantitative stability of optimal transport maps and linearization of the 2-Wasserstein space

Wed, 03 Jun 2020 00:00:00 +0000

This work studies an explicit embedding of the set of probability measures into a Hilbert space, defined using optimal transport maps from a reference probability density. This embedding linearizes to some extent the 2-Wasserstein space and is shown to be bi-Hölder continuous. It enables the direct use of generic supervised and unsupervised learning algorithms on measure data consistently w.r.t. the Wasserstein geometry.

Gaussianization Flows

Wed, 03 Jun 2020 00:00:00 +0000

Iterative Gaussianization is a fixed-point iteration procedure that allows one to transform a continuous distribution to Gaussian distribution. Based on iterative Gaussianization, we propose a new type of normalizing flow models that grants both efficient computation of likelihoods and efficient inversion for sample generation. We demonstrate that this new family of flow models, named as Gaussianization flows, are universal approximators for continuous probability distributions under some regularity conditions. This guaranteed expressivity, enabling them to capture multimodal target distributions better without compromising the efficiency in sample generation. Experimentally, we show that Gaussianization flows achieve better or comparable performance on several tabular datasets, compared to other efficiently invertible flow models such as Real NVP, Glow and FFJORD. In particular, Gaussianization flows are easier to initialize, demonstrate better robustness with respect to different transformations of the training data, and generalize better on small training sets.

Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation

Wed, 03 Jun 2020 00:00:00 +0000

We consider stochastic second-order methods for minimizing smooth and strongly-convex functions under an interpolation condition satisfied by over-parameterized models. Under this condition, we show that the regularized subsampled Newton method (R-SSN) achieves global linear convergence with an adaptive step-size and a constant batch-size. By growing the batch size for both the subsampled gradient and Hessian, we show that R-SSN can converge at a quadratic rate in a local neighbourhood of the solution. We also show that R-SSN attains local linear convergence for the family of self-concordant functions. Furthermore, we analyze stochastic BFGS algorithms in the interpolation setting and prove their global linear convergence. We empirically evaluate stochastic L-BFGS and a "Hessian-free" implementation of R-SSN for binary classification on synthetic, linearly-separable datasets and real datasets under a kernel mapping. Our experimental results demonstrate the fast convergence of these methods, both in terms of the number of iterations and wall-clock time.

A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players

Wed, 03 Jun 2020 00:00:00 +0000

We study a multiplayer stochastic multi-armed bandit problem in which players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider the challenging heterogeneous setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of leveraging forced collisions for implicit communication and that of performing matching eliminations. We present a finite-time analysis of our algorithm, giving the first sublinear minimax regret bound for this problem, and prove that if the optimal assignment of players to arms is unique, our algorithm attains the optimal O(ln(T)) regret, solving an open question raised at NeurIPS 2018 by Bistritz and Leshem (2018).

Automatic Differentiation of Some First-Order Methods in Parametric Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We aim at computing the derivative of the solution to a parametric optimization problem with respect to the involved parameters. For a class broader than that of strongly convex functions, this can be achieved by automatic differentiation of iterative minimization algorithms. If the iterative algorithm converges pointwise, then we prove that the derivative sequence also converges pointwise to the derivative of the minimizer with respect to the parameters. Moreover, we provide convergence rates for both sequences. In particular, we prove that the accelerated convergence rate of the Heavy-ball method compared to Gradient Descent also accelerates the derivative computation. An experiment with L2-Regularized Logistic Regression validates the theoretical results.

A Three Sample Hypothesis Test for Evaluating Generative Models

Wed, 03 Jun 2020 00:00:00 +0000

Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call {\em{data-copying}} – where the generative model memorizes and outputs training samples or small variations thereof. We provide a three sample test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model, and study the performance of our test on several canonical models and datasets.

Formal Limitations on the Measurement of Mutual Information

Wed, 03 Jun 2020 00:00:00 +0000

Measuring mutual information from finite data is difficult. Recent work has considered variational methods maximizing a lower bound. In this paper, we prove that serious statistical limitations are inherent to any method of measuring mutual information. More specifically, we show that any distribution-free high-confidence lower bound on mutual information estimated from N samples cannot be larger than O(ln N).

RATQ: A Universal Fixed-Length Quantizer for Stochastic Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We present Rotated Adaptive Tetra-iterated Quantizer (RATQ), afixed-length quantizer for gradients in first order stochasticoptimization. RATQ is easy to implement and involves only a Hadamard transform computation and adaptive uniform quantization with appropriately chosen dynamic ranges. For noisy gradients with almost surely bounded Euclidean norms, we establish an informationtheoretic lower bound for optimization accuracy using finite precisiongradients and show that RATQ almost attains this lower bound. For mean square bounded noisy gradients, we use a gain-shape quantizer which separately quantizes the Euclidean norm and uses RATQ to quantize the normalized unit norm vector. We establish lower bounds for performance of any optimization procedure and shape quantizer, when used with a uniform gain quantizer. Finally, we propose an adaptive quantizer for gain which when used with RATQ for shape quantizer outperforms uniform gain quantization and is, in fact, close to optimal.

Support recovery and sup-norm convergence rates for sparse pivotal estimation

Wed, 03 Jun 2020 00:00:00 +0000

In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a “non-smooth + non-smooth” optimization problem. Modern techniques to solve these include smoothing the datafitting term, to benefit from fast efficient proximal algorithms. In this work we show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators. Thanks to our theoretical analysis, we provide some guidelines on how to set the smoothing hyperparameter, and illustrate on synthetic data the interest of such guidelines.

BasisVAE: Translation-invariant feature-level clustering with Variational Autoencoders

Wed, 03 Jun 2020 00:00:00 +0000

Variational Autoencoders (VAEs) provide a flexible and scalable framework for non-linear dimensionality reduction. However, in application domains such as genomics where data sets are typically tabular and high-dimensional, a black-box approach to dimensionality reduction does not provide sufficient insights. Common data analysis workflows additionally use clustering techniques to identify groups of similar features. This usually leads to a two-stage process, however, it would be desirable to construct a joint modelling framework for simultaneous dimensionality reduction and clustering of features. In this paper, we propose to achieve this through the BasisVAE: a combination of the VAE and a probabilistic clustering prior, which lets us learn a one-hot basis function representation as part of the decoder network. Furthermore, for scenarios where not all features are aligned, we develop an extension to handle translation-invariant basis functions. We show how a collapsed variational inference scheme leads to scalable and efficient inference for BasisVAE, demonstrated on various toy examples as well as on single-cell gene expression data.

Neural Decomposition: Functional ANOVA with Variational Autoencoders

Wed, 03 Jun 2020 00:00:00 +0000

Variational Autoencoders (VAEs) have become a popular approach for dimensionality reduction. However, despite their ability to identify latent low-dimensional structures embedded within high-dimensional data, these latent representations are typically hard to interpret on their own. Due to the black-box nature of VAEs, their utility for healthcare and genomics applications has been limited. In this paper, we focus on characterising the sources of variation in Conditional VAEs. Our goal is to provide a feature-level variance decomposition, i.e. to decompose variation in the data by separating out the marginal additive effects of latent variables z and fixed inputs c from their non-linear interactions. We propose to achieve this through what we call Neural Decomposition – an adaptation of the well-known concept of functional ANOVA variance decomposition from classical statistics to deep learning models. We show how identifiability can be achieved by training models subject to constraints on the marginal properties of the decoder networks. We demonstrate the utility of our Neural Decomposition on a series of synthetic examples as well as high-dimensional genomics data.

Hyperbolic Manifold Regression

Wed, 03 Jun 2020 00:00:00 +0000

Geometric representation learning has shown great promise for important tasks inartificial intelligence and machine learning. However, an open problem is yethow to integrate non-Euclidean representations with standard machine learningmethods.In this work, we consider the task of regression onto hyperbolic space for whichwe propose two approaches: a non-parametric kernel-method for which we also proveexcess risk bounds and a parametric deep learning model that is informed bythe geodesics of the target space.By recasting predictions on trees as manifold regression problems we demonstrate the applications of our approach on two challenging tasks: 1)hierarchical classification via label embeddings and 2) inventing new conceptsby predicting their embedding in a continuous representation of a base taxonomy.In our experiments, we find that the proposed estimators outperform their naivecounterparts that perform regression in the ambient Euclidean space.

Learning in Gated Neural Networks

Wed, 03 Jun 2020 00:00:00 +0000

Gating is a key feature in modern neural networks including LSTMs, GRUs and sparsely-gated deep neural networks. The backbone of such gated networks is a mixture-of-experts layer, where several experts make regression decisions and gating controls how to weigh the decisions in an input-dependent manner. Despite having such a prominent role in both modern and classical machine learning, very little is understood about parameter recovery of mixture-of-experts since gradient descent and EM algorithms are known to be stuck in local optima in such models.In this paper, we perform a careful analysis of the optimization landscape and show that with appropriately designed loss functions, gradient descent can indeed learn the parameters accurately. A key idea underpinning our results is the design of two {\em distinct} loss functions, one for recovering the expert parameters and another for recovering the gating parameters. We demonstrate the first sample complexity results for parameter recovery in this model for any algorithm and demonstrate significant performance gains over standard loss functions in numerical experiments.

LdSM: Logarithm-depth Streaming Multi-label Decision Trees

Wed, 03 Jun 2020 00:00:00 +0000

We consider multi-label classification where the goal is to annotate each data point with the most relevant subset of labels from an extremely large label set. Efficient annotation can be achieved with balanced tree predictors, i.e. trees with logarithmic-depth in the label complexity, whose leaves correspond to labels. Designing prediction mechanism with such trees for real data applications is non-trivial as it needs to accommodate sending examples to multiple leaves while at the same time sustain high prediction accuracy. In this paper we develop the LdSM algorithm for the construction and training of multi-label decision trees, where in every node of the tree we optimize a novel objective function that favors balanced splits, maintains high class purity of children nodes, and allows sending examples to multiple directions but with a penalty that prevents tree over-growth. Each node of the tree is trained once the previous node is completed leading to a streaming approach for training. We analyze the proposed objective theoretically and show that minimizing it leads to pure and balanced data splits. Furthermore, we show a boosting theorem that captures its connection to the multi-label classification error. Experimental results on benchmark data sets demonstrate that our approach achieves high prediction accuracy and low prediction time and position LdSM as a competitive tool among existing state-of-the-art approaches.

Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data

Wed, 03 Jun 2020 00:00:00 +0000

Recently, new methods for model assessment, based on subsampling and posterior approximations, have been proposed for scaling leave-one-out cross-validation (LOO-CV) to large datasets. Although these methods work well for estimating predictive performance for individual models, they are less powerful in model comparison. We propose an efficient method for estimating differences in predictive performance by combining fast approximate LOO surrogates with exact LOO sub-sampling using the difference estimator and supply proofs with regards to scaling characteristics. The resulting approach can be orders of magnitude more efficient than previous approaches, as well as being better suited to model comparison.

RCD: Repetitive causal discovery of linear non-Gaussian acyclic models with latent confounders

Wed, 03 Jun 2020 00:00:00 +0000

Causal discovery from data affected by latent confounders is an important and difficult challenge. Causal functional model-based approaches have not been used to present variables whose relationships are affected by latent confounders, while some constraint-based methods can present them. This paper proposes a causal functional model-based method called repetitive causal discovery (RCD) to discover the causal structure of observed variables affected by latent confounders. RCD repeats inferring the causal directions between a small number of observed variables and determines whether the relationships are affected by latent confounders. RCD finally produces a causal graph where a bi-directed arrow indicates the pair of variables that have the same latent confounders, and a directed arrow indicates the causal direction of a pair of variables that are not affected by the same latent confounder. The results of experimental validation using simulated data and real-world data confirmed that RCD is effective in identifying latent confounders and causal directions between observed variables.

Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms

Wed, 03 Jun 2020 00:00:00 +0000

The statistical analysis of Randomized Numerical Linear Algebra (RandNLA) algorithms within the past few years has mostly focused on their performance as point estimators. However, this is insufficient for conducting statistical inference, e.g., constructing confidence intervals and hypothesis testing, since the distribution of the estimator is lacking. In this article, we develop asymptotic analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem. In particular, we derive the asymptotic distribution of a general sampling estimator with arbitrary sampling probabilities. The analysis is conducted in two complementary settings, i.e., when the objective of interest is to approximate the full sample estimator or is to infer the underlying ground truth model parameters. For each setting, we show that the sampling estimator is asymptotically normally distributed under mild regularity conditions. Moreover, the sampling estimator is asymptotically unbiased in both settings. Based on our asymptotic analysis, we use two criteria, the Asymptotic Mean Squared Error (AMSE) and the Expected Asymptotic Mean Squared Error (EAMSE), to identify optimal sampling probabilities. Several of these optimal sampling probability distributions are new to the literature, e.g., the root leverage sampling estimator and the predictor length sampling estimator. Our theoretical results clarify the role of leverage in the sampling process, and our empirical results demonstrate improvements over existing methods.

Additive Tree-Structured Covariance Function for Conditional Parameter Spaces in Bayesian Optimization

Wed, 03 Jun 2020 00:00:00 +0000

Bayesian optimization (BO) is a sample-efficient global optimization algorithm for black-box functions which are expensive to evaluate. Existing literature on model based optimization in conditional parameter spaces are usually built on trees. In this work, we generalize the additive assumption to tree-structured functions and propose an additive tree-structured covariance function, showing improved sample-efficiency, wider applicability and greater flexibility. Furthermore, by incorporating the structure information of parameter spaces and the additive assumption in the BO loop, we develop a parallel algorithm to optimize the acquisition function and this optimization can be performed in a low dimensional space. We demonstrate our method on an optimization benchmark function, as well as on a neural network model compression problem, and experimental results show our approach significantly outperforms the current state of the art for conditional parameter optimization including SMAC, TPE and Jenatton et al. (2017).

Exploiting Categorical Structure Using Tree-Based Methods

Wed, 03 Jun 2020 00:00:00 +0000

Standard methods of using categorical variables as predictors either endow them with an ordinal structure or assume they have no structure at all. However, categorical variables often possess structure that is more complicated than a linear ordering can capture. We develop a mathematical framework for representing the structure of categorical variables and show how to generalize decision trees to make use of this structure. This approach is applicable to methods such as Gradient Boosted Trees which use a decision tree as the underlying learner. We show results on weather data to demonstrate the improvement yielded by this approach.

Ensemble Gaussian Processes with Spectral Features for Online Interactive Learning with Scalability

Wed, 03 Jun 2020 00:00:00 +0000

Combining benefits of kernels with Bayesian models, Gaussian process (GP) based approaches have well-documented merits not only in learning over a rich class of nonlinear functions, but also quantifying the associated uncertainty. While most GP approaches rely on a single preselected prior, the present work employs a weighted ensemble of GP priors, each having a unique covariance (kernel) belonging to a prescribed kernel dictionary – which leads to a richer space of learning functions. Leveraging kernel approximants formed by spectral features for scalability, an online interactive ensemble (OI-E) GP framework is developed to jointly learn the sought function, and for the first time select interactively the EGP kernel on-the-fly. Performance of OI-EGP is benchmarked by the best fixed function estimator via regret analysis. Furthermore, the novel OI-EGP is adapted to accommodate dynamic learning functions. Synthetic and real data tests demonstrate the effectiveness of the proposed schemes.

Mitigating Overfitting in Supervised Classification from Two Unlabeled Datasets: A Consistent Risk Correction Approach

Wed, 03 Jun 2020 00:00:00 +0000

The recently proposed unlabeled-unlabeled (UU) classification method allows us to train a binary classifier only from two unlabeled datasets with different class priors. Since this method is based on the empirical risk minimization, it works as if it is a supervised classification method, compatible with any model and optimizer. However, this method sometimes suffers from severe overfitting, which we would like to prevent in this paper. Our empirical finding in applying the original UU method is that overfitting often co-occurs with the empirical risk going negative, which is not legitimate. Therefore, we propose to wrap the terms that cause a negative empirical risk by certain correction functions. Then, we prove the consistency of the corrected risk estimator and derive an estimation error bound for the corrected risk minimizer. Experiments show that our proposal can successfully mitigate overfitting of the UU method and significantly improve the classification accuracy.

Interpretable Deep Gaussian Processes with Moments

Wed, 03 Jun 2020 00:00:00 +0000

Deep Gaussian Processes (DGPs) combine the the expressiveness of Deep Neural Networks (DNNs) with quantified uncertainty of Gaussian Processes (GPs). Expressive power and intractable inference both result from the non-Gaussian distribution over composition functions. We propose interpretable DGP based on approximating DGP as a GP by calculating the exact moments, which additionally identify the heavy-tailed nature of some DGP distributions. Consequently, our approach admits interpretation as both NNs with specified activation functions and as a variational approximation to DGP. We identify the expressivity parameter of DGP and find non-local and non-stationary correlation from DGP composition. We provide general recipes for deriving the effective kernels for DGP of two, three, or infinitely many layers, composed of homogeneous or heterogeneous kernels. Results illustrate the expressiveness of our effective kernels through samples from the prior and inference on simulated and real data and demonstrate advantages of interpretability by analysis of analytic forms, and draw relations and equivalences across kernels.

Accelerating Gradient Boosting Machines

Wed, 03 Jun 2020 00:00:00 +0000

Gradient Boosting Machine (GBM) introduced by \cite{friedman2001greedy} is a widely popular ensembling technique and is routinely used in competitions such as Kaggle and the KDDCup \citep{chen2016xgboost}. In this work, we propose an Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov’s acceleration techniques into the design of GBM. The difficulty in accelerating GBM lies in the fact that weak (inexact) learners are commonly used, and therefore, with naive application, the errors can accumulate in the momentum term. To overcome it, we design a “corrected pseudo residual” that serves as a new target for fitting a weak learner, in order to perform the z-update. Thus, we are able to derive novel computational guarantees for AGBM. This is the first GBM type of algorithm with a theoretically-justified accelerated convergence rate.

Optimizing Millions of Hyperparameters by Implicit Differentiation

Wed, 03 Jun 2020 00:00:00 +0000

We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. We present results about the relationship between the IFT and differentiating through optimization, motivating our algorithm. We use the proposed approach to train modern network architectures with millions of weights and millions of hyper-parameters. For example, we learn a data-augmentation network—where every weight is a hyperparameter tuned for validation performance—outputting augmented training examples. Jointly tuning weights and hyper-parameters is only a few times more costly in memory and compute than standard training.

Online Learning Using Only Peer Prediction

Wed, 03 Jun 2020 00:00:00 +0000

This paper considers a variant of the classical online learning problem with expert predictions. Our model’s differences and challenges are due to lacking any direct feedback on the loss each expert incurs at each time step $t$. We propose an approach that uses peer prediction and identify conditions where it succeeds. Our techniques revolve around a carefully designed peer score function $s()$ that scores experts’ predictions based on the peer consensus. We show a sufficient condition, that we call \emph{peer calibration}, under which standard online learning algorithms using loss feedback computed by the carefully crafted $s()$ have bounded regret with respect to the unrevealed ground truth values. We then demonstrate how suitable $s()$ functions can be derived for different assumptions and models.

Competing Bandits in Matching Markets

Wed, 03 Jun 2020 00:00:00 +0000

Stable matching, a classical model for two-sided markets, has long been studied assuming known preferences. In reality agents often have to learn about their preferences through exploration. With the advent of massive online markets powered by data-driven matching platforms, it has become necessary to better understand the interplay between learning and market objectives. We propose a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. Our model extends the standard multi-armed bandits framework to multiple players, with the added feature that arms have preferences over players. We study both centralized and decentralized approaches to this problem and show surprising exploration-exploitation trade-offs compared to the single player multi-armed bandits setting.

High Dimensional Robust Sparse Regression

Wed, 03 Jun 2020 00:00:00 +0000

We provide a novel – and to the best of our knowledge, the first – algorithm for high dimensional sparse regression with constant fraction of corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters with sub-linear sample complexity,in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators:when the covariance matrix in sparse regression is identity, our error guarantee is near information-theoretically optimal. We then deal with robust sparse regression with unknown structured covariance matrix. We propose a filtering algorithm whichconsists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: the filtering algorithm is flexible enough to deal with unknown covariance.Also, it is orderwise more efficient computationally than the ellipsoid algorithm.Using sub-linear sample complexity, our algorithm achieves the best known (and first) error guarantee. We demonstrate the effectiveness on large-scale sparse regression problems with arbitrary corruptions.

A Double Residual Compression Algorithm for Efficient Distributed Learning

Wed, 03 Jun 2020 00:00:00 +0000

Large-scale machine learning models are often trained by parallel stochastic gradient descent algorithms. However, the communication cost of gradient aggregation and model synchronization between the master and worker nodes becomes the major obstacle for efficient learning as the number of workers and the dimension of the model increase. In this paper, we propose DORE, a DOuble REsidual compression stochastic gradient descent algorithm, to reduce over $95%$ of the overall communication such that the obstacle can be immensely mitigated. Our theoretical analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions. The experimental results validate that DORE achieves the best communication efficiency while maintaining similar model accuracy and convergence speed in comparison with start-of-the-art baselines.

More Powerful Selective Kernel Tests for Feature Selection

Wed, 03 Jun 2020 00:00:00 +0000

Refining one’s hypotheses in light of data is a commonplace scientific practice, however,this approach introduces selection bias and can lead to specious statisticalanalysis.One approach of addressing this phenomena is via conditioning on the selection procedure, i.e., how we have used the data to generate our hypotheses, and prevents information to be used again after selection.Many selective inference (a.k.a. post-selection inference) algorithms typically take this approach but will “over-condition”for sake of tractability. While this practice obtains well calibrated $p$-values,it can incur a major loss in power. In our work, we extend two recent proposals for selecting features using the Maximum Mean Discrepancyand Hilbert Schmidt Independence Criterion to condition on the minimalconditioning event. We show how recent advances inmultiscale bootstrap makesthis possible and demonstrate our proposal over a range of synthetic and real world experiments.Our results show that our proposed test is indeed more powerful in most scenarios.

Automatic Differentiation of Sketched Regression

Wed, 03 Jun 2020 00:00:00 +0000

Sketching for speeding up regression problems involves using a sketching matrix $S$ to quickly find the approximate solution to a linear least squares regression (LLS) problem: given $A$ of size $n \times d$, with $n \gg d$, along with $b$ of size $n \times 1$, we seek a vector $y$ with minimal regression error $\lVert A y - b\rVert_2$. This approximation technique is now standard in data science, and many software systems use sketched regression internally, as a component. It is often useful to calculate derivatives (gradients for the purpose of optimization, for example) of such large systems, where sketched LLS is merely a component of a larger system whose derivatives are needed. To support Automatic Differentiation (AD) of systems containing sketched LLS, we consider propagating derivatives through $\textrm{LLS}$: both propagating perturbations (forward AD) and gradients (reverse AD). AD performs accurate differentiation and is efficient for problems with a huge number of independent variables. Since we use $\textrm{LLS}_S$ (sketched LLS) instead of $\textrm{LLS}$ for reasons of efficiency, propagation of derivatives also needs to trade accuracy for efficiency, presumably by sketching. There are two approaches for this: (a) use AD to transform the code that defines $\textrm{LLS}_S$, or (b) approximate exact derivative propagation through $\textrm{LLS}$ using sketching methods. We provide strong bounds on the errors produced due to these two natural forms of sketching in the context of AD, giving the first dimensionality reduction analysis for calculating the derivatives of a sketched computation. Our results crucially depend on the analysis of the operator norm of a sketched inverse matrix product. Extensive experiments on both synthetic and real-world experiments demonstrate the efficacy of our sketched gradients.

Sketching Transformed Matrices with Applications to Natural Language Processing

Wed, 03 Jun 2020 00:00:00 +0000

Suppose we are given a large matrix $A=(a_{i,j})$ that cannot be stored in memory but is in a disk or is presented in a data stream. However, we need to compute a matrix decomposition of the entry-wisely transformed matrix, $f(A):=(f(a_{i,j}))$ for some function $f$. Is it possible to do it in a space efficient way? Many machine learning applications indeed need to deal with such large transformed matrices, for example word embedding method in NLP needs to work with the pointwise mutual information (PMI) matrix, while the entrywise transformation makes it difficult to apply known linear algebraic tools. Existing approaches for this problem either need to store the whole matrix and perform the entry-wise transformation afterwards, which is space consuming or infeasible, or need to redesign the learning method, which is application specific and requires substantial remodeling.In this paper, we first propose a space-efficient sketching algorithm for computing the product of a given small matrix with the transformed matrix. It works for a general family of transformations with provable small error bounds and thus can be used as a primitive in downstream learning tasks. We then apply this primitive to two concrete applications: low-rank approximation and linear regressions. We show that our approach obtains small error and is efficient in both space and time. For instance, for a large $n\times n$ matrix $A$, we show that only $\tilde{O}(nk^3)$ space and a few scans over the matrix $A$ are needed to compute a rank-$k$ approximation of $\log(|A|+1)$ to a fixed accuracy. This is a nearly quadratic space improvement for small $k$. We complement our theoretical results with experiments of low-rank approximation on synthetic and real data.

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Wed, 03 Jun 2020 00:00:00 +0000

Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including significantly corrupted ones. Despite this (over)fitting capacity in this paper we demonstrate that such overparameterized networks have an intriguing robustness capability: they are surprisingly robust to label noise when first order methods with early stopping is used to train them. This paper also takes a step towards demystifying this phenomena. Under a rich dataset model, we show that gradient descent is provably robust to noise/corruption on a constant fraction of the labels. In particular, we prove that: (i) In the first few iterations where the updates are still in the vicinity of the initialization gradient descent only fits to the correct labels essentially ignoring the noisy labels. (ii) To start to overfit to the noisy labels network must stray rather far from the initialization which can only occur after many more iterations. Together, these results show that gradient descent with early stopping is provably robust to label noise and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting.

Scalable Gradients for Stochastic Differential Equations

Wed, 03 Jun 2020 00:00:00 +0000

The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differentialequation whose solution is the gradient, a memory-efficient algorithm for cachingnoise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance ona 50-dimensional motion capture dataset.

Efficient Planning under Partial Observability with Unnormalized Q Functions and Spectral Learning

Wed, 03 Jun 2020 00:00:00 +0000

Learning and planning in partially-observable domains is one of the most difficult problems in reinforcement learning. Traditional methods consider these two problems as independent, resulting in a classic two-stage paradigm: first learn the environment dynamics and then compute the optimal policy accordingly. This approach, however, disconnects the reward information from the learning of the environment model and can consequently lead to representations that are sample inefficient and time consuming for planning purpose. In this paper, we propose a novel algorithm that incorporate reward information into the representations of the environment to unify these two stages. Our algorithm is closely related to the spectral learning algorithm for predicitive state representations and offers appealing theoretical guarantees and time complexity. We empirically show on two domains that our approach is more sample and time efficient compared to classical methods.

Censored Quantile Regression Forest

Wed, 03 Jun 2020 00:00:00 +0000

Random forests are powerful non-parametric regression method but are severely limited in their usage in the presence of randomly censored observations, and naively applied can exhibit poor predictive performance due to the incurred biases. Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. Regression adjustment is based on a new estimating equation that adapts to censoring and leads to quantile score whenever the data do not exhibit censoring. The proposed procedure named censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption. We establish its consistency under mild model specifications. Numerical studies showcase a clear advantage of the proposed procedure.

Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction

Wed, 03 Jun 2020 00:00:00 +0000

Due to the imminent need to alleviate the communication burden in multi-agent and federated learning, the investigation of communication-efficient distributed optimization algorithms for empirical risk minimization has flourished recently. A large fraction of existing algorithms are developed for the master/slave setting, relying on the presence of a central parameter server. This paper focuses on distributed optimization in the network setting (also known as the decentralized setting), where each agent is only allowed to aggregate information from its neighbors over a graph. By properly adjusting the global gradient estimate via a tracking term, we first develop a communication-efficient approximate Newton-type method, called Network-DANE, which generalizes the attractive DANE algorithm to decentralized networks. Our key algorithmic ideas can be applied, in a systematic manner, to obtain decentralized versions of other master/slave distributed algorithms. Notably, we develop Network-SVRG/SARAH, which employ stochastic variance reduction at each agent to accelerate local computations. We establish linear convergence of Network-DANE and Network-SVRG for strongly convex losses, and Network-SARAH for quadratic losses, which shed light on the impact of data homogeneity, network connectivity, and local averaging upon the rate of convergence. Numerical evidence is provided to demonstrate the appealing performance of our algorithms over competitive baselines, in terms of both communication and computation efficiency.

Regularization via Structural Label Smoothing

Wed, 03 Jun 2020 00:00:00 +0000

Regularization is an effective way to promote the generalization performance of machine learning models. In this paper, we focus on label smoothing, a form of output distribution regularization that prevents overfitting of a neural network by softening the ground-truth labels in the training data in an attempt to penalize overconfident outputs. Existing approaches typically use cross-validation to impose this smoothing, which is uniform across all training data. In this paper, we show that such label smoothing imposes a quantifiable bias in the Bayes error rate of the training data, with regions of the feature space with high overlap and low marginal likelihood having a lower bias and regions of low overlap and high marginal likelihood having a higher bias. These theoretical results motivate a simple objective function for data-dependent smoothing to mitigate the potential negative consequences of the operation while maintaining its desirable properties as a regularizer. We call this approach Structural Label Smoothing (SLS). We implement SLS and empirically validate on synthetic, Higgs, SVHN, CIFAR-10, and CIFAR-100 datasets. The results confirm our theoretical insights and demonstrate the effectiveness of the proposed method in comparison to traditional label smoothing.

A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization

Wed, 03 Jun 2020 00:00:00 +0000

\emph{Anderson acceleration} (or Anderson mixing) is an efficient acceleration method for fixed point iterations $x_{t+1}=G(x_t)$, e.g., gradient descent can be viewed as iteratively applying the operation $G(x) \triangleq x-\alphaabla f(x)$. It is known that Anderson acceleration is quite efficient in practice and can be viewed as an extension of Krylov subspace methods for nonlinear problems. In this paper, we show that Anderson acceleration with Chebyshev polynomial can achieve the optimal convergence rate $O(\sqrt{\kappa}\ln\frac{1}{\epsilon})$, which improves the previous result $O(\kappa\ln\frac{1}{\epsilon})$ provided by (Toth & Kelley, 2015) for quadratic functions. Moreover, we provide a convergence analysis for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter $L$) are not available, we propose a \emph{guessing algorithm} for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the proposed Anderson-Chebyshev acceleration method converges significantly faster than other algorithms, e.g., vanilla gradient descent (GD), Nesterov’s Accelerated GD. Also, these algorithms combined with the proposed guessing algorithm (guessing the hyperparameters dynamically) achieve much better performance.

Understanding Generalization in Deep Learning via Tensor Methods

Wed, 03 Jun 2020 00:00:00 +0000

Deep neural networks generalize well on unseen data though the number of parameters often far exceeds the number of training examples. Recently proposed complexity measures have provided insights to understanding the generalizability in neural networks from perspectives of PAC-Bayes, robustness, overparametrization, compression and so on. In this work, we advance the understanding of the relations between the network’s architecture and its generalizability from the compression perspective. Using tensor analysis, we propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks; thus, in practice, our generalization bound outperforms the previous compression-based ones, especially for neural networks using tensors as their weight kernels (e.g. CNNs). Moreover, these intuitive measurements provide further insights into designing neural network architectures with properties favorable for better/guaranteed generalizability. Our experimental results demonstrate that through the proposed measurable properties, our generalization error bound matches the trend of the test error well. Our theoretical analysis further provides justifications for the empirical success and limitations of some widely-used tensor-based compression approaches. We also discover the improvements to the compressibility and robustness of current neural networks when incorporating tensor operations via our proposed layer-wise structure.

Robust Importance Weighting for Covariate Shift

Wed, 03 Jun 2020 00:00:00 +0000

In many learning problems, the training and testing data follow different distributions and a particularly common situation is the \textit{covariate shift}. To correct for sampling biases, most approaches, including the popular kernel mean matching (KMM), focus on estimating the importance weights between the two distributions. Reweighting-based methods, however, are exposed to high variance when the distributional discrepancy is large. On the other hand, the alternate approach of using nonparametric regression (NR) incurs high bias when the training size is limited. In this paper, we propose and analyze a new estimator that systematically integrates the residuals of NR with KMM reweighting, based on a control-variate perspective. The proposed estimator can be shown to either strictly outperform or match the best-known existing rates for both KMM and NR, and thus is a robust combination of both estimators. The experiments shows the estimator works well in practice.

On the Convergence of SARAH and Beyond

Wed, 03 Jun 2020 00:00:00 +0000

The main theme of this work is a unifying algorithm, \textbf{L}oop\textbf{L}ess \textbf{S}ARAH (L2S) for problems formulated as summation of $n$ individual loss functions. L2S broadens a recently developed variance reduction method known as SARAH. To find an $\epsilon$-accurate solution, L2S enjoys a complexity of ${\cal O}\big( (n+\kappa) \ln (1/\epsilon)\big)$ for strongly convex problems. For convex problems, when adopting an $n$-dependent step size, the complexity of L2S is ${\cal O}(n+ \sqrt{n}/\epsilon)$; while for more frequently adopted $n$-independent step size, the complexity is ${\cal O}(n+ n/\epsilon)$. Distinct from SARAH, our theoretical findings support an $n$-independent step size in convex problems without extra assumptions. For nonconvex problems, the complexity of L2S is ${\cal O}(n+ \sqrt{n}/\epsilon)$. Our numerical tests on neural networks suggest that L2S can have better generalization properties than SARAH. Along with L2S, our side results include the linear convergence of the last iteration for SARAH in strongly convex problems.

Wasserstein Smoothing: Certified Robustness against Wasserstein Adversarial Attacks

Wed, 03 Jun 2020 00:00:00 +0000

In the last couple of years, several adversarial attack methods based on different threat models have been proposed for the image classification problem. Most existing defenses consider additive threat models in which sample perturbations have bounded L_p norms. These defenses, however, can be vulnerable against adversarial attacks under non-additive threat models. An example of an attack method based on a non-additive threat model is the Wasserstein adversarial attack proposed by Wong et al. (2019), where the distance between an image and its adversarial example is determined by the Wasserstein metric ("earth-mover distance") between their normalized pixel intensities. Until now, there has been no certifiable defense against this type of attack. In this work, we propose the first defense with certified robustness against Wasserstein adversarial attacks using randomized smoothing. We develop this certificate by considering the space of possible flows between images, and representing this space such that Wasserstein distance between images is upper-bounded by L_1 distance in this flow-space. We can then apply existing randomized smoothing certificates for the L_1 metric. In MNIST and CIFAR-10 datasets, we find that our proposed defense is also practically effective, demonstrating significantly improved accuracy under Wasserstein adversarial attack compared to unprotected models.

Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models

Wed, 03 Jun 2020 00:00:00 +0000

Models which estimate main effects of individual variables alongside interaction effects have an identifiability challenge: effects can be freely moved between main effects and interaction effects without changing the model prediction. This is a critical problem for interpretability because it permits “contradictory" models to represent the same function. To solve this problem, we propose pure interaction effects: variance in the outcome which cannot be represented by any subset of features. This definition has an equivalence with the Functional ANOVA decomposition. To compute this decomposition, we present a fast, exact algorithm that transforms any piecewise-constant function (such as a tree-based model) into a purified, canonical representation. We apply this algorithm to Generalized Additive Models with interactions trained on several datasets and show large disparity, including contradictions, between the apparent and the purified effects. These results underscore the need to specify data distributions and ensure identifiability before interpreting model parameters.

The Implicit Regularization of Ordinary Least Squares Ensembles

Wed, 03 Jun 2020 00:00:00 +0000

Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear predictors, where each individual predictor is fit using ordinary least squares on a random submatrix of the data matrix. We show that, under standard Gaussianity assumptions, when the number of features selected for each predictor is optimally tuned, the asymptotic risk of a large ensemble is equal to the asymptotic ridge regression risk, which is known to be optimal among linear predictors in this setting. In addition to eliciting this implicit regularization that results from subsampling, we also connect this ensemble to the dropout technique used in training deep (neural) networks, another strategy that has been shown to have a ridge-like regularizing effect.

Thresholding Graph Bandits with GrAPL

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we introduce a new online decision making paradigm that we call Thresholding Graph Bandits. The main goal is to efficiently identify a subset of arms in a multi-armed bandit problem whose means are above a specified threshold. While traditionally in such problems, the arms are assumed to be independent, in our paradigm we further suppose that we have access to the similarity between the arms in the form of a graph, allowing us to gain information about the arm means with fewer samples. Such a feature is particularly relevant in modern decision making problems, where rapid decisions need to be made in spite of the large number of options available. We present GrAPL, a novel algorithm for the thresholding graph bandit problem. We demonstrate theoretically that this algorithm is effective in taking advantage of the graph structure when the structure is reflective of the distribution of the rewards. We confirm these theoretical findings via experiments on both synthetic and real data.

Prior-aware Composition Inference for Spectral Topic Models

Wed, 03 Jun 2020 00:00:00 +0000

Spectral algorithms operate on matrices or tensors of word co-occurrence to learn latent topics. These approaches remove the dependence on the original documents and produce substantial gains in efficiency with provable inference, but at a cost: the models can no longer infer any information about individual documents. Thresholded Linear Inverse is developed to learn document-specific topic compositions, but its linear characteristics limit the inference quality without considering any prior information on topic distributions. We propose two novel estimation methods that respect previously unclear prior structures of spectral topic models. Experiments on a variety of synthetic to real collections demonstrate that our Prior-Aware Dual Decomposition outperforms the baseline method, whereas our Prior-Aware Manifold Iteration performs even better on short realistic data.

Convergence Rates of Smooth Message Passing with Rounding in Entropy-Regularized MAP Inference

Wed, 03 Jun 2020 00:00:00 +0000

Maximum a posteriori (MAP) inference is a fundamental computational paradigm for statistical inference. In the setting of graphical models, MAP inference entails solving a combinatorial optimization problem to find the most likely configuration of the discrete-valued model. Linear programming (LP) relaxations in the Sherali-Adams hierarchy are widely used to attempt to solve this problem, and smooth message passing algorithms have been proposed to solve regularized versions of these LPs with great success. This paper leverages recent work in entropy-regularized LPs to analyze convergence rates of a class of edge-based smooth message passing algorithms to epsilon-optimality in the relaxation. With an appropriately chosen regularization constant, we present a theoretical guarantee on the number of iterations sufficient to recover the true integral MAP solution when the LP is tight and the solution is unique.

Contextual Constrained Learning for Dose-Finding Clinical Trials

Wed, 03 Jun 2020 00:00:00 +0000

Clinical trials in the medical domain are constrained by budgets. The number of patients that can be recruited is therefore limited. When a patient population is heterogeneous, this creates difficulties in learning subgroup specific responses to a particular drug and especially for a variety of dosages. In addition, patient recruitment can be difficult by the fact that clinical trials do not aim to provide a benefit to any given patient in the trial. In this paper, we propose C3T-Budget, a contextual constrained clinical trial algorithm for dose-finding under both budget and safety constraints. The algorithm aims to maximize drug efficacy within the clinical trial while also learning about the drug being tested. C3T-Budget recruits patients with consideration of the remaining budget, the remaining time, and the characteristics of each group, such as the population distribution, estimated expected efficacy, and estimation credibility. In addition, the algorithm aims to avoid unsafe dosages. These characteristics are further illustrated in a simulated clinical trial study, which corroborates the theoretical analysis and demonstrates an efficient budget usage as well as a balanced learning-treatment trade-off.

Risk Bounds for Learning Multiple Components with Permutation-Invariant Losses

Wed, 03 Jun 2020 00:00:00 +0000

This paper proposes a simple approach to derive efficient error bounds for learning multiple components with sparsity-inducing regularization. We show that for such regularization schemes, known decompositions of the Rademacher complexity over the components can be used in a more efficient manner to result in tighter bounds without too much effort. We give examples of application to switching regression and center-based clustering/vector quantization. Then, the complete workflow is illustrated on the problem of subspace clustering, for which decomposition results were not previously available. For all these problems, the proposed approach yields risk bounds with mild dependencies on the number of components and completely removes this dependence for nonconvex regularization schemes that could not be handled by previous methods.

A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case

Wed, 03 Jun 2020 00:00:00 +0000

Recent work by Su, Boyd and Candes made a connection between Nesterov’s accelerated gradient descent method and an ordinary differential equation (ODE). We show that this connection can be extended to the case of stochastic gradients, and develop Lyapunov function based convergence rates proof for Nesterov’s accelerated stochastic gradient descent. In the gradient case, we show Nesterov’s method arises as a straightforward discretization of a modified ODE. Established Lyapunov analysis is used to recover the accelerated rates of convergence in both continuous and discrete time. Moreover, the Lyapunov analysis can be extended to the case of stochastic gradients. The result is a unified approach to accelerationin both continuous and discrete time, and in for both stochastic and full gradients.

EM Converges for a Mixture of Many Linear Regressions

Wed, 03 Jun 2020 00:00:00 +0000

We study the convergence of the Expectation-Maximization (EM) algorithm for mixtures of linear regressions with an arbitrary number $k$ of components. We show that as long as signal-to-noise ratio (SNR) is $\tilde{\Omega}(k)$, well-initialized EM converges to the true regression parameters. Previous results for $k \geq 3$ have only established local convergence for the noiseless setting, i.e., where SNR is infinitely large. Our results enlarge the scope to the environment with noises, and notably, we establish a statistical error rate that is independent of the norm (or pairwise distance) of the regression parameters. In particular, our results imply exact recovery as $\sigma \rightarrow 0$, in contrast to most previous local convergence results for EM, where the statistical error scaled with the norm of parameters. Standard moment-method approaches may be applied to guarantee we are in the region where our local convergence guarantees apply.

Randomized Exploration in Generalized Linear Bandits

Wed, 03 Jun 2020 00:00:00 +0000

We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the number of features and $K$ is the number of arms. The former improves on prior work while the latter is the first for Gaussian noise perturbations in non-linear models. We empirically evaluate both GLM-TSL and GLM-FPL in logistic bandits, and apply GLM-FPL to neural network bandits. Our work showcases the role of randomization, beyond posterior sampling, in exploration.

Domain-Liftability of Relational Marginal Polytopes

Wed, 03 Jun 2020 00:00:00 +0000

We study computational aspects of "relational marginal polytopes" which are statistical relational learning counterparts of marginal polytopes, well-known from probabilistic graphical models. Here, given some first-order logic formula, we can define its relational marginal statistic to be the fraction of groundings that make this formula true in a given possible world. For a list of first-order logic formulas, the relational marginal polytope is the set of all points that correspond to expected values of the relational marginal statistics that are realizable. In this paper we study the following two problems: (i) Do domain-liftability results for the partition functions of Markov logic networks (MLNs)carry over to the problem of relational marginal polytope construction? (ii) Is the relational marginal polytope containment problem hard under some plausible complexity-theoretic assumptions? Our positive results have consequences for lifted weight learning of MLNs. In particular, we show that weight learning of MLNs is domain-liftable whenever the computation of the partition function of the respective MLNs is domain-liftable (this result has not been rigorously proven before).

Active Community Detection with Maximal Expected Model Change

Wed, 03 Jun 2020 00:00:00 +0000

We present a novel active learning algorithm for community detection on networks. Our proposed algorithm uses a Maximal Expected Model Change (MEMC) criterion for querying network nodes label assignments. MEMC detects nodes that maximally change the community assignment likelihood model following a query. Our method is inspired by detection in the benchmark Stochastic Block Model (SBM), where we provide sample complexity analysis and empirical study with SBM and real network data for binary as well as for the multi-class settings. The analysis also covers the most challenging case of sparse degree and below-detection-threshold SBMs, where we observe a super-linear error reduction. MEMC is shown to be superior to the random selection baseline and other state-of-the-art active learners.

Regularized Autoencoders via Relaxed Injective Probability Flow

Wed, 03 Jun 2020 00:00:00 +0000

Invertible flow-based generative models are an effective method for learning to generate samples, while allowing for tractable likelihood computation and inference. However, the invertibility requirement restricts models to have the same latent dimensionality as the inputs. This imposes significant architectural, memory, and computational costs, making them more challenging to scale than other classes of generative models such as Variational Autoencoders (VAEs). We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity. This also provides another perspective on regularized autoencoders (RAEs), with our final objectives resembling RAEs with specific regularizers that are derived by lower bounding the probability flow objective. We empirically demonstrate the promise of the proposed model, improving over VAEs and AEs in terms of sample quality.

Ivy: Instrumental Variable Synthesis for Causal Inference

Wed, 03 Jun 2020 00:00:00 +0000

A popular way to estimate the causal effect of a variable x on y from observational data is to use an instrumental variable (IV): a third variable z that affects y only through x. The more strongly z is associated with x, the more reliable the estimate is, but such strong IVs are difficult to find. Instead, practitioners combine more commonly available IV candidates—which are not necessarily strong, or even valid, IVs—into a single "summary" that is plugged into causal effect estimators in place of an IV. In genetic epidemiology, such approaches are known as allele scores. Allele scores require strong assumptions—independence and validity of all IV candidates—for the resulting estimate to be reliable. To relax these assumptions, we propose Ivy, a new method to combine IV candidates that can handle correlated and invalid IV candidates in a robust manner. Theoretically, we characterize this robustness, its limits, and its impact on the resulting causal estimates. Empirically, Ivy can correctly identify the directionality of known relationships and is robust against false discovery (median effect size <= 0.025) on three real-world datasets with no causal effects, while allele scores return more biased estimates (median effect size >= 0.118).

Gaussian Sketching yields a J-L Lemma in RKHS

Wed, 03 Jun 2020 00:00:00 +0000

The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\bm K$ yields an operator whose counterpart in an RKHS $\cal H$, is a \emph{random projection} operator—in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\bm{K}$ corresponds to a particular random operator in (infinite-dimensional) Hilbert space $\cal H$ that maps functions $f \in \cal H$ to a low-dimensional space $\bb R^d$, while preserving a weighted RKHS inner-product of the form $⟨f, g \rangle_{\Sigma} \doteq ⟨f, \Sigma^3 g \rangle_{\cal H}$, where $\Sigma$ is the \emph{covariance} operator induced by the data distribution. In particular, under similar assumptions as in kernel PCA (KPCA), or kernel $k$-means (K-$k$-means), well-separated subsets of feature-space $\{K(\cdot, x): x \in \cal X\}$ remain well-separated after such operation, which suggests similar benefits as in KPCA and/or K-$k$-means, albeit at the much cheaper cost of a random projection. In particular, our convergence rates suggest that, given a large dataset $\{X_i\}_{i=1}^N$ of size $N$, we can build the Gram matrix $\bm K$ on a much smaller subsample of size $n\ll N$, so that the sketch $Z\bm K$ is very cheap to obtain and subsequently apply as a projection operator on the original data $\{X_i\}_{i=1}^N$. We verify these insights empirically on synthetic data, and on real-world clustering applications.

Computing Tight Differential Privacy Guarantees Using FFT

Wed, 03 Jun 2020 00:00:00 +0000

Differentially private (DP) machine learning has recently become popular. The privacy loss of DP algorithms is commonly reported using (e.d)-DP. In this paper, we propose a numerical accountant for evaluating the privacy loss for algorithms with continuous one dimensional output. This accountant can be applied to the subsampled multidimensional Gaussian mechanism which underlies the popular DP stochastic gradient descent. The proposed method is based on a numerical approximation of an integral formula which gives the exact (e.d)-values. The approximation is carried out by discretising the integral and by evaluating discrete convolutions using the fast Fourier transform algorithm. We give theoretical error bounds which show the convergence of the approximation and guarantee its accuracy to an arbitrary degree. We give both theoretical error bounds and numerical error estimates for the approximation. Experimental comparisons with state-of-the-art techniques demonstrate significant improvements in bound tightness and/or computation time.

Learning Rate Adaptation for Differentially Private Learning

Wed, 03 Jun 2020 00:00:00 +0000

Differentially private learning has recently emerged as the leading approach for privacy-preserving machine learning. Differential privacy can complicate learning procedures because each access to the data needs to be carefully designed and carries a privacy cost. For example, standard parameter tuning with a validation set cannot be easily applied. In this paper, we propose a differentially private algorithm for the adaptation of the learning rate for differentially private stochastic gradient descent (SGD) that avoids the need for validation set use. The idea for the adaptiveness comes from the technique of extrapolation in numerical analysis: to get an estimate for the error against the gradient flow we compare the result obtained by one full step and two half-steps. We prove the privacy of the method using the moments accountant mechanism. This allows us to compute tight privacy bounds. Empirically we show that our method is competitive with manually tuned commonly used optimisation methods for training deep neural networks and differentially private variational inference.

ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations

Wed, 03 Jun 2020 00:00:00 +0000

In applications such as molecule design or drug discovery, it is desirable to have an algorithm which recommends new candidate molecules based on the results of past tests. These molecules first need to be synthesized and then tested for objective properties. We describe ChemBO, a Bayesian optimization framework for generating and optimizing organic molecules for desired molecular properties. While most existing data-driven methods for this problem do not account for sample efficiency or fail to enforce realistic constraints on synthesizability, our approach explores synthesis graphs in a sample-efficient way and produces synthesizable candidates. We implement ChemBO as a Gaussian process model and explore existing molecular kernels for it. Moreover, we propose a novel optimal-transport based distance and kernel that accounts for graphical information explicitly. In our experiments, we demonstrate the efficacy of the proposed approach on several molecular optimization problems.

Sublinear Optimal Policy Value Estimation in Contextual Bandits

Wed, 03 Jun 2020 00:00:00 +0000

We study the problem of estimating the expected reward of the optimal policy in the stochastic disjoint linear bandit setting. We prove that for certain settings it is possible to obtain an accurate estimate of the optimal policy value even with a sublinear number of samples, where a linear set would be needed to reliably estimate the reward that can be obtained by any policy. We establish near matching information theoretic lower bounds, showing that our algorithm achieves near optimal estimation error. Finally, we demonstrate the effectiveness of our algorithm on joke recommendation and cancer inhibition dosage selection problems using real datasets.

The Expressive Power of a Class of Normalizing Flow Models

Wed, 03 Jun 2020 00:00:00 +0000

Normalizing flows have received a great deal of recent attention as they allow flexible generative modeling as well as easy likelihood computation. While a wide variety of flow models have been proposed, there is little formal understanding of the representation power of these models. In this work, we study some basic normalizing flows and rigorously establish bounds on their expressive power. Our results indicate that while these flows are highly expressive in one dimension, in higher dimensions their representation power may be limited, especially when the flows have moderate depth.

Adaptive, Distribution-Free Prediction Intervals for Deep Networks

Wed, 03 Jun 2020 00:00:00 +0000

The machine learning literature contains several constructions for prediction intervals that are intuitively reasonable but ultimately ad-hoc in that they do not come with provable performance guarantees. We present methods from the statistics literature that can be used efficiently with neural networks under minimal assumptions with guaranteed performance. We propose a neural network that outputs three values instead of a single point estimate and optimizes a loss function motivated by the standard quantile regression loss. We provide two prediction interval methods with finite sample coverage guarantees solely under the assumption that the observations are independent and identically distributed. The first method leverages the conformal inference framework and provides average coverage. The second method provides a new, stronger guarantee by conditioning on the observed data. Lastly, our loss function does not compromise the predictive accuracy of the network like other prediction interval methods. We demonstrate the ease of use of our procedures as well as its improvements over other methods on both simulated and real data. As most deep networks can easily be modified by our method to output predictions with valid prediction intervals, its use should become standard practice, much like reporting standard errors along with mean estimates.

Simulator Calibration under Covariate Shift with Kernels

Wed, 03 Jun 2020 00:00:00 +0000

We propose a novel calibration method for computer simulators, dealing with the problem of covariate shift.Covariate shift is the situation where input distributions for training and test are different, and ubiquitous in applications of simulations. Our approach is based on Bayesian inference with kernel mean embedding of distributions, and on the use of an importance-weighted reproducing kernel for covariate shift adaptation.We provide a theoretical analysis for the proposed method, including a novel theoretical result for conditional mean embedding, as well as empirical investigations suggesting its effectiveness in practice.The experiments include calibration of a widely used simulator for industrial manufacturing processes, where we also demonstrate how the proposed method may be useful for sensitivity analysis of model parameters.

Distributionally Robust Bayesian Optimization

Wed, 03 Jun 2020 00:00:00 +0000

Robustness to distributional shift is one of the key challenges of contemporary machine learning. Attaining such robustness is the goal of distributionally robust optimization, which seeks a solution to an optimization problem that is worst-case robust under a specified distributional shift of an uncontrolled covariate. In this paper, we study such a problem when the distributional shift is measured via the maximum mean discrepancy (MMD). For the setting of zeroth-order, noisy optimization, we present a novel distributionally robust Bayesian optimization algorithm (DRBO). Our algorithm provably obtains sub-linear robust regret in various settings that differ in how the uncertain covariate is observed. We demonstrate the robust performance of our method on both synthetic and real-world benchmarks.

Two-sample Testing Using Deep Learning

Wed, 03 Jun 2020 00:00:00 +0000

We propose a two-sample testing procedure based on learned deep neural network representations. To this end, we define two test statistics that perform an asymptotic location test on data samples mapped onto a hidden layer. The tests are consistent and asymptotically control the type-1 error rate. Their test statistics can be evaluated in linear time (in the sample size). Suitable data representations are obtained in a data-driven way, by solving a supervised or unsupervised transfer-learning task on an auxiliary (potentially distinct) data set. If no auxiliary data is available, we split the data into two chunks: one for learning representations and one for computing the test statistic. In experiments on audio samples, natural images and three-dimensional neuroimaging data our tests yield significant decreases in type-2 error rate (up to 35 percentage points) compared to state-of-the-art two-sample tests such as kernel-methods and classifier two-sample tests.

Stochastic Variance-Reduced Algorithms for PCA with Arbitrary Mini-Batch Sizes

Wed, 03 Jun 2020 00:00:00 +0000

We present two stochastic variance-reduced PCA algorithms and their convergence analyses. By deriving explicit forms of step size, epoch length and batch size to ensure the optimal runtime, we show that the proposed algorithms can attain the optimal runtime with any batch sizes. Also, we establish global convergence of the algorithms based on a novel approach, which studies the optimality gap as a ratio of two expectation terms. The framework in our analysis is general and can be used to analyze other stochastic variance-reduced PCA algorithms and improve their analyses. Moreover, we introduce practical implementations of the algorithms which do not require hyper-parameters. The experimental results show that the proposed methodsd outperform other stochastic variance-reduced PCA algorithms regardless of the batch size.

Multi-level Gaussian Graphical Models Conditional on Covariates

Wed, 03 Jun 2020 00:00:00 +0000

We address the problem of learning the structure of a high-dimensional Gaussian graphical model conditional on covariates, when each sample belongs to groups at multiple levels of hierarchy. The existing statistical methods for learning covariate-conditioned Gaussian graphical models focused on learning the aggregate behavior of inputs and outputs in a single-layer network. We propose a statistical model called multi-level conditional Gaussian graphical models for modeling multi-level output networks influenced by both individual-level and group-level inputs. We describe a decomposition of our model into a product of two components, one for sum variables and the other for difference variables derived from the original variables. This decomposition leads to an efficient learning algorithm for both complete data and incomplete data with randomly missing individual observations, as the expensive repeated computation of the partition function can be avoided. We demonstrate our method on simulated data and real-world data in finance and genomics.

Lipschitz Continuous Autoencoders in Application to Anomaly Detection

Wed, 03 Jun 2020 00:00:00 +0000

Anomaly detection is the task of finding abnormal data that are distinct from normal behavior. Current deep learning-based anomaly detection methods train neural networks with normal data alone and calculate anomaly scores based on the trained model. In this work, we formalize current practices, build a theoretical framework of anomaly detection algorithms equipped with an objective function and a hypothesis space, and establish a desirable property of the anomaly detection algorithm, namely, admissibility. Admissibility implies that optimal autoencoders for normal data yield a larger reconstruction error for anomalous data than that for normal data on average. We then propose a class of admissible anomaly detection algorithms equipped with an integral probability metric-based objective function and a class of autoencoders, Lipschitz continuous autoencoders. The proposed algorithm for Wasserstein distance is implemented by minimizing an approximated Wasserstein distance with a penalty to enforce Lipschitz continuity with respect to Wasserstein distance. Through ablation studies, we demonstrate the efficacy of enforcing Lipschitz continuity of the proposed method. The proposed method is shown to be more effective in detecting anomalies than existing methods via applications to network traffic and image datasets.

On casting importance weighted autoencoder to an EM algorithm to learn deep generative models

Wed, 03 Jun 2020 00:00:00 +0000

We propose a new and general approach to learn deep generative models. Our approach is based on a new observation that the importance weighted autoencoders (IWAE, Burda et al. (2015)) can be understood as a procedure of estimating the MLE with an EM algorithm. Utilizing this interpretation, we develop a new learning algorithm called importance weighted EM algorithm (IWEM). IWEM is an EM algorithm with self-normalized importance sampling (snIS) where the proposal distribution is carefully selected to reduce the variance due to snIS. In addition, we devise an annealing strategy to stabilize the learning algorithm. For missing data problems, we propose a modified IWEM algorithm called miss-IWEM. Using multiple benchmark datasets, we demonstrate empirically that our proposed methods outperform IWAE with significant margins for both fully-observed and missing data cases.

Recommendation on a Budget: Column Space Recovery from Partially Observed Entries with Random or Active Sampling

Wed, 03 Jun 2020 00:00:00 +0000

We analyze alternating minimization for column space recovery of a partially observed, approximately low rank matrix with a growing number of columns and a fixed budget of observations per column. We prove that if the budget is greater than the rank of the matrix, column space recovery succeeds – as the number of columns grows, the estimate from alternating minimization converges to the true column space with probability tending to one. From our proof techniques, we naturally formulate an active sampling strategy for choosing entries of a column that is theoretically and empirically (on synthetic and real data) better than the commonly studied uniformly random sampling strategy.

Fair Decisions Despite Imperfect Predictions

Wed, 03 Jun 2020 00:00:00 +0000

Consequential decisions are increasingly informed by sophisticated data-driven predictive models. However, consistently learning accurate predictive models requires access to ground truth labels. Unfortunately, in practice, labels may only exist conditional on certain decisions—if a loan is denied, there is not even an option for the individual to pay back the loan. In this paper, we show that, in this selective labels setting, learning to predict is suboptimal in terms of both fairness and utility. To avoid this undesirable behavior, we propose to directly learn stochastic decision policies that maximize utility under fairness constraints. In the context of fair machine learning, our results suggest the need for a paradigm shift from "learning to predict" to "learning to decide". Experiments on synthetic and real-world data illustrate the favorable properties of learning to decide, in terms of both utility and fairness.

Variational Autoencoders and Nonlinear ICA: A Unifying Framework

Wed, 03 Jun 2020 00:00:00 +0000

The framework of variational autoencoders allows us to efficiently learn deep latent-variable models, such that the model’s marginal distribution over observed variables fits the data. Often, we’re interested in going a step further, and want to approximate the true joint distribution over observed and latent variables, including the true prior and posterior distributions over latent variables. This is known to be generally impossible due to unidentifiability of the model. We address this issue by showing that for a broad family of deep latent-variable models, identification of the true joint distribution over observed and latent variables is actually possible up to very simple transformations, thus achieving a principled and powerful form of disentanglement. Our result requires a factorized prior distribution over the latent variables that is conditioned on an additionally observed variable, such as a class label or almost any other observation. We build on recent developments in nonlinear ICA, which we extend to the case with noisy, undercomplete or discrete observations, integrated in a maximum likelihood framework. The result also trivially contains identifiable flow-based generative models as a special case.

Nonmyopic Gaussian Process Optimization with Macro-Actions

Wed, 03 Jun 2020 00:00:00 +0000

This paper presents a multi-staged approach to nonmyopic adaptive Gaussian process optimization (GPO) for Bayesian optimization (BO) of unknown, highly complex objective functions that, in contrast to existing nonmyopic adaptive BO algorithms, exploits the notion of macro-actions for scaling up to a further lookahead to match up to a larger available budget. To achieve this, we generalize GP upper confidence bound to a new acquisition function defined w.r.t. a nonmyopic adaptive macro-action policy, which is intractable to be optimized exactly due to an uncountable set of candidate outputs. The contribution of our work here is thus to derive a nonmyopic adaptive epsilon-Bayes-optimal macro-action GPO (epsilon-Macro-GPO) policy. To perform nonmyopic adaptive BO in real time, we then propose an asymptotically optimal anytime variant of our epsilon-Macro-GPO policy with a performance guarantee. We empirically evaluate the performance of our epsilon-Macro-GPO policy and its anytime variant in BO with synthetic and real-world datasets.

Elimination of All Bad Local Minima in Deep Learning

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions. At every local minimum of any deep neural network with these added neurons, the set of parameters of the original neural network (without added neurons) is guaranteed to be a global minimum of the original neural network. The effects of the added neurons are proven to automatically vanish at every local minimum. Moreover, we provide a novel theoretical characterization of a failure mode of eliminating suboptimal local minima via an additional theorem and several examples. This paper also introduces a novel proof technique based on the perturbable gradient basis (PGB) necessary condition of local minima, which provides new insight into the elimination of local minima and is applicable to analyze various models and transformations of objective functions beyond the elimination of local minima.

Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization

Wed, 03 Jun 2020 00:00:00 +0000

We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems.

The True Sample Complexity of Identifying Good Arms

Wed, 03 Jun 2020 00:00:00 +0000

We consider two multi-armed bandit problems with $n$ arms: \emph{(i)} given an $\epsilon > 0$, identify an arm with mean that is within $\epsilon$ of the largest mean and \emph{(ii)} given a threshold $\mu_0$ and integer $k$, identify $k$ arms with means larger than $\mu_0$. Existing lower bounds and algorithms for the PAC framework suggest that both of these problems require $\Omega(n)$ samples. However, we argue that the PAC framework not only conflicts with how these algorithms are used in practice, but also that these results disagree with intuition that says \emph{(i)} requires only $\Theta(\frac{n}{m})$ samples where $m = |\{ i : \mu_i > \max_{j \in [n]} \mu_j - \epsilon\}|$ and \emph{(ii)} requires $\Theta(\frac{n}{m}k)$ samples where $m = |\{ i : \mu_i > \mu_0 \}|$. We provide definitions that formalize these intuitions, obtain lower bounds that match the above sample complexities, and develop explicit, practical algorithms that achieve nearly matching upper bounds.

Model-Agnostic Counterfactual Explanations for Consequential Decisions

Wed, 03 Jun 2020 00:00:00 +0000

Predictive models are being increasingly used to support consequential decision making at the individual level in contexts such as pretrial bail and loan approval. As a result, there is increasing social and legal pressure to provide explanations that help the affected individuals not only to understand why a prediction was output, but also how to act to obtain a desired outcome. To this end, several works have proposed optimization-based methods to generate nearest counterfactual explanations. However, these methods are often restricted to a particular subset of models (e.g., decision trees or linear models) and differentiable distance functions. In contrast, we build on standard theory and tools from formal verification and propose a novel algorithm that solves a sequence of satisfiability problems, where both the distance function (objective) and predictive model (constraints) are represented as logic formulae. As shown by our experiments on real-world data, our algorithm is: i) model-agnostic ({non-}linear, {non-}differentiable, {non-}convex); ii) data-type-agnostic (heterogeneous features); iii) distance-agnostic (l0, l1, l8, and combinations thereof); iv) able to generate plausible and diverse counterfactuals for any sample (i.e., 100% coverage); and v) at provably optimal distances.

Optimal Deterministic Coresets for Ridge Regression

Wed, 03 Jun 2020 00:00:00 +0000

We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2.$ In the special case of $\lambda = 0$, this is ordinary multi-response linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and re-weight a small subset of rows of $A$ and corresponding labels of $B$, denoted by $SA$ and $SB$, so that if $X’$ is the minimizer to $\min_{X’} \|SAX’-SB\|_F^2 + \lambda \|X’\|_F^2$, then $\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT$. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select $O({sd}_{\lambda}/\epsilon)$ rows of $A$ and $B$ to achieve this property, and prove a matching lower bound, showing that it is necessary to select $\Omega({sd}_{\lambda}/\epsilon)$ rows no matter what the weights are, for any $1 < 1/\epsilon \leq sd_{\lambda}$. Here ${sd}_{\lambda}$ is the statistical dimension of the input, and we assume $d’ = O({sd}_{\lambda}) \leq d$. In the case of ordinary regression, this gives a deterministic algorithm achieving $O(d/\epsilon)$ rows and a matching lower bound for any $1 \leq 1/\epsilon \leq d$; for $1/\epsilon > d$ we show $\Theta(d^2)$ rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with $O({sd}_{\lambda}/\epsilon)$ words of communication per server, in the important case when the rows of $A$ and $B$ have a constant number of non-zero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required $\Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2))$ communication.

Graph Coarsening with Preserved Spectral Properties

Wed, 03 Jun 2020 00:00:00 +0000

In graph coarsening, one aims to produce a coarse graph of reduced size while preserving important graph properties. However, as there is no consensus on which specific graph properties should be preserved by coarse graphs, measuring the differences between original and coarse graphs remains a key challenge. This work relies on spectral graph theory to justify a distance function constructed to measure the similarity between original and coarse graphs. We show that the proposed spectral distance captures the structural differences in the graph coarsening process. We also propose graph coarsening algorithms that aim to minimize the spectral distance. Experiments show that the proposed algorithms can outperform previous graph coarsening methods in graph classification and stochastic block recovery tasks.

Identifying and Correcting Label Bias in Machine Learning

Wed, 03 Jun 2020 00:00:00 +0000

Datasets often contain biases which unfairly disadvantage certain groups, and classifiers trained on such datasets can inherit these biases. In this paper, we provide a mathematical formulation of how this bias can arise. We do so by assuming the existence of underlying, unknown, and unbiased labels which are overwritten by an agent who intends to provide accurate labels but may have biases against certain groups. Despite the fact that we only observe the biased labels, we are able to show that the bias may nevertheless be corrected by re-weighting the data points without changing the labels. We show, with theoretical guarantees, that training on the re-weighted dataset corresponds to training on the unobserved but unbiased labels, thus leading to an unbiased machine learning classifier. Our procedure is fast and robust and can be used with virtually any learning algorithm. We evaluate on a number of standard machine learning fairness datasets and a variety of fairness notions, finding that our method outperforms standard approaches in achieving fair classification.

Inference of Dynamic Graph Changes for Functional Connectome

Wed, 03 Jun 2020 00:00:00 +0000

Dynamic functional connectivity is an effective measure for the brain’s responses to continuous stimuli. We propose an inferential method to detect the dynamic changes of brain networks based on time-varying graphical models. Whereas most existing methods focus on testing the existence of change points, the dynamics in the brain network offer more signals in many neuroscience studies. We propose a novel method to conduct hypothesis testing on changes in dynamic brain networks. We introduce a bootstrap statistic to approximate the supreme of the high-dimensional empirical processes over dynamically changing edges. Our simulations show that this framework can capture the change points with changed connectivity. Finally, we apply our method to a brain imaging dataset under a natural audio-video stimulus and illustrate that we are able to detect temporal changes in brain networks. The functions of the identified regions are consistent with specific emotional annotations, which are closely associated with changes inferred by our method.

Feature relevance quantification in explainable AI: A causal problem

Wed, 03 Jun 2020 00:00:00 +0000

We discuss promising recent contributions on quantifying feature relevance using Shapley values, where we observed some confusion on which probability distribution is the right one for dropped features. We argue that the confusion is based on not carefully distinguishing between observational and interventional conditional probabilities and try a clarification based on Pearl’s seminal work on causality. We conclude that unconditional rather than conditional expectations provide the right notion of dropping features. This contradicts the view of the authors of the software package SHAP. In that work, unconditional expectations (which we argue to be conceptually right) are only used as approximation for the conditional ones, which encouraged others to ’improve’ SHAP in a way that we believe to be flawed.

Bandit optimisation of functions in the Matérn kernel RKHS

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of optimising functions in the reproducing kernel Hilbert space (RKHS) of a Matérn kernel with smoothness parameter $u$ over the domain $[0,1]^d$ under noisy bandit feedback. Our contribution, the $\pi$-GP-UCB algorithm, is the first practical approach with guaranteed sublinear regret for all $u>1$ and $d \geq 1$. Empirical validation suggests better performance and drastically improved computational scalablity compared with its predecessor, Improved GP-UCB.

Spatio-temporal alignments: Optimal transport through space and time

Wed, 03 Jun 2020 00:00:00 +0000

Comparing data defined over space and time is notoriously hard. It involves quantifying both spatial and temporal variability while taking into account the chronological structure of the data. Dynamic Time Warping (DTW) computes a minimal cost alignment between time series that preserves the chronological order but is inherently blind to spatio-temporal shifts. In this paper, we propose Spatio-Temporal Alignments (STA), a new differentiable formulation of DTW that captures spatial and temporal variability. Spatial differences between time samples are captured using regularized Optimal transport. While temporal alignment cost exploits a smooth variant of DTW called soft-DTW. We show how smoothing DTW leads to alignment costs that increase quadratically with time shifts. The costs are expressed using an unbalanced Wasserstein distance to cope with observations that are not probabilities. Experiments on handwritten letters and brain imaging data confirm our theoretical findings and illustrate the effectiveness of STA as a dissimilarity for spatio-temporal data.

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes. Various iteration complexity results regarding descent direction computation, communication efficiency and stopping criteria are analyzed under convex setting. Our numerical results illustrate that the proposed method outperforms other comparable methods for solving learning problems including neural networks.

Flexible distribution-free conditional predictive bands using density estimators

Wed, 03 Jun 2020 00:00:00 +0000

Conformal methods create prediction bands that control average coverage assuming solely i.i.d. data. Besides average coverage, one might also desire to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods.

An Optimal Algorithm for Bandit Convex Optimization with Strongly-Convex and Smooth Loss

Wed, 03 Jun 2020 00:00:00 +0000

We consider non-stochastic bandit convex optimization with strongly-convex and smooth loss functions. For this problem, Hazan and Levy have proposed an algorithm with a regret bound of $\tilde{O}(d^{3/2} \sqrt{T})$ given access to an $O(d)$-self-concordant barrier over the feasible region, where $d$ and $T$ stand for the dimensionality of the feasible region and the number of rounds, respectively. However, there are no known efficient ways for constructing self-concordant barriers for general convex sets, and a $\tilde{O}(\sqrt{d})$ gap has remained between the upper and lower bounds, as the known regret lower bound is $\Omega(d\sqrt{T})$. Our study resolves these two issues by introducing an algorithm that achieves an optimal regret bound of $\tilde{O}(d \sqrt{T})$ under a mild assumption, without self-concordant barriers. More precisely, the algorithm requires only a membership oracle for the feasible region, and it achieves an optimal regret bound of $\tilde{O}(d\sqrt{T})$ under the assumption that the optimal solution is an interior of the feasible region. Even without this assumption, our algorithm achieves $\tilde{O}(d^{3/2}\sqrt{T})$-regret.

Stopping criterion for active learning based on deterministic generalization bounds

Wed, 03 Jun 2020 00:00:00 +0000

Active learning is a framework in which the learning machine can select the samples to be used for training. This technique is promising, particularly when the cost of data acquisition and labeling is high. In active learning, determining the timing at which learning should be stopped is a critical issue. In this study, we propose a criterion for automatically stopping active learning. The proposed stopping criterion is based on the difference in the expected generalization errors and hypothesis testing. We derive a novel upper bound for the difference in expected generalization errors before and after obtaining a new training datum based on PAC-Bayesian theory. Unlike ordinary PAC-Bayesian bounds, though, the proposed bound is deterministic; hence, there is no uncontrollable trade-off between the confidence and tightness of the inequality. We combine the upper bound with a statistical test to derive a stopping criterion for active learning. We demonstrate the effectiveness of the proposed method via experiments with both artificial and real datasets.

Optimal sampling in unbiased active learning

Wed, 03 Jun 2020 00:00:00 +0000

A common belief in unbiased active learning is that, in order to capture the most informative instances, the sampling probabilities should be proportional to the uncertainty of the class labels. We argue that this produces suboptimal predictions and present sampling schemes for unbiased pool-based active learning that minimise the actual prediction error, and demonstrate a better predictive performance than competing methods on a number of benchmark datasets. In contrast, both probabilistic and deterministic uncertainty sampling performed worse than simple random sampling on some of the datasets.

Fast Noise Removal for k-Means Clustering

Wed, 03 Jun 2020 00:00:00 +0000

This paper considers k-means clustering in the presence of noise. It is known that k-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called k-means clustering with outliers. The goal of k-means clustering with outliers is to discard up to a specified number z of points as noise/outliers and then find a k-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any k-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from k-means with outliers to k-means without outliers. Secondly, we show how to construct a coreset of size O(k log n). When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.

A Theoretical and Practical Framework for Regression and Classification from Truncated Samples

Wed, 03 Jun 2020 00:00:00 +0000

Machine learning and statistics are invaluable for extracting insights from data. A key assumption of most methods, however, is that they have access to independent samples from the distribution of relevant data. As such, these methods often perform poorly in the face of {\em biased data} which breaks this assumption. In this work, we consider the classical challenge of bias due to truncation, wherein samples falling outside of an “observation window” cannot be observed. We present a general framework for regression and classification from samples that are truncated according to the value of the dependent variable. The framework argues that stochastic gradient descent (SGD) can be efficiently executed on the population log-likelihood of the truncated sample. Our framework is broadly applicable, and we provide end-to-end guarantees for the well-studied problems of truncated logistic and probit regression, where we argue that the true model parameters can be identified computationally and statistically efficiently from truncated data, extending recent work on truncated linear regression. We also provide experiments to illustrate the practicality of our framework on synthetic and real data.

Robust Optimisation Monte Carlo

Wed, 03 Jun 2020 00:00:00 +0000

This paper is on Bayesian inference for parametric statistical models that are defined by a stochastic simulator which specifies how data is generated. Exact sampling is then possible but evaluating the likelihood function is typically prohibitively expensive. Approximate Bayesian Computation (ABC) is a framework to perform approximate inference in such situations. While basic ABC algorithms are widely applicable, they are notoriously slow and much research has focused on increasing their efficiency. Optimisation Monte Carlo (OMC) has recently been proposed as an efficient and embarrassingly parallel method that leverages optimisation to accelerate the inference. In this paper, we demonstrate an important previously unrecognised failure mode of OMC: It generates strongly overconfident approximations by collapsing regions of similar or near-constant likelihood into a single point. We propose an efficient, robust generalisation of OMC that corrects this. It makes fewer assumptions, retains the main benefits of OMC, and can be performed either as post-processing to OMC or as a stand-alone computation. We demonstrate the effectiveness of the proposed Robust OMC on toy examples and tasks in inverse-graphics where we perform Bayesian inference with a complex image renderer.

Local Differential Privacy for Sampling

Wed, 03 Jun 2020 00:00:00 +0000

Differential privacy (DP) is a leading privacy protection focused by design on individual privacy. In the local model of DP, strong privacy is achieved by privatizing each user’s individual data before sending it to an untrusted aggregator for analysis. While in recent years local DP has been adopted for practical deployments, most research in this area focuses on problems where each individual holds a single data record. In many problems of practical interest this assumption is unrealistic since nowadays most user-owned devices collect large quantities of data (e.g. pictures, text messages, time series). We propose to model this scenario by assuming each individual holds a distribution over the space of data records, and develop novel local DP methods to sample privately from these distributions. Our main contribution is a boosting-based density estimation algorithm for learning samplers that generate synthetic data while protecting the underlying distribution of each user with local DP. We give approximation guarantees quantifying how well these samplers approximate the true distribution. Experimental results against DP kernel density estimation and DP GANs displays the quality of our results.

Uncertainty Quantification for Deep Context-Aware Mobile Activity Recognition and Unknown Context Discovery

Wed, 03 Jun 2020 00:00:00 +0000

Activity recognition in wearable computing faces two key challenges: i) activity characteristics may be context-dependent and change under different contexts or situations; ii) unknown contexts and activities may occur from time to time, requiring flexibility and adaptability of the algorithm. We develop a context-aware mixture of deep models termed the $\alpha$-$\beta$ network coupled with uncertainty quantification (UQ) based upon maximum entropy to enhance human activity recognition performance. We improve accuracy and F score by 10% by identifying high-level contexts in a data-driven way to guide model development. In order to ensure training stability, we have used a clustering-based pre-training in both public and in-house datasets, demonstrating improved accuracy through unknown context discovery.

Fast Markov chain Monte Carlo algorithms via Lie groups

Wed, 03 Jun 2020 00:00:00 +0000

From basic considerations of the Lie group that preserves a target probability measure, we derive the Barker, Metropolis, and ensemble Markov chain Monte Carlo (MCMC) algorithms, as well as variants of waste-recycling Metropolis-Hastings and an altogether new MCMC algorithm. We illustrate these constructions with explicit numerical computations, and we empirically demonstrate on a spin glass that the new algorithm converges more quickly than its siblings.

Sharp Thresholds of the Information Cascade Fragility Under a Mismatched Model

Wed, 03 Jun 2020 00:00:00 +0000

We analyze a sequential decision making model in which decision makers (or, players) take their decisions based on their own private information as well as the actions of previous decision makers. Such decision making processes often lead to what is known as the \emph{information cascade} or \emph{herding} phenomenon. Specifically, a cascade develops when it seems rational for some players to abandon their own private information and imitate the actions of earlier players. The risk, however, is that if the initial decisions were wrong, then the whole cascade will be wrong. Nonetheless, information cascade are known to be fragile: there exists a sequence of \emph{revealing} probabilities $\{p_{\ell}\}_{\ell\geq1}$, such that if with probability $p_{\ell}$ player $\ell$ ignores the decisions of previous players, and rely on his private information only, then wrong cascades can be avoided. Previous related papers which study the fragility of information cascades always assume that the revealing probabilities are known to all players perfectly, which might be unrealistic in practice. Accordingly, in this paper we study a mismatch model where players believe that the revealing probabilities are $\{q_\ell\}_{\ell\in\mathbb{N}}$ when they truly are $\{p_\ell\}_{\ell\in\mathbb{N}}$, and study the effect of this mismatch on information cascades. We consider both adversarial and probabilistic sequential decision making models, and derive closed-form expressions for the optimal learning rates at which the error probability associated with a certain decision maker goes to zero. We prove several novel phase transitions in the behaviour of the asymptotic learning rate.

Validated Variational Inference via Practical Posterior Error Bounds

Wed, 03 Jun 2020 00:00:00 +0000

Variational inference has become an increasingly attractive fast alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, a major obstacle to the widespread use of variational methods is the lack of post-hoc accuracy measures that are both theoretically justified and computationally efficient. In this paper, we provide rigorous bounds on the error of posterior mean and uncertainty estimates that arise from full-distribution approximations, as in variational inference. Our bounds are widely applicable, as they require only that the approximating and exact posteriors have polynomial moments. Our bounds are also computationally efficient for variational inference because they require only standard values from variational objectives, straightforward analytic calculations, and simple Monte Carlo estimates. We show that our analysis naturally leads to a new and improved workflow for validated variational inference. Finally, we demonstrate the utility of our proposed workflow and error bounds on a robust regression problem and on a real-data example with a widely used multilevel hierarchical model.

Stochastic Neural Network with Kronecker Flow

Wed, 03 Jun 2020 00:00:00 +0000

Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and betterthan the baselines.

Linear Dynamics: Clustering without identification

Wed, 03 Jun 2020 00:00:00 +0000

Linear dynamical systems are a fundamental and powerful parametric model class. However, identifying the parameters of a linear dynamical system is a venerable task, permitting provably efficient solutions only in special cases. This work shows that the eigenspectrum of unknown linear dynamics can be identified without full system identification. We analyze a computationally efficient and provably convergent algorithm to estimate the eigenvalues of the state-transition matrix in a linear dynamical system.When applied to time series clustering, our algorithm can efficiently cluster multi-dimensional time series with temporal offsets and varying lengths, under the assumption that the time series are generated from linear dynamical systems. Evaluating our algorithm on both synthetic data and real electrocardiogram (ECG) signals, we see improvements in clustering quality over existing baselines.

Obfuscation via Information Density Estimation

Wed, 03 Jun 2020 00:00:00 +0000

Identifying features that leak information about sensitive attributes is a key challenge in the design of information obfuscation mechanisms. In this paper, we propose a framework to identify information-leaking features via information density estimation. Here, features whose information densities exceed a pre-defined threshold are deemed information-leaking features. Once these features are identified, we sequentially pass them through a targeted obfuscation mechanism with a provable leakage guarantee in terms of $\mathsf{E}_\gamma$-divergence. The core of this mechanism relies on a data-driven estimate of the trimmed information density for which we propose a novel estimator, named the \textit{trimmed information density estimator} (TIDE). We then use TIDE to implement our mechanism on three real-world datasets. Our approach can be used as a data-driven pipeline for designing obfuscation mechanisms targeting specific features.

Sequential no-Substitution k-Median-Clustering

Wed, 03 Jun 2020 00:00:00 +0000

We study the sample-based k-median clustering objective under a sequential setting without substitutions. In this setting, an i.i.d. sequence of examples is observed. An example can be selected as a center only immediately after it is observed, and it cannot be substituted later. The goal is to select a set of centers with a good k-median cost on the distribution which generated the sequence. We provide an efficient algorithm for this setting, and show that its multiplicative approximation factor is twice the approximation factor of an efficient offline algorithm. In addition, we show that if efficiency requirements are removed, there is an algorithm that can obtain the same approximation factor as the best offline algorithm. We demonstrate in experiments the performance of the efficient algorithm on real data sets. Our code is available at https://github.com/tomhess/No_Substitution_K_Median.

Safe-Bayesian Generalized Linear Regression

Wed, 03 Jun 2020 00:00:00 +0000

We study generalized Bayesian inference under misspecification, i.e. when the model is ‘wrong but useful’. Generalized Bayes equips the likelihood with a learning rate $\eta$. We show that for generalized linear models (GLMs), $\eta$-generalized Bayes concentrates around the best approximation of the truth within the model for specific $\eta eq 1$, even under severely misspecified noise, as long as the tails of the true distribution are exponential. We derive MCMC samplers for generalized Bayesian lasso and logistic regression and give examples of both simulated and real-world data in which generalized Bayes substantially outperforms standard Bayes.

Learning Hierarchical Interactions at Scale: A Convex Optimization Approach

Wed, 03 Jun 2020 00:00:00 +0000

In many learning settings, it is beneficial to augment the main features with pairwise interactions. Such interaction models can be often enhanced by performing variable selection under the so-called strong hierarchy constraint: an interaction is non-zero only if its associated main features are non-zero. Existing convex optimization-based algorithms face difficulties in handling problems where the number of main features p 10^3 (with total number of features p^2). In this paper, we study a convex relaxation which enforces strong hierarchy and develop a highly scalable algorithm based on proximal gradient descent. We introduce novel screening rules that allow for solving the complicated proximal problem in parallel. In addition, we introduce a specialized active-set strategy with gradient screening for avoiding costly gradient computations. The framework can handle problems having dense design matrices, with p = 50,000 ( 10^9 interactions)—instances that are much larger than the state of the art. Experiments on real and synthetic data suggest that our toolkit hierScale outperforms the state of the art in terms of prediction and variable selection and can achieve over a 4900x speed-up.

On Random Subsampling of Gaussian Process Regression: A Graphon-Based Analysis

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. Although subsampling discards a large part of training data, we show provable guarantees on the accuracy of the predictive mean/variance and its generalization ability.For analysis, we consider embedding kernel matrices into graphons, which encapsulate the difference of the sample size and enables us to evaluate the approximation and generalization errors in a unified manner. The experimental results show that the subsampling approximation achieves a better trade-off regarding accuracy and runtime than the ystrom and random Fourier expansion methods.

Dependent randomized rounding for clustering and partition systems with knapsack constraints

Wed, 03 Jun 2020 00:00:00 +0000

Clustering problems are fundamental to unsupervised learning. There is an increased emphasis on \emph{fairness} in machine learning and AI; one representative notion of fairness is that no single demographic group should be over-represented among the cluster-centers. This, and much more general clustering problems, can be formulated with “knapsack" and “partition" constraints. We develop new randomized algorithms targeting such problems, and study two in particular: multi-knapsack median and multi-knapsack center. Our rounding algorithms give new approximation and pseudo-approximation algorithms for these problems. One key technical tool we develop and use, which may be of independent interest, is a new tail bound analogous to Feige (2006) for sums of random variables with unbounded variances. Such bounds are very useful in inferring properties of large networks using few samples.

Adaptive Exploration in Linear Contextual Bandit

Wed, 03 Jun 2020 00:00:00 +0000

Contextual bandits serve as a fundamental model for many sequential decision making tasks. The most popular theoretically justified approaches are based on the optimism principle and Thompson sampling. While these algorithms can be practical, they are known to be suboptimal asymptotically. On the other hand, existing asymptotically optimal algorithms for this problem do not exploit the linear structure in an optimal way and suffer from lower-order terms that dominate the regret in all practically interesting regimes. We start to bridge the gap by designing an algorithm that is asymptotically optimal and has good finite-time empirical performance. At the same time, we make connections to the recent literature on when exploration-free methods are effective. Indeed, if the distribution of contexts is well behaved, then our algorithm acts mostly greedily and enjoys sub-logarithmic regret. Furthermore, our approach is adaptive in the sense that it automatically detects the nice case. Numerical results demonstrate significant regret reductions by our method relative to several baselines.

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.

Stein Variational Inference for Discrete Distributions

Wed, 03 Jun 2020 00:00:00 +0000

Gradient-based approximate inference methods, such as Stein variational gradient descent (SVGD) \cite{liu2016stein}, provide simple and general-purpose inference engines for differentiable continuous distributions. However, existing forms of SVGD can not be directly applied to discrete distributions. In this work, we fill this gap by proposing a simple general-purpose framework that transforms discrete distributions to equivalent piecewise continuous distribution, on which we apply gradient-free Stein variational gradient descent to perform efficient approximate inference. Our empirical results show that our method outperforms traditional algorithms such as Gibbs sampling and discontinuous Hamiltonian Monte Carlo on various challenging benchmarks of discrete graphical models. We demonstrate that our method provides a promising tool for learning ensembles of binarized neural network (BNN), outperforming other widely used ensemble methods on learning binarized AlexNet on CIFAR-10. In addition, such transform can be straightforwardly employed in gradient-free kernelized Stein discrepancy to perform goodness-of-fit (GOF) test on discrete distributions. Our proposed method outperforms existing GOF test methods for intractable discrete distributions.

MAP Inference for Customized Determinantal Point Processes via Maximum Inner Product Search

Wed, 03 Jun 2020 00:00:00 +0000

Determinantal point processes (DPPs) are a good fit for modeling diversity in many machine learning applications. For instance, in recommender systems, one might have a basic DPP defined by item features, and a customized version of this DPP for each user with features re-weighted according to user preferences. While such models perform well, they are typically applied only to relatively small datasets, because existing maximum a posteriori (MAP) approximation algorithms are expensive. In this work, we propose a new MAP algorithm: we show that, by performing a one-time preprocessing step on a basic DPP, it is possible to run an approximate version of the standard greedy MAP approximation algorithm on any customized version of the DPP in time sublinear in the number of items. Our key observation is that the core computation can be written as a maximum inner product search (MIPS), which allows us to accelerate inference via approximate MIPS structures, e.g., trees or hash tables. We provide a theoretical analysis of the algorithm’s approximation quality, as well as empirical results on real-world datasets demonstrating that it is often orders of magnitude faster while sacrificing little accuracy.

Scalable Feature Selection for (Multitask) Gradient Boosted Trees

Wed, 03 Jun 2020 00:00:00 +0000

Gradient Boosted Decision Trees (GBDTs) are widely used for building ranking and relevance models in search and recommendation. Considerations such as latency and interpretability dictate the use of as few features as possible to train these models. Feature selection in GBDT models typically involves heuristically ranking the features by importance and selecting the top few, or by per- forming a full backward feature elimination routine. On-the-fly feature selection methods proposed previously scale suboptimally with the number of features, which can be daunt- ing in high dimensional settings. We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. We show via ex- tensive experiments on both public and proprietary datasets that the proposed method offers significant speedups in training time, while being as competitive as existing GBDT methods in terms of model performance metrics. We also extend the method to the multitask setting, allowing the practitioner to select common features across tasks, as well as selecting task-specific features.

A Primal-Dual Solver for Large-Scale Tracking-by-Assignment

Wed, 03 Jun 2020 00:00:00 +0000

We propose a fast approximate solver for the combinatorial problem known as tracking-by-assignment, which we apply to cell tracking. The latter plays a key role in discovery in many life sciences, especially in cell and developmental biology. So far, in the most general setting this problem was addressed by off-the-shelf solvers like Gurobi, whose run time and memory requirements rapidly grow with the size of the input. In contrast, for our method this growth is nearly linear. Our contribution consists of a new (1) decomposable compact representation of the problem; (2) dual block-coordinate ascent method for optimizing the decomposition-based dual; and (3) primal heuristics that reconstructs a feasible integer solution based on the dual information. Compared to solving the problem with Gurobi, we observe an up to 60 times speed-up, while reducing the memory footprint significantly. We demonstrate the efficacy of our method on real-world tracking problems.

Statistical guarantees for local graph clustering

Wed, 03 Jun 2020 00:00:00 +0000

Local graph clustering methods aim to find small clusters in very large graphs. These methods take as input a graph and a seed node, and they return as output a good cluster in a running time that depends on the size of the output cluster but that is independent of the size of the input graph. In this paper, we adopt a statistical perspective on local graph clustering, and we analyze the performance of the l1-regularized PageRank method for the recovery of a single target cluster, given a seed node inside the cluster. Assuming the target cluster has been generated by a random model, we present two results. In the first, we show that the optimal support of l1-regularized PageRank recovers the full target cluster, with bounded false positives. In the second, we show that if the seed node is connected solely to the target cluster then the optimal support of l1-regularized PageRank recovers exactly the target cluster. We also show empirically that l1-regularized PageRank has a state-of-the-art performance on many real graphs, demonstrating the superiority of the method.

Fast Algorithms for Computational Optimal Transport and Wasserstein Barycenter

Wed, 03 Jun 2020 00:00:00 +0000

We provide theoretical complexity analysis for new algorithms to compute the optimal transport (OT) distance between two discrete probability distributions, and demonstrate their favorable practical performance compared to state-of-art primal-dual algorithms. First, we introduce the \emph{accelerated primal-dual randomized coordinate descent} (APDRCD) algorithm for computing the OT distance. We show that its complexity is $\bigOtil(\frac{n^{5/2}}{\varepsilon})$, where $n$ stands for the number of atoms of these probability measures and $\varepsilon > 0$ is the desired accuracy. This complexity bound matches the best known complexities of primal-dual algorithms for the OT problems, including the adaptive primal-dual accelerated gradient descent (APDAGD) and the adaptive primal-dual accelerated mirror descent (APDAMD) algorithms. Then, we demonstrate the improved practical efficiency of the APDRCD algorithm through extensive comparative experimental studies. We also propose a greedy version of APDRCD, which we refer to as \emph{accelerated primal-dual greedy coordinate descent} (APDGCD), to further enhance practical performance. Finally, we generalize the APDRCD and APDGCD algorithms to distributed algorithms for computing the Wasserstein barycenter for multiple probability distributions.

Differentiable Causal Backdoor Discovery

Wed, 03 Jun 2020 00:00:00 +0000

Discovering the causal effect of a decision is critical to nearly all forms of decision-making. In particular, it is a key quantity in drug development, in crafting government policy, and when implementing a real-world machine learning system. Given only observational data, confounders often obscure the true causal effect. Luckily, in some cases, it is possible to recover the causal effect by using certain observed variables to adjust for the effects of confounders. However, without access to the true causal model, finding this adjustment requires brute-force search. In this work, we present an algorithm that exploits auxiliary variables, similar to instruments, in order to find an appropriate adjustment by a gradient-based optimization method. We demonstrate that it outperforms practical alternatives in estimating the true causal effect, without knowledge of the full causal graph.

Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training

Wed, 03 Jun 2020 00:00:00 +0000

Despite the recent successes of deep neural networks, the corresponding training problem remains highly non-convex and difficult to optimize. Classes of models have been proposed that introduce greater structure to the objective function at the cost of lifting the dimension of the problem. However, these lifted methods sometimes perform poorly compared to traditional neural networks. In this paper, we introduce a new class of lifted models, Fenchel lifted networks, that enjoy the same benefits as previous lifted models, without suffering a degradation in performance over classical networks. Our model represents activation functions as equivalent biconvex constraints and uses Lagrange Multipliers to arrive at a rigorous lower bound of the traditional neural network training problem. This model is efficiently trained using block-coordinate descent and is parallelizable across data points and/or layers. We compare our model against standard fully connected and convolutional networks and show that we are able to match or beat their performance.

Bayesian Reinforcement Learning via Deep, Sparse Sampling

Wed, 03 Jun 2020 00:00:00 +0000

We address the problem of Bayesian reinforcement learning using efficient model-based online planning. We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal as well as lower computational complexity. The main novelty is the use of a candidate policy generator, to generate long-term options in the planning tree (over beliefs), which allows us to create much sparser and deeper trees. Experimental results on different environments show that in comparison to the state-of-the-art, our algorithm is both computationally more efficient, and obtains significantly higher reward over time in discrete environments.

On Thompson Sampling for Smoother-than-Lipschitz Bandits

Wed, 03 Jun 2020 00:00:00 +0000

Thompson Sampling is a well established approach to bandit and reinforcement learning problems. However its use in continuum armed bandit problems has received relatively little attention. We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions on the function class containing the true function and sub-exponential observation noise. The eluder dimension is a recently proposed measure of the complexity of a function class, which has been demonstrated to be useful in bounding the Bayesian regret of Thompson Sampling for simpler bandit problems under sub-Gaussian observation noise. We derive a new bound on the eluder dimension for classes of functions with Lipschitz derivatives, and generalise previous analyses in multiple regards.

A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent

Wed, 03 Jun 2020 00:00:00 +0000

In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent (SGD) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and their combinations: variance reduction, importance sampling, mini-batch sampling, quantization, and coordinate sub-sampling. As a by-product, we obtain the first unified theory of SGD and randomized coordinate descent (RCD) methods, the first unified theory of variance reduced and non-variance-reduced SGD methods, and the first unified theory of quantized and non-quantized methods. A key to our approach is a parametric assumption on the iterates and stochastic gradients. In a single theorem we establish a linear convergence result under this assumption and strong-quasi convexity of the loss function. Whenever we recover an existing method as a special case, our theorem gives the best known complexity result. Our approach can be used to motivate the development of new useful methods, and offers pre-proved convergence guarantees. To illustrate the strength of our approach, we develop five new variants of SGD, and through numerical experiments demonstrate some of their properties.

Gaussian-Smoothed Optimal Transport: Metric Structure and Statistical Efficiency

Wed, 03 Jun 2020 00:00:00 +0000

Optimal transport (OT), and in particular the Wasserstein distance, has seen a surge of interest and applications in machine learning. However, empirical approximation under Wasserstein distances suffers from a severe curse of dimensionality, rendering them impractical in high dimensions. As a result, entropically regularized OT has become a popular workaround. However, while it enjoys fast algorithms and better statistical properties, it looses the metric structure that Wasserstein distances enjoy. This work proposes a novel Gaussian-smoothed OT (GOT) framework, that achieves the best of both worlds: preserving the 1-Wasserstein metric structure while alleviating the empirical approximation curse of dimensionality. Furthermore, as the Gaussian-smoothing parameter shrinks to zero, GOT $\Gamma$-converges towards classic OT (with convergence of optimizers), thus serving as a natural extension. An empirical study that validates the theoretical results is provided, promoting Gaussian-smoothed OT as a powerful alternative to entropic OT.

Learning Ising and Potts Models with Latent Variables

Wed, 03 Jun 2020 00:00:00 +0000

We study the problem of learning graphical models with latent variables. We give the {\em first} efficient algorithms for learning: 1) ferromagnetic Ising models with latent variables under {\em arbitrary} external fields, and 2) ferromagnetic Potts model with latent variables under unidirectional non-negative external field. Our algorithms have optimal dependence on the dimension but suffer from a sub-optimal dependence on the underlying sparsity of the graph.Our results rely on two structural properties of the underlying graphical models. These in turn allow us to design an influence function which can be maximized greedily to recover the structure of the underlying graphical model. These structural results may be of independent interest.

Constructing a provably adversarially-robust classifier from a high accuracy one

Wed, 03 Jun 2020 00:00:00 +0000

Modern machine learning models with very high accuracy have been shown to be vulnerable to small, adversarially chosen perturbations of the input. Given black-box access to a high-accuracy classifier f, we show how to construct a new classifier g that has high accuracy and is also robust to adversarial L2-bounded perturbations. Our algorithm builds upon the framework of randomized smoothing that has been recently shown to outperform all previous defenses against L2-bounded adversaries. Using techniques like random partitions and doubling dimension, we are able to bound the adversarial error of g in terms of the optimum error. In this paper we focus on our conceptual contribution, but we do present two examples to illustrate our framework. We will argue that, under some assumptions, our bounds are optimal for these cases.

Alternating Minimization Converges Super-Linearly for Mixed Linear Regression

Wed, 03 Jun 2020 00:00:00 +0000

We address the problem of solving mixed random linear equations. In this problem, we have unlabeled observations coming from multiple linear regressions, and each observation corresponds to exactly one of the regression models. The goal is to learn the linear regressors from the observations. Classically, Alternating Minimization (AM) (which may be thought as a variant of Expectation Maximization (EM)) is used to solve this problem. AM iteratively alternates between the estimation of labels and solving the regression problems with the estimated labels. Empirically, it is observed that, for a large variety of non-convex problems including mixed linear regression, AM converges at a much faster rate compared to gradient based algorithms. However, the existing theory suggests similar rate of convergence, failing to capture this empirical behavior. In this paper, we close this gap between theory and practice for the special case of a mixture of $2$ linear regressions. We show that, provided initialized properly, AM enjoys a \emph{super-linear} rate of convergence. To the best of our knowledge, this is the first work that theoretically establishes such rate for AM. Hence, if we want to recover the unknown regressors upto an error (in $\ell_2$ norm) of $\epsilon$, AM only takes $\mathcal{O}(\log \log (1/\epsilon))$ iterations.

Low-rank regularization and solution uniqueness in over-parameterized matrix sensing

Wed, 03 Jun 2020 00:00:00 +0000

We consider the question whether algorithmic choices in over-parameterized linear matrix factorization introduce implicit low-rank regularization.We focus on the noiseless matrix sensing scenario over low-rank positive semi-definite (PSD) matrices over the reals, with a sensing mechanism that satisfies restricted isometry properties.Surprisingly, it was recently argued that for recovery of PSD matrices, gradient descent over a squared, \textit{full-rank} factorized space introduces implicit low-rank regularization.Thus, a clever choice of the recovery algorithm avoids the need for explicit low-rank regularization. In this contribution, we prove that in fact, under certain conditions, the PSD constraint by itself is sufficient to lead to a unique low-rank matrix recovery, without explicit or implicit regularization.Therefore, under these conditions, the set of PSD matrices that are consistent with the observed data, is a singleton, regardless of the algorithm used. Our numerical study indicates that this result is general and extends to cases beyond the those covered by the proof.

Integrals over Gaussians under Linear Domain Constraints

Wed, 03 Jun 2020 00:00:00 +0000

Integrals of linearly constrained multivariate Gaussian densities are a frequent problem in machine learning and statistics, arising in tasks like generalized linear models and Bayesian optimization. Yet they are notoriously hard to compute, and to further complicate matters, the numerical values of such integrals may be very small. We present an efficient black-box algorithm that exploits geometry for the estimation of integrals over a small, truncated Gaussian volume, and to simulate therefrom. Our algorithm uses the Holmes-Diaconis-Ross (HDR) method combined with an analytic version of elliptical slice sampling (ESS). Adapted to the linear setting, ESS allows for rejection-free sampling, because intersections of ellipses and domain boundaries have closed-form solutions. The key idea of HDR is to decompose the integral into easier-to-compute conditional probabilities by using a sequence of nested domains. Remarkably, it allows for direct computation of the logarithm of the integral value and thus enables the computation of extremely small probability masses. We demonstrate the effectiveness of our tailored combination of HDR and ESS on high-dimensional integrals and on entropy search for Bayesian optimization.

Tight Analysis of Privacy and Utility Tradeoff in Approximate Differential Privacy

Wed, 03 Jun 2020 00:00:00 +0000

We characterize the minimum noise amplitude and power for noise-adding mechanisms in (epsilon, delta)-differential privacy for single real-valued query function. We derive new lower bounds using the duality of linear programming, and new upper bounds by analyzing a special class of (epsilon, delta)-differentially private mechanisms, the truncated Laplacian mechanisms. We show that the multiplicative gap of the lower bounds and upper bounds goes to zero in various high privacy regimes, proving the tightness of the lower and upper bounds. In particular, our results close the previous constant multiplicative gap in the discrete setting. Numeric experiments show the improvement of the truncated Laplacian mechanism over the optimal Gaussian mechanism in all privacy regimes.

A Rule for Gradient Estimator Selection, with an Application to Variational Inference

Wed, 03 Jun 2020 00:00:00 +0000

Stochastic gradient descent (SGD) is the workhorse of modern machine learning. Sometimes, there are many different potential gradient estimators that can be used. When so, choosing the one with the best tradeoff between cost and variance is important. This paper analyzes the convergence rates of SGD as a function of time, rather than iterations. This results in a simple rule to select the estimator that leads to the best optimization convergence guarantee. This choice is the same for different variants of SGD, and with different assumptions about the objective (e.g. convexity or smoothness). Inspired by this principle, we propose a technique to automatically select an estimator when a finite pool of estimators is given. Then, we extend to infinite pools of estimators, where each one is indexed by control variate weights. Empirically, automatically choosing an estimator performs comparably to the best estimator chosen with hindsight.

Explaining the Explainer: A First Theoretical Analysis of LIME

Wed, 03 Jun 2020 00:00:00 +0000

Machine learning is used more and more often for sensitive applications, sometimes replacing humans in critical decision-making processes. As such, interpretability of these algorithms is a pressing need. One popular algorithm to provide interpretability is LIME (Local Interpretable Model-Agnostic Explanation). In this paper, we provide the first theoretical analysis of LIME. We derive closed-form expressions for the coefficients of the interpretable model when the function to explain is linear. The good news is that these coefficients are proportional to the gradient of the function to explain: LIME indeed discovers meaningful features. However, our analysis also reveals that poor choices of parameters can lead LIME to miss important features.

Conservative Exploration in Reinforcement Learning

Wed, 03 Jun 2020 00:00:00 +0000

While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world applications where a minimum requirement is that the executed policies are guaranteed to perform at least as well as an existing baseline. In this paper, we introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. We derive regret bounds showing that being conservative does not hinder the learning ability of these algorithms.

Improved Regret Bounds for Projection-free Bandit Convex Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We revisit the challenge of designing online algorithms for the bandit convex optimization problem (BCO) which are also scalable to high dimensional problems. Hence, we consider algorithms that are \textit{projection-free}, i.e., based on the conditional gradient method whose only access to the feasible decision set, is through a linear optimization oracle (as opposed to other methods which require potentially much more computationally-expensive subprocedures, such as computing Euclidean projections). We present the first such algorithm that attains $O(T^{3/4})$ expected regret using only $O(T)$ overall calls to the linear optimization oracle, in expectation, where $T$ in the number of prediction rounds. This improves over the $O(T^{4/5})$ expected regret bound recently obtained by \cite{Karbasi19}, and actually matches the current best regret bound for projection-free online learning in the \textit{full information} setting.

Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models

Wed, 03 Jun 2020 00:00:00 +0000

We propose automated augmented conjugate inference, a new inference method for non-conjugate Gaussian processes (GP) models.Our method automatically constructs an auxiliary variable augmentation that renders the GP model conditionally conjugate. Building on the conjugate structure of the augmented model, we develop two inference methods. First, a fast and scalable stochastic variational inference method that uses efficient block coordinate ascent updates, which are computed in closed form. Second, an asymptotically correct Gibbs sampler that is useful for small datasets.Our experiments show that our method is up two orders of magnitude faster and more robust than existing state-of-the-art black-box methods.

Enriched mixtures of generalised Gaussian process experts

Wed, 03 Jun 2020 00:00:00 +0000

Mixtures of experts probabilistically divide the input space into regions, where the assumptions of each expert, or conditional model, need only hold locally. Combined with Gaussian process (GP) experts, this results in a powerful and highly flexible model. We focus on alternative mixtures of GP experts, which model the joint distribution of the inputs and targets explicitly. We highlight issues of this approach in multi-dimensional input spaces, namely, poor scalability and the need for an unnecessarily large number of experts, degrading the predictive performance and increasing uncertainty. We construct a novel model to address these issues through a nested partitioning scheme that automatically infers the number of components at both levels. Multiple response types are accommodated through a generalised GP framework, while multiple input types are included through a factorised exponential family structure. We show the effectiveness of our approach in estimating a parsimonious probabilistic description of both synthetic data of increasing dimension and an Alzheimer’s challenge dataset.

A Topology Layer for Machine Learning

Wed, 03 Jun 2020 00:00:00 +0000

Topology applied to real world data using persistent homology has started to find applications within machine learning, including deep learning. We present a differentiable topology layer that computes persistent homology based on level set filtrations and edge-based filtrations. We present three novel applications: the topological layer can (i) regularize data reconstruction or the weights of machine learning models, (ii) construct a loss on the output of a deep generative network to incorporate topological priors, and (iii) perform topological adversarial attacks on deep networks trained with persistence features. The code is publicly available and we hope its availability will facilitate the use of persistent homology in deep learning and other gradient based applications.

POPCORN: Partially Observed Prediction Constrained Reinforcement Learning

Wed, 03 Jun 2020 00:00:00 +0000

Many medical decision-making tasks can be framed as partially observed Markov decision processes (POMDPs). However, prevailing two-stage approaches that first learn a POMDP and then solve it often fail because the model that best fits the data may not be well suited for planning. We introduce a new optimization objective that (a) produces both high-performing policies and high-quality generative models, even when some observations are irrelevant for planning, and (b) does so in batch off-policy settings that are typical in healthcare, when only retrospective data is available. We demonstrate our approach on synthetic examples and a challenging medical decision-making problem.

Noisy-Input Entropy Search for Efficient Robust Bayesian Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of robust optimization within the well-established Bayesian Optimization (BO) framework.While BO is intrinsically robust to noisy evaluations of the objective function, standard approaches do not consider the case of uncertainty about the input parameters.In this paper, we propose Noisy-Input Entropy Search (NES), a novel information-theoretic acquisition function that is designed to find robust optima for problems with both input and measurement noise.NES is based on the key insight that the robust objective in many cases can be modeled as a Gaussian process, however, it cannot be observed directly.We evaluate NES on several benchmark problems from the optimization literature and from engineering.The results show that NES reliably finds robust optima, outperforming existing methods from the literature on all benchmarks.

Approximate Inference with Wasserstein Gradient Flows

Wed, 03 Jun 2020 00:00:00 +0000

We present a novel approximate inference method for diffusion processes, based on the Wasserstein gradient flow formulation of the diffusion. In this formulation, the time-dependent density of the diffusion is derived as the limit of implicit Euler steps that follow the gradients of a particular free energy functional. Existing methods for computing Wasserstein gradient flows rely on discretization of the domain of the diffusion, prohibiting their application to domains in more than several dimensions. We propose instead a discretization-free inference method that computes the Wasserstein gradient flow directly in a space of continuous functions. We characterize approximation properties of the proposed method and evaluate it on a nonlinear filtering task, finding performance comparable to the state-of-the-art for filtering diffusions.

A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments

Wed, 03 Jun 2020 00:00:00 +0000

We introduce a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED). Our approach utilizes variational lower bounds on the expected information gain (EIG) of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct a pointwise EIG estimator, before passing this estimator to a separate optimizer. We provide a number of different variational objectives including the novel adaptive contrastive estimation (ACE) bound. Finally, we show that our gradient-based approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.

GP-VAE: Deep Probabilistic Time Series Imputation

Wed, 03 Jun 2020 00:00:00 +0000

Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naive applications of deep learning fall short in giving reliable confidence estimates and lack interpretability.We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The non-linear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms both classical and recent deep learning-based data imputation methods on high dimensional data from the domains of computer vision and healthcare.

Fairness Evaluation in Presence of Biased Noisy Labels

Wed, 03 Jun 2020 00:00:00 +0000

Risk assessment tools are widely used around the country to inform decision making within the criminal justice system. Recently, considerable attention has been devoted to the question of whether such tools may suffer from racial bias. In this type of assessment, a fundamental issue is that the training and evaluation of the model is based on a variable (arrest) that may represent a noisy version of an unobserved outcome of more central interest (offense). We propose a sensitivity analysis framework for assessing how assumptions on the noise across groups affect the predictive bias properties of the risk assessment model as a predictor of reoffense. Our experimental results on two real world criminal justice data sets demonstrate how even small biases in the observed labels may call into question the conclusions of an analysis based on the noisy outcome.

A Locally Adaptive Bayesian Cubature Method

Wed, 03 Jun 2020 00:00:00 +0000

Bayesian cubature (BC) is a popular inferential perspective on the cubature of expensive integrands, wherein the integrand is emulated using a stochastic process model. Several approaches have been put forward to encode sequential adaptation (i.e. dependence on previous integrand evaluations) into this framework. However, these proposals have been limited to either estimating the parameters of a stationary covariance model or focusing computational resources on regions where large values are taken by the integrand. In contrast, many classical adaptive cubature methods are locally adaptive in the sense that they focus computational resources on spatial regions in which local error estimates are largest. The main contributions of this work are twofold; first we establish that existing BC methods do not possess local adaptivity in the sense of many classical adaptive methods and secondly, we developed a novel BC method whose behaviour, demonstrated empirically, is analogous to such methods. Finally we present evidence that the novel method provides improved cubature performance, relative to standard BC, in a detailed empirical assessment.

Adaptive multi-fidelity optimization with fast learning rates

Wed, 03 Jun 2020 00:00:00 +0000

In multi-fidelity optimization, biased approximations of varying costs of the target function are available. This paper studies the problem of optimizing a locally smooth function with a limited budget, where the learner has to make a tradeoff between the cost and the bias of these approximations. We first prove lower bounds for the simple regret under different assumptions on the fidelities, based on a cost-to-bias function. We then present the Kometo algorithm which achieves, with additional logarithmic factors, the same rates without any knowledge of the function smoothness and fidelity assumptions, and improves previously proven guarantees. We finally empirically show that our algorithm outperforms previous multi-fidelity optimization methods without the knowledge of problem-dependent parameters.

Measuring Mutual Information Between All Pairs of Variables in Subquadratic Complexity

Wed, 03 Jun 2020 00:00:00 +0000

Finding associations between pairs of variables in large datasets is crucial for various disciplines. The brute force method for solving this problem requires computing the mutual information between $\binom{N}{2}$ pairs. In this paper, we consider the problem of finding pairs of variables with high mutual information in sub-quadratic complexity. This problem is analogous to the nearest neighbor search, where the goal is to find pairs among $N$ variables that are similar to each other. To solve this problem, we develop a new algorithm for finding associations based on constructing a decision tree that assigns a hash to each variable, in a way that for pairs with higher mutual information, the chance of having the same hash is higher. For any $1 \leq \lambda \leq 2$, we prove that in the case of binary data, we can reduce the number of necessary mutual information computations for finding all pairs satisfying $I(X, Y) > 2- \lambda$ from $O(N^2)$ to $O(N^\lambda)$, where $I(X,Y)$ is the empirical mutual information between variables $X$ and $Y$. Finally, we confirmed our theory by experiments on simulated and real data. The implementation of our method and experiments is publicly available at \href{https://github.com/mohimanilab/HashMI}{https://github.com/mohimanilab/HashMI}.

Learning with minibatch Wasserstein : asymptotic and gradient properties

Wed, 03 Jun 2020 00:00:00 +0000

Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches i.e., they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, GANs or color transfer that highlight the practical interest of this strategy.

AP-Perf: Incorporating Generic Performance Metrics in Differentiable Learning

Wed, 03 Jun 2020 00:00:00 +0000

We propose a method that enables practitioners to conveniently incorporate custom non-decomposable performance metrics into differentiable learning pipelines, notably those based upon neural network architectures. Our approach is based on the recently developed adversarial prediction framework, a distributionally robust approach that optimizes a metric in the worst case given the statistical summary of the empirical distribution. We formulate a marginal distribution technique to reduce the complexity of optimizing the adversarial prediction formulation over a vast range of non-decomposable metrics. We demonstrate how easy it is to write and incorporate complex custom metrics using our provided tool. Finally, we show the effectiveness of our approach various classification tasks on tabular datasets from the UCI repository and benchmark datasets, as well as image classification tasks. The code for our proposed method is available at https://github.com/rizalzaf/AdversarialPrediction.jl.

Radial Bayesian Neural Networks: Beyond Discrete Support In Large-Scale Bayesian Deep Learning

Wed, 03 Jun 2020 00:00:00 +0000

We propose Radial Bayesian Neural Networks (BNNs): a variational approximate posterior for BNNs which scales well to large models. Unlike scalable Bayesian deep learning methods like deep ensembles that have discrete support (assign exactly zero probability almost everywhere in weight-space) Radial BNNs maintain full support; letting them act as a prior for continual learning and avoiding the a priori implausibility of discrete support. Our method avoids a sampling problem in mean-field variational inference (MFVI) caused by the so-called ’soap-bubble’ pathology of multivariate Gaussians. We show that, unlike MFVI, Radial BNNs are robust to hyperparameters and can be efficiently applied to challenging real-world tasks without needing ad-hoc tweaks and intensive tuning: on a real-world medical imaging task Radial BNNs outperform MC dropout and deep ensembles.

Orthogonal Gradient Descent for Continual Learning

Wed, 03 Jun 2020 00:00:00 +0000

Neural networks are achieving state of the art and sometimes super-human performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. We present the Orthogonal Gradient Descent (OGD) method, which accomplishes this goal by projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task. Our approach utilizes the high capacity of a neural network more efficiently and does not require storing the previously learned data that might raise privacy concerns. Experiments on common benchmarks reveal the effectiveness of the proposed OGD method.

Greed Meets Sparsity: Understanding and Improving Greedy Coordinate Descent for Sparse Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We consider greedy coordinate descent (GCD) for composite problems with sparsity inducing regularizers, including 1-norm regularization and non-negative constraints. Empirical evidence strongly suggests that GCD, when initialized with the zero vector, has an implicit screening ability that usually selects at each iteration coordinates that at are nonzero at the solution. Thus, for problems with sparse solutions, GCD can converge significantly faster than randomized coordinate descent. We present an improved convergence analysis of GCD for sparse optimization, and a formal analysis of its screening properties. We also propose and analyze an improved selection rule with stronger ability to produce sparse iterates. Numerical experiments on both synthetic and real-world data support our analysis and the effectiveness of the proposed selection rule.

Online Binary Space Partitioning Forests

Wed, 03 Jun 2020 00:00:00 +0000

The Binary Space Partitioning-Tree (BSP-Tree) process was recently proposed as an efficient strategy for space partitioning tasks. Because it uses more than one dimension to partition the space, the BSP-Tree process is more efficient and flexible than conventional axis-aligned cut strategies. However, due to its batch learning setting, it is not well suited to large-scale classification and regression problems. In this paper, we develop an online BSP-Forest framework to address this limitation. With the arrival of new data, the resulting online algorithm can simultaneously expand the space coverage and refine the partition structure, with guaranteed universal consistency for classification problems. The effectiveness and competitive performance of the online BSP-Forest is verified via simulations.

On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms

Wed, 03 Jun 2020 00:00:00 +0000

We study the convergence of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall complexity as well as their best achievable accuracy in terms of gradient norm for nonconvex loss functions. We start with the MAML method and its first-order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. By overcoming these challenges not only we provide the first theoretical guarantees for MAML and FO-MAML in nonconvex settings, but also we answer some of the unanswered questions for the implementation of these algorithms including how to choose their learning rate and the batch size for both tasks and datasets corresponding to tasks. In particular, we show that MAML can find an ?-first-order stationary point ( ?-FOSP) for any positive ? after at most O(1/?^2) iterations at the expense of requiring second-order information. We also show that FO-MAML which ignores the second-order information required in the update of MAML cannot achieve any small desired level of accuracy, i.e., FO-MAML cannot find an ?-FOSP for any ?>0. We further propose a new variant of the MAML algorithm called Hessian-free MAML which preserves all theoretical guarantees of MAML, without requiring access to second-order information.

Towards Competitive N-gram Smoothing

Wed, 03 Jun 2020 00:00:00 +0000

N-gram models remain a fundamental component of language modeling. In data-scarce regimes, they are a strong alternative to neural models. Even when not used as-is, recent work shows they can regularize neural models. Despite this success, the effectiveness of one of the best N-gram smoothing methods, the one suggested by Kneser and Ney (1995), is not fully understood. In the hopes of explaining this performance, we study it through the lens of competitive distribution estimation: the ability to perform as well as an oracle aware of further structure in the data. We first establish basic competitive properties of Kneser-Ney smoothing. We then investigate the nature of its backoff mechanism and show that it emerges from first principles, rather than being an assumption of the model. We do this by generalizing the Good-Turing estimator to the contextual setting. This exploration leads us to a powerful generalization of Kneser-Ney, which we conjecture to have even stronger competitive properties. Empirically, it significantly improves performance on language modeling, even matching feed-forward neural models. To show that the mechanisms at play are not restricted to language modeling, we demonstrate similar gains on the task of predicting attack types in the Global Terrorism Database.

Prophets, Secretaries, and Maximizing the Probability of Choosing the Best

Wed, 03 Jun 2020 00:00:00 +0000

Suppose a customer is faced with a sequence of fluctuating prices, such as for airfare or a product sold by a large online retailer. Given distributional information about what price they might face each day, how should they choose when to purchase in order to maximize the likelihood of getting the best price in retrospect? This is related to the classical secretary problem, but with values drawn from known distributions. In their pioneering work, Gilbert and Mosteller [\textit{J. Amer. Statist. Assoc. 1966}] showed that when the values are drawn i.i.d., there is a thresholding algorithm that selects the best value with probability approximately 0.58010.5801. However, the more general problem with non-identical distributions has remained unsolved.In this paper, we provide an algorithm for the case of non-identical distributions that selects the maximum element with probability 1/e1/e, and we show that this is tight. We further show that if the observations arrive in a random order, this barrier of 1/e1/e can be broken using a static threshold algorithm, and we show that our success probability is the best possible for any single-threshold algorithm under random observation order. Moreover, we prove that one can achieve a strictly better success probability using more general multi-threshold algorithms, unlike the non-random-order case. Along the way, we show that the best achievable success probability for the random-order case matches that of the i.i.d. case, which is approximately 0.58010.5801, under a “no-superstars” condition that no single distribution is very likely ex ante to generate the maximum value. We also extend our results to the problem of selecting one of the kk best values.One of the main tools in our analysis is a suitable “Poissonization” of random order distributions, which uses Le Cam’s theorem to connect the Poisson binomial distribution with the discrete Poisson distribution. This approach may be of independent interest.

Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models

Wed, 03 Jun 2020 00:00:00 +0000

We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that rectified linear units in neural networks act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, we prove that finite two-layer ReLU networks with norm regularization yield linear spline interpolation. In the more general higher dimensional case, we show that the training problem for two-layer networks can be cast as a convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cutting-plane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. Our results show that the hidden neurons of a ReLU network can be interpreted as convex autoencoders of the input layer. We also establish a connection to $\ell_0$-$\ell_1$ equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing. Extensive experimental results show that the proposed approach yields interpretable and accurate models.

Fast and Bayes-consistent nearest neighbors

Wed, 03 Jun 2020 00:00:00 +0000

Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects – either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrificing too much in the risk decay rate. We combine the locality-sensitive hashing (LSH) technique with a novel missing-mass argument to obtain a fast and Bayes-consistent classifier. Our algorithm’s prediction runtime compares favorably against state of the art approximate NN methods, while maintaining Bayes-consistency and attaining rates comparable to minimax. On samples of size $n$ in $\R^d$, our pre-processing phase has runtime $O(d n \log n)$, while the evaluation phase has runtime $O(d\log n)$ per query point.

Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data

Wed, 03 Jun 2020 00:00:00 +0000

We focus on the problem of unsupervised cell outlier detection and repair inmixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset areoutliers. However, identifying which cells are corrupted in aspecific row is an important problem in practice, and the very first steptowards repairing them. We introduce the Robust VariationalAutoencoder (RVAE), a deep generative model that learns the jointdistribution of the clean data while identifying the outlier cells, allowing their imputation (repair). RVAE explicitly learns the probability of each cell being an outlier, balancing differentlikelihood models in the row outlier score, making the method suitablefor outlier detection in mixed-type datasets.We show experimentallythat not only RVAE performs better than several state-of-the-art methods incell outlier detection and repair for tabular data, but also that is robust against theinitial hyper-parameter selection.

Sharp Analysis of Expectation-Maximization for Weakly Identifiable Models

Wed, 03 Jun 2020 00:00:00 +0000

We study a class of weakly identifiable location-scale mixture models for which the maximum likelihood estimates based on $n$ i.i.d. samples are known to have lower accuracy than the classical $n^{- \frac{1}{2}}$ error. We investigate whether the Expectation-Maximization (EM) algorithm also converges slowly for these models. We provide a rigorous characterization of EM for fitting a weakly identifiable Gaussian mixture in a univariate setting where we prove that the EM algorithm converges in order $n^{\frac{3}{4}}$ steps and returns estimates that are at a Euclidean distance of order ${ n^{- \frac{1}{8}}}$ and ${ n^{-\frac{1} {4}}}$ from the true location and scale parameter respectively. Establishing the slow rates in the univariate setting requires a novel localization argument with two stages, with each stage involving an epoch-based argument applied to a different surrogate EM operator at the population level. We demonstrate several multivariate ($d \geq 2$) examples that exhibit the same slow rates as the univariate case. We also prove slow statistical rates in higher dimensions in a special case, when the fitted covariance is constrained to be a multiple of identity.

Bayesian Image Classification with Deep Convolutional Gaussian Processes

Wed, 03 Jun 2020 00:00:00 +0000

In decision-making systems, it is important to have classifiers that have calibrated uncertainties, with an optimisation objective that can be used for automated model selection and training. Gaussian processes (GPs) provide uncertainty estimates and a marginal likelihood objective, but their weak inductive biases lead to inferior accuracy. This has limited their applicability in certain tasks (e.g. image classification). We propose a translation insensitive convolutional kernel, which relaxes the translation invariance constraint imposed by previous convolutional GPs. We show how we can use the marginal likelihood to learn the degree of insensitivity. We also reformulate GP image-to-image convolutional mappings as multi-output GPs, leading to deep convolutional GPs. We show experimentally that our new kernel improves performance in both single-layer and deep models. We also demonstrate that our fully Bayesian approach improves on dropout-based Bayesian deep learning methods in terms of uncertainty and marginal likelihood estimates.

A Diversity-aware Model for Majority Vote Ensemble Accuracy

Wed, 03 Jun 2020 00:00:00 +0000

Ensemble classifiers are a successful and popular approach for classification, and are frequently found to have better generalization performance than single models in practice. Although it is widely recognized that ‘diversity’ between ensemble members is important in achieving these performance gains, for classification ensembles it is not widely understood which diversity measures are most predictive of ensemble performance, nor how large an ensemble should be for a particular application. In this paper, we explore the predictive power of several common diversity measures and show – with extensive experiments – that contrary to earlier work that finds no clear link between these diversity measures (in isolation) and ensemble accuracy instead by using the $\rho$ diversity measure of Sneath and Sokal as an estimator for the dispersion parameter of a Polya-Eggenberger distribution we can predict, independently of the choice of base classifier family, the accuracy of a majority vote classifier ensemble ridiculously well. We discuss our model and some implications of our findings – such as diversity-aware (non-greedy) pruning of a majority-voting ensemble.

Distributed, partially collapsed MCMC for Bayesian Nonparametrics

Wed, 03 Jun 2020 00:00:00 +0000

Bayesian nonparametric (BNP) models provide elegant methods for discovering underlying latent features within a data set, but inference in such models can be slow. We exploit the fact that completely random measures, which commonly-used models like the Dirichlet process and the beta-Bernoulli process can be expressed using, are decomposable into independent sub-measures. We use this decomposition to partition the latent measure into a finite measure containing only instantiated components, and an infinite measure containing all other components. We then select different inference algorithms for the two components: uncollapsed samplers mix well on the finite measure, while collapsed samplers mix well on the infinite, sparsely occupied tail. The resulting hybrid algorithm can be applied to a wide class of models, and can be easily distributed to allow scalable inference without sacrificing asymptotic convergence guarantees.

Invertible Generative Modeling using Linear Rational Splines

Wed, 03 Jun 2020 00:00:00 +0000

Normalizing flows attempt to model an arbitrary probability distribution through a set of invertible mappings. These transformations are required to achieve a tractable Jacobian determinant that can be used in high-dimensional scenarios. The first normalizing flow designs used coupling layer mappings built upon affine transformations. The significant advantage of such models is their easy-to-compute inverse. Nevertheless, making use of affine transformations may limit the expressiveness of such models. Recently, invertible piecewise polynomial functions as a replacement for affine transformations have attracted attention. However, these methods require solving a polynomial equation to calculate their inverse. In this paper, we explore using linear rational splines as a replacement for affine transformations used in coupling layers. Besides having a straightforward inverse, inference and generation have similar cost and architecture in this method. Moreover, simulation results demonstrate the competitiveness of this approach’s performance compared to existing methods.

Precision-Recall Curves Using Information Divergence Frontiers

Wed, 03 Jun 2020 00:00:00 +0000

Despite the tremendous progress in the estimation of generative models, the development of tools for diagnosing their failures and assessing their performance has advanced at a much slower pace. Recent developments have investigated metrics that quantify which parts of the true distribution is modeled well, and, on the contrary, what the model fails to capture, akin to precision and recall in information retrieval. In this paper, we present a general evaluation framework for generative models that measures the trade-off between precision and recall using Renyi divergences. Our framework provides a novel perspective on existing techniques and extends them to more general domains. As a key advantage, this formulation encompasses both continuous and discrete models and allows for the design of efficient algorithms that do not have to quantize the data. We further analyze the biases of the approximations used in practice.

Dynamical Systems Theory for Causal Inference with Application to Synthetic Control Methods

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we adopt results in nonlinear time series analysis for causal inference in dynamical settings. Our motivation is policy analysis with panel data, particularly through the use of “synthetic control" methods. These methods regress pre-intervention outcomes of the treated unit to outcomes from a pool of control units, and then use the fitted regression model to estimate causal effects post-intervention. In this setting, we propose to screen out control units that have a weak dynamical relationship to the treated unit. In simulations, we show that this method can mitigate bias from “cherry-picking" of control units, which is usually an important concern. We illustrate on real-world applications, including the tobacco legislation example of \citet{Abadie2010}, and Brexit.

Locally Accelerated Conditional Gradients

Wed, 03 Jun 2020 00:00:00 +0000

Conditional gradients constitute a class of projection-free first-order algorithms for smooth convex optimization. As such, they are frequently used in solving smooth convex optimization problems over polytopes, for which the computational cost of projections is prohibitive. However, they do not enjoy the optimal convergence rates achieved by projection-based accelerated methods; moreover, achieving such globally-accelerated rates is information-theoretically impossible. To address this issue, we present Locally Accelerated Conditional Gradients – an algorithmic framework that couples accelerated steps with conditional gradient steps to achieve \emph{local} acceleration on smooth strongly convex problems. Our approach does not require projections onto the feasible set, but only on (typically low-dimensional) simplices, thus keeping the computational cost of projections at bay. Further, it achieves optimal accelerated local convergence. Our theoretical results are supported by numerical experiments, which demonstrate significant speedups over state of the art methods in both per-iteration progress and wall-clock time.

Bayesian experimental design using regularized determinantal point processes

Wed, 03 Jun 2020 00:00:00 +0000

We establish a fundamental connection between Bayesian experimental design and determinantal point processes (DPPs). Experimental design is a classical task in combinatorial optimization, where we wish to select a small subset of $d$-dimensional vectors to minimize a statistical optimality criterion. We show that a new regularized variant of DPPs can be used to design efficient algorithms for finding $(1+\epsilon)$-approximate solutions to experimental design under four commonly used optimality criteria: A-, C-, D- and V-optimality. A key novelty is that we offer improved guarantees under the Bayesian framework, where prior knowledge is incorporated into the criteria. Our algorithm returns a $(1+\epsilon)$-approximate solution when the subset size $k$ is $\Omega\left(\frac{d_A}{\epsilon} + \frac{\log(11/epsilon)}{\epsilon^2}\right)$, where $d_A << d$ is an effective dimension determined by prior knowledge (via a precision matrix $\mathbf{A}$).This is the first approximation guarantee where the dependence on $d$ is replaced by an effective dimension. Moreover, the time complexity of our algorithm significantly improves on existing approaches with comparable guarantees.

Logistic regression with peer-group effects via inference in higher-order Ising models

Wed, 03 Jun 2020 00:00:00 +0000

Spin glass models, such as the Sherrington-Kirkpatrick, Hopfield and Ising models, are all well-studied members of the exponential family of discrete distributions, and have been influential in a number of application domains where they are used to model correlation phenomena on networks. Conventionally these models have quadratic sufficient statistics and consequently capture correlations arising from pairwise interactions. In this work we study extensions of these models to models with higher-order sufficient statistics, modeling behavior on a social network with peer-group effects. In particular, we model binary outcomes on a network as a higher-order spin glass, where the behavior of an individual depends on a linear function of their own vector of covariates and some polynomial function of the behavior of others, capturing peer-group effects. Using a {\em single}, high-dimensional sample from such model our goal is to recover the coefficients of the linear function as well as the strength of the peer-group effects. The heart of our result is a novel approach for showing strong concavity of the log pseudo-likelihood of the model, implying statistical error rate of $\sqrt{d/n}$ for the Maximum Pseudo-Likelihood Estimator (MPLE), where $d$ is the dimensionality of the covariate vectors and $n$ is the size of the network (number of nodes). Our model generalizes vanilla logistic regression as well as the models studied in recent works of \cite{chatterjee2007estimation,ghosal2018joint,DDP19}, and our results extend these results to accommodate higher-order interactions.

Robust Learning from Discriminative Feature Feedback

Wed, 03 Jun 2020 00:00:00 +0000

Recent work introduced the model of "learning from discriminative feature feedback", in which a human annotator not only provides labels of instances, but also identifies discriminative features that highlight important differences between pairs of instances. It was shown that such feedback can be conducive to learning, and makes it possible to efficiently learn some concept classes that would otherwise be intractable. However, these results all relied upon *perfect* annotator feedback. In this paper, we introduce a more realistic, *robust* version of the framework, in which the annotator is allowed to make mistakes. We show how such errors can be handled algorithmically, in both an adversarial and a stochastic setting. In particular, we derive regret bounds in both settings that, as in the case of a perfect annotator, are independent of the number of features. We show that this result cannot be obtained by a naive reduction from the robust setting to the non-robust setting.

Modular Block-diagonal Curvature Approximations for Feedforward Architectures

Wed, 03 Jun 2020 00:00:00 +0000

We propose a modular extension of backpropagation for the computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian). The approach reduces the otherwise tedious manual derivation of these matrices into local modules, and is easy to integrate into existing machine learning libraries. Moreover, we develop a compact notation derived from matrix differential calculus. We outline different strategies applicable to our method. They subsume recently-proposed block-diagonal approximations as special cases, and are extended to convolutional neural networks in this work.

Validation of Approximate Likelihood and Emulator Models for Computationally Intensive Simulations

Wed, 03 Jun 2020 00:00:00 +0000

Complex phenomena in engineering and the sciences are often modeled with computationally intensive feed-forward simulations for which a tractable analytic likelihood does not exist. In these cases, it is sometimes necessary to estimate an approximate likelihood or fit a fast emulator model for efficient statistical inference; such surrogate models include Gaussian synthetic likelihoods and more recently neural density estimators such as autoregressive models and normalizing flows. To date, however, there is no consistent way of quantifying the quality of such a fit. Here we propose a statistical framework that can distinguish any arbitrary misspecified model from the target likelihood, and that in addition can identify with statistical confidence the regions of parameter as well as feature space where the fit is inadequate. At the heart of our approach is a two-sample test that quantifies the quality of the fit at fixed parameter values, and a global test that assesses goodness-of-fit across simulation parameters. While our general framework can incorporate any test statistic or distance metric, we specifically argue for a new two-sample test that can leverage any regression method to attain high power and provide diagnostics in complex data settings. Software for our approach is available on GitHub in Python and R.

A nonasymptotic law of iterated logarithm for general M-estimators

Wed, 03 Jun 2020 00:00:00 +0000

M-estimators are ubiquitous in machine learning and statistical learning theory. They are used both for defining prediction strategies and for evaluating their precision. In this paper, we propose the first non-asymptotic ’any-time’ deviation bounds for general M-estimators, where ’any-time’ means that the bound holds with a prescribed probability for every sample size. These bounds are non-asymptotic versions of the law of iterated logarithm. They are established under general assumptions such as Lipschitz continuity of the loss function and (local) curvature of thepopulation risk. These conditions are satisfied for most examples used in machine learning, including those ensuring robustness to outliers and to heavy tailed distributions. As an example of application, we consider the problem of best arm identification in a stochastic multi-arm bandit setting. We show that the established bound can be converted into a new algorithm, with provably optimal theoretical guarantees. Numerical experiments illustrating the validity of the algorithm are reported.

Rk-means: Fast Clustering for Relational Data

Wed, 03 Jun 2020 00:00:00 +0000

Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k-means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small grid coreset of the data matrix for subsequent cluster construction. This gives a constant approximation for the k-means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.

Hermitian matrices for clustering directed graphs: insights and applications

Wed, 03 Jun 2020 00:00:00 +0000

Graph clustering is a basic technique in machine learning, and has widespread applications in different domains. While spectral techniques have been successfully applied for clustering undirected graphs, the performance of spectral clustering algorithms for directed graphs (digraphs) is not in general satisfactory: these algorithms usually require symmetrising the matrix representing a digraph, and typical objective functions for undirected graph clustering do not capture cluster-structures in which the information given by the direction of the edges is crucial. To overcome these downsides, we propose a spectral clustering algorithm based on a complex-valued matrix representation of digraphs. We analyse its theoretical performance on a Stochastic Block Model for digraphs in which the cluster-structure is given not only by variations in edge densities, but also by the direction of the edges. The significance of our work is highlighted on a data set pertaining to internal migration in the United States: while previous spectral clustering algorithms for digraphs can only reveal that people are more likely to move between counties that are geographically close, our approach is able to cluster together counties with a similar socio-economical profile even when they are geographically distant, and illustrates how people tend to move from rural to more urbanised areas.

On Pruning for Score-Based Bayesian Network Structure Learning

Wed, 03 Jun 2020 00:00:00 +0000

Many algorithms for score-based Bayesian network structure learning (BNSL), in particular exact ones, take as input a collection of potentially optimal parent sets for each variable in the data. Constructing such collections naively is computationally intensive since the number of parent sets grows exponentially with the number of variables. Thus, pruning techniques are not only desirable but essential. While good pruning rules exist for the Bayesian Information Criterion (BIC), current results for the Bayesian Dirichlet equivalent uniform (BDeu) score reduce the search space very modestly, hampering the use of the (often preferred) BDeu. We derive new non-trivial theoretical upper bounds for the BDeu score that considerably improve on the state-of-the-art. Since the new bounds are mathematically proven to be tighter than previous ones and at little extra computational cost, they are a promising addition to BNSL methods.

Data Generation for Neural Programming by Example

Wed, 03 Jun 2020 00:00:00 +0000

Programming by example is the problem of synthesizing a program from a small set of input / output pairs. Recent works applying machine learning methods to this task show promise, but are typically reliant on generating synthetic examples for training. A particular challenge lies in generating meaningful sets of inputs and outputs, which well-characterize a given program and accurately demonstrate its behavior. Where examples used for testing are generated by the same method as training data then the performance of a model may be partly reliant on this similarity. In this paper we introduce a novel approach using an SMT solver to synthesize inputs which cover a diverse set of behaviors for a given program. We carry out a case study comparing this method to existing synthetic data generation procedures in the literature, and find that data generated using our approach improves both the discriminatory power of example sets and the ability of trained machine learning models to generalize to unfamiliar data.

Distributionally Robust Formulation and Model Selection for the Graphical Lasso

Wed, 03 Jun 2020 00:00:00 +0000

Building on a recent framework for distributionally robust optimization, we consider inverse covariance matrix estimation for multivariate data. A novel notion of Wasserstein ambiguity set is provided that is specifically tailored to this problem, leading to a tractable class of regularized estimators. Penalized likelihood estimators for Gaussian data, specifically the graphical lasso estimator, are special cases. Consequently, a direction connection is made between the radius of the Wasserstein ambiguity and the regularization parameter, so that the level of robustness of the estimator is shown to correspond to the level of confidence with which the ambiguity set contains a distribution with the population covariance. A unique feature of the formulation is that the radius can be expressed in closed-form as a function of the ordinary sample covariance matrix. Taking advantage of this finding, a simple algorithm is developed to determine a regularization parameter for graphical lasso, using only the bootstrapped sample covariance matrices, rendering computationally expensive repeated evaluation of the graphical lasso algorithm unnecessary. Alternatively, the distributionally robust formulation can also quantify the robustness of the corresponding estimator if one uses an off-the-shelf method such as cross-validation. Finally, a numerical study is performed to analyze the robustness of the proposed method relative to other automated tuning procedures used in practice.

Practical Nonisotropic Monte Carlo Sampling in High Dimensions via Determinantal Point Processes

Wed, 03 Jun 2020 00:00:00 +0000

We propose a new class of practical structured methods for nonisotropic Monte Carlo (MC) sampling, called DPPMC, designed for high-dimensional nonisotropic distributions where samples are correlated to reduce the variance of the estimator via determinantal point processes. We successfully apply DPPMCs to high-dimensional problems involving nonisotropic distributions arising in guided evolution strategy (GES) methods for reinforcement learning (RL), CMA-ES techniques and trust region algorithms for blackbox optimization, improving state-of-the-art in all these settings. In particular, we show that DPPMCs drastically improve exploration profiles of the existing evolution strategy algorithms. We further confirm our results, analyzing random feature map estimators for Gaussian mixture kernels. We provide theoretical justification of our empirical results, showing a connection between DPPMCs and recently introduced structured orthogonal MC methods for isotropic distributions.

Patient-Specific Effects of Medication Using Latent Force Models with Gaussian Processes

Wed, 03 Jun 2020 00:00:00 +0000

A multi-output Gaussian process (GP) is a flexible Bayesian nonparametric framework that has proven useful in jointly modeling the physiological states of patients in medical time series data. However, capturing the short-term effects of drugs and therapeutic interventions on patient physiological state remains challenging. We propose a novel approach that models the effect of interventions as a hybrid Gaussian process composed of a GP capturing patient baseline physiology convolved with a latent force model capturing effects of treatments on specific physiological features. The combination of a multi-output GP with a time-marked kernel GP leads to a well-characterized model of patients’ physiological state across a hospital stay, including response to interventions. Our model leads to analytically tractable cross-covariance functions that allow for scalable inference. Our hierarchical model includes estimates of patient-specific effects but allows sharing of support across patients. Our approach achieves competitive predictive performance on challenging hospital data, where we recover patient-specific response to the administration of three common drugs: one antihypertensive drug and two anticoagulants.

A Reduction from Reinforcement Learning to No-Regret Online Learning

Wed, 03 Jun 2020 00:00:00 +0000

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any $\gamma$-discounted tabular RL problem, with probability at least $1-\delta$, it learns an $\epsilon$-optimal policy using at most $\tilde{O}\left(\frac{|\SS||Å|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right)$ samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of $|\SS|$,$|Å|$, though at the cost of potential approximation bias.

Online Learning with Continuous Variations: Dynamic Regret and Reductions

Wed, 03 Jun 2020 00:00:00 +0000

Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup fails to capture regularity that can exist in practice. Motivated by this observation, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learner’s decisions. We show that COL appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of sublinear dynamic regret. We prove a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs. With this insight, we offer conditions for efficient algorithms that achieve sublinear dynamic regret, even when the losses are chosen adaptively without any a priori variation budget. Furthermore, we show for COL a reduction from dynamic regret to both static regret and convergence in the associated EP, allowing us to analyze the dynamic regret of many existing algorithms.

Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation

Wed, 03 Jun 2020 00:00:00 +0000

This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of $O(1/n)$, subject to conditions on the step-size sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm design.

On Generalization Bounds of a Family of Recurrent Neural Networks

Wed, 03 Jun 2020 00:00:00 +0000

Recurrent Neural Networks (RNNs) have been widely applied to sequential data analysis. Due to their complicated modeling structures, however, the theory behind is still largely missing. To connect theory and practice, we study the generalization properties of vanilla RNNs as well as their variants, including Minimal Gated Unit (MGU), Long Short Term Memory (LSTM), and Convolutional (Conv) RNNs. Specifically, our theory is established under the PAC-Learning framework. The generalization bound is presented in terms of the spectral norms of the weight matrices and the total number of parameters. We also establish refined generalization bounds with additional norm assumptions, and draw a comparison among these bounds. We remark: (1) Our generalization bound for vanilla RNNs is significantly tighter than the best of existing results; (2) We are not aware of any other generalization bounds for MGU and LSTM RNNs in the exiting literature; (3) We demonstrate the advantages of these variants in generalization.

Black Box Submodular Maximization: Discrete and Continuous Settings

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we consider the problem of black box continuous submodular maximization where we only have access to the function values and no information about the derivatives is provided. For a monotone and continuous DR-submodular function, and subject to a bounded convex body constraint, we propose Black-box Continuous Greedy, a derivative-free algorithm that provably achieves the tight $[(1-1/e)OPT-\epsilon]$ approximation guarantee with $O(d/\epsilon^3)$ function evaluations. We then extend our result to the stochastic setting where function values are subject to stochastic zero-mean noise. It is through this stochastic generalization that we revisit the discrete submodular maximization problem and use the multi-linear extension as a bridge between discrete and continuous settings. Finally, we extensively evaluate the performance of our algorithm on continuous and discrete submodular objective functions using both synthetic and real data.

Contextual Online False Discovery Rate Control

Wed, 03 Jun 2020 00:00:00 +0000

Multiple hypothesis testing, a situation when we wish to consider many hypotheses, is a core problem in statistical inference that arises in almost every scientific field. In this setting, controlling the false discovery rate (FDR), which is the expected proportion of type I error, is an important challenge for making meaningful inferences. In this paper, we consider a setting where an ordered (possibly infinite) sequence of hypotheses arrives in a stream, and for each hypothesis we observe a p-value along with a set of features specific to that hypothesis. The decision whether or not to reject the current hypothesis must be made immediately at each time step, before the next hypothesis is observed. This model provides a general way of leveraging the side (contextual) information in the data to help maximize the number of discoveries while controlling the FDR.We propose a new class of powerful online testing procedures, where the rejection thresholds are learned sequentially by incorporating contextual information and previous results. We prove that any rule in this class controls online FDR under some standard assumptions. We then focus on a subclass of these procedures, based on weighting the rejection thresholds, to derive a practical algorithm that learns a parametric weight function in an online fashion to gain more discoveries. We also theoretically prove that our proposed procedures, under some easily verifiable assumptions, would lead to an increase of statistical power over a popular online testing procedure proposed by Javanmard and Montanari (2018). Finally, we demonstrate the superior performance of our procedure, by comparing it to state-of-the-art online multiple testing procedures, on both synthetic data and real data generated from different applications.

Efficient Spectrum-Revealing CUR Matrix Decomposition

Wed, 03 Jun 2020 00:00:00 +0000

The CUR matrix decomposition is an important tool for low-rank matrix approximation. It approximates a data matrix though selecting a small number of columns and rows of the matrix. Those CUR algorithms with gap-dependent approximation bounds can obtain high approximation quality for matrices with good singular value spectrum decay, but they have impractically high time complexities. In this paper, we propose a novel CUR algorithm based on truncated LU factorization with an efficient variant of complete pivoting. Our algorithm has gap-dependent approximation bounds on both spectral and Frobenius norms while maintaining high efficiency. Numerical experiments demonstrate the effectiveness of our algorithm and verify our theoretical guarantees.

The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits

Wed, 03 Jun 2020 00:00:00 +0000

We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of $N$ agents, solving the same MAB instance to minimize individual cumulative regret. In our model, agents collaborate by exchanging messages through pairwise gossip style communications. We develop two novel algorithms, where each agent only plays from a subset of all the arms. Agents use the communication medium to recommend only arm-IDs (not samples), and thus update the set of arms from which they play. We establish that, if agents communicate $\Omega(\log(T))$ times through any connected pairwise gossip mechanism, then every agent’s regret is a factor of order $N$ smaller compared to the case of no collaborations. Furthermore, we show that the communication constraints only have a second order effect on the regret of our algorithm. We then analyze this second order term of the regret to derive bounds on the regret-communication tradeoffs. Finally, we empirically evaluate our algorithm and conclude that the insights are fundamental and not artifacts of our bounds. We also show a lower bound which gives that the regret scaling obtained by our algorithm cannot be improved even in the absence of any communication constraints. Our results demonstrate that even a minimal level of collaboration among agents greatly reduces regret for all agents.

Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection

Wed, 03 Jun 2020 00:00:00 +0000

Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization viewpoint of hierarchical clustering with pairwise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage), but also designing new algorithms. Here, for the maximization dual of Dasgupta’s objective (introduced by Moseley-Wang), we present polynomial-time 42.46% approximation algorithms that use Max-Uncut Bisection as a subroutine. The previous best worst-case approximation factor in polynomial time was 33.6%, improving only slightly over Average-Linkage which achieves 33.3%. Finally, we complement our positive results by providing APX-hardness (even for 0-1 similarities), under the Small Set Expansion hypothesis.

Learning Gaussian Graphical Models via Multiplicative Weights

Wed, 03 Jun 2020 00:00:00 +0000

Graphical model selection in Markov random fields is a fundamental problem in statistics and machine learning. Two particularly prominent models, the Ising model and Gaussian model, have largely developed in parallel using different (though often related) techniques, and several practical algorithms with rigorous sample complexity bounds have been established for each. In this paper, we adapt a recently proposed algorithm of Klivans and Meka (FOCS, 2017), based on the method of multiplicative weight updates, from the Ising model to the Gaussian model, via non-trivial modifications to both the algorithm and its analysis. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature, has a low runtime $O(mp^2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.

OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits

Wed, 03 Jun 2020 00:00:00 +0000

We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for their alternate regime. We design a single computationally efficient algorithm that simultaneously obtains problem-dependent optimal regret rates in the simple multi-armed bandit regime and minimax optimal regret rates in the linear contextual bandit regime, without knowing a priori which of the two models generates the rewards. These results are proved under the condition of stochasticity of contextual information over multiple rounds. Our results should be viewed as a step towards principled data-dependent policy class selection for contextual bandits.

Langevin Monte Carlo without smoothness

Wed, 03 Jun 2020 00:00:00 +0000

Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant. The nonasymptotic dependence of its mixing time on the dimension and target accuracy is understood mainly in the setting of smooth (gradient-Lipschitz) log-densities, a serious limitation for applications in machine learning. In this paper, we remove this limitation, providing polynomial-time convergence guarantees for a variant of LMC in the setting of nonsmooth log-concave distributions. At a high level, our results follow by leveraging the implicit smoothing of the log-density that comes from a small Gaussian perturbation that we add to the iterates of the algorithm and controlling the bias and variance that are induced by this perturbation.

Entropy Weighted Power k-Means Clustering

Wed, 03 Jun 2020 00:00:00 +0000

Despite its well-known shortcomings, k-means remains one of the most widely used approaches to data clustering. Current research continues to tackle its flaws while attempting to preserve its simplicity. Recently, the power k-means algorithm was proposed to avoid poor local minima by annealing through a family of smoother surfaces. However, the approach lacks statistical guarantees and fails in high dimensions when many features are irrelevant. This paper addresses these issues by introducing entropy regularization to learn feature relevance while annealing. We prove consistency of the proposed approach and derive a scalable majorization-minimization algorithm that enjoys closed-form updates and convergence guarantees. In particular, our method retains the same computational complexity of k-means and power k-means, but yields significant improvements over both. Its merits are thoroughly assessed on a suite of real and synthetic data.

Stochastic Bandits with Delay-Dependent Payoffs

Wed, 03 Jun 2020 00:00:00 +0000

Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by $\widetilde{\scO}\big(\!\sqrt{kT}\big)$, where $k$ is the number of arms and $T$ is time. Our algorithm uses only $\scO\big(k\ln\ln T)$ switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instances.

Budget-Constrained Bandits over General Cost and Reward Distributions

Wed, 03 Jun 2020 00:00:00 +0000

We consider a budget-constrained bandit problem where each arm pull incurs a random cost, and yields a random reward in return. The objective is to maximize the total expected reward under a budget constraint on the total cost. The model is general in the sense that it allows correlated and potentially heavy-tailed cost-reward pairs that can take on negative values as required by many applications. We show that if moments of order $(2+\gamma)$ for some $\gamma > 0$ exist for all cost-reward pairs, $O(\log B)$ regret is achievable for a budget $B>0$. In order to achieve tight regret bounds, we propose algorithms that exploit the correlation between the cost and reward of each arm by extracting the common information via linear minimum mean-square error estimation. We prove a regret lower bound for this problem, and show that the proposed algorithms achieve tight problem-dependent regret bounds, which are optimal up to a universal constant factor in the case of jointly Gaussian cost and reward pairs.

PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures

Wed, 03 Jun 2020 00:00:00 +0000

Persistence diagrams, the most common descriptors of Topological Data Analysis, encode topological properties of data and have already proved pivotal in many different applications of data science. However, since the metric space of persistence diagrams is not Hilbert, they end up being difficult inputs for most Machine Learning techniques. To address this concern, several vectorization methods have been put forward that embed persistence diagrams into either finite-dimensional Euclidean space or implicit infinite dimensional Hilbert space with kernels. In this work, we focus on persistence diagrams built on top of graphs. Relying on extended persistence theory and the so-called heat kernel signature, we show how graphs can be encoded by (extended) persistence diagrams in a provably stable way. We then propose a general and versatile framework for learning vectorizations of persistence diagrams, which encompasses most of the vectorization techniques used in the literature. We finally showcase the experimental strength of our setup by achieving competitive scores on classification tasks on real-life graph datasets.

Semi-Modular Inference: enhanced learning in multi-modular models by tempering the influence of components

Wed, 03 Jun 2020 00:00:00 +0000

Bayesian statistical inference loses predictive optimality when generative models are misspecified.Working within an existing coherent loss-based generalisation of Bayesian inference, we show existing Modular/Cut-model inference is coherent, and write down a new family of Semi-Modular Inference (SMI) schemes, indexed by an influence parameter, with Bayesian inference and Cut-models as special cases. We give a meta-learning criterion and estimation procedure to choose the inference scheme. This returns Bayesian inference when there is no misspecification.The framework applies naturally to Multi-modular models. Cut-model inference allows directed information flow from well-specified modules to misspecified modules, but not vice versa. An existing alternative power posterior method gives tunable but undirected control of information flow, improving prediction in some settings. In contrast, SMI allows \emph{tunable and directed} information flow between modules.We illustrate our methods on two standard test cases from the literature and a motivating archaeological data set.

Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

Wed, 03 Jun 2020 00:00:00 +0000

In this work, we develop a novel regularizer to improve the learning of long-range dependency of sequence data. Applied on language modelling, our regularizer expresses the inductive bias that sequence variables should have high mutual information even though the model might not see abundant observations for complex long-range dependency. We show how the “next sentence prediction (classification)" heuristic can be derived in a principled way from our mutual information estimation framework, and be further extended to maximize the mutual information of sequence variables. The proposed approach not only is effective at increasing the mutual information of segments under the learned model but more importantly, leads to a higher likelihood on holdout data, and improved generation quality. Code is releasedat https://github.com/BorealisAI/BMI.

Conditional Linear Regression

Wed, 03 Jun 2020 00:00:00 +0000

Work in machine learning and statistics commonly focuses on building models that capture the vast majority of data, possibly ignoring a segment of the population as outliers. However, there may not exist a good, simple model for the distribution, so we seek to find a small subset where there exists such a model. We give a computationally efficient algorithm with theoretical analysis for the conditional linear regression task, which is the joint task of identifying a significant portion of the data distribution, described by a k-DNF, along with a linear predictor on that portion with a small loss. In contrast to work in robust statistics on small subsets, our loss bounds do not feature a dependence on the density of the portion we fit, and compared to previous work on conditional linear regression, our algorithm’s running time scales polynomially with the sparsity of the linear predictor. We also demonstrate empirically that our algorithm can leverage this advantage to obtain a k-DNF with a better linear predictor in practice.

Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of robust matrix completion, which aims to recover a low rank matrix $L_*$ and a sparse matrix $S_*$ from incomplete observations of their sum $M=L_*+S_*\in\mathbb{R}^{m\times n}$.Algorithmically, the robust matrix completion problem is transformed into a problem of solving a system of nonlinear equations,and the alternative direction method is then used to solve the nonlinear equations.In addition, the algorithm is highly parallelizable and suitable for large scale problems.Theoretically, we characterize the sufficient conditions for when $L_*$ can be approximated by a low rank approximation of the observed $M_*$.And under proper assumptions, it is shown that the algorithm converges to the true solution linearly.Numerical simulations show that the simple method works as expected and is comparable with state-of-the-art methods.

An Inverse-free Truncated Rayleigh-Ritz Method for Sparse Generalized Eigenvalue Problem

Wed, 03 Jun 2020 00:00:00 +0000

This paper considers the sparse generalized eigenvalue problem (SGEP), which aims to find the leading eigenvector with at most $k$ nonzero entries. SGEP naturally arises in many applications in machine learning, statistics, and scientific computing, for example, the sparse principal component analysis (SPCA), the sparse discriminant analysis (SDA), and the sparse canonical correlation analysis (SCCA). In this paper, we focus on the development of a three-stage algorithm named {\em inverse-free truncated Rayleigh-Ritz method} ({\em IFTRR}) to efficiently solve SGEP. In each iteration of IFTRR, only a small number of matrix-vector products is required. This makes IFTRR well-suited for large scale problems. Particularly, a new truncation strategy is proposed, which is able to find the support set of the leading eigenvector effectively. Theoretical results are developed to explain why IFTRR works well. Numerical simulations demonstrate the merits of IFTRR.

Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions

Wed, 03 Jun 2020 00:00:00 +0000

Exact probabilistic inference in discrete models is often prohibitively expensive, as it may require evaluating the (unnormalized) target density on its entire domain. Here we consider the setting where only a limited budget of calls to the unnormalized target density oracle is available, raising the challenge of where in its domain to allocate these function calls in order to construct a good approximate solution. We formulate this problem as an instance of sequential decision-making under uncertainty and leverage methods from reinforcement learning for probabilistic inference with budget constraints. In particular, we propose the TreeSample algorithm, an adaptation of Monte Carlo Tree Search to approximate inference. This algorithm caches all previous queries to the density oracle in an explicit search tree, and dynamically allocates new queries based on a "best-first" heuristic for exploration, using existing upper confidence bound methods. Our non-parametric inference method can be effectively combined with neural networks that compile approximate conditionals of the target, which are then used to guide the inference search and enable generalization of across multiple target distributions. We show empirically that TreeSample outperforms standard approximate inference methods on synthetic factor graphs.

Kernels over Sets of Finite Sets using RKHS Embeddings, with Application to Bayesian (Combinatorial) Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We focus on kernel methods for set-valued inputs and their application to Bayesian set optimization, notably combinatorial optimization. We investigate two classes of set kernels that both rely on Reproducing Kernel Hilbert Space embeddings, namely the "Double Sum" (DS) kernels recently considered in Bayesian set optimization, and a class introduced here called "Deep Embedding" (DE) kernels that essentially consists in applying a radial kernel on Hilbert space on top of the canonical distance induced by another kernel such as a DS kernel. We establish in particular that while DS kernels typically suffer from a lack of strict positive definiteness, vast subclasses of DE kernels built upon DS kernels do possess this property, enabling in turn combinatorial optimization without requiring to introduce a jitter parameter. Proofs of theoretical results about considered kernels are complemented by a few practicalities regarding hyperparameter fitting. We furthermore demonstrate the applicability of our approach in prediction and optimization tasks, relying both on toy examples and on two test cases from mechanical engineering and hydrogeology, respectively. Experimental results highlight the applicability and compared merits of the considered approaches while opening new perspectives in prediction and sequential design with set inputs.

Utility/Privacy Trade-off through the lens of Optimal Transport

Wed, 03 Jun 2020 00:00:00 +0000

Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off between both objectives.We formalize this as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.

Nonparametric Estimation in the Dynamic Bradley-Terry Model

Wed, 03 Jun 2020 00:00:00 +0000

We propose a time-varying generalization of the Bradley-Terry model that allows for nonparametric modeling of dynamic global rankings of distinct teams. We develop a novel estimator that relies on kernel smoothing to pre-process the pairwise comparisons over time and is applicable in sparse settings where the Bradley-Terry may not be fit. We obtain necessary and sufficient conditions for the existence and uniqueness of our estimator. We also derive time-varying oracle bounds for both the estimation error and the excess risk in the model-agnostic setting where the Bradley-Terry model is not necessarily the true data generating process. We thoroughly test the practical effectiveness of our model using both simulated and real world data and suggest an efficient data-driven approach for bandwidth tuning.

Learnable Bernoulli Dropout for Bayesian Deep Learning

Wed, 03 Jun 2020 00:00:00 +0000

In this work, we propose learnable Bernoulli dropout (LBD), a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters. By probabilistic modeling of Bernoulli dropout, our method enables more robust prediction and uncertainty quantification in deep models. Especially, when combined with variational auto-encoders (VAEs), LBD enables flexible semi-implicit posterior representations, leading to new semi-implicit VAE (SIVAE) models. We solve the optimization for training with respect to the dropout parameters using Augment-REINFORCE-Merge (ARM), an unbiased and low-variance gradient estimator. Our experiments on a range of tasks show the superior performance of our approach compared with other commonly used dropout schemes. Overall, LBD leads to improved accuracy and uncertainty estimates in image classification and semantic segmentation. Moreover, using SIVAE, we can achieve state-of-the-art performance on collaborative filtering for implicit feedback on several public datasets.

Corruption-Tolerant Gaussian Process Bandit Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of optimizing an unknown (typically non-convex) function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS), based on noisy bandit feedback. We consider a novel variant of this problem in which the point evaluations are not only corrupted by random noise, but also adversarial corruptions. We introduce an algorithm Fast-Slow GP-UCB based on Gaussian process methods, randomized selection between two instances labeled ’fast’ (but non-robust) and ’slow’ (but robust), enlarged confidence bounds, and the principle of optimism under uncertainty. We present a novel theoret- ical analysis upper bounding the cumulative regret in terms of the corruption level, the time horizon, and the underlying kernel, and we argue that certain dependencies cannot be improved. We observe that distinct algorithmic ideas are required depending on whether one is required to perform well in both the corrupted and non-corrupted settings, and whether the corruption level is known or not.

Adversarial Robustness Guarantees for Classification with Gaussian Processes

Wed, 03 Jun 2020 00:00:00 +0000

We investigate adversarial robustness of Gaussian Process classification (GPC) models. Specifically, given a compact subset of the input space $T\subseteq \mathbb{R}^d$ enclosing a test point $x^*$ and a GPC trained on a dataset $\mathcal{D}$, we aim to compute the minimum and the maximum classification probability for the GPC over all the points in $T$.In order to do so, we show how functions lower- and upper-bounding the GPC output in $T$ can be derived, and implement those in a branch and bound optimisation algorithm. For any error threshold $\epsilon > 0$ selected \emph{a priori}, we show that our algorithm is guaranteed to reach values $\epsilon$-close to the actual values in finitely many iterations.We apply our method to investigate the robustness of GPC models on a 2D synthetic dataset, the SPAM dataset and a subset of the MNIST dataset, providing comparisons of different GPC training techniques, and show how our method can be used for interpretability analysis. Our empirical analysis suggests that GPC robustness increases with more accurate posterior estimation.

Statistical and Computational Rates in Graph Logistic Regression

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of graph logistic regression, based on partial observation of a large network, and on side information associated to its vertices. The generative model is formulated as a matrix logistic regression. The performance of the model is analyzed in a high-dimensional regime under a structural assumption. The optimal statistical rates are derived, and an estimator based on penalized maximum likelihood is shown to attain it. The algorithmic aspects of this problem are also studied, and optimal rates under computational constraints are derived, and shown to differ from the information-theoretic rates - under a complexity assumption.

Ordering-Based Causal Structure Learning in the Presence of Latent Variables

Wed, 03 Jun 2020 00:00:00 +0000

We consider the task of learning a causal graph in the presence of latent confounders given i.i.d.samples from the model. While current algorithms for causal structure discovery in the presence of latent confounders are constraint-based, we here propose a hybrid approach. We prove that under assumptions weaker than faithfulness, any sparsest independence map (IMAP) of the distribution belongs to the Markov equivalence class of the true model. This motivates the Sparsest Poset formulation - that posets can be mapped to minimal IMAPs of the true model such that the sparsest of these IMAPs is Markov equivalent to the true model. Motivated by this result, we propose a greedy algorithm over the space of posets for causal structure discovery in the presence of latent confounders and compare its performance to the current state-of-the-art algorithms FCI and FCI+ on synthetic data.

Non-exchangeable feature allocation models with sublinear growth of the feature sizes

Wed, 03 Jun 2020 00:00:00 +0000

Feature allocation models are popular models used in different applications such as unsupervised learning or network modeling. In particular, the Indian buffet process is a flexible and simple one-parameter feature allocation model where the number of features grows unboundedly with the number of objects. The Indian buffet process, like most feature allocation models, satisfies a symmetry property of exchangeability: the distribution is invariant under permutation of the objects. While this property is desirable in some cases, it has some strong implications. Importantly, the number of objects sharing a particular feature grows linearly with the number of objects. In this article, we describe a class of non-exchangeable feature allocation models where the number of objects sharing a given feature grows sublinearly, where the rate can be controlled by a tuning parameter. We derive the asymptotic properties of the model, and show that such models provides a better fit and better predictive performances on various datasets.

Minimax Bounds for Structured Prediction Based on Factor Graphs

Wed, 03 Jun 2020 00:00:00 +0000

Structured prediction can be considered as a generalization of many standard supervised learning tasks, and is usually thought as a simultaneous prediction of multiple labels. One standard approach is to maximize a score function on the space of labels, which usually decomposes as a sum of unary and pairwise potentials, each depending on one or two specific labels, respectively.For this approach, several learning and inference algorithms have been proposed over the years, ranging from exact to approximate methods while balancing the computational complexity.However, in contrast to binary and multiclass classification, results on the necessary number of samples for achieving learning are still limited, even for a specific family of predictors such as factor graphs.In this work, we provide minimax lower bounds for a class of general factor-graph inference models in the context of structured prediction.That is, we characterize the necessary sample complexity for any conceivable algorithm to achieve learning of general factor-graph predictors.

Private Protocols for U-Statistics in the Local Model and Beyond

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we study the problem of computing $U$-statistics of degree $2$, i.e., quantities that come in the form of averages over pairs of data points, in the local model of differential privacy (LDP). The class of $U$-statistics covers many statistical estimates of interest, including Gini mean difference, Kendall’s tau coefficient and Area under the ROC Curve (AUC), as well as empirical risk measures for machine learning problems such as ranking, clustering and metric learning. We first introduce an LDP protocol based on quantizing the data into bins and applying randomized response, which guarantees an $\epsilon$-LDP estimate with a Mean Squared Error (MSE) of $O(1/\sqrt{n}\epsilon)$ under regularity assumptions on the $U$-statistic or the data distribution. We then propose a specialized protocol for AUC based on a novel use of hierarchical histograms that achieves MSE of $O(\alpha^3/n\epsilon^2)$ for arbitrary data distribution. We also show that 2-party secure computation allows to design a protocol with MSE of $O(1/n\epsilon^2)$, without any assumption on the kernel function or data distribution and with total communication linear in the number of users $n$. Finally, we evaluate the performance of our protocols through experiments on synthetic and real datasets.

Tighter Theory for Local SGD on Identical and Heterogeneous Data

Wed, 03 Jun 2020 00:00:00 +0000

We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. The tightness of our results is guaranteed by recovering known statements when we plug $H=1$, where $H$ is the number of local steps. The empirical evidence further validates the severe impact of data heterogeneity on the performance of local SGD.

Sample Complexity of Estimating the Policy Gradient for Nearly Deterministic Dynamical Systems

Wed, 03 Jun 2020 00:00:00 +0000

Reinforcement learning is a promising approach to learning robotics controllers. It has recently been shown that algorithms based on finite-difference estimates of the policy gradient are competitive with algorithms based on the policy gradient theorem. We propose a theoretical framework for understanding this phenomenon. Our key insight is that many dynamical systems (especially those of interest in robotics control tasks) are nearly deterministic—i.e., they can be modeled as a deterministic system with a small stochastic perturbation. We show that for such systems, finite-difference estimates of the policy gradient can have substantially lower variance than estimates based on the policy gradient theorem. Finally, we empirically evaluate our insights in an experiment on the inverted pendulum.

RelatIF: Identifying Explanatory Training Samples via Relative Influence

Wed, 03 Jun 2020 00:00:00 +0000

In this work, we focus on the use of influence functions to identify relevant training examples that one might hope “explain” the predictions of a machine learning model. One shortcoming of influence functions is that the training examples deemed most “influential” are often outliers or mislabelled, making them poor choices for explanation. In order to address this shortcoming, we separate the role of global versus local influence. We introduce RelatIF, a new class of criteria for choosing relevant training examples by way of an optimization objective that places a constraint on global influence. RelatIF considers the local influence that an explanatory example has on a prediction relative to its global effects on the model. In empirical evaluations, we find that the examples returned by RelatIF are more intuitive when compared to those found using influence functions.

Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification

Wed, 03 Jun 2020 00:00:00 +0000

Complex classification performance metrics such as the F-measure and Jaccard index are often used, in order to handle class-imbalanced cases such as information retrieval and image segmentation. These performance metrics are not decomposable, that is, they cannot be expressed in a per-example manner, which hinders a straightforward application of M-estimation widely used in supervised learning. In this paper, we consider linear-fractional metrics, which are a family of classification performance metrics that encompasses many standard ones such as the F-measure and Jaccard index, and propose methods to directly maximize performances under those metrics. A clue to tackle their direct optimization is a calibrated surrogate utility, which is a tractable lower bound of the true utility function representing a given metric. We characterize sufficient conditions which make the surrogate maximization coincide with the maximization of the true utility. Simulation results on benchmark datasets validate the effectiveness of our calibrated surrogate maximization especially if the sample sizes are extremely small.

Hypothesis Testing Interpretations and Renyi Differential Privacy

Wed, 03 Jun 2020 00:00:00 +0000

Differential privacy is a de facto standard in data privacy, with applicationsin the public and private sectors. One way of explaining differential privacy,which is particularly appealing to statistician and social scientists, is bymeans of its statistical hypothesis testing interpretation. Informally, onecannot effectively test whether a specific individual has contributed her databy observing the output of a private mechanism—any test cannot have bothhigh significance and high power.In this paper, we identify some conditions under which a privacy definition given in terms of a statistical divergence satisfies a similar interpretation.These conditions are useful to analyze the distinguishing power of divergencesand we use them to study the hypothesis testing interpretation of somerelaxations of differential privacy based on Renyi divergence. Ouranalysis also results in an improved conversion rule between these definitionsand differential privacy.

Adversarial Risk Bounds through Sparsity based Compression

Wed, 03 Jun 2020 00:00:00 +0000

Neural networks have been shown to be vulnerable against minor adversarial perturbations of their inputs, especially for high dimensional data under $\ell_\infty$ attacks.To combat this problem, techniques like adversarial training have been employed to obtain models that are robust on the training set.However, the robustness of such models against adversarial perturbations may not generalize to unseen data.To study how robustness generalizes, recent works assume that the inputs have bounded $\ell_2$-norm in order to bound the adversarial risk for $\ell_\infty$ attacks with no explicit dimension dependence.In this work, we focus on $\ell_\infty$ attacks with $\ell_\infty$ bounded inputs and prove margin-based bounds.Specifically, we use a compression-based approach that relies on efficiently compressing the set of tunable parameters without distorting the adversarial risk. To achieve this, we apply the concept of effective sparsity and effective joint sparsity on the weight matrices of neural networks.This leads to bounds with no explicit dependence on the input dimension, neither on the number of classes.Our results show that neural networks with approximately sparse weight matrices not only enjoy enhanced robustness but also better generalization. Finally, empirical simulations show that the notion of effective joint sparsity plays a significant role in generalizing robustness to $\ell_\infty$ attacks.

How To Backdoor Federated Learning

Wed, 03 Jun 2020 00:00:00 +0000

Federated models are created by aggregating model updates submittedby participants. To protect confidentiality of the training data,the aggregator by design has no visibility into how these updates aregenerated. We show that this makes federated learning vulnerable to amodel-poisoning attack that is significantly more powerful than poisoningattacks that target only the training data.A single or multiple malicious participants can use modelreplacement to introduce backdoor functionality into the joint model,e.g., modify an image classifier so that it assigns an attacker-chosenlabel to images with certain features, or force a word predictor tocomplete certain sentences with an attacker-chosen word. We evaluatemodel replacement under different assumptions for the standardfederated-learning tasks and show that it greatly outperformstraining-data poisoning.Federated learning employs secure aggregation to protect confidentialityof participants’ local models and thus cannot detect anomalies inparticipants’ contributions to the joint model. To demonstrate thatanomaly detection would not have been effective in any case, we alsodevelop and evaluate a generic constrain-and-scale technique thatincorporates the evasion of defenses into the attacker’s loss functionduring training.

A Tight and Unified Analysis of Gradient-Based Methods for a Whole Spectrum of Differentiable Games

Wed, 03 Jun 2020 00:00:00 +0000

We consider differentiable games where the goal is to find a Nash equilibrium. The machine learning community has recently started using variants of the gradient method (GD). Prime examples are extragradient (EG), the optimistic gradient method (OG) and consensus optimization (CO) which enjoy linear convergence in cases like bilinear games, where the standard GD fails. The full benefits of theses relatively new methods are not known as there is no unified analysis for both strongly monotone and bilinear games. We provide new analysis of the EG’s local and global convergence properties and use is to get a tighter global convergence rate for OG and CO. Our analysis covers the whole range of settings between bilinear and strongly monotone games. It reveals that these methods converges via different mechanisms at these extremes; in between, it exploits the most favorable mechanism for the given problem. We then prove that EG achieves the optimal rate for a wide class of algorithms with any number of extrapolations. Our tight analysis of EG’s convergence rate in games shows that, unlike in convex minimization, EG may be much faster than GD.

Accelerating Smooth Games by Manipulating Spectral Shapes

Wed, 03 Jun 2020 00:00:00 +0000

We use matrix iteration theory to characterize acceleration in smooth games. We define the spectral shape of a family of games as the set containing all eigenvalues of the Jacobians of standard gradient dynamics in the family. Shapes restricted to the real line represent well-understood classes of problems, like minimization. Shapes spanning the complex plane capture the added numerical challenges in solving smooth games. In this framework, we describe gradient-based methods, such as extragradient, as transformations on the spectral shape. Using this perspective, we propose an optimal algorithm for bilinear games. For smooth and strongly monotone operators, we identify a continuum between convex minimization, where acceleration is possible using Polyak’s momentum, and the worst case where gradient descent is optimal. Finally, going beyond first-order methods, we propose an accelerated version of consensus optimization.

Equalized odds postprocessing under imperfect group information

Wed, 03 Jun 2020 00:00:00 +0000

Most approaches aiming to ensure a model’s fairness with respect to a protected attribute (such as gender or race) assume to know the true value of the attribute for every data point. In this paper, we ask to what extent fairness interventions can be effective even when only imperfect information about the protected attribute is available. In particular, we study the prominent equalized odds postprocessing method of Hardt et al. (2016) under a perturbation of the attribute. We identify conditions on the perturbation that guarantee that the bias of a classifier is reduced even by running equalized odds with the perturbed attribute. We also study the error of the resulting classifier. We empirically observe that under our identified conditions most often the error does not suffer from a perturbation of the protected attribute. For a special case, we formally prove this observation to be true.

Almost-Matching-Exactly for Treatment Effect Estimation under Network Interference

Wed, 03 Jun 2020 00:00:00 +0000

We propose a matching method that recovers direct treatment effects from randomized experiments where units are connected in an observed network, and units that share edges can potentially influence each others’ outcomes. Traditional treatment effect estimators for randomized experiments are biased and error prone in this setting. Our method matches units almost exactly on counts of unique subgraphs within their neighborhood graphs. The matches that we construct are interpretable and high-quality. Our method can be extended easily to accommodate additional unit-level covariate information. We show empirically that our method performs better than other existing methodologies for this problem, while producing meaningful, interpretable results.

Multi-attribute Bayesian optimization with interactive preference learning

Wed, 03 Jun 2020 00:00:00 +0000

We consider black-box global optimization of time-consuming-to-evaluate functions on behalf of a decision-maker (DM) whose preferences must be learned. Each feasible design is associated with a time-consuming-to-evaluate vector of attributes and each vector of attributes is assigned a utility by the DM’s utility function, which may be learned approximately using preferences expressed over pairs of attribute vectors. Past work has used a point estimate of this utility function as if it were error-free within single-objective optimization. However, utility estimation errors may yield a poor suggested design. Furthermore, this approach produces a single suggested ‘best’ design, whereas DMs often prefer to choose from a menu. We propose a novel multi-attribute Bayesian optimization with preference learning approach. Our approach acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate that are good not just for a single estimated utility function but a range of likely ones. The outcome of our approach is a menu of designs and evaluated attributes from which the DM makes a final selection. We demonstrate the value and flexibility of our approach in a variety of experiments.

Naive Feature Selection: Sparsity in Naive Bayes

Wed, 03 Jun 2020 00:00:00 +0000

Due to its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our bound becomes tight as the marginal contribution of additional features decreases. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, l_1-penalized logistic regression and LASSO, while being orders of magnitude faster. For a large data set, having more than with 1.6 million training points and about 12 million features, and with a non-optimized CPU implementation, our sparse naive Bayes model can be trained in less than 15 seconds.

An approximate KLD based experimental design for models with intractable likelihoods

Wed, 03 Jun 2020 00:00:00 +0000

Data collection is a critical step in statistical inference and data science,and the goal of statistical experimental design (ED) is to find the data collection setupthat can provide most information for the inference. In this work we consider a special type of ED problems where the likelihoods are not available in a closed form. In this case, the popular information-theoretic Kullback-Leibler divergence (KLD) based design criterioncan not be used directly, as it requires to evaluate the likelihood function. To address the issue, we derive a new utility function,which is a lower bound of the original KLD utility. This lower bound is expressed in terms of the summation of two or more entropies in the data space, and thus can be evaluated efficiently via entropy estimation methods.We provide several numerical examples to demonstrate the performance of the proposed method.

On the Completeness of Causal Discovery in the Presence of Latent Confounding with Tiered Background Knowledge

Wed, 03 Jun 2020 00:00:00 +0000

The discovery of causal relationships is a core part of scientific research. Accordingly, over the past several decades, algorithms have been developed to discover the causal structure for a system of variables from observational data. Learning ancestral graphs is of particular interest due to their ability to represent latent confounding implicitly with bi-directed edges. The well-known FCI algorithm provably recovers an ancestral graph for a system of variables encoding the sound and complete set of causal relationships identifiable from observational data. Additional causal relationships become identifiable with the incorporation of background knowledge; however, it is not known for what types of knowledge FCI remains complete. In this paper, we define tiered background knowledge and show that FCI is sound and complete with the incorporation of this knowledge.

A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms

Wed, 03 Jun 2020 00:00:00 +0000

We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes. We demonstrate its effectiveness by presenting simple and unified proofs of convergence for a variety of commonly-used methods. We show that value-based methods such as TD(?) and Q-Learning have update rules which are contractive in the space of distributions of functions, thus establishing their exponentially fast convergence to a stationary distribution. We demonstrate that the stationary distribution obtained by any algorithm whose target is an expected Bellman update has a mean which is equal to the true value function. Furthermore, we establish that the distributions concentrate around their mean as the step-size shrinks. We further analyse the optimistic policy iteration algorithm, for which the contraction property does not hold, and formulate a probabilistic policy improvement property which entails the convergence of the algorithm.

Derivative-Free & Order-Robust Optimisation

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we formalise order-robust optimisation as an instance of online learning minimising simple regret, and propose Vroom, a zero’th order optimisation algorithm capable of achieving vanishing regret in non-stationary environments, while recovering favorable rates under stochastic reward-generating processes. Our results are the first to target simple regret definitions in adversarial scenarios unveiling a challenge that has been rarely considered in prior work.

Understanding the Effects of Batching in Online Active Learning

Wed, 03 Jun 2020 00:00:00 +0000

Online active learning (AL) algorithms often assume immediate access to a label once a query has been made. However, due to practical constraints, the labels of these queried examples are generally only available in “batches”. In this work, we present an analysis for a generic class of batch online AL algorithms, which reveals that the effects of batching are in fact mild and only result in an additional label complexity term that is quasilinear in the batch size. To our knowledge, this provides the first theoretical justification for such algorithms and we show how they can be applied to batch variants of three canonical online AL algorithms: IWAL, ORIWAL, and DHM. Finally, we also present empirical results across several benchmark datasets that corroborate these theoretical insights.

Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

Wed, 03 Jun 2020 00:00:00 +0000

This paper focuses on the problem of unsupervised alignment of hierarchical data such as ontologies or lexical databases. This problem arises across areas, from natural language processing to bioinformatics, and is typically solved by appeal to outside knowledge bases and label-textual similarity. In contrast, we approach the problem from a purely geometric perspective: given only a vector-space representation of the items in the two hierarchies, we seek to infer correspondences across them. Our work derives from and interweaves hyperbolic-space representations for hierarchical data, on one hand, and unsupervised word-alignment methods, on the other. We first provide a set of negative results showing how and why Euclidean methods fail in this hyperbolic setting. We then propose a novel approach based on optimal transport over hyperbolic spaces, and show that it outperforms standard embedding alignment techniques in various experiments on cross-lingual WordNet alignment and ontology matching tasks.

A Continuous-time Perspective for Modeling Acceleration in Riemannian Optimization

Wed, 03 Jun 2020 00:00:00 +0000

We propose a novel second-order ODE as the continuous-time limit of a Riemannian accelerated gradient-based method on a manifold with curvature bounded from below. This ODE can be seen as a generalization of the ODE derived for Euclidean spaces, and can also serve as an analysis tool. We analyze the convergence behavior of this ODE for different types of functions, such as geodesically convex, strongly-convex and weakly-quasi-convex. We demonstrate how such an ODE can be discretized using a semi-implicit and Nesterov-inspired numerical integrator, that empirically yields stable algorithms which are faithful to the continuous-time analysis and exhibit accelerated convergence.

Fair Correlation Clustering

Wed, 03 Jun 2020 00:00:00 +0000

In this paper, we study correlation clustering under fairness constraints. Fair variants of $k$-median and $k$-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints.Our results hinge on obtaining a fairlet decomposition for correlation clustering by introducing a novel combinatorial optimization problem. We define a fairlet decomposition with cost similar to the $k$-median cost and this allows us to obtain approximation algorithms for a wide range of fairness constraints. We complement our theoretical results with an in-depth analysis of our algorithms on real graphs where we show that fair solutions to correlation clustering can be obtained with limited increase in cost compared to the state-of-the-art (unfair) algorithms.

Causal Bayesian Optimization

Wed, 03 Jun 2020 00:00:00 +0000

This paper studies the problem of globally optimizing a variable of interest that is part of a causal model in which a sequence of interventions can be performed. This problem arises in biology, operational research, communications and, more generally, in all fields where the goal is to optimize an output metric of a system of interconnected nodes. Our approach combines ideas from causal inference, uncertainty quantification and sequential decision making. In particular, it generalizes Bayesian optimization, which treats the input variables of the objective function as independent, to scenarios where causal information is available. We show how knowing the causal graph significantly improves the ability to reason about optimal decision making strategies decreasing the optimization cost while avoiding suboptimal solutions. We propose a new algorithm called Causal Bayesian Optimization (CBO). CBO automatically balances two trade-offs: the classical exploration-exploitation and the new observation-intervention, which emerges when combining real interventional data with the estimated intervention effects computed via do-calculus. We demonstrate the practical benefits of this method in a synthetic setting and in two real-world applications.

Expressiveness and Learning of Hidden Quantum Markov Models

Wed, 03 Jun 2020 00:00:00 +0000

Extending classical probabilistic reasoning using the quantum mechanical view of probability has been of recent interest, particularly in the development of hidden quantum Markov models (HQMMs) to model stochastic processes. However, there has been little progress in characterizing the expressiveness of such models and learning them from data. We tackle these problems by showing that HQMMs are a special subclass of the general class of observable operator models (OOMs) that do not suffer from the negative probability problem by design. We also provide a feasible retraction-based learning algorithm for HQMMs using constrained gradient descent on the Stiefel manifold of model parameters. We demonstrate that this approach is faster and scales to larger models than previous learning algorithms.

On the Sample Complexity of Learning Sum-Product Networks

Wed, 03 Jun 2020 00:00:00 +0000

Sum-Product Networks (SPNs) can be regarded as a form of deep graphical models that compactly represent deeply factored and mixed distributions. An SPN is a rooted directed acyclic graph (DAG) consisting of a set of leaves (corresponding to base distributions), a set of sum nodes (which represent mixtures of their children distributions) and a set of product nodes (representing the products of its children distributions). In this work, we initiate the study of the sample complexity of PAC-learning the set of distributions that correspond to SPNs. We show that the sample complexity of learning tree structured SPNs with the usual type of leaves (i.e., Gaussian or discrete) grows at most linearly (up to logarithmic factors) with the number of parameters of the SPN.More specifically, we show that the class of distributions that corresponds to tree structured Gaussian SPNs with $k$ mixing weights and $e$ ($d$-dimensional Gaussian) leaves can be learned within Total Variation error $\epsilon$ using at most $\widetilde{O}(\frac{ed^2+k}{\epsilon^2})$ samples. A similar result holds for tree structured SPNs with discrete leaves. We obtain the upper bounds based on the recently proposed notion of distribution compression schemes. More specifically, we show that if a (base) class of distributions $\cF$ admits an “efficient” compression, then the class of tree structured SPNs with leaves from $\cF$ also admits an efficient compression.

Doubly Sparse Variational Gaussian Processes

Wed, 03 Jun 2020 00:00:00 +0000

The use of Gaussian process models is typically limited to datasets with a few tens of thousands of observations due to their complexity and memory footprint.The two most commonly used methods to overcome this limitation are 1) the variational sparse approximation which relies on inducing points and 2) the state-space equivalent formulation of Gaussian processes which can be seen as exploiting some sparsity in the precision matrix.In this work, we propose to take the best of both worlds: we show that the inducing point framework is still valid for state space models and that it can bring further computational and memory savings. Furthermore, we provide the natural gradient formulation for the proposed variational parameterisation.Finally, this work makes it possible to use the state-space formulation inside deep Gaussian process models as illustrated in one of the experiments.

Budget Learning via Bracketing

Wed, 03 Jun 2020 00:00:00 +0000

Conventional machine learning applications in the mobile/IoT setting transmit data to a cloud-server for predictions. Due to cost considerations (power, latency, monetary), it is desirable to minimise device-to-server transmissions. The budget learning (BL) problem poses the learner’s goal as minimising use of the cloud while suffering no discernible loss in accuracy, under the constraint that the methods employed be edge-implementable. We propose a new formulation for the BL problem via the concept of bracketings. Concretely, we propose to sandwich the cloud’s prediction, $g,$ via functions $h^-, h^+$ from a ‘simple’ class so that $h^- \le g \le h^+$ nearly always. On an instance $x$, if $h^+(x)=h^-(x)$, we leverage local processing, and bypass the cloud. We explore theoretical aspects of this formulation, providing PAC-style learnability definitions; associating the notion of budget learnability to approximability via brackets; and giving VC-theoretic analyses of their properties. We empirically validate our theory on real-world datasets, demonstrating improved performance over prior gating based methods.

Value Preserving State-Action Abstractions

Wed, 03 Jun 2020 00:00:00 +0000

Abstraction can improve the sample efficiency of reinforcement learning. However, the process of abstraction inherently discards information, potentially compromising an agent’s ability to represent high-value policies. To mitigate this, we here introduce combinations of state abstractions and options that are guaranteed to preserve representation of near-optimal policies. We first define $\phi$-relative options, a general formalism for analyzing the value loss of options paired with a state abstraction, and present necessary and sufficient conditions for $\phi$-relative options to preserve near-optimal behavior in any finite Markov Decision Process. We further show that, under appropriate assumptions, $\phi$-relative options can be composed to induce hierarchical abstractions that are also guaranteed to represent high-value policies.

Best-item Learning in Random Utility Models with Subset Choices

Wed, 03 Jun 2020 00:00:00 +0000

We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items, when, upon playing a subset, the learner receives relative feedback sampled according to a general Random Utility Model (RUM) with independent noise perturbations to the latent item utilities. We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items based on their relative win/loss empirical counts, and can be bounded as a function of the noise distribution alone. We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee of $O(\frac{n}{c^2\epsilon^2} \log \frac{k}{\delta})$ rounds to identify an $\epsilon$-optimal item with confidence $1-\delta$, when the worst case pairwise advantage in the RUM has sensitivity at least $c$ to the parameter gaps of items. Fundamental lower bounds on PAC sample complexity show that this is near-optimal in terms of its dependence on $n,k$ and $c$.