Proceedings of Machine Learning ResearchProceedings of The 24th International Conference on Artificial Intelligence and Statistics
Held in Virtual on 13-15 April 2021
Published as Volume 130 by the Proceedings of Machine Learning Research on 18 March 2021.
Volume Edited by:
Arindam Banerjee
Kenji Fukumizu
Series Editors:
Neil D. Lawrence
Mark Reid
http://proceedings.mlr.press/v130/
Sat, 08 May 2021 17:46:00 +0000Sat, 08 May 2021 17:46:00 +0000Jekyll v3.9.0 No-Regret Reinforcement Learning with Heavy-Tailed Rewards Reinforcement learning algorithms typically assume rewards to be sampled from light-tailed distributions, such as Gaussian or bounded. However, a wide variety of real-world systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging techniques from robust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and show that they achieve near-optimal regret bounds in this setting. Our algorithms also naturally generalize to deep reinforcement learning applications; we instantiate Heavy-DQN as an example of this. We demonstrate that all of our algorithms outperform baselines on both synthetic MDPs and standard RL benchmarks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhuang21a.html
http://proceedings.mlr.press/v130/zhuang21a.html One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous work assumes that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid’s inequality. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhu21d.html
http://proceedings.mlr.press/v130/zhu21d.html Taming heavy-tailed features by shrinkage In this work, we focus on a variant of the generalized linear model (GLM) called corrupted GLM (CGLM) with heavy-tailed features and responses. To robustify the statistical inference on this model, we propose to apply L4-norm shrinkage to the feature vectors in the low-dimensional regime and apply elementwise shrinkage to them in the high-dimensional regime. Under bounded fourth moment assumptions, we show that the maximum likelihood estimator (MLE) based on the shrunk data enjoys nearly the minimax optimal rate with an exponential deviation bound. Our simulations demonstrate that the proposed feature shrinkage significantly enhances the statistical performance in linear regression and logistic regression on heavy-tailed data. Finally, we apply our shrinkage principle to guard against mislabeling and image noise in the human-written digit recognition problem. We add an L4-norm shrinkage layer to the original neural net and reduce the testing misclassification rate by more than 30% relatively in the presence of mislabeling and image noise. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhu21c.html
http://proceedings.mlr.press/v130/zhu21c.html Deep Fourier Kernel for Self-Attentive Point Processes We present a novel attention-based model for discrete event data to capture complex non-linear temporal dependence structures. We borrow the idea from the attention mechanism and incorporate it into the point processes’ conditional intensity function. We further introduce a novel score function using Fourier kernel embedding, whose spectrum is represented using neural networks, which drastically differs from the traditional dot-product kernel and can capture a more complex similarity structure. We establish our approach’s theoretical properties and demonstrate our approach’s competitive performance compared to the state-of-the-art for synthetic and real data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhu21b.html
http://proceedings.mlr.press/v130/zhu21b.html Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finite-order moment bounds. This perspective unifies multiple existing robust and stochastic optimization methods. We prove a theorem that generalizes the classical duality in the mathematical problem of moments. Enabled by this theorem, we reformulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations such as polynomial losses and knowledge of the Lipschitz constant. We then establish a connection between DRO and stochastic optimization with expectation constraints. Finally, we propose practical algorithms based on both batch convex solvers and stochastic functional gradient, which apply to general optimization and machine learning tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhu21a.html
http://proceedings.mlr.press/v130/zhu21a.html Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $O_P\left(1/\sqrt{n}\right)$ regret, meaning that with high probability, the policy learned from using $n$ training data points will be $O\left(1/\sqrt{n}\right)$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhou21d.html
http://proceedings.mlr.press/v130/zhou21d.html Principal Subspace Estimation Under Information Diffusion Let $\mathbf{A} = \mathbf{L}_0 + \mathbf{S}_0$, where $\mathbf{L}_0 \in \mathbb{R}^{d\times d}$ is low rank and $\mathbf{S}_0$ is a perturbation matrix. We study the principal subspace estimation of $\mathbf{L}_0$ through observations $\mathbf{y}_j = f(\mathbf{A})\mathbf{x}_j$, $j=1,…,n$, where $f:\mathbb{R}\rightarrow \mathbb{R}$ is an unknown polynomial and $\mathbf{x}_j$’s are i.i.d. random input signals. Such models are widely used in graph signal processing to model information diffusion dynamics over networks with applications in network topology inference and data analysis. We develop an estimation procedure based on nuclear norm penalization, and establish upper bounds on the principal subspace estimation error when $\mathbf{A}$ is the adjacency matrix of a random graph generated by $\mathbf{L}_0$. Our theory shows that when the signal strength is strong enough, the exact rank of $\mathbf{L}_0$ can be recovered. By applying our results to blind community detection, we show that consistency of spectral clustering can be achieved for some popular stochastic block models. Together with the experimental results, our theory show that there is a fundamental limit of using the principal components obtained from diffused graph signals which is commonly adapted in current practice. Finally, under some structured perturbation $\mathbf{S}_0$, we build the connection between this model with spiked covariance model and develop a new estimation procedure. We show that such estimators can be optimal under the minimax paradigm. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhou21c.html
http://proceedings.mlr.press/v130/zhou21c.html Towards Understanding the Behaviors of Optimal Deep Active Learning Algorithms Active learning (AL) algorithms may achieve better performance with fewer data because the model guides the data selection process. While many algorithms have been proposed, there is little study on what the optimal AL algorithm looks like, which would help researchers understand where their models fall short and iterate on the design. In this paper, we present a simulated annealing algorithm to search for this optimal oracle and analyze it for several tasks. We present qualitative and quantitative insights into the behaviors of this oracle, comparing and contrasting them with those of various heuristics. Moreover, we are able to consistently improve the heuristics using one particular insight. We hope that our findings can better inform future active learning research. The code is available at https://github.com/YilunZhou/optimal-active-learning. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhou21b.html
http://proceedings.mlr.press/v130/zhou21b.html Curriculum Learning by Optimizing Learning Dynamics We study a novel curriculum learning scheme where in each round, samples are selected to achieve the greatest progress and fastest learning speed towards the ground-truth on all available samples. Inspired by an analysis of optimization dynamics under gradient flow for both regression and classification, the problem reduces to selecting training samples by a score computed from samples’ residual and linear temporal dynamics. It encourages the model to focus on the samples at learning frontier, i.e., those with large loss but fast learning speed. The scores in discrete time can be estimated via already-available byproducts of training, and thus require a negligible amount of extra computation. We discuss the properties and potential advantages of the proposed dynamics optimization via current deep learning theory and empirical study. By integrating it with cyclical training of neural networks, we introduce "dynamics-optimized curriculum learning (DoCL)", which selects the training set for each step by weighted sampling based on the scores. On nine different datasets, DoCL significantly outperforms random mini-batch SGD and recent curriculum learning methods both in terms of efficiency and final performance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhou21a.html
http://proceedings.mlr.press/v130/zhou21a.html Federated f-Differential Privacy Federated learning (FL) is a training paradigm where the clients collaboratively learn models by repeatedly sharing information without compromising much on the privacy of their local sensitive data. In this paper, we introduce \emph{federated $f$-differential privacy}, a new notion specifically tailored to the federated setting, based on the framework of Gaussian differential privacy. Federated $f$-differential privacy operates on \emph{record level}: it provides the privacy guarantee on each individual record of one client’s data against adversaries. We then propose a generic private federated learning framework \fedsync that accommodates a large family of state-of-the-art FL algorithms, which provably achieves {federated $f$-differential privacy}. Finally, we empirically demonstrate the trade-off between privacy guarantee and prediction performance for models trained by \fedsync in computer vision tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zheng21a.html
http://proceedings.mlr.press/v130/zheng21a.html Bayesian Active Learning by Soft Mean Objective Cost of Uncertainty To achieve label efficiency for training supervised learning models, pool-based active learning sequentially selects samples from a set of candidates as queries to label by optimizing an acquisition function. One category of existing methods adopts one-step-look-ahead strategies based on acquisition functions tailored with the learning objectives, for example based on the expected loss reduction (ELR) or the mean objective cost of uncertainty (MOCU) proposed recently. These active learning methods are optimal with the maximum classification error reduction when one considers a single query. However, it is well-known that there is no performance guarantee in the long run for these myopic methods. In this paper, we show that these methods are not guaranteed to converge to the optimal classifier of the true model because MOCU is not strictly concave. Moreover, we suggest a strictly concave approximation of MOCU—Soft MOCU—that can be used to define an acquisition function to guide Bayesian active learning with theoretical convergence guarantee. For training Bayesian classifiers with both synthetic and real-world data, our experiments demonstrate the superior performance of active learning by Soft MOCU compared to other existing methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhao21c.html
http://proceedings.mlr.press/v130/zhao21c.html Active Learning under Label Shift We address the problem of active learning under label shift: when the class proportions of source and target domains differ. We introduce a "medial distribution" to incorporate a tradeoff between importance weighting and class-balanced sampling and propose their combined usage in active learning. Our method is known as Mediated Active Learning under Label Shift (MALLS). It balances the bias from class-balanced sampling and the variance from importance weighting. We prove sample complexity and generalization guarantees for MALLS which show active learning reduces asymptotic sample complexity even under arbitrary label shift. We empirically demonstrate MALLS scales to high-dimensional datasets and can reduce the sample complexity of active learning by 60% in deep active learning tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhao21b.html
http://proceedings.mlr.press/v130/zhao21b.html Right Decisions from Wrong Predictions: A Mechanism Design Alternative to Individual Calibration Decision makers often need to rely on imperfect probabilistic forecasts. While average performance metrics are typically available, it is difficult to assess the quality of individual forecasts and the corresponding utilities. To convey confidence about individual predictions to decision-makers, we propose a compensation mechanism ensuring that the forecasted utility matches the actually accrued utility. While a naive scheme to compensate decision-makers for prediction errors can be exploited and might not be sustainable in the long run, we propose a mechanism based on fair bets and online learning that provably cannot be exploited. We demonstrate an application showing how passengers could confidently optimize individual travel plans based on flight delay probabilities estimated by an airline. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhao21a.html
http://proceedings.mlr.press/v130/zhao21a.html Meta-Learning Divergences for Variational Inference Variational inference (VI) plays an essential role in approximate Bayesian inference due to its computational efficiency and broad applicability. Crucial to the performance of VI is the selection of the associated divergence measure, as VI approximates the intractable distribution by minimizing this divergence. In this paper we propose a meta-learning algorithm to learn the divergence metric suited for the task of interest, automating the design of VI methods. In addition, we learn the initialization of the variational parameters without additional cost when our method is deployed in the few-shot learning scenarios. We demonstrate our approach outperforms standard VI on Gaussian mixture distribution approximation, Bayesian neural network regression, image generation with variational autoencoders and recommender systems with a partial variational autoencoder. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21o.html
http://proceedings.mlr.press/v130/zhang21o.html On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner. MBRL algorithms can be fairly complex due to the separate dynamics modeling and the subsequent planning algorithm, and as a result, they often possess tens of hyperparameters and architectural choices. For this reason, MBRL typically requires significant human expertise before it can be applied to new problems and domains. To alleviate this problem, we propose to use automatic hyperparameter optimization (HPO). We demonstrate that this problem can be tackled effectively with automated HPO, which we demonstrate to yield significantly improved performance compared to human experts. In addition, we show that tuning of several MBRL hyperparameters dynamically, i.e. during the training itself, further improves the performance compared to using static hyperparameters which are kept fix for the whole training. Finally, our experiments provide valuable insights into the effects of several hyperparameters, such as plan horizon or learning rate and their influence on the stability of training and resulting rewards. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21n.html
http://proceedings.mlr.press/v130/zhang21n.html Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21m.html
http://proceedings.mlr.press/v130/zhang21m.html A Scalable Gradient Free Method for Bayesian Experimental Design with Implicit Models Bayesian experimental design (BED) is to answer the question that how to choose designs that maximize the information gathering. For implicit models, where the likelihood is intractable but sampling is possible, conventional BED methods have difficulties in efficiently estimating the posterior distribution and maximizing the mutual information (MI) between data and parameters. Recent work proposed the use of gradient ascent to maximize a lower bound on MI to deal with these issues. However, the approach requires a sampling path to compute the pathwise gradient of the MI lower bound with respect to the design variables, and such a pathwise gradient is usually inaccessible for implicit models. In this paper, we propose a novel approach that leverages recent advances in stochastic approximate gradient ascent incorporated with a smoothed variational MI estimator for efficient and robust BED. Without the necessity of pathwise gradients, our approach allows the design process to be achieved through a unified procedure with an approximate gradient for implicit models. Several experiments show that our approach outperforms baseline methods, and significantly improves the scalability of BED in high-dimensional problems Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21l.html
http://proceedings.mlr.press/v130/zhang21l.html Efficient Designs Of SLOPE Penalty Sequences In Finite Dimension In linear regression, SLOPE is a new convex analysis method that generalizes the Lasso via the sorted $\ell_1$ penalty: larger fitted coefficients are penalized more heavily. This magnitude-dependent regularization requires an input of penalty sequence $\blam$, instead of a scalar penalty as in the Lasso case, thus making the design extremely expensive in computation. In this paper, we propose two efficient algorithms to design the possibly high-dimensional SLOPE penalty, in order to minimize the mean squared error. For Gaussian data matrices, we propose a first order Projected Gradient Descent (PGD) under the Approximate Message Passing regime. For general data matrices, we present a zero-th order Coordinate Descent (CD) to design a sub-class of SLOPE, referred to as the $k$-level SLOPE. Our CD allows a useful trade-off between the accuracy and the computation speed. We demonstrate the performance of SLOPE with our designs via extensive experiments on synthetic data and real-world datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21k.html
http://proceedings.mlr.press/v130/zhang21k.html Product Manifold Learning We consider dimensionality reduction for data sets with two or more independent degrees of freedom. For example, measurements of deformable shapes with several parts that move independently fall under this characterization. Mathematically, if the space of each continuous independent motion is a manifold, then their combination forms a product manifold. In this paper, we present an algorithm for manifold factorization given a sample of points from the product manifold. Our algorithm is based on spectral graph methods for manifold learning and the separability of the Laplacian operator on product spaces. Recovering the factors of a manifold yields meaningful lower-dimensional representations, allowing one to focus on particular aspects of the data space while ignoring others. We demonstrate the potential use of our method for an important and challenging problem in structural biology: mapping the motions of proteins and other large molecules using cryo-electron microscopy data sets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21j.html
http://proceedings.mlr.press/v130/zhang21j.html Detection and Defense of Topological Adversarial Attacks on Graphs Graph neural network (GNN) models achieve superior performance when classifying nodes in graph-structured data. Given that state-of-the-art GNNs share many similarities with their CNN cousins and that CNNs suffer adversarial vulnerabilities, there has also been interest in exploring analogous vulnerabilities in GNNs. Indeed, recent work has demonstrated that node classification performance of several graph models, including the popular graph convolution network (GCN) model, can be severely degraded through adversarial perturbations to the graph structure and the node features. In this work, we take a first step towards detecting adversarial attacks against graph models. We first propose a straightforward single node threshold test for detecting nodes subject to targeted attacks. Subsequently, we describe a kernel-based two-sample test for detecting whether a given subset of nodes within a graph has been maliciously corrupted. The efficacy of our algorithms is established via thorough experiments using commonly used node classification benchmark datasets. We also illustrate the potential practical benefit of our detection method by demonstrating its application to a real-world Bitcoin transaction network. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21i.html
http://proceedings.mlr.press/v130/zhang21i.html Animal pose estimation from video data with a hierarchical von Mises-Fisher-Gaussian model Animal pose estimation from video data is an important step in many biological studies, but current methods struggle in complex environments where occlusions are common and training data is scarce. Recent work has demonstrated improved accuracy with deep neural networks, but these methods often do not incorporate prior distributions that could improve localization. Here we present GIMBAL: a hierarchical von Mises-Fisher-Gaussian model that improves upon deep networks’ estimates by leveraging spatiotemporal constraints. The spatial constraints come from the animal’s skeleton, which induces a curved manifold of keypoint configurations. The temporal constraints come from the postural dynamics, which govern how angles between keypoints change over time. Importantly, the conditional conjugacy of the model permits simple and efficient Bayesian inference algorithms. We assess the model on a unique experimental dataset with video of a freely-behaving rodent from multiple viewpoints and ground-truth motion capture data for 20 keypoints. GIMBAL extends existing techniques, and in doing so offers more accurate estimates of keypoint positions, especially in challenging contexts. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21h.html
http://proceedings.mlr.press/v130/zhang21h.html Bayesian Coresets: Revisiting the Nonconvex Optimization Perspective Bayesian coresets have emerged as a promising approach for implementing scalable Bayesian inference. The Bayesian coreset problem involves selecting a (weighted) subset of the data samples, such that the posterior inference using the selected subset closely approximates the posterior inference using the full dataset. This manuscript revisits Bayesian coresets through the lens of sparsity constrained optimization. Leveraging recent advances in accelerated optimization methods, we propose and analyze a novel algorithm for coreset selection. We provide explicit convergence rate guarantees and present an empirical evaluation on a variety of benchmark datasets to highlight our proposed algorithm’s superior performance compared to state-of-the-art on speed and accuracy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21g.html
http://proceedings.mlr.press/v130/zhang21g.html Provably Eﬃcient Actor-Critic for Risk-Sensitive and Robust Adversarial RL: A Linear-Quadratic Case Risk-sensitivity plays a central role in artiﬁcial intelligence safety. In this paper, we study the global convergence of the actor-critic algorithm for risk-sensitive reinforcement learning (RSRL) with exponential utility, which remains challenging for policy optimization as it lacks the linearity needed to formulate policy gradient. To bypass such an issue of nonlinearity, we resort to the equivalence between RSRL and robust adversarial reinforcement learning (RARL), which is formulated as a zero-sum Markov game with a hypothetical adversary. In particular, the Nash equilibrium (NE) of such a game yields the optimal policy for RSRL, which is provably robust. We focus on a simple yet fundamental setting known as linear-quadratic (LQ) game. To attain the optimal policy, we develop a nested natural actor-critic algorithm, which provably converges to the NE of the LQ game at a sublinear rate, thus solving both RSRL and RARL. To the best knowledge, the proposed nested actor-critic algorithm appears to be the ﬁrst model-free policy optimization algorithm that provably attains the optimal policy for RSRL and RARL in the LQ setting, which sheds light on more general settings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21f.html
http://proceedings.mlr.press/v130/zhang21f.html Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $p$-Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $p$-Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21e.html
http://proceedings.mlr.press/v130/zhang21e.html On the Suboptimality of Negative Momentum for Minimax Optimization Smooth game optimization has recently attracted great interest in machine learning as it generalizes the single-objective optimization paradigm. However, game dynamics is more complex due to the interaction between different players and is therefore fundamentally different from minimization, posing new challenges for algorithm design. Notably, it has been shown that negative momentum is preferred due to its ability to reduce oscillation in game dynamics. Nevertheless, existing analysis of negative momentum was restricted to simple bilinear games. In this paper, we extend the analysis to smooth and strongly-convex strongly-concave minimax games by taking the variational inequality formulation. By connecting Polyak’s momentum with Chebyshev polynomials, we show that negative momentum accelerates convergence of game dynamics locally, though with a suboptimal rate. To the best of our knowledge, this is the \emph{first work} that provides an explicit convergence rate for negative momentum in this setting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21d.html
http://proceedings.mlr.press/v130/zhang21d.html Exploiting Equality Constraints in Causal Inference Assumptions about equality of effects are commonly made in causal inference tasks. For example, the well-known “difference-in-differences” method assumes that confounding remains constant across time periods. Similarly, it is not unreasonable to assume that causal effects apply equally to units undergoing interference. Finally, sensitivity analysis often hypothesizes equality among existing and unaccounted for confounders. Despite the ubiquity of these “equality constraints,” modern identification methods have not leveraged their presence in a systematic way. In this paper, we develop a novel graphical criterion that extends the well-known method of generalized instrumental sets to exploit such additional constraints for causal identification in linear models. We further demonstrate how it solves many diverse problems found in the literature in a general way, including difference-in-differences, interference, as well as benchmarking in sensitivity analysis. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21c.html
http://proceedings.mlr.press/v130/zhang21c.html Provable Hierarchical Imitation Learning via EM Due to recent empirical successes, the options framework for hierarchical reinforcement learning is gaining increasing popularity. Rather than learning from rewards, we consider learning an options-type hierarchical policy from expert demonstrations. Such a problem is referred to as hierarchical imitation learning. Converting this problem to parameter inference in a latent variable model, we develop convergence guarantees for the EM approach proposed by Daniel et al. (2016b). The population level algorithm is analyzed as an intermediate step, which is nontrivial due to the samples being correlated. If the expert policy can be parameterized by a variant of the options framework, then, under regularity conditions, we prove that the proposed algorithm converges with high probability to a norm ball around the true parameter. To our knowledge, this is the first performance guarantee for an hierarchical imitation learning algorithm that only observes primitive state-action pairs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21b.html
http://proceedings.mlr.press/v130/zhang21b.html Generalization Bounds for Stochastic Saddle Point Problems This paper studies the generalization bounds for the empirical saddle point (ESP) solution to stochastic saddle point (SSP) problems. For SSP with Lipschitz continuous and strongly convex-strongly concave objective functions, we establish an $O\left(1/n\right)$ generalization bound by using a probabilistic stability argument. We also provide generalization bounds under a variety of assumptions, including the cases without strong convexity and without bounded domains. We illustrate our results in three examples: batch policy learning in Markov decision process, stochastic composite optimization problem, and mixed strategy Nash equilibrium estimation for stochastic games. In each of these examples, we show that a regularized ESP solution enjoys a near-optimal sample complexity. To the best of our knowledge, this is the first set of results on the generalization theory of ESP. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/zhang21a.html
http://proceedings.mlr.press/v130/zhang21a.html Stability and Risk Bounds of Iterative Hard Thresholding The Iterative Hard Thresholding (IHT) algorithm is one of the most popular and promising greedy pursuit methods for high-dimensional statistical estimation under cardinality constraint. The existing analysis of IHT mostly focuses on parameter estimation and sparsity recovery consistency. From the perspective of statistical learning theory, another fundamental question is how well the IHT estimation would perform on unseen samples. The answer to this question is important for understanding the generalization ability of IHT yet has remaind elusive. In this paper, we investigate this problem and develop a novel generalization theory for IHT from the viewpoint of algorithmic stability. Our theory reveals that: 1) under natural conditions on the empirical risk function over $n$ samples of dimension $p$, IHT with sparsity level $k$ enjoys an $\mathcal{\tilde O}(n^{-1/2}\sqrt{k\log(n)\log(p)})$ rate of convergence in sparse excess risk; and 2) a fast rate of order $\mathcal{\tilde O}(n^{-1}k(\log^3(n)+\log(p)))$ can be derived for strongly convex risk function under certain strong-signal conditions. The results have been substantialized to sparse linear regression and logistic regression models along with numerical evidence provided to support our theory. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yuan21a.html
http://proceedings.mlr.press/v130/yuan21a.html False Discovery Rates in Biological Networks The increasing availability of data has generated unprecedented prospects for network analyses in many biological fields, such as neuroscience (e.g., brain networks), genomics (e.g., gene-gene interaction networks), and ecology (e.g., species interaction networks). A powerful statistical framework for estimating such networks is Gaussian graphical models, but standard estimators for the corresponding graphs are prone to large numbers of false discoveries. In this paper, we introduce a novel graph estimator based on knockoffs that imitate the partial correlation structures of unconnected nodes. We then show that this new estimator provides accurate control of the false discovery rate and yet large power. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yu21a.html
http://proceedings.mlr.press/v130/yu21a.html Minimax Estimation of Laplacian Constrained Precision Matrices This paper considers the problem of high-dimensional sparse precision matrix estimation under Laplacian constraints. We prove that the Laplacian constraints bring favorable properties for estimation: the Gaussian maximum likelihood estimator exists and is unique almost surely on the basis of one observation, irrespective of the dimension. We establish the optimal rate of convergence under Frobenius norm by the derivation of the minimax lower and upper bounds. The minimax lower bound is obtained by applying Le Cam-Assouad’s method with a novel construction of a subparameter space of multivariate normal distributions. The minimax upper bound is established by designing an adaptive $\ell_1$-norm regularized maximum likelihood estimation method and quantifying the rate of convergence. We prove that the proposed estimator attains the optimal rate of convergence with an overwhelming probability. Numerical experiments demonstrate the effectiveness of the proposed estimator. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ying21a.html
http://proceedings.mlr.press/v130/ying21a.html Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning The problem of \emph{Offline Policy Evaluation} (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real life applications. Existing work on OPE mostly focus on evaluating a \emph{fixed} target policy $\pi$, which does not provide useful bounds for offline policy learning as $\pi$ will then be data-dependent. We address this problem by \emph{simultaneously} evaluating all policies in a policy class $\Pi$ — uniform convergence in OPE — and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the \emph{time-inhomogeneous episodic} MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $\mu$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yin21a.html
http://proceedings.mlr.press/v130/yin21a.html Deep Spectral Ranking Learning from ranking observations arises in many domains, and siamese deep neural networks have shown excellent inference performance in this setting. However, SGD does not scale well, as an epoch grows exponentially with the ranking observation size. We show that a spectral algorithm can be combined with deep learning methods to significantly accelerate training. We combine a spectral estimate of Plackett-Luce ranking scores with a deep model via the Alternating Directions Method of Multipliers with a Kullback-Leibler proximal penalty. Compared to a state-of-the-art siamese network, our algorithms are up to 175 times faster and attain better predictions by up to 26% Top-1 Accuracy and 6% Kendall-Tau correlation over five real-life ranking datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yildiz21a.html
http://proceedings.mlr.press/v130/yildiz21a.html Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive computationally efficient learning algorithms. We consider solving a binary classification problem using random features and stochastic gradient descent, both of which are common and widely used in practical large-scale problems. Although there are plenty of previous works investigating the efficiency of these algorithms in terms of the convergence of the objective loss function, these results suggest that the computational gain comes at expense of the learning accuracy when dealing with general Lipschitz loss functions such as logistic loss. In this study, we analyze the properties of these algorithms in terms of the convergence not of the loss function, but the classification error under the strong low-noise condition, which reflects a realistic property of real-world datasets. We extend previous studies on SGD to a random features setting, examining a novel analysis about the error induced by the approximation of random features in terms of the distance between the generated hypothesis to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. We demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems under the strong low-noise condition. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yashima21a.html
http://proceedings.mlr.press/v130/yashima21a.html Understanding Robustness in Teacher-Student Setting: A New Perspective Adversarial examples have appeared as a ubiquitous property of machine learning models where bounded adversarial perturbation could mislead the models to make arbitrarily incorrect predictions. Such examples provide a way to assess the robustness of machine learning models as well as a proxy for understanding the model training process. Extensive studies try to explain the existence of adversarial examples and provide ways to improve model robustness (e.g. adversarial training). While they mostly focus on models trained on datasets with predefined labels, we leverage the teacher-student framework and assume a teacher model, or \emph{oracle}, to provide the labels for given instances. We extend \citet{tian2019student} in the case of low-rank input data and show that \emph{student specialization} (trained student neuron is highly correlated with certain teacher neuron at the same layer) still happens within the input subspace, but the teacher and student nodes could \emph{differ wildly} out of the data subspace, which we conjecture leads to adversarial examples. Extensive experiments show that student specialization correlates strongly with model robustness in different scenarios, including student trained via standard training, adversarial training, confidence-calibrated adversarial training, and training with robust feature dataset. Our studies could shed light on the future exploration about adversarial examples, and enhancing model robustness via principled data augmentation. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yang21e.html
http://proceedings.mlr.press/v130/yang21e.html TenIPS: Inverse Propensity Sampling for Tensor Completion Tensors are widely used to represent multiway arrays of data. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a higher-order SVD approach, reweighting the observed tensor by the inverse propensities. We provide finite-sample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yang21d.html
http://proceedings.mlr.press/v130/yang21d.html Stability and Differential Privacy of Stochastic Gradient Descent for Pairwise Learning with Non-Smooth Loss Pairwise learning has recently received increasing attention since it subsumes many important machine learning tasks (e.g. AUC maximization and metric learning) into a unifying framework. In this paper, we give the first-ever-known stability and generalization analysis of stochastic gradient descent (SGD) for pairwise learning with non-smooth loss functions, which are widely used (e.g. Ranking SVM with the hinge loss). We introduce a novel decomposition in its stability analysis to decouple the pairwisely dependent random variables, and derive generalization bounds consistent with pointwise learning. Furthermore, we apply our stability analysis to develop differentially private SGD for pairwise learning, for which our utility bounds match with the state-of-the-art output perturbation method (Huai et al., 2020) with smooth losses. Finally, we illustrate the results using specific examples of AUC maximization and similarity metric learning. As a byproduct, we provide an affirmative solution to an open question on the advantage of the nuclear-norm constraint over Frobenius norm constraint in similarity metric learning. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yang21c.html
http://proceedings.mlr.press/v130/yang21c.html Q-learning with Logarithmic Regret This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yang21b.html
http://proceedings.mlr.press/v130/yang21b.html Fully Gap-Dependent Bounds for Multinomial Logit Bandit We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items, and the buyer purchases an item from the assortment according to a MNL choice model. The objective is to learn the model parameters and maximize the expected revenue. We present (i) an algorithm that identifies the optimal assortment $S^*$ within $\widetilde{O}(\sum_{i = 1}^N \Delta_i^{-2})$ time steps with high probability, and (ii) an algorithm that incurs $O(\sum_{i \notin S^*} K\Delta_i^{-1} \log T)$ regret in $T$ time steps. To our knowledge, our algorithms are the \emph{first} to achieve gap-dependent bounds that \emph{fully} depends on the suboptimality gaps of \emph{all} items. Our technical contributions include an algorithmic framework that relates the MNL-bandit problem to a variant of the top-$K$ arm identification problem in multi-armed bandits, a generalized epoch-based offering procedure, and a layer-based adaptive estimation procedure. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yang21a.html
http://proceedings.mlr.press/v130/yang21a.html Faster Kernel Interpolation for Gaussian Processes A key challenge in scaling Gaussian Process (GP) regression to massive datasets is that exact inference requires computation with a dense n × n kernel matrix, where n is the number of data points. Significant work focuses on approximating the kernel matrix via interpolation using a smaller set of m “inducing points”. Structured kernel interpolation (SKI) is among the most scalable methods: by placing inducing points on a dense grid and using structured matrix algebra, SKI achieves per-iteration time of O(n + m log m) for approximate inference. This linear scaling in n enables approximate inference for very large data sets; however, the cost is per-iteration, which remains a limitation for extremely large n. We show that the SKI per-iteration time can be reduced to O(m log m) after a single O(n) time precomputation step by reframing SKI as solving a natural Bayesian linear regression problem with a fixed set of m compact basis functions. For a fixed grid, our new method scales to truly massive data sets: after the initial linear time pass, all subsequent computations are independent of n. We demonstrate speedups in practice for a wide range of m and n and for all the main GP inference tasks. With per-iteration complexity independent of the dataset size n for a fixed grid, our method scales to truly massive data sets. We demonstrate speedups in practice for a wide range of m and n and apply the method to GP inference on a three-dimensional weather radar dataset with over 100 million points. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/yadav21a.html
http://proceedings.mlr.press/v130/yadav21a.html Couplings for Multinomial Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time – the time it takes for the coupled chains to meet – based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21i.html
http://proceedings.mlr.press/v130/xu21i.html DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks Recent years have witnessed strong empirical performance of over-parameterized neural networks on various tasks and many advances in the theory, e.g. the universal approximation and provable convergence to global minimum. In this paper, we incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. By doing so, we can exploit a wide class of networks to approximate the nuisance functions and to estimate the parameters of interest consistently. Therefore, we may offer the best of two worlds: the universal approximation ability from neural networks and the interpretability from classic ordinary linear model, leading to valid inference and accurate prediction. We show the theoretical foundations that make this possible and demonstrate with numerical experiments. Furthermore, we propose a framework, DebiNet, in which we plug-in arbitrary feature selection methods to our semi-parametric neural network and illustrate that our framework debiases the regularized estimators and performs well, in terms of the post-selection inference and the generalization error. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21h.html
http://proceedings.mlr.press/v130/xu21h.html Meta Learning in the Continuous Time Limit In this paper, we establish the ordinary differential equation (ODE) that underlies the training dynamics of Model-Agnostic Meta-Learning (MAML). Our continuous-time limit view of the process eliminates the influence of the manually chosen step size of gradient descent and includes the existing gradient descent training algorithm as a special case that results from a specific discretization. We show that the MAML ODE enjoys a linear convergence rate to an approximate stationary point of the MAML loss function for strongly convex task losses, even when the corresponding MAML loss is non-convex. Moreover, through the analysis of the MAML ODE, we propose a new BI-MAML training algorithm that reduces the computational burden associated with existing MAML training methods, and empirical experiments are performed to showcase the superiority of our proposed methods in the rate of convergence with respect to the vanilla MAML algorithm. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21g.html
http://proceedings.mlr.press/v130/xu21g.html Optimal query complexity for private sequential learning against eavesdropping We study the query complexity of a learner-private sequential learning problem, motivated by the privacy and security concerns due to eavesdropping that arise in practical applications such as pricing and Federated Learning. A learner tries to estimate an unknown scalar value, by sequentially querying an external database and receiving binary responses; meanwhile, a third-party adversary observes the learner’s queries but not the responses. The learner’s goal is to design a querying strategy with the minimum number of queries (optimal query complexity) so that she can accurately estimate the true value, while the eavesdropping adversary even with the complete knowledge of her querying strategy cannot. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21f.html
http://proceedings.mlr.press/v130/xu21f.html Learning Matching Representations for Individualized Organ Transplantation Allocation Organ transplantation can improve life expectancy for recipients, but the probability of a successful transplant depends on the compatibility between donor and recipient features. Current medical practice relies on coarse rules for donor-recipient matching, but is short of domain knowledge regarding the complex factors underlying organ compatibility. In this paper, we formulate the problem of learning data-driven rules for donor-recipient matching using observational data for organ allocations and transplant outcomes. This problem departs from the standard supervised learning setup in that it involves matching two feature spaces (for donors and recipients), and requires estimating transplant outcomes under counterfactual matches not observed in the data. To address this problem, we propose a model based on representation learning to predict donor-recipient compatibility—our model learns representations that cluster donor features, and applies donor-invariant transformations to recipient features to predict transplant outcomes under a given donor-recipient feature instance. Experiments on several semi-synthetic and real-world datasets show that our model outperforms state-of-art allocation models and real-world policies executed by human experts. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21e.html
http://proceedings.mlr.press/v130/xu21e.html On the Faster Alternating Least-Squares for CCA We study alternating least-squares (ALS) for canonical correlation analysis (CCA). Recent research shows that the alternating least-squares solver for k-CCA can be directly accelerated with momentum and prominent performance gain has been observed in practice for the resulting simple algorithm. However, despite the simplicity, it is difficult for the accelerated rate to be analyzed in theory in order to explain and match the empirical performance gain. By looking into two neighboring iterations, in this work, we propose an even simpler variant of the faster alternating least-squares solver. Instead of applying momentum to each update for acceleration, the proposed variant only leverages momentum at every other iteration and can converge at a provably faster linear rate of nearly square-root dependence on the singular value gap of the whitened cross-covariance matrix. In addition to the high consistency between theory and practice, experimental studies also show that our variant of the alternating least-squares algorithm as a block CCA solver is even more pass efficient than other variants. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21d.html
http://proceedings.mlr.press/v130/xu21d.html Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with single-sample update at each iteration. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two time-scale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with mini-batch data for each update. For linear TDC, we provide a novel non-asymptotic analysis and our sample complexity result achieves the complexity $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$. For nonlinear TDC and Greedy-GQ, we show that both algorithms attain $\epsilon$-accurate stationary solution with sample complexity $\mathcal{O}(\epsilon^{-2})$. It is the first time that non-asymptotic convergence result has been established for nonlinear TDC and our result for Greedy-GQ outperforms previous result orderwisely by a factor of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21c.html
http://proceedings.mlr.press/v130/xu21c.html A Stein Goodness-of-test for Exponential Random Graph Models We propose and analyse a novel nonparametric goodness-of-fit testing procedure for ex-changeable exponential random graph model (ERGM) when a single network realisation is observed. The test determines how likely it is that the observation is generated from a target unnormalised ERGM density. Our test statistics are derived of kernel Stein discrepancy, a divergence constructed via Stein’s method using functions from a reproducing kernel Hilbert space (RKHS), combined with a discrete Stein operator for ERGMs. The test is a Monte Carlo test using simulated networks from the target ERGM. We show theoretical properties for the testing procedure w.r.t a class of ERGMs. Simulation studies and real network applications are presented. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21b.html
http://proceedings.mlr.press/v130/xu21b.html Decision Making Problems with Funnel Structure: A Multi-Task Learning Approach with Application to Email Marketing Campaigns This paper studies the decision making problem with Funnel Structure. Funnel structure, a well-known concept in the marketing field, occurs in those systems where the decision maker interacts with the environment in a layered manner receiving far fewer observations from deep layers than shallow ones. For example, in the email marketing campaign application, the layers correspond to Open, Click and Purchase events. Conversions from Click to Purchase happen very infrequently because a purchase cannot be made unless the link in an email is clicked on. We formulate this challenging decision making problem as a contextual bandit with funnel structure and develop a multi-task learning algorithm that mitigates the lack of sufficient observations from deeper layers. We analyze both the prediction error and the regret of our algorithms. We verify our theory on prediction errors through a simple simulation. Experiments on both a simulated environment and an environment based on real-world data from a major email marketing company show that our algorithms offer significant improvement over previous methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xu21a.html
http://proceedings.mlr.press/v130/xu21a.html Adversarially Robust Estimate and Risk Analysis in Linear Regression Adversarial robust learning aims to design algorithms that are robust to small adversarial perturbations on input variables. Beyond the existing studies on the predictive performance to adversarial samples, our goal is to understand statistical properties of adversarial robust estimates and analyze adversarial risk in the setup of linear regression models. By discovering the statistical minimax rate of convergence of adversarial robust estimators, we emphasize the importance of incorporating model information, e.g., sparsity, in adversarial robust learning. Further, we reveal an explicit connection of adversarial and standard estimates, and propose a straightforward two-stage adversarial training framework, which facilitates to utilize model structure information to improve adversarial robustness. In theory, the consistency of the adversarial robust estimator is proven and its Bahadur representation is also developed for the statistical inference purpose. The proposed estimator converges in a sharp rate under either low-dimensional or sparse scenario. Moreover, our theory confirms two phenomena in adversarial robust learning: adversarial robustness hurts generalization, and unlabeled data help improve the generalization. In the end, we conduct numerical simulations to verify our theory. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xing21c.html
http://proceedings.mlr.press/v130/xing21c.html On the Generalization Properties of Adversarial Training Modern machine learning and deep learning models are shown to be vulnerable when testing data are slightly perturbed. Theoretical studies of adversarial training algorithms mostly focus on their adversarial training losses or local convergence properties. In contrast, this paper studies the generalization performance of a generic adversarial training algorithm. Specifically, we consider linear regression models and two-layer neural networks (with lazy training) using squared loss under low-dimensional regime and high-dimensional regime. In the former regime, after overcoming the non-smoothness of adversarial training, the adversarial risk of the trained models will converge to the minimal adversarial risk. In the latter regime, we discover that data interpolation prevents the adversarial robust estimator from being consistent (i.e. converge in probability). Therefore, inspired by successes of the least absolute shrinkage and selection operator (LASSO), we incorporate the $\mathcal{L}_1$ penalty in the high dimensional adversarial learning, and show that it leads to consistent adversarial robust estimation. A series of numerical studies are conducted to demonstrate that how the smoothness and $\mathcal{L}_1$ penalization help to improve the adversarial robustness of DNN models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xing21b.html
http://proceedings.mlr.press/v130/xing21b.html Predictive Power of Nearest Neighbors Algorithm under Random Perturbation This work investigates the predictive performance of the classical $k$ Nearest Neighbors ($k$-NN) algorithm when the testing data are corrupted by random perturbation. The impact of corruption level on the asymptotic regret is carefully characterized and we reveal a phase-transition phenomenon that, when the corruption level of the random perturbation $\omega$ is below a critical order (i.e., small-$\omega$ regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large-$\omega$ regime), the asymptotic regret deteriorates polynomially. More importantly, the regret of $k$-NN classifier heuristically matches the rate of minimax regret for randomly perturbed testing data, thus implies the strong robustness of $k$-NN against random perturbation on testing data. In fact, we show that the classical $k$-NN can achieve no worse predictive performance, compared to the NN classifiers trained via the popular noise-injection strategy. Our numerical experiment also illustrates that combining $k$-NN component with modern learning algorithms will inherit the strong robustness of $k$-NN. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in \cite{xue2017achieving} will at most achieve a sub-optimal rate when the data dimension $d>4$ even if $k$ is chosen optimally in the pre-processing step. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xing21a.html
http://proceedings.mlr.press/v130/xing21a.html Understanding the wiring evolution in differentiable neural architecture search Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit biases in the cost’s assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xie21a.html
http://proceedings.mlr.press/v130/xie21a.html Semi-Supervised Learning with Meta-Gradient In this work, we propose a simple yet effective meta-learning algorithm in semi-supervised learning. We notice that most existing consistency-based approaches suffer from overfitting and limited model generalization ability, especially when training with only a small number of labeled data. To alleviate this issue, we propose a learn-to-generalize regularization term by utilizing the label information and optimize the problem in a meta-learning fashion. Specifically, we seek the pseudo labels of the unlabeled data so that the model can generalize well on the labeled data, which is formulated as a nested optimization problem. We address this problem using the meta-gradient that bridges between the pseudo label and the regularization term. In addition, we introduce a simple first-order approximation to avoid computing higher-order derivatives and provide theoretic convergence analysis. Extensive evaluations on the SVHN, CIFAR, and ImageNet datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/xiao21a.html
http://proceedings.mlr.press/v130/xiao21a.html Completing the Picture: Randomized Smoothing Suffers from the Curse of Dimensionality for a Large Family of Distributions Randomized smoothing is currently the most competitive technique for providing provable robustness guarantees. Since this approach is model-agnostic and inherently scalable we can certify arbitrary classifiers. Despite its success, recent works show that for a small class of i.i.d. distributions, the largest $l_p$ radius that can be certified using randomized smoothing decreases as $O(1/d^{1/2-1/p})$ with dimension $d$ for $p > 2$. We complete the picture and show that similar no-go results hold for the $l_2$ norm for a much more general family of distributions which are continuous and symmetric about the origin. Specifically, we calculate two different upper bounds of the $l_2$ certified radius which have a constant multiplier of order $\Theta(1/d^{1/2})$. Moreover, we extend our results to $l_p (p>2)$ certification with spherical symmetric distributions solidifying the limitations of randomized smoothing. We discuss the implications of our results for how accuracy and robustness are related, and why robust training with noise augmentation can alleviate some of the limitations in practice. We also show that on real-world data the gap between the certified radius and our upper bounds is small. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wu21d.html
http://proceedings.mlr.press/v130/wu21d.html Prediction with Finitely many Errors Almost Surely Using only samples from a probabilistic model, we predict properties of the model and of future observations. The prediction game continues in an online fashion as the sample size grows with new observations. After each prediction, the predictor incurs a binary (0-1) loss. The probability model underlying a sample is otherwise unknown except that it belongs to a known class of models. The goal is to make finitely many errors (i.e. loss of 1) with probability 1 under the generating model, no matter what it may be in the known model class. Model classes admitting predictors that make only finitely many errors are eventually almost surely (eas) predictable. When the losses incurred are observable (the supervised case), we completely characterize eas predictable classes. We provide analogous results in the unsupervised case. Our results have a natural interpretation in terms of regularization. In eas-predictable classes, we study if it is possible to have a universal stopping rule that identifies (to any given confidence) when no more errors will be made. Classes admitting such a stopping rule are eas learnable. When samples are generated iid, we provide a complete characterization of eas learnability. We also study cases when samples are not generated iid, but a full characterization remains open at this point. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wu21c.html
http://proceedings.mlr.press/v130/wu21c.html Hierarchical Inducing Point Gaussian Process for Inter-domian Observations We examine the general problem of inter-domain Gaussian Processes (GPs): problems where the GP realization and the noisy observations of that realization lie on different domains. When the mapping between those domains is linear, such as integration or diﬀerentiation, inference is still closed form. However, many of the scaling and approximation techniques that our community has developed do not apply to this setting. In this work, we introduce the hierarchical inducing point GP (HIP-GP), a scalable inter-domain GP inference method that enables us to improve the approximation accuracy by increasing the number of inducing points to the millions. HIP-GP, which relies on inducing points with grid structure and a stationary kernel assumption, is suitable for low-dimensional problems. In developing HIP-GP, we introduce (1) a fast whitening strategy, and (2) a novel preconditioner for conjugate gradients which can be helpful in general GP settings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wu21b.html
http://proceedings.mlr.press/v130/wu21b.html Hadamard Wirtinger Flow for Sparse Phase Retrieval We consider the problem of reconstructing an $n$-dimensional $k$-sparse signal from a set of noiseless magnitude-only measurements. Formulating the problem as an unregularized empirical risk minimization task, we study the sample complexity performance of gradient descent with Hadamard parametrization, which we call Hadamard Wirtinger flow (HWF). Provided knowledge of the signal sparsity $k$, we prove that a single step of HWF is able to recover the support from $k(x^*_{max})^{-2}$ (modulo logarithmic term) samples, where $x^*_{max}$ is the largest component of the signal in magnitude. This support recovery procedure can be used to initialize existing reconstruction methods and yields algorithms with total runtime proportional to the cost of reading the data and improved sample complexity, which is linear in $k$ when the signal contains at least one large component. We numerically investigate the performance of HWF at convergence and show that, while not requiring any explicit form of regularization nor knowledge of $k$, HWF adapts to the signal sparsity and reconstructs sparse signals with fewer measurements than existing gradient based methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wu21a.html
http://proceedings.mlr.press/v130/wu21a.html Adaptive wavelet pooling for convolutional neural networks Convolutional neural networks (CNN)s have become the go-to choice for most image and video processing tasks. Most CNN architectures rely on pooling layers to reduce the resolution along spatial dimensions. The reduction allows subsequent deep convolution layers to operate with greater efficiency. This paper introduces adaptive wavelet pooling layers, which employ fast wavelet transforms (FWT) to reduce the feature resolution. The FWT decomposes the input features into multiple scales reducing the feature dimensions by removing the fine-scale subbands. Our approach adds extra flexibility through wavelet-basis function optimization and coefficient weighting at different scales. The adaptive wavelet layers integrate directly into well-known CNNs like the LeNet, Alexnet, or Densenet architectures. Using these networks, we validate our approach and find competitive performance on the MNIST, CIFAR10, and SVHN (street view house numbers) data-sets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wolter21a.html
http://proceedings.mlr.press/v130/wolter21a.html Sparse Algorithms for Markovian Gaussian Processes Approximate Bayesian inference methods that scale to very large datasets are crucial in leveraging probabilistic models for real-world time series. Sparse Markovian Gaussian processes combine the use of inducing variables with efficient Kalman filter-like recursions, resulting in algorithms whose computational and memory requirements scale linearly in the number of inducing points, whilst also enabling parallel parameter updates and stochastic optimisation. Under this paradigm, we derive a general site-based approach to approximate inference, whereby we approximate the non-Gaussian likelihood with local Gaussian terms, called sites. Our approach results in a suite of novel sparse extensions to algorithms from both the machine learning and signal processing literature, including variational inference, expectation propagation, and the classical nonlinear Kalman smoothers. The derived methods are suited to large time series, and we also demonstrate their applicability to spatio-temporal data, where the model has separate inducing points in both time and space. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wilkinson21a.html
http://proceedings.mlr.press/v130/wilkinson21a.html Moment-Based Variational Inference for Stochastic Differential Equations Existing deterministic variational inference approaches for diffusion processes use simple proposals and target the marginal density of the posterior. We construct the variational process as a controlled version of the prior process and approximate the posterior by a set of moment functions. In combination with moment closure, the smoothing problem is reduced to a deterministic optimal control problem. Exploiting the path-wise Fisher information, we propose an optimization procedure that corresponds to a natural gradient descent in the variational parameters. Our approach allows for richer variational approximations that extend to state-dependent diffusion terms. The classical Gaussian process approximation is recovered as a special case. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wildner21a.html
http://proceedings.mlr.press/v130/wildner21a.html Foundations of Bayesian Learning from Synthetic Data There is significant growth and interest in the use of synthetic data as an enabler for machine learning in environments where the release of real data is restricted due to privacy or availability constraints. Despite a large number of methods for synthetic data generation, there are comparatively few results on the statistical properties of models learnt on synthetic data, and fewer still for situations where a researcher wishes to augment real data with another party’s synthesised data. We use a Bayesian paradigm to characterise the updating of model parameters when learning in these settings, demonstrating that caution should be taken when applying conventional learning algorithms without appropriate consideration of the synthetic data generating process and learning task at hand. Recent results from general Bayesian updating support a novel and robust approach to Bayesian synthetic-learning founded on decision theory that outperforms standard approaches across repeated experiments on supervised learning and inference problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wilde21a.html
http://proceedings.mlr.press/v130/wilde21a.html Bayesian Inference with Certifiable Adversarial Robustness We consider adversarial training of deep neural networks through the lens of Bayesian learning and present a principled framework for adversarial training of Bayesian Neural Networks (BNNs) with certifiable guarantees. We rely on techniques from constraint relaxation of non-convex optimisation problems and modify the standard cross-entropy error model to enforce posterior robustness to worst-case perturbations in $\epsilon-$balls around input points. We illustrate how the resulting framework can be combined with methods commonly employed for approximate inference of BNNs. In an empirical investigation, we demonstrate that the presented approach enables training of certifiably robust models on MNIST, FashionMNIST, and CIFAR-10 and can also be beneficial for uncertainty calibration. Our method is the first to directly train certifiable BNNs, thus facilitating their deployment in safety-critical applications. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wicker21a.html
http://proceedings.mlr.press/v130/wicker21a.html Inference in Stochastic Epidemic Models via Multinomial Approximations We introduce a new method for inference in stochastic epidemic models which uses recursive multinomial approximations to integrate over unobserved variables and thus circumvent likelihood intractability. The method is applicable to a class of discrete-time, finite-population compartmental models with partial, randomly under-reported or missing count observations. In contrast to state-of-the-art alternatives such as Approximate Bayesian Computation techniques, no forward simulation of the model is required and there are no tuning parameters. Evaluating the approximate marginal likelihood of model parameters is achieved through a computationally simple filtering recursion. The accuracy of the approximation is demonstrated through analysis of real and simulated data using a model of the 1995 Ebola outbreak in the Democratic Republic of Congo. We show how the method can be embedded within a Sequential Monte Carlo approach to estimating the time-varying reproduction number of COVID-19 in Wuhan, China, recently published by Kucharski et al. (2020). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/whiteley21a.html
http://proceedings.mlr.press/v130/whiteley21a.html Algorithms for Fairness in Sequential Decision Making It has recently been shown that if feedback effects of decisions are ignored, then imposing fairness constraints such as demographic parity or equality of opportunity can actually exacerbate unfairness. We propose to address this challenge by modeling feedback effects as Markov decision processes (MDPs). First, we propose analogs of fairness properties for the MDP setting. Second, we propose algorithms for learning fair decision-making policies for MDPs. Finally, we demonstrate the need to account for dynamical effects using simulations on a loan applicant MDP. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wen21a.html
http://proceedings.mlr.press/v130/wen21a.html Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O(\sqrt{T}) regret and another computationally efficient variant with O(T^{3/4}) regret, where T is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with O(\sqrt{T}) regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with O(T^{2/3}) regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wei21d.html
http://proceedings.mlr.press/v130/wei21d.html Sample Elicitation It is important to collect credible training samples $(x,y)$ for building data-intensive learning systems (e.g., a deep learning system). Asking people to report complex distribution $p(x)$, though theoretically viable, is challenging in practice. This is primarily due to the cognitive loads required for human agents to form the report of this highly complicated information. While classical elicitation mechanisms apply to eliciting a complex and generative (and continuous) distribution $p(x)$, we are interested in eliciting samples $x_i \sim p(x)$ from agents directly. We coin the above problem sample elicitation. This paper introduces a deep learning aided method to incentivize credible sample contributions from self-interested and rational agents. We show that with an accurate estimation of a certain $f$-divergence function we can achieve approximate incentive compatibility in eliciting truthful samples. We then present an efficient estimator with theoretical guarantees via studying the variational forms of the $f$-divergence function. We also show a connection between this sample elicitation problem and $f$-GAN, and how this connection can help reconstruct an estimator of the distribution based on collected samples. Experiments on synthetic data, MNIST, and CIFAR-10 datasets demonstrate that our mechanism elicits truthful samples. Our implementation is available at https://github.com/weijiaheng/Credible-sample-elicitation.git. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wei21c.html
http://proceedings.mlr.press/v130/wei21c.html Direct Loss Minimization for Sparse Gaussian Processes The paper provides a thorough investigation of Direct Loss Minimization (DLM), which optimizes the posterior to minimize predictive loss, in sparse Gaussian processes. For the conjugate case, we consider DLM for log-loss and DLM for square loss showing a significant performance improvement in both cases. The application of DLM in non-conjugate cases is more complex because the logarithm of expectation in the log-loss DLM objective is often intractable and simple sampling leads to biased estimates of gradients. The paper makes two technical contributions to address this. First, a new method using product sampling is proposed, which gives unbiased estimates of gradients (uPS) for the objective function. Second, a theoretical analysis of biased Monte Carlo estimates (bMC) shows that stochastic gradient descent converges despite the biased gradients. Experiments demonstrate empirical success of DLM. A comparison of the sampling methods shows that, while uPS is potentially more sample-efficient, bMC provides a better tradeoff in terms of convergence time and computational efficiency. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wei21b.html
http://proceedings.mlr.press/v130/wei21b.html Goodness-of-Fit Test for Mismatched Self-Exciting Processes Recently there have been many research efforts in developing generative models for self-exciting point processes, partly due to their broad applicability for real-world applications. However, rarely can we quantify how well the generative model captures the nature or ground-truth since it is usually unknown. The challenge typically lies in the fact that the generative models typically provide, at most, good approximations to the ground-truth (e.g., through the rich representative power of neural networks), but they cannot be precisely the ground-truth. We thus cannot use the classic goodness-of-fit (GOF) test framework to evaluate their performance. In this paper, we develop a GOF test for generative models of self-exciting processes by making a new connection to this problem with the classical statistical theory of Quasi-maximum-likelihood estimator (QMLE). We present a non-parametric self-normalizing statistic for the GOF test: the Generalized Score (GS) statistics, and explicitly capture the model misspecification when establishing the asymptotic distribution of the GS statistic. Numerical simulation and real-data experiments validate our theory and demonstrate the proposed GS test’s good performance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wei21a.html
http://proceedings.mlr.press/v130/wei21a.html Graphical Normalizing Flows Normalizing flows model complex probability distributions by combining a base distribution with a series of bijective neural networks. State-of-the-art architectures rely on coupling and autoregressive transformations to lift up invertible functions from scalars to vectors. In this work, we revisit these transformations as probabilistic graphical models, showing they reduce to Bayesian networks with a pre-defined topology and a learnable density at each node. From this new perspective, we propose the graphical normalizing flow, a new invertible transformation with either a prescribed or a learnable graphical structure. This model provides a promising way to inject domain knowledge into normalizing flows while preserving both the interpretability of Bayesian networks and the representation capacity of normalizing flows. We show that graphical conditioners discover relevant graph structure when we cannot hypothesize it. In addition, we analyze the effect of $\ell_1$-penalization on the recovered structure and on the quality of the resulting density estimation. Finally, we show that graphical conditioners lead to competitive white box density estimators. Our implementation is available at <https://github.com/AWehenkel/DAG-NF>. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wehenkel21a.html
http://proceedings.mlr.press/v130/wehenkel21a.html Latent Derivative Bayesian Last Layer Networks Bayesian neural networks (BNN) are powerful parametric models for nonlinear regression with uncertainty quantification. However, the approximate inference techniques for weight space priors suffer from several drawbacks. The ‘Bayesian last layer’ (BLL) is an alternative BNN approach that learns the feature space for an exact Bayesian linear model with explicit predictive distributions. However, its predictions outside of the data distribution (OOD) are typically overconfident, as the marginal likelihood objective results in a learned feature space that overfits to the data. We overcome this weakness by introducing a functional prior on the model’s derivatives w.r.t. the inputs. Treating these Jacobians as latent variables, we incorporate the prior into the objective to influence the smoothness and diversity of the features, which enables greater predictive uncertainty. For the BLL, the Jacobians can be computed directly using forward mode automatic differentiation, and the distribution over Jacobians may be obtained in closed-form. We demonstrate this method enhances the BLL to Gaussian process-like performance on tasks where calibrated uncertainty is critical: OOD regression, Bayesian optimization and active learning, which include high-dimensional real-world datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/watson21a.html
http://proceedings.mlr.press/v130/watson21a.html The Multiple Instance Learning Gaussian Process Probit Model In the Multiple Instance Learning (MIL) scenario, the training data consists of instances grouped into bags. Bag labels indicate whether each bag contains at least one positive instance, but instance labels are not observed. Recently, Haussmann et al (CVPR 2017) tackled the MIL instance label prediction task by introducing the Multiple Instance Learning Gaussian Process Logistic (MIL-GP-Logistic) model, an adaptation of the Gaussian Process Logistic Classification model that inherits its uncertainty quantification and flexibility. Notably, they provide a fast mean-field variational inference procedure. However, due to their choice of the logistic link, they do not maximize the ELBO objective directly, but rather a lower bound on it. This approximation, as we show, hurts predictive performance. In this work, we propose the Multiple Instance Learning Gaussian Process Probit (MIL-GP-Probit) model, an adaptation of the Gaussian Process Probit Classification model to solve the MIL instance label prediction problem. Leveraging the analytical tractability of the probit link, we give a variational inference procedure based on variable augmentation that maximizes the ELBO objective directly. Applying it, we show MIL-GP-Probit is significantly more calibrated than MIL-GP-Logistic on all 20 datasets of the benchmark 20 Newsgroups dataset collection, and achieves higher AUC than MIL-GP-Logistic on an additional 51 out of 59 datasets. Furthermore, we show how the probit formulation enables principled bag label predictions and a Gibbs sampling scheme. This is the first exact posterior inference procedure for any Bayesian model for the MIL scenario. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21h.html
http://proceedings.mlr.press/v130/wang21h.html Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations? While uncertainty estimation is a well-studied topic in deep learning, most such work focuses on marginal uncertainty estimates, i.e. the predictive mean and variance at individual input locations. But it is often more useful to estimate predictive correlations between the function values at different input locations. In this paper, we consider the problem of benchmarking how accurately Bayesian models can estimate predictive correlations. We first consider a downstream task which depends on posterior predictive correlations: transductive active learning (TAL). We find that TAL makes better use of models’ uncertainty estimates than ordinary active learning, and recommend this as a benchmark for evaluating Bayesian models. Since TAL is too expensive and indirect to guide development of algorithms, we introduce two metrics which more directly evaluate the predictive correlations and which can be computed efficiently: meta-correlations (i.e. the correlations between the models correlation estimates and the true values), and cross-normalized likelihoods (XLL). We validate these metrics by demonstrating their consistency with TAL performance and obtain insights about the relative performance of current Bayesian neural net and Gaussian process models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21g.html
http://proceedings.mlr.press/v130/wang21g.html The Sample Complexity of Meta Sparse Regression This paper addresses the meta-learning problem in sparse linear regression with infinite tasks. We assume that the learner can access several similar tasks. The goal of the learner is to transfer knowledge from the prior tasks to a similar but novel task. For $p$ parameters, size of the support set $k$, and $l$ samples per task, we show that $T \in O((k \log (p-k)) / l)$ tasks are sufficient in order to recover the common support of all tasks. With the recovered support, we can greatly reduce the sample complexity for estimating the parameter of the novel task, i.e., $l \in O(1)$ with respect to $T$ and $p$. We also prove that our rates are minimax optimal. A key difference between meta-learning and the classical multi-task learning, is that meta-learning focuses only on the recovery of the parameters of the novel task, while multi-task learning estimates the parameter of all tasks, which requires $l$ to grow with $T$. Instead, our efficient meta-learning estimator allows for $l$ to be constant with respect to $T$ (i.e., few-shot learning). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21f.html
http://proceedings.mlr.press/v130/wang21f.html Multitask Bandit Learning Through Heterogeneous Feedback Aggregation In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $\epsilon$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg($\epsilon$), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21e.html
http://proceedings.mlr.press/v130/wang21e.html Maximal Couplings of the Metropolis-Hastings Algorithm Couplings play a central role in the analysis of Markov chain Monte Carlo algorithms and appear increasingly often in the algorithms themselves, e.g. in convergence diagnostics, parallelization, and variance reduction techniques. Existing couplings of the Metropolis-Hastings algorithm handle the proposal and acceptance steps separately and fall short of the upper bound on one-step meeting probabilities given by the coupling inequality. This paper introduces maximal couplings which achieve this bound while retaining the practical advantages of current methods. We consider the properties of these couplings and examine their behavior on a selection of numerical examples. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21d.html
http://proceedings.mlr.press/v130/wang21d.html Multi-Fidelity High-Order Gaussian Processes for Physical Simulation The key task of physical simulation is to solve partial differential equations (PDEs) on discretized domains, which is known to be costly. In particular, high-fidelity solutions are much more expensive than low-fidelity ones. To reduce the cost, we consider novel Gaussian process (GP) models that leverage simulation examples of different fidelities to predict high-dimensional PDE solution outputs. Existing GP methods are either not scalable to high-dimensional outputs or lack effective strategies to integrate multi-fidelity examples. To address these issues, we propose Multi-Fidelity High-Order Gaussian Process (MFHoGP) that can capture complex correlations both between the outputs and between the fidelities to enhance solution estimation, and scale to large numbers of outputs. Based on a novel nonlinear coregionalization model, MFHoGP propagates bases throughout fidelities to fuse information, and places a deep matrix GP prior over the basis weights to capture the (nonlinear) relationships across the fidelities. To improve inference efficiency and quality, we use bases decomposition to largely reduce the model parameters, and layer-wise matrix Gaussian posteriors to capture the posterior dependency and to simplify the computation. Our stochastic variational learning algorithm successfully handles millions of outputs without extra sparse approximations. We show the advantages of our method in several typical applications. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21c.html
http://proceedings.mlr.press/v130/wang21c.html Shapley Flow: A Graph-based Approach to Interpreting Model Predictions Many existing approaches for estimating feature importance are problematic because they ignore or hide dependencies among features. A causal graph, which encodes the relationships among input variables, can aid in assigning feature importance. However, current approaches that assign credit to nodes in the causal graph fail to explain the entire graph. In light of these limitations, we propose Shapley Flow, a novel approach to interpreting machine learning models. It considers the entire causal graph, and assigns credit to edges instead of treating nodes as the fundamental unit of credit assignment. Shapley Flow is the unique solution to a generalization of the Shapley value axioms for directed acyclic graphs. We demonstrate the benefit of using Shapley Flow to reason about the impact of a model’s input on its output. In addition to maintaining insights from existing approaches, Shapley Flow extends the flat, set-based, view prevalent in game theory based explanation methods to a deeper, graph-based, view. This graph-based view enables users to understand the flow of importance through a system, and reason about potential interventions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21b.html
http://proceedings.mlr.press/v130/wang21b.html A comparative study on sampling with replacement vs Poisson sampling in optimal subsampling Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency, subsampling is often implemented with replacement or through Poisson subsampling. However, no rigorous investigation has been performed to study the difference between the two subsampling procedures such as their estimation efficiency and computational convenience. In the context of maximizing a general target function, this paper derives optimal subsampling probabilities for both subsampling with replacement and Poisson subsampling. The optimal subsampling probabilities minimize variance functions of the subsampling estimators. Furthermore, they provide deep insights on the theoretical similarities and differences between subsampling with replacement and Poisson subsampling. Practically implementable algorithms are proposed based on the optimal structural results, which are evaluated by both theoretical and empirical analysis. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wang21a.html
http://proceedings.mlr.press/v130/wang21a.html Experimental Design for Regret Minimization in Linear Bandits In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms–which have been shown to be suboptimal in many cases–our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/wagenmaker21a.html
http://proceedings.mlr.press/v130/wagenmaker21a.html Minimax Model Learning We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning. Notably, our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift. Compared to previous model-based techniques, our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy. We provide a theoretical analysis and show empirical improvements over existing model-based off-policy evaluation methods. We provide further analysis showing our loss can be used for off-policy optimization (OPO) and demonstrate its integration with more recent improvements in OPO. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/voloshin21a.html
http://proceedings.mlr.press/v130/voloshin21a.html Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints Many applications of AI involve scoring individuals using a learned function of their attributes. These predictive risk scores are then used to take decisions based on whether the score exceeds a certain threshold, which may vary depending on the context. The level of delegation granted to such systems in critical applications like credit lending and medical diagnosis will heavily depend on how questions of fairness can be answered. In this paper, we study fairness for the problem of learning scoring functions from binary labeled data, a classic learning task known as bipartite ranking. We argue that the functional nature of the ROC curve, the gold standard measure of ranking accuracy in this context, leads to several ways of formulating fairness constraints. We introduce general families of fairness definitions based on the AUC and on ROC curves, and show that our ROC-based constraints can be instantiated such that classifiers obtained by thresholding the scoring function satisfy classification fairness for a desired range of thresholds. We establish generalization bounds for scoring functions learned under such constraints, design practical learning algorithms and show the relevance our approach with numerical experiments on real and synthetic data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vogel21a.html
http://proceedings.mlr.press/v130/vogel21a.html Large Scale K-Median Clustering for Stable Clustering Instances We study the problem of computing a good k-median clustering in a parallel computing environment. We design an efficient algorithm that gives a constant-factor approximation to the optimal solution for stable clustering instances. The notion of stability that we consider is resilience to perturbations of the distances between the points. Our computational experiments show that our algorithm works well in practice - we are able to find better clusterings than Lloyd’s algorithm and a centralized coreset construction using samples of the same size. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/voevodski21a.html
http://proceedings.mlr.press/v130/voevodski21a.html Causal Modeling with Stochastic Confounders This work extends causal inference in temporal models with stochastic confounders. We propose a new approach to variational estimation of causal inference based on a representer theorem with a random input space. We estimate causal effects involving latent confounders that may be interdependent and time-varying from sequential, repeated measurements in an observational study. Our approach extends current work that assumes independent, non-temporal latent confounders with potentially biased estimators. We introduce a simple yet elegant algorithm without parametric specification on model components. Our method avoids the need for expensive and careful parameterization in deploying complex models, such as deep neural networks in existing approaches, for causal inference and analysis. We demonstrate the effectiveness of our approach on various benchmark temporal datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vinh-vo21a.html
http://proceedings.mlr.press/v130/vinh-vo21a.html Deep Neural Networks Are Congestion Games: From Loss Landscape to Wardrop Equilibrium and Beyond The theoretical analysis of deep neural networks (DNN) is arguably among the most challenging research directions in machine learning (ML) right now, as it requires from scientists to lay novel statistical learning foundations to explain their behaviour in practice. While some success has been achieved recently in this endeavour, the question on whether DNNs can be analyzed using the tools from other scientific fields outside the ML community has not received the attention it may well have deserved. In this paper, we explore the interplay between DNNs and game theory (GT), and show how one can benefit from the classic readily available results from the latter when analyzing the former. In particular, we consider the widely studied class of congestion games, and illustrate their intrinsic relatedness to both linear and non-linear DNNs and to the properties of their loss surface. Beyond retrieving the state-of-the-art results from the literature, we argue that our work provides a very promising novel tool for analyzing the DNNs and support this claim by proposing concrete open problems that can advance significantly our understanding of DNNs when solved. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vesseron21a.html
http://proceedings.mlr.press/v130/vesseron21a.html Sample efficient learning of image-based diagnostic classifiers via probabilistic labels Deep learning approaches often require huge datasets to achieve good generalization. This complicates its use in tasks like image-based medical diagnosis, where the small training datasets are usually insufficient to learn appropriate data representations. For such sensitive tasks it is also important to provide the confidence in the predictions. Here, we propose a way to learn and use probabilistic labels to train accurate and calibrated deep networks from relatively small datasets. We observe gains of up to 22% in the accuracy of models trained with these labels, as compared with traditional approaches, in three classification tasks: diagnosis of hip dysplasia, fatty liver, and glaucoma. The outputs of models trained with probabilistic labels are calibrated, allowing the interpretation of its predictions as proper probabilities. We anticipate this approach will apply to other tasks where few training instances are available and expert knowledge can be encoded as probabilities. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vega21a.html
http://proceedings.mlr.press/v130/vega21a.html Learning Shared Subgraphs in Ising Model Pairs Probabilistic graphical models (PGMs) are effective for capturing the statistical dependencies in stochastic databases. In many domains (e.g., working with multimodal data), one faces multiple information layers that can be modeled by structurally similar PGMs. While learning the structures of PGMs in isolation is well-investigated, the algorithmic design and performance limits of learning from multiple coupled PGMs are investigated far less. This paper considers learning the structural similarities shared by a pair of Ising PGMs. The objective is learning the shared structure with no regard for the structures exclusive to either of the graphs, and significantly different from the existing approaches that focus on entire structure of the graphs. We propose an algorithm for the shared structure learning objective, evaluate its performance empirically, and compare with existing approaches on structure learning of single graphs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/varici21a.html
http://proceedings.mlr.press/v130/varici21a.html Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vankadara21a.html
http://proceedings.mlr.press/v130/vankadara21a.html Neural Empirical Bayes: Source Distribution Estimation and its Applications to Simulation-Based Inference We revisit g-modeling empirical Bayes in the absence of a tractable likelihood function, as is typical in scientific domains relying on computer simulations. We investigate how the empirical Bayesian can make use of neural density estimators first to use all noise-corrupted observations to estimate a prior or source distribution over uncorrupted samples, and then to perform single-observation posterior inference using the fitted source distribution. We propose an approach based on the direct maximization of the log-marginal likelihood of the observations, examining both biased and de-biased estimators, and comparing to variational approaches. We find that, up to symmetries, a neural empirical Bayes approach recovers ground truth source distributions. With the learned source distribution in hand, we show the applicability to likelihood-free inference and examine the quality of the resulting posterior estimates. Finally, we demonstrate the applicability of Neural Empirical Bayes on an inverse problem from collider physics. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vandegar21a.html
http://proceedings.mlr.press/v130/vandegar21a.html On Information Gain and Regret Bounds in Gaussian Process Bandits Consider the sequential optimization of an expensive to evaluate and possibly non-convex objective function $f$ from noisy feedback, that can be considered as a continuum-armed bandit problem. Upper bounds on the regret performance of several learning algorithms (GP-UCB, GP-TS, and their variants) are known under both a Bayesian (when $f$ is a sample from a Gaussian process (GP)) and a frequentist (when $f$ lives in a reproducing kernel Hilbert space) setting. The regret bounds often rely on the maximal information gain $\gamma_T$ between $T$ observations and the underlying GP (surrogate) model. We provide general bounds on $\gamma_T$ based on the decay rate of the eigenvalues of the GP kernel, whose specialisation for commonly used kernels improves the existing bounds on $\gamma_T$, and subsequently the regret bounds relying on $\gamma_T$ under numerous settings. For the Matérn family of kernels, where the lower bounds on $\gamma_T$, and regret under the frequentist setting, are known, our results close a huge polynomial in $T$ gap between the upper and lower bounds (up to logarithmic in $T$ factors). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vakili21a.html
http://proceedings.mlr.press/v130/vakili21a.html Hierarchical Clustering via Sketches and Hierarchical Correlation Clustering Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization. In particular, two maximization objectives have been defined. Moseley and Wang defined the \emph{Revenue} objective to handle similarity information given by a weighted graph on the data points (w.l.o.g., $[0,1]$ weights), while Cohen-Addad et al. defined the \emph{Dissimilarity} objective to handle dissimilarity information. In this paper, we prove structural lemmas for both objectives allowing us to convert any HC tree to a tree with constant number of internal nodes while incurring an arbitrarily small loss in each objective. Although the best-known approximations are 0.585 and 0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not all weights are small (i.e., there exist constants $\epsilon, \delta$ such that the fraction of weights smaller than $\delta$, is at most $1 - \epsilon$); such instances encompass many metric-based similarity instances, thereby improving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) to handle instances that contain similarity and dissimilarity information simultaneously. For HCC, we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights (analogous to $+/-$ correlation clustering), we again present nearly-optimal approximations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/vainstein21a.html
http://proceedings.mlr.press/v130/vainstein21a.html Differentially Private Analysis on Graph Streams In this paper, we focus on answering queries, in a differentially private manner, on graph streams. We adopt the sliding window model of privacy, where we wish to perform analysis on the last $W$ updates and ensure that privacy is preserved for the entire stream. We show that in this model, the price of ensuring differential privacy is minimal. Furthermore, since differential privacy is preserved under post-processing, our results can be used as a subroutine in many tasks, including Lipschitz learning on graphs, cut functions, and spectral clustering. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/upadhyay21a.html
http://proceedings.mlr.press/v130/upadhyay21a.html A Statistical Perspective on Coreset Density Estimation Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Holder-smooth densities. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/turner21b.html
http://proceedings.mlr.press/v130/turner21b.html Efficient Interpolation of Density Estimators We study the problem of space and time efficient evaluation of a nonparametric estimator that approximates an unknown density. In the regime where consistent estimation is possible, we use a piecewise multivariate polynomial interpolation scheme to give a computationally efficient construction that converts the original estimator to a new estimator that can be queried efficiently and has low space requirements, all without adversely deteriorating the original approximation quality. Our result gives a new statistical perspective on the problem of fast evaluation of kernel density estimators in the presence of underlying smoothness. As a corollary, we give a succinct derivation of a classical result of Kolmogorov—Tikhomirov on the metric entropy of Holder classes of smooth functions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/turner21a.html
http://proceedings.mlr.press/v130/turner21a.html Finding First-Order Nash Equilibria of Zero-Sum Games with the Regularized Nikaido-Isoda Function Efficiently finding First-order Nash Equilibria (FNE) in zero-sum games can be challenging, even in a two-player setting. This work proposes an algorithm for finding the FNEs of a two-player zero-sum game, in which the local cost functions can be non-convex, and the players only have access to local stochastic gradients. The proposed approach is based on reformulating the problem of interest as minimizing the Regularized Nikaido-Isoda (RNI) function. We show that the global minima of the RNI correspond to the set of FNEs, and that for certain classes of non-convex games the RNI minimization problem becomes convex. Moreover, we introduce a first-order (stochastic) optimization method, and establish its convergence to a neighborhood of a stationary solution of the RNI objective. The key in the analysis is to properly control the bias between the local stochastic gradient and the true one. Although the RNI function has been used in analyzing convex games, to our knowledge, this is the first time that the properties of the RNI formulation have been exploited to find FNEs for non-convex games in a stochastic setting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/tsaknakis21a.html
http://proceedings.mlr.press/v130/tsaknakis21a.html Noise Contrastive Meta-Learning for Conditional Density Estimation using Kernel Mean Embeddings Current meta-learning approaches focus on learning functional representations of relationships between variables, \textit{i.e.} estimating conditional expectations in regression. In many applications, however, the conditional distributions cannot be meaningfully summarized solely by expectation (due to \textit{e.g.} multimodality). We introduce a novel technique for meta-learning conditional densities, which combines neural representation and noise contrastive estimation together with well-established literature in conditional mean embeddings into reproducing kernel Hilbert spaces. The method shows significant improvements over standard density estimation methods on synthetic and real-world data, by leveraging shared representations across multiple conditional density estimation tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ton21a.html
http://proceedings.mlr.press/v130/ton21a.html Good Classifiers are Abundant in the Interpolating Regime Within the machine learning community, the widely-used uniform convergence framework has been used to answer the question of how complex, over-parameterized models can generalize well to new data. This approach bounds the test error of the \emph{worst-case} model one could have fit to the data, but it has fundamental limitations. Inspired by the statistical mechanics approach to learning, we formally define and develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers from several model classes. We apply our method to compute this distribution for several real and synthetic datasets, with both linear and random feature classification models. We find that test errors tend to concentrate around a small \emph{typical} value $\varepsilon^*$, which deviates substantially from the test error of the worst-case interpolating model on the same datasets, indicating that “bad” classifiers are extremely rare. We provide theoretical results in a simple setting in which we characterize the full asymptotic distribution of test errors, and we show that these indeed concentrate around a value $\varepsilon^*$, which we also identify exactly. We then formalize a more general conjecture supported by our empirical findings. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice, and that approaches based on the statistical mechanics of learning may offer a promising alternative. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/theisen21a.html
http://proceedings.mlr.press/v130/theisen21a.html Robust hypothesis testing and distribution estimation in Hellinger distance We propose a simple robust hypothesis test that has the same sample complexity as that of the optimal Neyman-Pearson test up to constants, but robust to distribution perturbations under Hellinger distance. We discuss the applicability of such a robust test for estimating distributions in Hellinger distance. We empirically demonstrate the power of the test on canonical distributions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/theertha-suresh21a.html
http://proceedings.mlr.press/v130/theertha-suresh21a.html Optimal Quantisation of Probability Measures Using Maximum Mean Discrepancy Several researchers have proposed minimisation of maximum mean discrepancy (MMD) as a method to quantise probability measures, i.e., to approximate a distribution by a representative point set. We consider sequential algorithms that greedily minimise MMD over a discrete candidate set. We propose a novel non-myopic algorithm and, in order to both improve statistical efficiency and reduce computational cost, we investigate a variant that applies this technique to a mini-batch of the candidate set at each iteration. When the candidate points are sampled from the target, the consistency of these new algorithms—and their mini-batch variants—is established. We demonstrate the algorithms on a range of important computational problems, including optimisation of nodes in Bayesian cubature and the thinning of Markov chain output. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/teymur21a.html
http://proceedings.mlr.press/v130/teymur21a.html Regret Minimization for Causal Inference on Large Treatment Space Predicting which action (treatment) will lead to a better outcome is a central task in decision support systems. To build a prediction model in real situations, learning from observational data with a sampling bias is a critical issue due to the lack of randomized controlled trial (RCT) data. To handle such biased observational data, recent efforts in causal inference and counterfactual machine learning have focused on debiased estimation of the potential outcomes on a binary action space and the difference between them, namely, the individual treatment effect. When it comes to a large action space (e.g., selecting an appropriate combination of medicines for a patient), however, the regression accuracy of the potential outcomes is no longer sufficient in practical terms to achieve a good decision-making performance. This is because a high mean accuracy on the large action space does not guarantee the nonexistence of a single potential outcome misestimation that misleads the whole decision. Our proposed loss minimizes the classification error of whether or not the action is relatively good for the individual target among all feasible actions, which further improves the decision-making performance, as we demonstrate. We also propose a network architecture and a regularizer that extracts a debiased representation not only from the individual feature but also from the biased action for better generalization in large action spaces. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the superiority of our method for large combinatorial action spaces. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/tanimoto21a.html
http://proceedings.mlr.press/v130/tanimoto21a.html Approximating Lipschitz continuous functions with GroupSort neural networks Recent advances in adversarial attacks and Wasserstein GANs have advocated for use of neural networks with restricted Lipschitz constants. Motivated by these observations, we study the recently introduced GroupSort neural networks, with constraints on the weights, and make a theoretical step towards a better understanding of their expressive power. We show in particular how these networks can represent any Lipschitz continuous piecewise linear functions. We also prove that they are well-suited for approximating Lipschitz continuous functions and exhibit upper bounds on both the depth and size. To conclude, the efficiency of GroupSort networks compared with more standard ReLU networks is illustrated in a set of synthetic experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/tanielian21a.html
http://proceedings.mlr.press/v130/tanielian21a.html Robust Imitation Learning from Noisy Demonstrations Robust learning from noisy demonstrations is a practical but highly challenging problem in imitation learning. In this paper, we first theoretically show that robust imitation learning can be achieved by optimizing a classification risk with a symmetric loss. Based on this theoretical finding, we then propose a new imitation learning method that optimizes the classification risk by effectively combining pseudo-labeling with co-training. Unlike existing methods, our method does not require additional labels or strict assumptions about noise distributions. Experimental results on continuous-control benchmarks show that our method is more robust compared to state-of-the-art methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/tangkaratt21a.html
http://proceedings.mlr.press/v130/tangkaratt21a.html Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective. The E-step provides a natural interpretation of how ’learning in hindsight’ techniques, such as HER, to handle extremely sparse goal-conditioned rewards. The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images. We show that the combined algorithm, hEM significantly outperforms model-free baselines on a wide range of goal-conditioned benchmarks with sparse rewards. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/tang21b.html
http://proceedings.mlr.press/v130/tang21b.html Linear Models are Robust Optimal Under Strategic Behavior There is an increasing use of algorithms to inform decisions in many settings, from student evaluations, college admissions, to credit scoring. These decisions are made by applying a decision rule to individual’s observed features. Given the impacts of these decisions on individuals, decision makers are increasingly required to be transparent on their decision making to offer the “right to explanation.” Meanwhile, being transparent also invites potential manipulations, also known as gaming, that the individuals can utilize the knowledge to strategically alter their features in order to receive a more beneficial decision. In this work, we study the problem of \emph{robust} decision-making under strategic behavior. Prior works often assume that the decision maker has full knowledge of individuals’ cost structure for manipulations. We study the robust variant that relaxes this assumption: The decision maker does not have full knowledge but knows only a subset of the individuals’ available actions and associated costs. To approach this non-quantifiable uncertainty, we define robustness based on the worst-case guarantee of a decision, over all possible actions (including actions unknown to the decision maker) individuals might take. A decision rule is called \emph{robust optimal} if its worst case performance is (weakly) better than that of all other decision rules. Our main contributions are two-fold. First, we provide a crisp characterization of the above robust optimality: For any decision rules under mild conditions that are robust optimal, there exists a linear decision rule that is equally robust optimal. Second, we explore the computational problem of searching for the robust optimal decision rule and interestingly, we demonstrate the problem is closely related to distributionally robust optimization. We believe our results promotes the use of simple linear decisions with uncertain individual manipulations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/tang21a.html
http://proceedings.mlr.press/v130/tang21a.html A Parameter-Free Algorithm for Misspecified Linear Contextual Bandits We investigate the misspecified linear contextual bandit (MLCB) problem, which is a generalization of the linear contextual bandit (LCB) problem. The MLCB problem is a decision-making problem in which a learner observes $d$-dimensional feature vectors, called arms, chooses an arm from $K$ arms, and then obtains a reward from the chosen arm in each round. The learner aims to maximize the sum of the rewards over $T$ rounds. In contrast to the LCB problem, the rewards in the MLCB problem may not be represented by a linear function in feature vectors; instead, it is approximated by a linear function with additive approximation parameter $\varepsilon \geq 0$. In this paper, we propose an algorithm that achieves $\tilde{O}(\sqrt{dT\log(K)} + \varepsilon\sqrt{d}T)$ regret, where $\tilde{O}(\cdot)$ ignores polylogarithmic factors in $d$ and $T$. This is the first algorithm that guarantees a high-probability regret bound for the MLCB problem without knowledge of the approximation parameter $\varepsilon$. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/takemura21a.html
http://proceedings.mlr.press/v130/takemura21a.html On the number of linear functions composing deep neural network: Towards a refined definition of neural networks complexity The classical approach to measure the expressive power of deep neural networks with piecewise linear activations is based on counting their maximum number of linear regions. This complexity measure is quite relevant to understand general properties of the expressivity of neural networks such as the benefit of depth over width. Nevertheless, it appears limited when it comes to comparing the expressivity of different network architectures. This lack becomes particularly prominent when considering permutation-invariant networks, due to the symmetrical redundancy among the linear regions. To tackle this, we propose a refined definition of piecewise linear function complexity: instead of counting the number of linear regions directly, we first introduce an equivalence relation among the linear functions composing a piecewise linear function and then count those linear functions relative to that equivalence relation. Our new complexity measure can clearly distinguish between the two aforementioned models, is consistent with the classical measure, and increases exponentially with depth. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/takai21a.html
http://proceedings.mlr.press/v130/takai21a.html Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions Despite the popularity of Empirical Risk Minimization (ERM) algorithms, a theory that explains their statistical properties in modern high-dimensional regimes is only recently emerging. We characterize for the first time the fundamental limits on the statistical accuracy of convex ridge-regularized ERM for inference in high-dimensional generalized linear models. For a stylized setting with Gaussian features and problem dimensions that grow large at a proportional rate, we start with sharp performance characterizations and then derive tight lower bounds on the estimation and prediction error. Our bounds provably hold over a wide class of loss functions, and, for any value of the regularization parameter and of the sampling ratio. Our precise analysis has several attributes. First, it leads to a recipe for optimally tuning the loss function and the regularization parameter. Second, it allows to precisely quantify the sub-optimality of popular heuristic choices, such as optimally-tuned least-squares. Third, we use the bounds to precisely assess the merits of ridge-regularization as a function of the sampling ratio. Our bounds are expressed in terms of the Fisher Information of random variables that are simple functions of the data distribution, thus making ties to corresponding bounds in classical statistics. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/taheri21a.html
http://proceedings.mlr.press/v130/taheri21a.html Amortized Bayesian Prototype Meta-learning: A New Probabilistic Meta-learning Approach to Few-shot Image Classification Probabilistic meta-learning methods recently have achieved impressive success in few-shot image classification. However, they introduce a huge number of random variables for neural network weights and thus severe computational and inferential challenges. In this paper, we propose a novel probabilistic meta-learning method called amortized Bayesian prototype meta-learning. In contrast to previous methods, we introduce only a small number of random variables for latent class prototypes rather than a huge number for network weights; we learn to learn the posterior distributions of these latent prototypes in an amortized inference way with no need for an extra amortization network, such that we can easily approximate their posteriors conditional on few labeled samples, whenever at meta-training or meta-testing stage. The proposed method can be trained end-to-end without any pre-training. Compared with other probabilistic meta-learning methods, our proposed approach is more interpretable with much less random variables, while still be able to achieve competitive performance for few-shot image classification problems on various benchmark datasets. Its excellent robustness and predictive uncertainty are also demonstrated through ablation studies. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sun21a.html
http://proceedings.mlr.press/v130/sun21a.html CONTRA: Contrarian statistics for controlled variable selection The holdout randomization test (HRT) discovers a set of covariates most predictive of a response. Given the covariate distribution, HRTs can explicitly control the false discovery rate (FDR). However, if this distribution is unknown and must be estimated from data, HRTs can inflate the FDR. To alleviate the inflation of FDR, we propose the contrarian randomization test (CONTRA), which is designed explicitly for scenarios where the covariate distribution must be estimated from data and may even be misspecified. Our key insight is to use an equal mixture of two “contrarian” probabilistic models in determining the importance of a covariate. One model is fit with the real data, while the other is fit using the same data, but with the covariate being tested replaced with samples from an estimate of the covariate distribution. CONTRA is flexible enough to achieve a power of 1 asymptotically, can reduce the FDR compared to state-of-the-art CVS methods when the covariate distribution is misspecified, and is computationally efficient in high dimensions and large sample sizes. We further demonstrate the effectiveness of CONTRA on numerous synthetic benchmarks, and highlight its capabilities on a genetic dataset. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sudarshan21a.html
http://proceedings.mlr.press/v130/sudarshan21a.html Evaluating Model Robustness and Stability to Dataset Shift As the use of machine learning in high impact domains becomes widespread, the importance of evaluating safety has increased. An important aspect of this is evaluating how robust a model is to changes in setting or population, which typically requires applying the model to multiple, independent datasets. Since the cost of collecting such datasets is often prohibitive, in this paper, we propose a framework for evaluating this type of stability using the available data. We use the original evaluation data to determine distributions under which the algorithm performs poorly, and estimate the algorithm’s performance on the "worst-case" distribution. We consider shifts in user defined conditional distributions, allowing some distributions to shift while keeping other portions of the data distribution fixed. For example, in a healthcare context, this allows us to consider shifts in clinical practice while keeping the patient population fixed. To address the challenges associated with estimation in complex, high-dimensional distributions, we derive a "debiased" estimator which maintains root-N consistency even when machine learning methods with slower convergence rates are used to estimate the nuisance parameters. In experiments on a real medical risk prediction task, we show this estimator can be used to analyze stability and accounts for realistic shifts that could not previously be expressed. The proposed framework allows practitioners to proactively evaluate the safety of their models without requiring additional data collection. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/subbaswamy21a.html
http://proceedings.mlr.press/v130/subbaswamy21a.html Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and—in asynchronous implementations—on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and non-convex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/stich21a.html
http://proceedings.mlr.press/v130/stich21a.html Kernel Interpolation for Scalable Online Gaussian Processes Gaussian processes (GPs) provide a gold standard for performance in online settings, such as sample-efficient control and black box optimization, where we need to update a posterior distribution as we acquire data in a sequential online setting. However, updating a GP posterior to accommodate even a single new observation after having observed $n$ points incurs at least $\mathcal{O}(n)$ computations in the exact setting. We show how to use structured kernel interpolation to efficiently reuse computations for constant-time $\mathcal{O}(1)$ online updates with respect to the number of points $n$, while retaining exact inference. We demonstrate the promise of our approach in a range of online regression and classification settings, Bayesian optimization, and active sampling to reduce error in malaria incidence forecasting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/stanton21a.html
http://proceedings.mlr.press/v130/stanton21a.html When OT meets MoM: Robust estimation of Wasserstein Distance Originated from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. It owes its recent ubiquity in generative modelling and variational inference to its ability to cope with distributions having non overlapping support. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to leverage a Medians of Means (MoM) approach to provide robust estimates. Exploiting the dual Kantorovitch formulation of the Wasserstein distance, we introduce and discuss novel MoM-based robust estimators whose consistency is studied under a data contamination model and for which convergence rates are provided. Beyond computational issues, the choice of the partition size, i.e., the unique parameter of theses robust estimators, is investigated in numerical experiments. Furthermore, these MoM estimators make Wasserstein Generative Adversarial Network (WGAN) robust to outliers, as witnessed by an empirical study on two benchmarks CIFAR10 and Fashion MNIST. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/staerman21a.html
http://proceedings.mlr.press/v130/staerman21a.html Non-asymptotic Performance Guarantees for Neural Estimation of f-Divergences Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there seems to be a fundamental tradeoff between the two sources of error involved: approximation and estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. This paper explores this tradeoff by means of non-asymptotic error bounds, focusing on three popular choices of SDs—Kullback-Leibler divergence, chi-squared divergence, and squared Hellinger distance. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. Numerical results validating the theory are also provided. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sreekumar21a.html
http://proceedings.mlr.press/v130/sreekumar21a.html Learning GPLVM with arbitrary kernels using the unscented transformation Gaussian Process Latent Variable Model (GPLVM) is a flexible framework to handle uncertain inputs in Gaussian Processes (GPs) and incorporate GPs as components of larger graphical models. Nonetheless, the standard GPLVM variational inference approach is tractable only for a narrow family of kernel functions. The most popular implementations of GPLVM circumvent this limitation using quadrature methods, which may become a computational bottleneck even for relatively low dimensions. For instance, the widely employed Gauss-Hermite quadrature has exponential complexity on the number of dimensions. In this work, we propose using the unscented transformation instead. Overall, this method presents comparable, if not better, performance than off-the-shelf solutions to GPLVM, and its computational complexity scales only linearly on dimension. In contrast to Monte Carlo methods, our approach is deterministic and works well with quasi-Newton methods, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. We illustrate the applicability of our method with experiments on dimensionality reduction and multistep-ahead prediction with uncertainty propagation. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/souza21a.html
http://proceedings.mlr.press/v130/souza21a.html Ridge Regression with Over-parametrized Two-Layer Networks Converge to Ridgelet Spectrum Characterization of local minima draws much attention in theoretical studies of deep learning. In this study, we investigate the distribution of parameters in an over-parametrized finite neural network trained by ridge regularized empirical square risk minimization (RERM). We develop a new theory of ridgelet transform, a wavelet-like integral transform that provides a powerful and general framework for the theoretical study of neural networks involving not only the ReLU but general activation functions. We show that the distribution of the parameters converges to a spectrum of the ridgelet transform. This result provides a new insight into the characterization of the local minima of neural networks, and the theoretical background of an inductive bias theory based on lazy regimes. We confirm the visual resemblance between the parameter distribution trained by SGD, and the ridgelet spectrum calculated by numerical integration through numerical experiments with finite models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sonoda21a.html
http://proceedings.mlr.press/v130/sonoda21a.html Evading the Curse of Dimensionality in Unconstrained Private GLMs We revisit the well-studied problem of differentially private empirical risk minimization (ERM). We show that for unconstrained convex generalized linear models (GLMs), one can obtain an excess empirical risk of $\tilde O\left(\sqrt{\rank}/\epsilon n\right)$, where $\rank$ is the rank of the feature matrix in the GLM problem, $n$ is the number of data samples, and $\epsilon$ is the privacy parameter. This bound is attained via differentially private gradient descent (DP-GD). Furthermore, via the \emph{first lower bound for unconstrained private ERM}, we show that our upper bound is tight. In sharp contrast to the constrained ERM setting, there is no dependence on the dimensionality of the ambient model space ($p$). (Notice that $\rank\leq \min\{n, p\}$.) Besides, we obtain an analogous excess population risk bound which depends on $\rank$ instead of $p$. For the smooth non-convex GLM setting (i.e., where the objective function is non-convex but preserves the GLM structure), we further show that DP-GD attains a dimension-independent convergence of $\tilde O\left(\sqrt{\rank}/\epsilon n\right)$ to a first-order-stationary-point of the underlying objective. Finally, we show that for convex GLMs, a variant of DP-GD commonly used in practice (which involves clipping the individual gradients) also exhibits the same dimension-independent convergence to the minimum of a well-defined objective. To that end, we provide a structural lemma that characterizes the effect of clipping on the optimization profile of DP-GD. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/song21a.html
http://proceedings.mlr.press/v130/song21a.html A Fast and Robust Method for Global Topological Functional Optimization Topological statistics, in the form of persistence diagrams, are a class of shape descriptors that capture global structural information in data. The mapping from data structures to persistence diagrams is almost everywhere differentiable, allowing for topological gradients to be backpropagated to ordinary gradients. However, as a method for optimizing a topological functional, this backpropagation method is expensive, unstable, and produces very fragile optima. Our contribution is to introduce a novel backpropagation scheme that is significantly faster, more stable, and produces more robust optima. Moreover, this scheme can also be used to produce a stable visualization of dots in a persistence diagram as a distribution over critical, and near-critical, simplices in the data structure. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/solomon21a.html
http://proceedings.mlr.press/v130/solomon21a.html Multi-Armed Bandits with Cost Subsidy In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by several businesses (especially online platforms), and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish a fundamental lower bound on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem. We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sinha21a.html
http://proceedings.mlr.press/v130/sinha21a.html Continuum-Armed Bandits: A Function Space Perspective The continuum-armed bandits problem involves optimizing an unknown objective function given an oracle that evaluates the function at a query point. In the most well-studied case, the objective function is assumed to be Lipschitz continuous and minimax rates of simple and cumulative regrets are known under both noiseless and noisy conditions. In this paper, we investigate continuum-armed bandits under more general smoothness conditions, namely Besov smoothness conditions, on the objective function. In both noiseless and noisy conditions, we derive minimax rates under both simple and cumulative regrets. In particular, our results show that minimax rates over objective functions in a Besov space are identical to minimax rates over objective functions in the smallest Holder space into which the Besov space embeds. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/singh21a.html
http://proceedings.mlr.press/v130/singh21a.html The Minecraft Kernel: Modelling correlated Gaussian Processes in the Fourier domain In the univariate setting, using the kernel spectral representation is an appealing approach for generating stationary covariance functions. However, performing the same task for multiple-output Gaussian processes is substantially more challenging. We demonstrate that current approaches to modelling cross-covariances with a spectral mixture kernel possess a critical blind spot. Pairs of highly correlated (or highly anti-correlated) processes are not reproducible, aside from the special case when their spectral densities are of identical shape. We present a solution to this issue by replacing the conventional Gaussian components of a spectral mixture with block components of finite bandwidth (i.e. rectangular step functions). The proposed family of kernel represents the first multi-output generalisation of the spectral mixture kernel that can approximate any stationary multi-output kernel to arbitrary precision. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/simpson21a.html
http://proceedings.mlr.press/v130/simpson21a.html LENA: Communication-Efficient Distributed Learning with Self-Triggered Gradient Uploads In distributed optimization, parameter updates from the gradient computing node devices have to be aggregated in every iteration on the orchestrating server. When these updates are sent over an arbitrary commodity network, bandwidth and latency can be limiting factors. We propose a communication framework where nodes may skip unnecessary uploads. Every node locally accumulates an error vector in memory and self-triggers the upload of the memory contents to the parameter server using a significance filter. The server then uses a history of the nodes’ gradients to update the parameter. We characterize the convergence rate of our algorithm in smooth settings (strongly-convex, convex, and non-convex) and show that it enjoys the same convergence rate as when sending gradients every iteration, with substantially fewer uploads. Numerical experiments on real data indicate a significant reduction of used network resources (total communicated bits and latency), especially in large networks, compared to state-of-the-art algorithms. Our results provide important practical insights for using machine learning over resource-constrained networks, including Internet-of-Things and geo-separated datasets across the globe. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shokri-ghadikolaei21a.html
http://proceedings.mlr.press/v130/shokri-ghadikolaei21a.html On Multilevel Monte Carlo Unbiased Gradient Estimation for Deep Latent Variable Models Standard variational schemes for training deep latent variable models rely on biased gradient estimates of the target objective. Techniques based on the Evidence Lower Bound (ELBO), and tighter variants obtained via importance sampling, produce biased gradient estimates of the true log-likelihood. The family of Reweighted Wake-Sleep (RWS) methods further relies on a biased estimator of the inference objective, which biases training of the encoder also. In this work, we show how Multilevel Monte Carlo (MLMC) can provide a natural framework for debiasing these methods with two different estimators. We prove rigorously that this approach yields unbiased gradient estimators with finite variance under reasonable conditions. Furthermore, we investigate methods that can reduce variance and ensure finite variance in practice. Finally, we show empirically that the proposed unbiased estimators outperform IWAE and other debiasing method on a variety of applications at the same expected cost. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shi21d.html
http://proceedings.mlr.press/v130/shi21d.html Federated Multi-armed Bandits with Personalization A general framework of personalized federated multi-armed bandits (PF-MAB) is proposed, which is a new bandit paradigm analogous to the federated learning (FL) framework in supervised learning and enjoys the features of FL with personalization. Under the PF-MAB framework, a mixed bandit learning problem that flexibly balances generalization and personalization is studied. A lower bound analysis for the mixed model is presented. We then propose the Personalized Federated Upper Confidence Bound (PF-UCB) algorithm, where the exploration length is chosen carefully to achieve the desired balance of learning the local model and supplying global information for the mixed learning objective. Theoretical analysis proves that PF-UCB achieves an O(log(T)) regret regardless of the degree of personalization, and has a similar instance dependency as the lower bound. Experiments using both synthetic and real-world datasets corroborate the theoretical analysis and demonstrate the effectiveness of the proposed algorithm. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shi21c.html
http://proceedings.mlr.press/v130/shi21c.html A Deterministic Streaming Sketch for Ridge Regression We provide a deterministic space-efficient algorithm for estimating ridge regression. For n data points with d features and a large enough regularization parameter, we provide a solution within eps L_2 error using only O(d/eps) space. This is the first o(d^2) space deterministic streaming algorithm with guaranteed solution error and risk bound for this classic problem. The algorithm sketches the covariance matrix by variants of Frequent Directions, which implies it can operate in insertion-only streams and a variety of distributed data settings. In comparisons to randomized sketching algorithms on synthetic and real-world datasets, our algorithm has less empirical error using less space and similar time. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shi21b.html
http://proceedings.mlr.press/v130/shi21b.html Active Learning with Maximum Margin Sparse Gaussian Processes We present a maximum-margin sparse Gaussian Process (MM-SGP) for active learning (AL) of classification models for multi-class problems. The proposed model makes novel extensions to a GP by integrating maximum-margin constraints into its learning process, aiming to further improve its predictive power while keeping its inherent capability for uncertainty quantification. The MM constraints ensure small "effective size" of the model, which allows MM-SGP to provide good predictive performance by using limited "active" data samples, a critical property for AL. Furthermore, as a Gaussian process model, MM-SGP will output both the predicted class distribution and the predictive variance, both of which are essential for defining a sampling function effective to improve the decision boundaries of a large number of classes simultaneously. Finally, the sparse nature of MM-SGP ensures that it can be efficiently trained by solving a low-rank convex dual problem. Experiment results on both synthetic and real-world datasets show the effectiveness and efficiency of the proposed AL model. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shi21a.html
http://proceedings.mlr.press/v130/shi21a.html Significance of Gradient Information in Bayesian Optimization We consider the problem of Bayesian Optimization (BO) in which the goal is to design an adaptive querying strategy to optimize a function $f:[0,1]^d\mapsto \reals$. The function is assumed to be drawn from a Gaussian Process, and can only be accessed through noisy oracle queries. The most commonly used oracle in BO literature is the noisy Zeroth-Order-Oracle (ZOO) which returns noise-corrupted function value $y = f(x) + \eta$ at any point $x \in \domain$ queried by the agent. A less studied oracle in BO is the First-Order-Oracle (FOO) which also returns noisy gradient value at the queried point. In this paper we consider the fundamental question of quantifying the possible improvement in regret that can be achieved under FOO access as compared to the case in which only ZOO access is available. Under some regularity assumptions on $K$, we first show that the expected cumulative regret with ZOO of any algorithm must satisfy a lower bound of $\Omega(\sqrt{2^d n})$, where $n$ is the query budget. This lower bound captures the appropriate scaling of the regret on both dimension $d$ and budget $n$, and relies on a novel reduction from BO to a multi-armed bandit (MAB) problem. We then propose a two-phase algorithm which, with some additional prior knowledge, achieves a vastly improved $\mc{O}\lp d (\log n)^2 \rp$ regret when given access to a FOO. Together, these two results highlight the significant value of incorporating gradient information in BO algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shekhar21a.html
http://proceedings.mlr.press/v130/shekhar21a.html Sequential Random Sampling Revisited: Hidden Shuffle Method Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shekelyan21a.html
http://proceedings.mlr.press/v130/shekelyan21a.html On Learning Continuous Pairwise Markov Random Fields We consider learning a sparse pairwise Markov Random Field (MRF) with continuous-valued variables from i.i.d samples. We adapt the algorithm of Vuffray et al. (2019) to this setting and provide finite-sample analysis revealing sample complexity scaling logarithmically with the number of variables, as in the discrete and Gaussian settings. Our approach is applicable to a large class of pairwise MRFs with continuous variables and also has desirable asymptotic properties, including consistency and normality under mild conditions. Further, we establish that the population version of the optimization criterion employed in Vuffray et al. (2019) can be interpreted as local maximum likelihood estimation (MLE). As part of our analysis, we introduce a robust variation of sparse linear regression a‘ la Lasso, which may be of interest in its own right. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/shah21a.html
http://proceedings.mlr.press/v130/shah21a.html Generalization of Quasi-Newton Methods: Application to Robust Symmetric Multisecant Updates Quasi-Newton (qN) techniques approximate the Newton step by estimating the Hessian using the so-called secant equations. Some of these methods compute the Hessian using several secant equations but produce non-symmetric updates. Other quasi-Newton schemes, such as BFGS, enforce symmetry but cannot satisfy more than one secant equation. We propose a new type of quasi-Newton symmetric update using several secant equations in a least-squares sense. Our approach generalizes and unifies the design of quasi-Newton updates and satisfies provable robustness guarantees. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/scieur21a.html
http://proceedings.mlr.press/v130/scieur21a.html Rao-Blackwellised parallel MCMC Multiple proposal Markov chain Monte Carlo (MP-MCMC) as introduced in Calderhead (2014) allow for computationally efficient and parallelisable inference, whereby multiple states are proposed and computed simultaneously. In this paper, we improve the resulting integral estimators by sequentially using the multiple states within a Rao-Blackwellised estimator. We further propose a novel adaptive Rao-Blackwellised MP-MCMC algorithm, which generalises the adaptive MCMC algorithm introduced by Haario et al. (2001) to allow for multiple proposals. We prove its asymptotic unbiasedness, and demonstrate significant improvements in sampling efficiency through numerical studies. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/schwedes21a.html
http://proceedings.mlr.press/v130/schwedes21a.html Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers make particular decisions. For CEs to be useful, it is important that they are easy for users to interpret. Existing methods for generating interpretable CEs rely on auxiliary generative models, which may not be suitable for complex datasets, and incur engineering overhead. We introduce a simple and fast method for generating interpretable CEs in a white-box setting without an auxiliary model, by using the predictive uncertainty of the classifier. Our experiments show that our proposed algorithm generates more interpretable CEs, according to IM1 scores (Van Looveren et al., 2019), than existing methods. Additionally, our approach allows us to estimate the uncertainty of a CE, which may be important in safety-critical applications, such as those in the medical domain. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/schut21a.html
http://proceedings.mlr.press/v130/schut21a.html A Spectral Analysis of Dot-product Kernels We present eigenvalue decay estimates of integral operators associated with compositional dot-product kernels. The estimates improve on previous ones established for power series kernels on spheres. This allows us to obtain the volumes of balls in the corresponding reproducing kernel Hilbert spaces. We discuss the consequences on statistical estimation with compositional dot product kernels and highlight interesting trade-offs between the approximation error and the statistical error depending on the number of compositions and the smoothness of the kernels. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/scetbon21b.html
http://proceedings.mlr.press/v130/scetbon21b.html Equitable and Optimal Transport with Multiple Agents We introduce an extension of the Optimal Transport problem when multiple costs are involved. Considering each cost as an agent, we aim to share equally between agents the work of transporting one distribution to another. To do so, we minimize the transportation cost of the agent who works the most. Another point of view is when the goal is to partition equitably goods between agents according to their heterogeneous preferences. Here we aim to maximize the utility of the least advantaged agent. This is a fair division problem. Like Optimal Transport, the problem can be cast as a linear optimization problem. When there is only one agent, we recover the Optimal Transport problem. When two agents are considered, we are able to recover Integral Probability Metrics defined by $\alpha$-Hölder functions, which include the widely-known Dudley metric. To the best of our knowledge, this is the first time a link is given between the Dudley metric and Optimal Transport. We provide an entropic regularization of that problem which leads to an alternative algorithm faster than the standard linear program. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/scetbon21a.html
http://proceedings.mlr.press/v130/scetbon21a.html Distributionally Robust Optimization for Deep Kernel Multiple Instance Learning Multiple Instance Learning (MIL) provides a promising solution to many real-world problems, where labels are only available at the bag level but missing for instances due to a high labeling cost. As a powerful Bayesian non-parametric model, Gaussian Processes (GP) have been extended from classical supervised learning to MIL settings, aiming to identify the most likely positive (or least negative) instance from a positive (or negative) bag using only the bag-level labels. However, solely focusing on a single instance in a bag makes the model less robust to outliers or multi-modal scenarios, where a single bag contains a diverse set of positive instances. We propose a general GP mixture framework that simultaneously considers multiple instances through a latent mixture model. By adding a top-k constraint, the framework is equivalent to choosing the top-k most positive instances, making it more robust to outliers and multimodal scenarios. We further introduce a Distributionally Robust Optimization (DRO) constraint that removes the limitation of specifying a fix k value. To ensure the prediction power over high-dimensional data (e.g., videos and images) that are common in MIL, we augment the GP kernel with fixed basis functions by using a deep neural network to learn adaptive basis functions so that the covariance structure of high-dimensional data can be accurately captured. Experiments are conducted on highly challenging real-world video anomaly detection tasks to demonstrate the effectiveness of the proposed model. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sapkota21a.html
http://proceedings.mlr.press/v130/sapkota21a.html Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship Online learning in a two-sided matching market, with demand side agents continuously competing to be matched with supply side (arms), abstracts the complex interactions under partial information on matching platforms (e.g. UpWork, TaskRabbit). We study the decentralized serial dictatorship setting, a two-sided matching market where the demand side agents have unknown and heterogeneous valuation over the supply side (arms), while the arms have known uniform preference over the demand side (agents). We design the first decentralized algorithm - UCB with Decentralized Dominant-arm Deletion (UCB-D3), for the agents, that does not require any knowledge of reward gaps or time horizon. UCB-D3 works in phases, where in each phase, agents delete dominated arms – the arms preferred by higher ranked agents, and play only from the non-dominated arms according to the UCB. At the end of the phase, agents broadcast in a decentralized fashion, their estimated preferred arms through pure exploitation. We prove a new regret lower bound for the decentralized serial dictatorship model, and prove that UCB-D3 achieves order optimal regret guarantee. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sankararaman21a.html
http://proceedings.mlr.press/v130/sankararaman21a.html Differentiable Greedy Algorithm for Monotone Submodular Maximization: Guarantees, Gradient Estimators, and Applications Motivated by, e.g., sensitivity analysis and end-to-end learning, the demand for differentiable optimization algorithms has been increasing. This paper presents a theoretically guaranteed differentiable greedy algorithm for monotone submodular function maximization. We smooth the greedy algorithm via randomization, and prove that it almost recovers original approximation guarantees in expectation for the cases of cardinality and $\kappa$-extendible system constraints. We then present how to efficiently compute gradient estimators of any expected output-dependent quantities. We demonstrate the usefulness of our method by instantiating it for various applications. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sakaue21a.html
http://proceedings.mlr.press/v130/sakaue21a.html Differentially Private Monotone Submodular Maximization Under Matroid and Knapsack Constraints Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of non-negative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1-\frac{\kappa}{e})$-approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $Ø(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the non-private setting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sadeghi21a.html
http://proceedings.mlr.press/v130/sadeghi21a.html Improved Exploration in Factored Average-Reward MDPs We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and reward functions are factored over $\mathcal X$ and $\mathcal S$. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $\cS_i$’s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sadegh-talebi21a.html
http://proceedings.mlr.press/v130/sadegh-talebi21a.html Regret-Optimal Filtering We consider the problem of filtering in linear state-space models (e.g., the Kalman filter setting) through the lens of regret optimization. Specifically, we study the problem of causally estimating a desired signal, generated by a linear state-space model driven by process noise, based on noisy observations of a related observation process. We define a novel regret criterion for estimator design as the difference of the estimation error energies between a clairvoyant estimator that has access to all future observations (a so-called smoother) and a causal one that only has access to current and past observations. The regret-optimal estimator is the causal estimator that minimizes the worst-case regret across all bounded-energy noise sequences. We provide a solution for the regret filtering problem at two levels. First, an horizon-independent solution at the operator level is obtained by reducing the regret to the well-known Nehari problem. Secondly, our main result for state-space models is an explicit estimator that achieves the optimal regret. The regret-optimal estimator is represented as a finite-dimensional state-space whose parameters can be computed by solving three Riccati equations and a single Lyapunov equation. We demonstrate the applicability and efficacy of the estimator in a variety of problems and observe that the estimator has average and worst-case performances that are simultaneously close to their optimal values. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/sabag21a.html
http://proceedings.mlr.press/v130/sabag21a.html Self-Concordant Analysis of Generalized Linear Bandits with Forgetting Contextual sequential decision problems with categorical or numerical observations are ubiquitous and Generalized Linear Bandits (GLB) offer a solid theoretical framework to address them. In contrast to the case of linear bandits, existing algorithms for GLB have two drawbacks undermining their applicability. First, they rely on excessively pessimistic concentration bounds due to the non-linear nature of the model. Second, they require either non-convex projection steps or burn-in phases to enforce boundedness of the estimators. Both of these issues are worsened when considering non-stationary models, in which the GLB parameter may vary with time. In this work, we focus on self-concordant GLB (which include logistic and Poisson regression) with forgetting achieved either by the use of a sliding window or exponential weights. We propose a novel confidence-based algorithm for the maximum-likehood estimator with forgetting and analyze its perfomance in abruptly changing environments. These results as well as the accompanying numerical simulations highlight the potential of the proposed approach to address non-stationarity in GLB. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/russac21a.html
http://proceedings.mlr.press/v130/russac21a.html Towards Flexible Device Participation in Federated Learning Traditional federated learning algorithms impose strict requirements on the participation rates of devices, which limit the potential reach of federated learning. This paper extends the current learning paradigm to include devices that may become inactive, compute incomplete updates, and depart or arrive in the middle of training. We derive analytical results to illustrate how allowing more flexible device participation can affect the learning convergence when data is not independently and identically distributed (non-IID). We then propose a new federated aggregation scheme that converges even when devices may be inactive or return incomplete updates. We also study how the learning process can adapt to early departures or late arrivals, and analyze their impacts on the convergence. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ruan21a.html
http://proceedings.mlr.press/v130/ruan21a.html ATOL: Measure Vectorization for Automatic Topologically-Oriented Learning Robust topological information commonly comes in the form of a set of persistence diagrams, finite measures that are in nature uneasy to affix to generic machine learning frameworks. We introduce a fast, learnt, unsupervised vectorization method for measures in Euclidean spaces and use it for reflecting underlying changes in topological behaviour in machine learning contexts. The algorithm is simple and efficiently discriminates important space regions where meaningful differences to the mean measure arise. It is proven to be able to separate clusters of persistence diagrams. We showcase the strength and robustness of our approach on a number of applications, from emulous and modern graph collections where the method reaches state-of-the-art performance to a geometric synthetic dynamical orbits problem. The proposed methodology comes with a single high level tuning parameter: the total measure encoding budget. We provide a completely open access software. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/royer21a.html
http://proceedings.mlr.press/v130/royer21a.html Dynamic Cutset Networks Tractable probabilistic models (TPMs) are appealing because they admit polynomial-time inference for a wide variety of queries. In this work, we extend the cutset network (CN) framework, a powerful sub-class of TPMs that often outperforms probabilistic graphical models in terms of prediction accuracy, to the temporal domain. This extension, dubbed dynamic cutset networks (DCNs), uses a CN to model the prior distribution and a conditional CN to model the transition distribution. We show that although exact inference is intractable when arbitrary conditional CNs are used, particle filtering is efficient. To ensure tractability of exact inference, we introduce a novel constrained conditional model called AND/OR conditional cutset networks and show that under certain conditions exact inference is linear in the size of the corresponding constrained DCN. Experiments on several sequential datasets demonstrate the efficacy of our framework. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/roy21a.html
http://proceedings.mlr.press/v130/roy21a.html Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations Variational inference techniques based on inducing variables provide an elegant framework for scalable posterior estimation in Gaussian process (GP) models. Besides enabling scalability, one of their main advantages over sparse approximations using direct marginal likelihood maximization is that they provide a robust alternative for point estimation of the inducing inputs, i.e. the location of the inducing variables. In this work we challenge the common wisdom that optimizing the inducing inputs in the variational framework yields optimal performance. We show that, by revisiting old model approximations such as the fully-independent training conditionals endowed with powerful sampling-based inference methods, treating both inducing locations and GP hyper-parameters in a Bayesian way can improve performance significantly. Based on stochastic gradient Hamiltonian Monte Carlo, we develop a fully Bayesian approach to scalable GP and deep GP models, and demonstrate its state-of-the-art performance through an extensive experimental campaign across several regression and classification problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/rossi21a.html
http://proceedings.mlr.press/v130/rossi21a.html Provably Safe PAC-MDP Exploration Using Analogies A key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration (needed to attain good performance on the task) with safety (needed to avoid catastrophic failure). Although a growing line of work in reinforcement learning has investigated this area of "safe exploration," most existing techniques either 1) do not guarantee safety during the actual exploration process; and/or 2) limit the problem to a priori known and/or deterministic transition dynamics with strong smoothness assumptions. Addressing this gap, we propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in MDPs with unknown, stochastic dynamics. Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP sense. Additionally, ASE also guides exploration towards the most task-relevant states, which empirically results in significant improvements in terms of sample efficiency, when compared to existing methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/roderick21a.html
http://proceedings.mlr.press/v130/roderick21a.html Localizing Changes in High-Dimensional Regression Models This paper addresses the problem of localizing change points in high-dimensional linear regression models with piecewise constant regression coefficients. We develop a dynamic programming approach to estimate the locations of the change points whose performance improves upon the current state-of-the-art, even as the dimension, the sparsity of the regression coefficients, the temporal spacing between two consecutive change points, and the magnitude of the difference of two consecutive regression coefficient vectors are allowed to vary with the sample size. Furthermore, we devise a computationally-efficient refinement procedure that provably reduces the localization error of preliminary estimates of the change points. We demonstrate minimax lower bounds on the localization error that nearly match the upper bound on the localization error of our methodology and show that the signal-to-noise condition we impose is essentially the weakest possible based on information-theoretic arguments. Extensive numerical results support our theoretical findings, and experiments on real air quality data reveal change points supported by historical information not used by the algorithm. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/rinaldo21a.html
http://proceedings.mlr.press/v130/rinaldo21a.html Asymptotics of Ridge(less) Regression under General Source Condition We analyze the prediction error of ridge regression in an asymptotic regime where the sample size and dimension go to infinity at a proportional rate. In particular, we consider the role played by the structure of the true regression parameter. We observe that the case of a general deterministic parameter can be reduced to the case of a random parameter from a structured prior. The latter assumption is a natural adaptation of classic smoothness assumptions in nonparametric regression, which are known as source conditions in the the context of regularization theory for inverse problems. Roughly speaking, we assume the large coefficients of the parameter are in correspondence to the principal components. In this setting a precise characterisation of the test error is obtained, depending on the inputs covariance and regression parameter structure. We illustrate this characterisation in a simplified setting to investigate the influence of the true parameter on optimal regularisation for overparameterized models. We show that interpolation (no regularisation) can be optimal even with bounded signal-to-noise ratio (SNR), provided that the parameter coefficients are larger on high-variance directions of the data, corresponding to a more regular function than posited by the regularization term. This contrasts with previous work considering ridge regression with isotropic prior, in which case interpolation is only optimal in the limit of infinite SNR. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/richards21b.html
http://proceedings.mlr.press/v130/richards21b.html Learning with Gradient Descent and Weakly Convex Losses We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk’s Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample guarantees are then achieved by decomposing the test error into generalisation, optimisation and approximation errors, each of which can be bounded and traded off with respect to algorithmic parameters, sample size and magnitude of this eigenvalue. In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity, specifically, the Hessian’s smallest eigenvalue during training can be controlled by the normalisation of the layers, i.e., network scaling. This allows test error guarantees to then be achieved when the population risk minimiser satisfies a complexity assumption. By trading off the network complexity and scaling, insights are gained into the implicit bias of neural network scaling, which are further supported by experimental findings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/richards21a.html
http://proceedings.mlr.press/v130/richards21a.html Online Active Model Selection for Pre-trained Classifiers Given $k$ pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can also be used for online prediction tasks for both adversarial and stochastic streams. We establish several theoretical guarantees for our algorithm and extensively demonstrate its effectiveness in our experimental studies. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/reza-karimi21a.html
http://proceedings.mlr.press/v130/reza-karimi21a.html Influence Decompositions For Neural Network Attribution Methods of neural network attribution have emerged out of a necessity for explanation and accountability in the predictions of black-box neural models. Most approaches use a variation of sensitivity analysis, where individual input variables are perturbed and the downstream effects on some output metric are measured. We demonstrate that a number of critical functional properties are not revealed when only considering lower-order perturbations. Motivated by these shortcomings, we propose a general framework for decomposing the orders of influence that a collection of input variables has on an output classification. These orders are based on the cardinality of input subsets which are perturbed to yield a change in classification. This decomposition can be naturally applied to attribute which input variables rely on higher-order coordination to impact the classification decision. We demonstrate that our approach correctly identifies higher-order attribution on a number of synthetic examples. Additionally, we showcase the differences between attribution in our approach and existing approaches on benchmark networks for MNIST and ImageNet. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/reing21a.html
http://proceedings.mlr.press/v130/reing21a.html RankDistil: Knowledge Distillation for Ranking Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a distillation framework for top-k ranking and draw connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the order of items of student and teacher, while penalizing large scores for items ranked low by the teacher. Building on this, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank, and establish statistical basis for the method. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/reddi21a.html
http://proceedings.mlr.press/v130/reddi21a.html Top-m identification for linear bandits Motivated by an application to drug repurposing, we propose the first algorithms to tackle the identification of the m ≥ 1 arms with largest means in a linear bandit model, in the fixed-confidence setting. These algorithms belong to the generic family of Gap-Index Focused Algorithms (GIFA) that we introduce for Top-m identification in linear bandits. We propose a unified analysis of these algorithms, which shows how the use of contexts might decrease the sample complexity. We further validate these algorithms empirically on simulated data and on a simple drug repurposing task. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/reda21a.html
http://proceedings.mlr.press/v130/reda21a.html Longitudinal Variational Autoencoder Longitudinal datasets measured repeatedly over time from individual subjects, arise in many biomedical, psychological, social, and other studies. A common approach to analyse high-dimensional data that contains missing values is to learn a low-dimensional representation using variational autoencoders (VAEs). However, standard VAEs assume that the learnt representations are i.i.d., and fail to capture the correlations between the data samples. We propose the Longitudinal VAE (L-VAE), that uses a multi-output additive Gaussian process (GP) prior to extend the VAE’s capability to learn structured low-dimensional representations imposed by auxiliary covariate information, and derive a new KL divergence upper bound for such GPs. Our approach can simultaneously accommodate both time-varying shared and random effects, produce structured low-dimensional representations, disentangle effects of individual covariates or their interactions, and achieve highly accurate predictive performance. We compare our model against previous methods on synthetic as well as clinical datasets, and demonstrate the state-of-the-art performance in data imputation, reconstruction, and long-term prediction tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ramchandran21b.html
http://proceedings.mlr.press/v130/ramchandran21b.html Latent Gaussian process with composite likelihoods and numerical quadrature Clinical patient records are an example of high-dimensional data that is typically collected from disparate sources and comprises of multiple likelihoods with noisy as well as missing values. In this work, we propose an unsupervised generative model that can learn a low-dimensional representation among the observations in a latent space, while making use of all available data in a heterogeneous data setting with missing values. We improve upon the existing Gaussian process latent variable model (GPLVM) by incorporating multiple likelihoods and deep neural network parameterised back-constraints to create a non-linear dimensionality reduction technique for heterogeneous data. In addition, we develop a variational inference method for our model that uses numerical quadrature. We establish the effectiveness of our model and compare against existing GPLVM methods on a standard benchmark dataset as well as on clinical data of Parkinson’s disease patients treated at the HUS Helsinki University Hospital. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ramchandran21a.html
http://proceedings.mlr.press/v130/ramchandran21a.html Explicit Regularization of Stochastic Gradient Methods through Duality We consider stochastic gradient methods under the interpolation regime where a perfect fit can be obtained (minimum loss at each observation). While previous work highlighted the implicit regularization of such algorithms, we consider an explicit regularization framework as a minimum Bregman divergence convex feasibility problem. Using convex duality, we propose randomized Dykstra-style algorithms based on randomized dual coordinate ascent. For non-accelerated coordinate descent, we obtain an algorithm which bears strong similarities with (non-averaged) stochastic mirror descent on specific functions, as it is equivalent for quadratic objectives, and equivalent in the early iterations for more general objectives. It comes with the benefit of an explicit convergence theorem to a minimum norm solution. For accelerated coordinate descent, we obtain a new algorithm that has better convergence properties than existing stochastic gradient methods in the interpolating regime. This leads to accelerated versions of the perceptron for generic $\ell_p$-norm regularizers, which we illustrate in experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/raj21a.html
http://proceedings.mlr.press/v130/raj21a.html The Base Measure Problem and its Solution Probabilistic programming systems generally compute with probability density functions, leaving the base measure of each such function implicit. This mostly works, but creates problems when densities with respect to different base measures are accidentally combined or compared. Mistakes also happen when computing volume corrections for continuous changes of variables, which in general depend on the support measure. We motivate and clarify the problem in the context of a composable library of probability distributions and bijective transformations. We solve the problem by standardizing on Hausdorff measure as a base, and deriving formulas for comparing and combining mixed-dimension densities, as well as updating densities with respect to Hausdorff measure under diffeomorphic transformations. We also propose a software architecture that implements these formulas efficiently in the common case. We hope that by adopting our solution, probabilistic programming systems can become more robust and general, and make a broader class of models accessible to practitioners. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/radul21a.html
http://proceedings.mlr.press/v130/radul21a.html Adaptive Sampling for Fast Constrained Maximization of Submodular Functions Several large-scale machine learning tasks, such as data summarization, can be approached by maximizing functions that satisfy submodularity. These optimization problems often involve complex side constraints, imposed by the underlying application. In this paper, we develop an algorithm with poly-logarithmic adaptivity for non-monotone submodular maximization under general side constraints. The adaptive complexity of a problem is the minimal number of sequential rounds required to achieve the objective. Our algorithm is suitable to maximize a non-monotone submodular function under a p-system side constraint, and it achieves a (p + O(sqrt(p)))-approximation for this problem, after only poly-logarithmic adaptive rounds and polynomial queries to the valuation oracle function. Furthermore, our algorithm achieves a (p + O(1))-approximation when the given side constraint is a p-extendable system. This algorithm yields an exponential speed-up, with respect to the adaptivity, over any other known constant-factor approximation algorithm for this problem. It also competes with previous known results in terms of the query complexity. We perform various experiments on various real-world applications. We find that, in comparison with commonly used heuristics, our algorithm performs better on these instances. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/quinzan21a.html
http://proceedings.mlr.press/v130/quinzan21a.html On the Memory Mechanism of Tensor-Power Recurrent Models Tensor-power (TP) recurrent model is a family of non-linear dynamical systems, of which the recurrence relation consists of a p-fold (a.k.a., degree-p) tensor product. Despite such the model frequently appears in the advanced recurrent neural networks (RNNs), to this date there is limited study on its memory property, a critical characteristic in sequence tasks. In this work, we conduct a thorough investigation of the memory mechanism of TP recurrent models. Theoretically, we prove that a large degree p is an essential condition to achieve the long memory effect, yet it would lead to unstable dynamical behaviors. Empirically, we tackle this issue by extending the degree p from discrete to a differentiable domain, such that it is efficiently learnable from a variety of datasets. Taken together, the new model is expected to benefit from the long memory effect in a stable manner. We experimentally show that the proposed model achieves competitive performance compared to various advanced RNNs in both the single-cell and seq2seq architectures. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/qiu21a.html
http://proceedings.mlr.press/v130/qiu21a.html Understanding Gradient Clipping In Incremental Gradient Methods We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/qian21a.html
http://proceedings.mlr.press/v130/qian21a.html Statistical Guarantees for Transformation Based Models with applications to Implicit Variational Inference Transformation based methods have been an attractive approach in non-parametric inference for problems such as unconditioned and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformation based models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both non-parametric inference and variational inference lacks theoretical justification. In the context of non-linear latent variable models (NL-LVM), we provide theoretical justification for the use of these models in non-parametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the $L_1$ sense and show that for this class of priors the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the non-parametric setting we use the NL-LVM to construct an implicit family of variational distributions, deemed as GP-IVI. We delineate sufficient conditions under which GP-IVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback-Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit variational inference. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/plummer21a.html
http://proceedings.mlr.press/v130/plummer21a.html Designing Transportable Experiments Under S-admissability We consider the problem of designing a randomized experiment on a source population to estimate the Average Treatment Effect (ATE) on a target population. We propose a novel approach which explicitly considers the target when designing the experiment on the source. Under the covariate shift assumption, we design an unbiased importance-weighted estimator for the target population’s ATE. To reduce the variance of our estimator, we design a covariate balance condition (Target Balance) between the treatment and control groups based on the target population. We show that Target Balance achieves a higher variance reduction asymptotically than methods that do not consider the target population during the design phase. Our experiments illustrate that Target Balance reduces the variance even for small sample sizes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/phan21a.html
http://proceedings.mlr.press/v130/phan21a.html Differentially Private Online Submodular Maximization In this work we consider the problem of online submodular maximization under a cardinality constraint with differential privacy (DP). A stream of T submodular functions over a common finite ground set U arrives online, and at each time-step the decision maker must choose at most k elements of U before observing the function. The decision maker obtains a profit equal to the function evaluated on the chosen set and aims to learn a sequence of sets that achieves low expected regret. In the full-information setting, we develop an $(\varepsilon,\delta)$-DP algorithm with expected (1-1/e)-regret bound of $O( \frac{k^2\log |U|\sqrt{T \log k/\delta}}{\varepsilon} )$. This algorithm contains k ordered experts that learn the best marginal increments for each item over the whole time horizon while maintaining privacy of the functions. In the bandit setting, we provide an $(\varepsilon,\delta+ O(e^{-T^{1/3}}))$-DP algorithm with expected (1-1/e)-regret bound of $O( \frac{\sqrt{\log k/\delta}}{\varepsilon} (k (|U| \log |U|)^{1/3})^2 T^{2/3} )$. One challenge for privacy in this setting is that the payoff and feedback of expert i depends on the actions taken by her i-1 predecessors. This particular type of information leakage is not covered by post-processing, and new analysis is required. Our techniques for maintaining privacy with feedforward may be of independent interest. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/perez-salazar21a.html
http://proceedings.mlr.press/v130/perez-salazar21a.html Regression Discontinuity Design under Self-selection Regression Discontinuity (RD) design is commonly used to estimate the causal effect of a policy. Existing RD relies on the continuity assumption of potential outcomes. However, self selection leads to different distributions of covariates on two sides of the policy intervention, which violates this assumption. The standard RD estimators are no longer applicable in such setting. We show that the direct causal effect can still be recovered under a class of weighted average treatment effects. We propose a set of estimators through a weighted local linear regression framework and prove the consistency and asymptotic normality of the estimators. We apply our method to a novel data set from Microsoft Bing on Generalized Second Price (GSP) auction and show that by placing the advertisement on the second ranked position can increase the click-ability by 1.91%. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/peng21a.html
http://proceedings.mlr.press/v130/peng21a.html Uniform Consistency of Cross-Validation Estimators for High-Dimensional Ridge Regression We examine generalized and leave-one-out cross-validation for ridge regression in a proportional asymptotic framework where the dimension of the feature space grows proportionally with the number of observations. Given i.i.d. samples from a linear model with an arbitrary feature covariance and a signal vector that is bounded in $\ell_2$ norm, we show that generalized cross-validation for ridge regression converges almost surely to the expected out-of-sample prediction error, uniformly over a range of ridge regularization parameters that includes zero (and even negative values). We prove the analogous result for leave-one-out cross-validation. As a consequence, we show that ridge tuning via minimization of generalized or leave-one-out cross-validation asymptotically almost surely delivers the optimal level of regularization for predictive accuracy, whether it be positive, negative, or zero. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/patil21a.html
http://proceedings.mlr.press/v130/patil21a.html A unified view of likelihood ratio and reparameterization gradients Reparameterization (RP) and likelihood ratio (LR) gradient estimators are used to estimate gradients of expectations throughout machine learning and reinforcement learning; however, they are usually explained as simple mathematical tricks, with no insight into their nature. We use a first principles approach to explain that LR and RP are alternative methods of keeping track of the movement of probability mass, and the two are connected via the divergence theorem. Moreover, we show that the space of all possible estimators combining LR and RP can be completely parameterized by a flow field u(x) and importance sampling distribution q(x). We prove that there cannot exist a single-sample estimator of this type outside our characterized space, thus, clarifying where we should be searching for better Monte Carlo gradient estimators. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/parmas21a.html
http://proceedings.mlr.press/v130/parmas21a.html Local Competition and Stochasticity for Adversarial Robustness in Deep Learning This work addresses adversarial robustness in deep learning by considering deep networks with stochastic local winner-takes-all (LWTA) activations. This type of network units result in sparse representations from each model layer, as the units are organized in blocks where only one unit generates a non-zero output. The main operating principle of the introduced units lies on stochastic arguments, as the network performs posterior sampling over competing units to select the winner. We combine these LWTA arguments with tools from the field of Bayesian non-parametrics, specifically the stick-breaking construction of the Indian Buffet Process, to allow for inferring the sub-part of each layer that is essential for modeling the data at hand. Then, inference is performed by means of stochastic variational Bayes. We perform a thorough experimental evaluation of our model using benchmark datasets. As we show, our method achieves high robustness to adversarial perturbations, with state-of-the-art performance in powerful adversarial attack schemes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/panousis21a.html
http://proceedings.mlr.press/v130/panousis21a.html Sketch based Memory for Neural Networks Deep learning has shown tremendous success on a variety of problems. However, unlike traditional computational paradigm, most neural networks do not have access to a memory, which might be hampering its ability to scale to large data structures such as graphs, lookup-tables, databases. We propose a neural architecture where sketch based memory is integrated into a neural network in a uniform manner at every layer. This architecture supplements a neural layer by information accessed from the memory before feeding it to the next layer, thereby significantly expanding the capacity of the network to solve larger problem instances. We show theoretically that problems involving key-value lookup that are traditionally stored in standard databases can now be solved using neural networks augmented by our memory architecture. We also show that our memory layer can be viewed as a kernel function. We show benefits on diverse problems such as long tail image classification, language model, large graph multi hop traversal, etc. arguing that they are all build upon the classical key-value lookup problem (or the variant where the keys may be fuzzy). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/panigrahy21a.html
http://proceedings.mlr.press/v130/panigrahy21a.html Fourier Bases for Solving Permutation Puzzles Traditionally, permutation puzzles such as the Rubik’s Cube were often solved by heuristic search like $A^*\!$-search and value based reinforcement learning methods. Both heuristic search and Q-learning approaches to solving these puzzles can be reduced to learning a heuristic/value function to decide what puzzle move to make at each step. We propose learning a value function using the irreducible representations basis (which we will also call the Fourier basis) of the puzzle’s underlying group. Classical Fourier analysis on real valued functions tells us we can approximate smooth functions with low frequency basis functions. Similarly, smooth functions on finite groups can be represented by the analogous low frequency Fourier basis functions. We demonstrate the effectiveness of learning a value function in the Fourier basis for solving various permutation puzzles and show that it outperforms standard deep learning methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/pan21a.html
http://proceedings.mlr.press/v130/pan21a.html Stochastic Bandits with Linear Constraints We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multi-armed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/pacchiano21a.html
http://proceedings.mlr.press/v130/pacchiano21a.html Training a Single Bandit Arm In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to train novice workers of unknown quality in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from $T$ pulls, we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls, and aim to maximize the expected value of the highest cumulative reward across the $K$ arms. This corresponds to the objective of training a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and an instance-independent regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy, featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). Our numerical experiments demonstrate the efficacy of this policy compared to several natural alternatives in practical parameter regimes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ozbay21a.html
http://proceedings.mlr.press/v130/ozbay21a.html A Theoretical Characterization of Semi-supervised Learning with Self-training for Gaussian Mixture Models Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithms generate pseudo-labels for the unlabeled examples and progressively refine these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithms with a focus on linear classifiers. First, we provide a sample complexity analysis for Gaussian mixture models with two components. This is established by sharp non-asymptotic characterization of the self-training iterations which captures the evolution of the model accuracy in terms of a fixed-point iteration. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates how self-training iterations can gracefully improve the model accuracy. Secondly, we study a generalized GMM where the component means follow a distribution. We demonstrate that ridge regularization and class margin (i.e. separation between the component means) is crucial for the success and lack of regularization may prevent self-training from identifying the core features in the data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/oymak21a.html
http://proceedings.mlr.press/v130/oymak21a.html Associative Convolutional Layers We provide a general and easy to implement method for reducing the number of parameters of Convolutional Neural Networks (CNNs) during the training and inference phases. We introduce a simple trainable auxiliary neural network which can generate approximate versions of “slices” of the sets of convolutional filters of any CNN architecture from a low dimensional “code” space. These slices are then concatenated to form the sets of filters in the CNN architecture. The auxiliary neural network, which we call “Convolutional Slice Generator” (CSG), is unique to the network and provides the association among its convolutional layers. We apply our method to various CNN architectures including ResNet, DenseNet, MobileNet and ShuffleNet. Experiments on CIFAR-10 and ImageNet-1000, without any hyper-parameter tuning, show that our approach reduces the network parameters by approximately $2\times$ while the reduction in accuracy is confined to within one percent and sometimes the accuracy even improves after compression. Interestingly, through our experiments, we show that even when the CSG takes random binary values for its weights that are not learned, still acceptable performances are achieved. To show that our approach generalizes to other tasks, we apply it to an image segmentation architecture, Deeplab V3, on the Pascal VOC 2012 dataset. Results show that without any parameter tuning, there is $\approx 2.3\times$ parameter reduction and the mean Intersection over Union (mIoU) drops by $\approx 3%$. Finally, we provide comparisons with several related methods showing the superiority of our method in terms of accuracy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/omidvar21a.html
http://proceedings.mlr.press/v130/omidvar21a.html Unconstrained MAP Inference, Exponentiated Determinantal Point Processes, and Exponential Inapproximability We study the computational complexity of two hard problems on determinantal point processes (DPPs). One is maximum a posteriori (MAP) inference, i.e., to find a principal submatrix having the maximum determinant. The other is probabilistic inference on exponentiated DPPs (E-DPPs), which can sharpen or weaken the diversity preference of DPPs with an exponent parameter $p$. We prove the following complexity-theoretic hardness results that explain the difficulty in approximating unconstrained MAP inference and the normalizing constant for E-DPPs. (1) Unconstrained MAP inference for an $n \times n$ matrix is NP-hard to approximate within a $2^{\beta n}$-factor, where $\beta = 10^{-10^{13}}$. This result improves upon a $(9/8-\epsilon)$-factor inapproximability given by Kulesza and Taskar (2012). (2) The normalizing constant for E-DPPs of any (fixed) constant exponent $p \geq \beta^{-1} = 10^{10^{13}}$ is NP-hard to approximate within a $2^{\beta pn}$-factor. This gives a(nother) negative answer to open questions posed by Kulesza and Taskar (2012); Ohsaka and Matsuoka (2020). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ohsaka21a.html
http://proceedings.mlr.press/v130/ohsaka21a.html Novel Change of Measure Inequalities with Applications to PAC-Bayesian Bounds and Monte Carlo Estimation We introduce several novel change of measure inequalities for two families of divergences: $f$-divergences and $\alpha$-divergences. We show how the variational representation for $f$-divergences leads to novel change of measure inequalities. We also present a multiplicative change of measure inequality for $\alpha$-divergences and a generalized version of Hammersley-Chapman-Robbins inequality. Finally, we present several applications of our change of measure inequalities, including PAC-Bayesian bounds for various classes of losses and non-asymptotic intervals for Monte Carlo estimates. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ohnishi21a.html
http://proceedings.mlr.press/v130/ohnishi21a.html Spectral Tensor Train Parameterization of Deep Learning Layers We study low-rank parameterizations of weight matrices with embedded spectral properties in the Deep Learning context. The low-rank property leads to parameter efficiency and permits taking computational shortcuts when computing mappings. Spectral properties are often subject to constraints in optimization problems, leading to better models and stability of optimization. We start by looking at the compact SVD parameterization of weight matrices and identifying redundancy sources in the parameterization. We further apply the Tensor Train (TT) decomposition to the compact SVD components, and propose a non-redundant differentiable parameterization of fixed TT-rank tensor manifolds, termed the Spectral Tensor Train Parameterization (STTP). We demonstrate the effects of neural network compression in the image classification setting, and both compression and improved training stability in the generative adversarial training setting. Project website: www.obukhov.ai/sttp Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/obukhov21a.html
http://proceedings.mlr.press/v130/obukhov21a.html Group testing for connected communities In this paper, we propose algorithms that leverage a known community structure to make group testing more efficient. We consider a population organized in disjoint communities: each individual participates in a community, and its infection probability depends on the community (s)he participates in. Use cases include families, students who participate in several classes, and workers who share common spaces. Group testing reduces the number of tests needed to identify the infected individuals by pooling diagnostic samples and testing them together. We show that if we design the testing strategy taking into account the community structure, we can significantly reduce the number of tests needed for adaptive and non-adaptive group testing, and can improve the reliability in cases where tests are noisy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/nikolopoulos21a.html
http://proceedings.mlr.press/v130/nikolopoulos21a.html Hogwild! over Distributed Local Data Sets with Linearly Increasing Mini-Batch Sizes Hogwild! implements asynchronous Stochastic Gradient Descent (SGD) where multiple threads in parallel access a common repository containing training data, perform SGD iterations and update shared state that represents a jointly learned (global) model. We consider big data analysis where training data is distributed among local data sets in a heterogeneous way – and we wish to move SGD computations to local compute nodes where local data resides. The results of these local SGD computations are aggregated by a central “aggregator” which mimics Hogwild!. We show how local compute nodes can start choosing small mini-batch sizes which increase to larger ones in order to reduce communication cost (round interaction with the aggregator). We improve state-of-the-art literature and show O(K^{0.5}) communication rounds for heterogeneous data for strongly convex problems, where K is the total number of gradient computations across all local compute nodes. For our scheme, we prove a tight and novel non-trivial convergence analysis for strongly convex problems for heterogeneous data which does not use the bounded gradient assumption as seen in many existing publications. The tightness is a consequence of our proofs for lower and upper bounds of the convergence rate, which show a constant factor difference. We show experimental results for plain convex and non-convex problems for biased (i.e., heterogeneous) and unbiased local data sets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/nguyen21a.html
http://proceedings.mlr.press/v130/nguyen21a.html Parametric Programming Approach for More Powerful and General Lasso Selective Inference Selective Inference (SI) has been actively studied in the past few years for conducting inference on the features of linear models that are adaptively selected by feature selection methods such as Lasso. The basic idea of SI is to make inference conditional on the selection event. Unfortunately, the main limitation of the original SI approach for Lasso is that the inference is conducted not only conditional on the selected features but also on their signs—this leads to loss of power because of over-conditioning. Although this limitation can be circumvented by considering the union of such selection events for all possible combinations of signs, this is only feasible when the number of selected features is sufficiently small. To address this computational bottleneck, we propose a parametric programming-based method that can conduct SI without conditioning on signs even when we have thousands of active features. The main idea is to compute the continuum path of Lasso solutions in the direction of a test statistic, and identify the subset of the data space corresponding to the feature selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned computational bottleneck but also improves the performance and practicality of SI for Lasso in various respects. We conduct several experiments to demonstrate the effectiveness and efficiency of our proposed method. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/nguyen-le-duy21a.html
http://proceedings.mlr.press/v130/nguyen-le-duy21a.html Predictive Complexity Priors Specifying a Bayesian prior is notoriously difficult for complex models such as neural networks. Reasoning about parameters is made challenging by the high-dimensionality and over-parameterization of the space. Priors that seem benign and uninformative can have unintuitive and detrimental effects on a model’s predictions. For this reason, we propose predictive complexity priors: a functional prior that is defined by comparing the model’s predictions to those of a reference model. Although originally defined on the model outputs, we transfer the prior to the model parameters via a change of variables. The traditional Bayesian workflow can then proceed as usual. We apply our predictive complexity prior to high-dimensional regression, reasoning over neural network depth, and sharing of statistical strength for few-shot learning. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/nalisnick21a.html
http://proceedings.mlr.press/v130/nalisnick21a.html Budgeted and Non-Budgeted Causal Bandits Learning good interventions in a causal graph can be modelled as a stochastic multi-armed bandit problem with side-information. First, we study this problem when interventions are more expensive than observations and a budget is specified. If there are no backdoor paths from the intervenable nodes to the reward node then we propose an algorithm to minimize simple regret that optimally trades-off observations and interventions based on the cost of intervention. We also propose an algorithm that accounts for the cost of interventions, utilizes causal side-information, and minimizes the expected cumulative regret without exceeding the budget. Our algorithm performs better than standard algorithms that do not take side-information into account. Finally, we study the problem of learning best interventions without budget constraint in general graphs and give an algorithm that achieves constant expected cumulative regret in terms of the instance parameters when the parent distribution of the reward variable for each intervention is known. Our results are experimentally validated and compared to the best-known bounds in the current literature. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/nair21a.html
http://proceedings.mlr.press/v130/nair21a.html Gradient Descent in RKHS with Importance Labeling Labeling cost is often expensive and is a fundamental limitation of supervised learning. In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one. We propose a new importance labeling scheme that can effectively select an informative subset of unlabeled data in least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). We analyze the generalization error of gradient descent combined with our labeling scheme and show that the proposed algorithm achieves the optimal rate of convergence in much wider settings and especially gives much better generalization ability in a small noise setting than the usual uniform sampling scheme. Numerical experiments verify our theoretical findings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/murata21a.html
http://proceedings.mlr.press/v130/murata21a.html Private optimization without constraint violations We study the problem of differentially private optimization with linear constraints when the right-hand-side of the constraints depends on private data. This type of problem appears in many applications, especially resource allocation. Previous research provided solutions that retained privacy but sometimes violated the constraints. In many settings, however, the constraints cannot be violated under any circumstances. To address this hard requirement, we present an algorithm that releases a nearly-optimal solution satisfying the constraints with probability 1. We also prove a lower bound demonstrating that the difference between the objective value of our algorithm’s solution and the optimal solution is tight up to logarithmic factors among all differentially private algorithms. We conclude with experiments demonstrating that our algorithm can achieve nearly optimal performance while preserving privacy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/munoz21a.html
http://proceedings.mlr.press/v130/munoz21a.html Stochastic Gradient Descent Meets Distribution Regression Stochastic gradient descent (SGD) provides a simple and efficient way to solve a broad range of machine learning problems. Here, we focus on distribution regression (DR), involving two stages of sampling: Firstly, we regress from probability measures to real-valued responses. Secondly, we sample bags from these distributions for utilizing them to solve the overall regression problem. Recently, DR has been tackled by applying kernel ridge regression and the learning properties of this approach are well understood. However, nothing is known about the learning properties of SGD for two stage sampling problems. We fill this gap and provide theoretical guarantees for the performance of SGD for DR. Our bounds are optimal in a mini-max sense under standard assumptions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/muecke21a.html
http://proceedings.mlr.press/v130/muecke21a.html On the Convergence of Gradient Descent in GANs: MMD GAN As a Gradient Flow We consider the maximum mean discrepancy MMD GAN problem and propose a parametric kernelized gradient flow that mimics the min-max game in gradient regularized MMD GAN. We show that this flow provides a descent direction minimizing the MMD on a statistical manifold of probability distributions. We then derive an explicit condition which ensures that gradient descent on the parameter space of the generator in gradient regularized MMD GAN is globally convergent to the target distribution. Under this condition , we give non asymptotic convergence results for MMD GAN. Another contribution of this paper is the introduction of a dynamic formulation of a regularization of MMD and demonstrating that the parametric kernelized descent for MMD is the gradient flow of this functional with respect to the new Riemannian structure. Our obtained theoretical result allows ones to treat gradient flows for quite general functionals and thus has potential applications to other types of variational inferences on a statistical manifold beyond GANs. Finally, numerical experiments suggest that our parametric kernelized gradient flow stabilizes GAN training and guarantees convergence. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mroueh21a.html
http://proceedings.mlr.press/v130/mroueh21a.html Hierarchical Clustering in General Metric Spaces using Approximate Nearest Neighbors Hierarchical clustering is a widely used data analysis method, but suffers from scalability issues, requiring quadratic time in general metric spaces. In this work, we demonstrate how approximate nearest neighbor (ANN) queries can be used to improve the running time of the popular single-linkage and average-linkage methods. Our proposed algorithms are the first subquadratic time algorithms for non-Euclidean metrics. We complement our theoretical analysis with an empirical evaluation showcasing our methods’ efficiency and accuracy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/moseley21a.html
http://proceedings.mlr.press/v130/moseley21a.html Automatic Differentiation Variational Inference with Mixtures Automatic Differentiation Variational Inference (ADVI) is a useful tool for efficiently learning probabilistic models in machine learning. Generally approximate posteriors learned by ADVI are forced to be unimodal in order to facilitate use of the reparameterization trick. In this paper, we show how stratified sampling may be used to enable mixture distributions as the approximate posterior, and derive a new lower bound on the evidence analogous to the importance weighted autoencoder (IWAE). We show that this "SIWAE" is a tighter bound than both IWAE and the traditional ELBO, both of which are special instances of this bound. We verify empirically that the traditional ELBO objective disfavors the presence of multimodal posterior distributions and may therefore not be able to fully capture structure in the latent space. Our experiments show that using the SIWAE objective allows the encoder to learn more complex distributions which regularly contain multimodality, resulting in higher accuracy and better calibration in the presence of incomplete, limited, or corrupted data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/morningstar21b.html
http://proceedings.mlr.press/v130/morningstar21b.html Density of States Estimation for Out of Distribution Detection Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of “density of states,” the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the “probability of the model probability,” or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE’s state-of-the-art performance against other unsupervised OOD detectors on previously established “hard” benchmarks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/morningstar21a.html
http://proceedings.mlr.press/v130/morningstar21a.html Independent Innovation Analysis for Nonlinear Vector Autoregressive Process The nonlinear vector autoregressive (NVAR) model provides an appealing framework to analyze multivariate time series obtained from a nonlinear dynamical system. However, the innovation (or error), which plays a key role by driving the dynamics, is almost always assumed to be additive. Additivity greatly limits the generality of the model, hindering analysis of general NVAR processes which have nonlinear interactions between the innovations. Here, we propose a new general framework called independent innovation analysis (IIA), which estimates the innovations from completely general NVAR. We assume mutual independence of the innovations as well as their modulation by an auxiliary variable (which is often taken as the time index and simply interpreted as nonstationarity). We show that IIA guarantees the identifiability of the innovations with arbitrary nonlinearities, up to a permutation and component-wise invertible nonlinearities. We also propose three estimation frameworks depending on the type of the auxiliary variable. We thus provide the first rigorous identifiability result for general NVAR, as well as very general tools for learning such models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/morioka21a.html
http://proceedings.mlr.press/v130/morioka21a.html Approximate Message Passing with Spectral Initialization for Generalized Linear Models We consider the problem of estimating a signal from measurements obtained via a generalized linear model. We focus on estimators based on approximate message passing (AMP), a family of iterative algorithms with many appealing features: the performance of AMP in the high-dimensional limit can be succinctly characterized under suitable model assumptions; AMP can also be tailored to the empirical distribution of the signal entries, and for a wide class of estimation problems, AMP is conjectured to be optimal among all polynomial-time algorithms. However, a major issue of AMP is that in many models (such as phase retrieval), it requires an initialization correlated with the ground-truth signal and independent from the measurement matrix. Assuming that such an initialization is available is typically not realistic. In this paper, we solve this problem by proposing an AMP algorithm initialized with a spectral estimator. With such an initialization, the standard AMP analysis fails since the spectral estimator depends in a complicated way on the design matrix. Our main contribution is a rigorous characterization of the performance of AMP with spectral initialization in the high-dimensional limit. The key technical idea is to define and analyze a two-phase artificial AMP algorithm that first produces the spectral estimator, and then closely approximates the iterates of the true AMP. We also provide numerical results that demonstrate the validity of the proposed approach. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mondelli21a.html
http://proceedings.mlr.press/v130/mondelli21a.html DAG-Structured Clustering by Nearest Neighbors Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structures are not only more flexible than trees, but also allow for points to be members of multiple clusters. We describe a scalable algorithm, Llama, which simply merges nearest neighbor substructures to form a DAG structure. Llama discovers structures that are more accurate than state-of-the-art tree-based techniques while remaining scalable to large-scale clustering benchmarks. Additionally, we support the proposed algorithm with theoretical guarantees on separated data, including types of data that cannot be correctly clustered by tree-based algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/monath21a.html
http://proceedings.mlr.press/v130/monath21a.html Iterative regularization for convex regularizers We study iterative regularization for linear models, when the bias is convex but not necessarily strongly convex. We characterize the stability properties of a primal-dual gradient based approach, analyzing its convergence in the presence of worst case deterministic noise. As a main example, we specialize and illustrate the results for the problem of robust sparse recovery. Key to our analysis is a combination of ideas from regularization theory and optimization in the presence of errors. Theoretical results are complemented by experiments showing that state-of-the-art performances are achieved with considerable computational speed-ups. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/molinari21a.html
http://proceedings.mlr.press/v130/molinari21a.html Non-Volume Preserving Hamiltonian Monte Carlo and No-U-TurnSamplers Volume preservation is usually regarded as a necessary property for the leapfrog transition functions that are used in Hamiltonian Monte Carlo (HMC) and No-U-Turn (NUTS) samplers to guarantee convergence to the target distribution. In this work we rigorously prove that with minimal algorithmic modifications, both HMC and NUTS can be combined with transition functions that are not necessarily volume preserving. In light of these results, we propose a non-volume preserving transition function that conserves the Hamiltonian better than the baseline leapfrog mechanism, on piecewise-continuous distributions. The resulting samplers do not require any assumptions on the geometry of the discontinuity boundaries, and our experimental results show a significant improvement upon traditional HMC and NUTS. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mohasel-afshar21a.html
http://proceedings.mlr.press/v130/mohasel-afshar21a.html Hidden Cost of Randomized Smoothing The fragility of modern machine learning models has drawn a considerable amount of attention from both academia and the public. While immense interests were in either crafting adversarial attacks as a way to measure the robustness of neural networks or devising worst-case analytical robustness verification with guarantees, few methods could enjoy both scalability and robustness guarantees at the same time. As an alternative to these attempts, randomized smoothing adopts a different prediction rule that enables statistical robustness arguments which easily scale to large networks. However, in this paper, we point out the side effects of current randomized smoothing workflows. Specifically, we articulate and prove two major points: 1) the decision boundaries of smoothed classifiers will shrink, resulting in disparity in class-wise accuracy; 2) applying noise augmentation in the training process does not necessarily resolve the shrinking issue due to the inconsistent learning objectives. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mohapatra21a.html
http://proceedings.mlr.press/v130/mohapatra21a.html Diagnostic Uncertainty Calibration: Towards Reliable Machine Predictions in Medical Domain We propose an evaluation framework for class probability estimates (CPEs) in the presence of label uncertainty, which is commonly observed as diagnosis disagreement between experts in the medical domain. We also formalize evaluation metrics for higher-order statistics, including inter-rater disagreement, to assess predictions on label uncertainty. Moreover, we propose a novel post-hoc method called alpha-calibration, that equips neural network classifiers with calibrated distributions over CPEs. Using synthetic experiments and a large-scale medical imaging application, we show that our approach significantly enhances the reliability of uncertainty estimates: disagreement probabilities and posterior CPEs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mimori21a.html
http://proceedings.mlr.press/v130/mimori21a.html Tensor Networks for Probabilistic Sequence Modeling Tensor networks are a powerful modeling framework developed for computational many-body physics, which have only recently been applied within machine learning. In this work we utilize a uniform matrix product state (u-MPS) model for probabilistic modeling of sequence data. We first show that u-MPS enable sequence-level parallelism, with length-n sequences able to be evaluated in depth O(log n). We then introduce a novel generative algorithm giving trained u-MPS the ability to efficiently sample from a wide variety of conditional distributions, each one defined by a regular expression. Special cases of this algorithm correspond to autoregressive and fill-in-the-blank sampling, but more complex regular expressions permit the generation of richly structured data in a manner that has no direct analogue in neural generative models. Experiments on sequence modeling with synthetic and real text data show u-MPS outperforming a variety of baselines and effectively generalizing their predictions in the presence of limited data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/miller21a.html
http://proceedings.mlr.press/v130/miller21a.html Continual Learning using a Bayesian Nonparametric Dictionary of Weight Factors Naively trained neural networks tend to experience catastrophic forgetting in sequential task settings, where data from previous tasks are unavailable. A number of methods, using various model expansion strategies, have been proposed recently as possible solutions. However, determining how much to expand the model is left to the practitioner, and often a constant schedule is chosen for simplicity, regardless of how complex the incoming task is. Instead, we propose a principled Bayesian nonparametric approach based on the Indian Buffet Process (IBP) prior, letting the data determine how much to expand the model complexity. We pair this with a factorization of the neural network’s weight matrices. Such an approach allows us to scale the number of factors of each weight matrix to the complexity of the task, while the IBP prior encourages sparse weight factor selection and factor reuse, promoting positive knowledge transfer between tasks. We demonstrate the effectiveness of our method on a number of continual learning benchmarks and analyze how weight factors are allocated and reused throughout the training. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mehta21a.html
http://proceedings.mlr.press/v130/mehta21a.html Differentiating the Value Function by using Convex Duality We consider the differentiation of the value function for parametric optimization problems. Such problems are ubiquitous in machine learning applications such as structured support vector machines, matrix factorization and min-min or minimax problems in general. Existing approaches for computing the derivative rely on strong assumptions of the parametric function. Therefore, in several scenarios there is no theoretical evidence that a given algorithmic differentiation strategy computes the true gradient information of the value function. We leverage a well known result from convex duality theory to relax the conditions and to derive convergence rates of the derivative approximation for several classes of parametric optimization problems in Machine Learning. We demonstrate the versatility of our approach in several experiments, including non-smooth parametric functions. Even in settings where other approaches are applicable, our duality based strategy shows a favorable performance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mehmood21a.html
http://proceedings.mlr.press/v130/mehmood21a.html Location Trace Privacy Under Conditional Priors Providing meaningful privacy to users of location based services is particularly challenging when multiple locations are revealed in a short period of time. This is primarily due to the tremendous degree of dependence that can be anticipated between points. We propose a Rényi divergence based privacy framework for bounding expected privacy loss for conditionally dependent data. Additionally, we demonstrate an algorithm for achieving this privacy under Gaussian process conditional priors. This framework both exemplifies why conditionally dependent data is so challenging to protect and offers a strategy for preserving privacy to within a fixed radius for sensitive locations in a user’s trace. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/meehan21a.html
http://proceedings.mlr.press/v130/meehan21a.html Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees We develop a novel method that provides theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mazzetto21a.html
http://proceedings.mlr.press/v130/mazzetto21a.html Collaborative Classification from Noisy Labels We consider a setting where users interact with a collection of N items on an online platform. We are given class labels possibly corrupted by noise, and we seek to recover the true class of each item. We postulate a simple probabilistic model of the interactions between users and items, based on the assumption that users interact with classes in different proportions. We then develop a message-passing algorithm that decodes the noisy class labels efficiently. Under suitable assumptions, our method provably recovers all items’ true classes in the large N limit, even when the interaction graph remains sparse. Empirically, we show that our approach is effective on several practical applications, including predicting the location of businesses, the category of consumer goods, and the language of audio content. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/maystre21a.html
http://proceedings.mlr.press/v130/maystre21a.html Wyner-Ziv Estimators: Efficient Distributed Mean Estimation with Side-Information Communication efficient distributed mean estimation is an important primitive that arises in many distributed learning and optimization scenarios such as federated learning. Without any probabilistic assumptions on the underlying data, we study the problem of distributed mean estimation where the server has access to side information. We propose \emph{Wyner-Ziv estimators}, which are efficient and near-optimal when an upper bound for the distance between the side information and the data is known. In a different direction, when there is no knowledge assumed about the distance between side information and the data, we present an alternative Wyner-Ziv estimator that uses correlated sampling. This latter setting offers universal recovery guarantees, and perhaps will be of interest in practice when the number of users is large, where keeping track of the distances between the data and the side information may not be possible. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mayekar21a.html
http://proceedings.mlr.press/v130/mayekar21a.html Tracking Regret Bounds for Online Submodular Optimization In this paper, we propose algorithms for online submodular optimization with tracking regret bounds. Online submodular optimization is a generic framework for sequential decision making used to select subsets. Existing algorithms for online submodular optimization have been shown to achieve small (static) regret, which means that the algorithm’s performance is comparable to the performance of a fixed optimal action. Such algorithms, however, may perform poorly in an environment that changes over time. To overcome this problem, we apply a tracking-regret-analysis framework to online submodular optimization, one by which output is assessed through comparison with time-varying optimal subsets. We propose algorithms for submodular minimization, monotone submodular maximization under a size constraint, and unconstrained submodular maximization, and we show tracking regret bounds. In addition, we show that our tracking regret bound for submodular minimization is nearly tight. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/matsuoka21a.html
http://proceedings.mlr.press/v130/matsuoka21a.html Misspecification in Prediction Problems and Robustness via Improper Learning We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model. We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a “proper” predictor—one from the putative model class—relative to the best predictor in the same model class has lower bound scaling at least as $\sqrt{\gamma n}$, where $\gamma$ is a measure of the model misspecification to the true distribution in terms of total variation distance. In contrast, using an aggregation-based (improper) learner, one can obtain regret $d \log n$ for any underlying generating distribution, where $d$ is the dimension of the parameter; we exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family. These results suggest that simple strategies for aggregating multiple learners together should be more robust, and several experiments conform to this hypothesis. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/marsden21a.html
http://proceedings.mlr.press/v130/marsden21a.html Transforming Gaussian Processes With Normalizing Flows Gaussian Processes (GP) can be used as flexible, non-parametric function priors. Inspired by the growing body of work on Normalizing Flows, we enlarge this class of priors through a parametric invertible transformation that can be made input-dependent. Doing so also allows us to encode interpretable prior knowledge (e.g., boundedness constraints). We derive a variational approximation to the resulting Bayesian inference problem, which is as fast as stochastic variational GP regression (Hensman et al., 2013; Dezfouli and Bonilla, 2015). This makes the model a computationally efficient alternative to other hierarchical extensions of GP priors (Lázaro-Gredilla,2012; Damianou and Lawrence,2013). The resulting algorithm’s computational and inferential performance is excellent, and we demonstrate this on a range of data sets. For example, even with only 5 inducing points and an input-dependent flow, our method is consistently competitive with a standard sparse GP fitted using 100 inducing points. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/maronas21a.html
http://proceedings.mlr.press/v130/maronas21a.html High-Dimensional Multi-Task Averaging and Application to Kernel Mean Embedding We propose an improved estimator for the multi-task averaging problem, whose goal is the joint estimation of the means of multiple distributions using separate, independent data sets. The naive approach is to take the empirical mean of each data set individually, whereas the proposed method exploits similarities between tasks, without any related information being known in advance. First, for each data set, similar or neighboring means are determined from the data by multiple testing. Then each naive estimator is shrunk towards the local average of its neighbors. We prove theoretically that this approach provides a reduction in mean squared error. This improvement can be significant when the dimension of the input space is large; demonstrating a “blessing of dimensionality” phenomenon. An application of this approach is the estimation of multiple kernel mean embeddings, which plays an important role in many modern applications. The theoretical results are verified on artificial and real world data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/marienwald21a.html
http://proceedings.mlr.press/v130/marienwald21a.html An Analysis of LIME for Text Data Text data are increasingly handled in an automated fashion by machine learning algorithms. But the models handling these data are not always well-understood due to their complexity and are more and more often referred to as “black-boxes.” Interpretability methods aim to explain how these models operate. Among them, LIME has become one of the most popular in recent years. However, it comes without theoretical guarantees: even for simple models, we are not sure that LIME behaves accurately. In this paper, we provide a first theoretical analysis of LIME for text data. As a consequence of our theoretical findings, we show that LIME indeed provides meaningful explanations for simple models, namely decision trees and linear models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mardaoui21a.html
http://proceedings.mlr.press/v130/mardaoui21a.html A Theory of Multiple-Source Adaptation with Limited Target Labeled Data We study multiple-source domain adaptation, when the learner has access to abundant labeled data from multiple-source domains and limited labeled data from the target domain. We analyze existing algorithms for this problem, and propose a novel algorithm based on model selection. Our algorithms are efficient, and experiments on real data-sets empirically demonstrate their benefits. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/mansour21a.html
http://proceedings.mlr.press/v130/mansour21a.html Fast Adaptation with Linearized Neural Networks The inductive biases of trained neural networks are difficult to understand and, consequently, to adapt to new settings. We study the inductive biases of linearizations of neural networks, which we show to be surprisingly good summaries of the full network functions. Inspired by this finding, we propose a technique for embedding these inductive biases into Gaussian processes through a kernel designed from the Jacobian of the network. In this setting, domain adaptation takes the form of interpretable posterior inference, with accompanying uncertainty estimation. This inference is analytic and free of local optima issues found in standard techniques such as fine-tuning neural network weights to a new task. We develop significant computational speed-ups based on matrix multiplies, including a novel implementation for scalable Fisher vector products. Our experiments on both image classification and regression demonstrate the promise and convenience of this framework for transfer learning, compared to neural network fine-tuning. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/maddox21a.html
http://proceedings.mlr.press/v130/maddox21a.html Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering Hierarchical clustering is a fundamental task often used to discover meaningful structures in data. Due to the combinatorial number of possible hierarchical clusterings, approximate algorithms are typically used for inference. In contrast to existing methods, we present novel dynamic-programming algorithms for exact inference in hierarchical clustering based on a novel trellis data structure, and we prove that we can exactly compute the partition function, maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and clusters. Our algorithms scale in time and space proportional to the powerset of N elements, which is super-exponentially more efficient than explicitly considering each of the (2N − 3)!! possible hierarchies. Also, for larger datasets where our exact algorithms become infeasible, we introduce an approximate algorithm based on a sparse trellis that out- performs greedy and beam search baselines. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/macaluso21a.html
http://proceedings.mlr.press/v130/macaluso21a.html Causal Inference under Networked Interference and Intervention Policy Enhancement Estimating individual treatment effects from data of randomized experiments is a critical task in causal inference. The Stable Unit Treatment Value Assumption (SUTVA) is usually made in causal inference. However, interference can introduce bias when the assigned treatment on one unit affects the potential outcomes of the neighboring units. This interference phenomenon is known as spillover effect in economics or peer effect in social science. Usually, in randomized experiments or observational studies with interconnected units, one can only observe treatment responses under interference. Hence, the issue of how to estimate the superimposed causal effect and recover the individual treatment effect in the presence of interference becomes a challenging task in causal inference. In this work, we study causal effect estimation under general network interference using Graph Neural Networks, which are powerful tools for capturing node and link dependencies in graphs. After deriving causal effect estimators, we further study intervention policy improvement on the graph under capacity constraint. We give policy regret bounds under network interference and treatment capacity constraint. Furthermore, a heuristic graph structure-dependent error bound for Graph Neural Network-based causal estimators is provided. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ma21c.html
http://proceedings.mlr.press/v130/ma21c.html Reaping the Benefits of Bundling under High Production Costs It is well-known that selling different goods in a single bundle can significantly increase revenue. However, bundling is no longer profitable if the goods have high production costs. To overcome this challenge, we introduce a new mechanism, Pure Bundling with Disposal for Cost (PBDC), where after buying the bundle, the customer is allowed to return any subset of goods for their costs. We provide two types of guarantees on the profit of PBDC mechanisms relative to the optimum in the presence of production costs, under the assumption that customers have valuations which are additive over the items and drawn independently. We first provide a distribution-dependent guarantee which shows that PBDC earns at least 1-6c^{2/3} of the optimal profit, where c denotes the coefficient of variation of the welfare random variable. c approaches 0 if there are a large number of items whose individual valuations have bounded coefficients of variation, and our constants improve upon those from the classical result of Bakos and Brynjolfsson (1999) without costs. We then provide a distribution-free guarantee which shows that either PBDC or individual sales earns at least 1/5.2 times the optimal profit, generalizing and improving the constant of 1/6 from the celebrated result of Babaioff et al. (2014). Conversely, we also provide the best-known upper bound on the performance of any partitioning mechanism (which captures both individual sales and pure bundling), of 1/1.19 times the optimal profit, improving on the previously-known upper bound of 1/1.08. Finally, we conduct simulations under the same playing field as the extensive numerical study of Chu et al. (2011), which confirm that PBDC outperforms other simple pricing schemes overall. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ma21b.html
http://proceedings.mlr.press/v130/ma21b.html Learning-to-Rank with Partitioned Preference: Fast Estimation for the Plackett-Luce Model We consider the problem of listwise learning-to-rank (LTR) on data with \textit{partitioned preference}, where a set of items are sliced into ordered and disjoint partitions, but the ranking of items within a partition is unknown. The Plackett-Luce (PL) model has been widely used in listwise LTR methods. However, given $N$ items with $M$ partitions, calculating the likelihood of data with partitioned preference under the PL model has a time complexity of $O(N+S!)$, where $S$ is the maximum size of the top $M-1$ partitions. This computational challenge restrains existing PL-based listwise LTR methods to only a special case of partitioned preference, \textit{top-$K$ ranking}, where the exact order of the top $K$ items is known. In this paper, we exploit a random utility model formulation of the PL model and propose an efficient approach through numerical integration for calculating the likelihood. This numerical approach reduces the aforementioned time complexity to $O(N+MS)$, which allows training deep-neural-network-based ranking models with a large output space. We demonstrate that the proposed method outperforms well-known LTR baselines and remains scalable through both simulation experiments and applications to real-world eXtreme Multi-Label (XML) classification tasks. The proposed method also achieves state-of-the-art performance on XML datasets with relatively large numbers of labels per sample. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ma21a.html
http://proceedings.mlr.press/v130/ma21a.html On the Effect of Auxiliary Tasks on Representation Dynamics While auxiliary tasks play a key role in shaping the representations learnt by reinforcement learning agents, much is still unknown about the mechanisms through which this is achieved. This work develops our understanding of the relationship between auxiliary tasks, environment structure, and representations by analysing the dynamics of temporal difference algorithms. Through this approach, we establish a connection between the spectral decomposition of the transition operator and the representations induced by a variety of auxiliary tasks. We then leverage insights from these theoretical results to inform the selection of auxiliary tasks for deep reinforcement learning agents in sparse-reward environments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lyle21a.html
http://proceedings.mlr.press/v130/lyle21a.html Benchmarking Simulation-Based Inference Recent advances in probabilistic modelling have led to a large number of simulation-based inference algorithms which do not require numerical evaluation of likelihoods. However, a public benchmark with appropriate performance metrics for such ’likelihood-free’ algorithms has been lacking. This has made it difficult to compare algorithms and identify their strengths and weaknesses. We set out to fill this gap: We provide a benchmark with inference tasks and suitable performance metrics, with an initial selection of algorithms including recent approaches employing neural networks and classical Approximate Bayesian Computation methods. We found that the choice of performance metric is critical, that even state-of-the-art algorithms have substantial room for improvement, and that sequential estimation improves sample efficiency. Neural network-based approaches generally exhibit better performance, but there is no uniformly best algorithm. We provide practical advice and highlight the potential of the benchmark to diagnose problems and improve algorithms. The results can be explored interactively on a companion website. All code is open source, making it possible to contribute further benchmark tasks and inference algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lueckmann21a.html
http://proceedings.mlr.press/v130/lueckmann21a.html Low-Rank Generalized Linear Bandit Problems In a low-rank linear bandit problem, the reward of an action (represented by a matrix of size $d_1 \times d_2$) is the inner product between the action and an unknown low-rank matrix $\Theta^*$. We propose an algorithm based on a novel combination of online-to-confidence-set conversion \citep{abbasi2012online} and the exponentially weighted average forecaster constructed by a covering of low-rank matrices. In $T$ rounds, our algorithm achieves $\widetilde{O}((d_1+d_2)^{3/2}\sqrt{rT})$ regret that improves upon the standard linear bandit regret bound of $\widetilde{O}(d_1d_2\sqrt{T})$ when the rank of $\Theta^*$: $r \ll \min\{d_1,d_2\}$. We also extend our algorithmic approach to the generalized linear setting to get an algorithm which enjoys a similar bound under regularity conditions on the link function. To get around the computational intractability of covering based approaches, we propose an efficient algorithm by extending the "Explore-Subspace-Then-Refine" algorithm of \citet{jun2019bilinear}. Our efficient algorithm achieves $\widetilde{O}((d_1+d_2)^{3/2}\sqrt{rT})$ regret under a mild condition on the action set $\mathcal{X}$ and the $r$-th singular value of $\Theta^*$. Our upper bounds match the conjectured lower bound of \cite{jun2019bilinear} for a subclass of low-rank linear bandit problems. Further, we show that existing lower bounds for the sparse linear bandit problem strongly suggest that our regret bounds are unimprovable. To complement our theoretical contributions, we also conduct experiments to demonstrate that our algorithm can greatly outperform the performance of the standard linear bandit approach when $\Theta^*$ is low-rank. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lu21a.html
http://proceedings.mlr.press/v130/lu21a.html Hyperbolic graph embedding with enhanced semi-implicit variational inference. Efficient modeling of relational data arising in physical, social, and information sciences is challenging due to complicated dependencies within the data. In this work we build off of semi-implicit graph variational auto-encoders to capture higher order statistics in a low-dimensional graph latent representation. We incorporate hyperbolic geometry in the latent space through a Poincare embedding to efficiently represent graphs exhibiting hierarchical structure. To address the naive posterior latent distribution assumptions in classical variational inference, we use semi-implicit hierarchical variational Bayes to implicitly capture posteriors of given graph data, which may exhibit heavy tails, multiple modes, skewness, and highly correlated latent structures. We show that the existing semi-implicit variational inference objective provably reduces information in the observed graph. Based on this observation, we estimate and add an additional mutual information term to the semi-implicit variational inference learning objective to capture rich correlations arising between the input and latent spaces. We show that the inclusion of this regularization term in conjunction with the \poincare embedding boosts the quality of learned high-level representations and enables more flexible and faithful graphical modeling. We experimentally demonstrate that our approach outperforms existing graph variational auto-encoders both in Euclidean and in hyperbolic spaces for edge link prediction and node classification. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lotfi-rezaabad21a.html
http://proceedings.mlr.press/v130/lotfi-rezaabad21a.html Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/loizou21a.html
http://proceedings.mlr.press/v130/loizou21a.html Contrastive learning of strong-mixing continuous-time stochastic processes Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. It has recently emerged as one of the leading learning paradigms in the absence of labels across many different domains (e.g. brain imaging, text, images). However, theoretical understanding of many aspects of training, both statistical and algorithmic, remain fairly elusive. In this work, we study the setting of time series—more precisely, when we get data from a strong-mixing continuous-time stochastic process. We show that a properly constructed contrastive learning task can be used to the transition kernel for small-to-mid-range intervals in the diffusion case. Moreover, we give sample complexity bounds for solving this task and quantitatively characterize what the value of the contrastive loss implies for distributional closeness of the learned kernel. As a byproduct, we illuminate the appropriate settings for the contrastive distribution, as well as other hyperparameters in this setup. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21h.html
http://proceedings.mlr.press/v130/liu21h.html Variable Selection with Rigorous Uncertainty Quantification using Deep Bayesian Neural Networks: Posterior Concentration and Bernstein-von Mises Phenomenon This work develops a theoretical basis for the deep Bayesian neural network (BNN)’s ability in performing high-dimensional variable selection with rigorous uncertainty quantification. We develop new Bayesian non-parametric theorems to show that a properly configured deep BNN (1) learns the variable importance effectively in high dimensions, and its learning rate can sometimes “break” the curse of dimensionality. (2) BNN’s uncertainty quantification for variable importance is rigorous, in the sense that its 95% credible intervals for variable importance indeed covers the truth 95% of the time (i.e. the Bernstein-von Mises (BvM) phenomenon). The theoretical results suggest a simple variable selection algorithm based on the BNN’s credible intervals. Extensive simulation confirms the theoretical findings and shows that the proposed algorithm outperforms existing classic and neural-network-based variable selection methods, particularly in high dimensions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21g.html
http://proceedings.mlr.press/v130/liu21g.html Smooth Bandit Optimization: Generalization to Holder Space We consider bandit optimization of a smooth reward function, where the goal is cumulative regret minimization. This problem has been studied for $\alpha$-Holder continuous (including Lipschitz) functions with $0<\alpha\leq 1$. Our main result is in generalization of the reward function to Holder space with exponent $\alpha>1$ to bridge the gap between Lipschitz bandits and infinitely-differentiable models such as linear bandits. For Holder continuous functions, approaches based on random sampling in bins of a discretized domain suffices as optimal. In contrast, we propose a class of two-layer algorithms that deploy misspecified linear/polynomial bandit algorithms in bins. We demonstrate that the proposed algorithm can exploit higher-order smoothness of the function by deriving a regret upper bound of $\tilde{O}(T^\frac{d+\alpha}{d+2\alpha})$ for when $\alpha>1$, which matches existing lower bound. We also study adaptation to unknown function smoothness over a continuous scale of Holder spaces indexed by $\alpha$, with a bandit model selection approach applied with our proposed two-layer algorithms. We show that it achieves regret rate that matches the existing lower bound for adaptation within the $\alpha\leq 1$ subset. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21f.html
http://proceedings.mlr.press/v130/liu21f.html Noisy Gradient Descent Converges to Flat Minima for Nonconvex Matrix Factorization Numerous empirical evidences have corroborated the importance of noise in nonconvex optimization problems. The theory behind such empirical observations, however, is still largely unknown. This paper studies this fundamental problem through investigating the nonconvex rectangular matrix factorization problem, which has infinitely many global minima due to rotation and scaling invariance. Hence, gradient descent (GD) can converge to any optimum, depending on the initialization. In contrast, we show that a perturbed form of GD with an arbitrary initialization converges to a global optimum that is uniquely determined by the injected noise. Our result implies that the noise imposes implicit bias towards certain optima. Numerical experiments are provided to support our theory. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21e.html
http://proceedings.mlr.press/v130/liu21e.html Learning with Hyperspherical Uniformity Due to the over-parameterization nature, neural networks are a powerful tool for nonlinear function approximation. In order to achieve good generalization on unseen data, a suitable inductive bias is of great importance for neural networks. One of the most straightforward ways is to regularize the neural network with some additional objectives. L2 regularization serves as a standard regularization for neural networks. Despite its popularity, it essentially regularizes one dimension of the individual neuron, which is not strong enough to control the capacity of highly over-parameterized neural networks. Motivated by this, hyperspherical uniformity is proposed as a novel family of relational regularizations that impact the interaction among neurons. We consider several geometrically distinct ways to achieve hyperspherical uniformity. The effectiveness of hyperspherical uniformity is justified by theoretical insights and empirical evaluations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21d.html
http://proceedings.mlr.press/v130/liu21d.html Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget. The novel components in the proofs include a more refined analysis of the majority voting classifier — which could be of independent interest — and an observation that the synthetic “student” learning problem is nearly realizable by construction under the Tsybakov noise condition. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21c.html
http://proceedings.mlr.press/v130/liu21c.html Kernel regression in high dimensions: Refined analysis beyond double descent In this paper, we provide a precise characterization of generalization properties of high dimensional kernel ridge regression across the under- and over-parameterized regimes, depending on whether the number of training data n exceeds the feature dimension d. By establishing a bias-variance decomposition of the expected excess risk, we show that, while the bias is (almost) independent of d and monotonically decreases with n, the variance depends on n,d and can be unimodal or monotonically decreasing under different regularization schemes. Our refined analysis goes beyond the double descent theory by showing that, depending on the data eigen-profile and the level of regularization, the kernel regression risk curve can be a double-descent-like, bell-shaped, or monotonic function of n. Experiments on synthetic and real data are conducted to support our theoretical findings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21b.html
http://proceedings.mlr.press/v130/liu21b.html Fast Learning in Reproducing Kernel Krein Spaces via Signed Measures In this paper, we attempt to solve a long-lasting open question for non-positive definite (non-PD) kernels in machine learning community: can a given non-PD kernel be decomposed into the difference of two PD kernels (termed as positive decomposition)? We cast this question as a distribution view by introducing the signed measure, which transforms positive decomposition to measure decomposition: a series of non-PD kernels can be associated with the linear combination of specific finite Borel measures. In this manner, our distribution-based framework provides a sufficient and necessary condition to answer this open question. Specifically, this solution is also computationally implementable in practice to scale non-PD kernels in large sample cases, which allows us to devise the first random features algorithm to obtain an unbiased estimator. Experimental results on several benchmark datasets verify the effectiveness of our algorithm over the existing methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liu21a.html
http://proceedings.mlr.press/v130/liu21a.html On the Privacy Properties of GAN-generated Samples The privacy implications of generative adversarial networks (GANs) are a topic of great interest, leading to several recent algorithms for training GANs with privacy guarantees. By drawing connections to the generalization properties of GANs, we prove that under some assumptions, GAN-generated samples inherently satisfy some (weak) privacy guarantees. First, we show that if a GAN is trained on m samples and used to generate n samples, the generated samples are (epsilon, delta)-differentially-private for (epsilon, delta) pairs where delta scales as O(n/m). We show that under some special conditions, this upper bound is tight. Next, we study the robustness of GAN-generated samples to membership inference attacks. We model membership inference as a hypothesis test in which the adversary must determine whether a given sample was drawn from the training dataset or from the underlying data distribution. We show that this adversary can achieve an area under the ROC curve that scales no better than O(m^{-1/4}). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lin21b.html
http://proceedings.mlr.press/v130/lin21b.html On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification Optimal transport (OT) distances are increasingly used as loss functions for statistical inference, notably in the learning of generative models or supervised learning. Yet, the behavior of minimum Wasserstein estimators is poorly understood, notably in high-dimensional regimes or under model misspecification. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected. Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances, complementing and improving previous literature that has been restricted to one-dimensional and well-specified cases. Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces. Our complexity bounds can help explain why both PRW and IPRW distances outperform Wasserstein distances empirically in high-dimensional inference tasks. Finally, we consider parametric inference using the PRW distance. We provide an asymptotic guarantee of two types of minimum PRW estimators and formulate a central limit theorem for max-sliced Wasserstein estimator under model misspecification. To enable our analysis on PRW with projection dimension larger than one, we devise a novel combination of variational analysis and statistical theory. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lin21a.html
http://proceedings.mlr.press/v130/lin21a.html Model updating after interventions paradoxically introduces bias Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such ‘naive updating’ when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liley21a.html
http://proceedings.mlr.press/v130/liley21a.html CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcome low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform – a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/likhosherstov21a.html
http://proceedings.mlr.press/v130/likhosherstov21a.html Accelerating Metropolis-Hastings with Lightweight Inference Compilation In order to construct accurate proposers for Metropolis-Hastings Markov Chain Monte Carlo, we integrate ideas from probabilistic graphical models and neural networks in an open-source framework we call Lightweight Inference Compilation (LIC). LIC implements amortized inference within an open-universe declarative probabilistic programming language (PPL). Graph neural networks are used to parameterize proposal distributions as functions of Markov blankets, which during “compilation” are optimized to approximate single-site Gibbs sampling distributions. Unlike prior work in inference compilation (IC), LIC forgoes importance sampling of linear execution traces in favor of operating directly on Bayesian networks. Through using a declarative PPL, the Markov blankets of nodes (which may be non-static) are queried at inference-time to produce proposers Experimental results show LIC can produce proposers which have less parameters, greater robustness to nuisance random variables, and improved posterior sampling in a Bayesian logistic regression and n-schools inference application. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/liang21a.html
http://proceedings.mlr.press/v130/liang21a.html Nonlinear Projection Based Gradient Estimation for Query Efficient Blackbox Attacks Gradient estimation and vector space projection have been studied as two distinct topics. We aim to bridge the gap between the two by investigating how to efficiently estimate gradient based on a projected low-dimensional space. We first provide lower and upper bounds for gradient estimation under both linear and nonlinear projections, and outline checkable sufficient conditions under which one is better than the other. Moreover, we analyze the query complexity for the projection-based gradient estimation and present a sufficient condition for query-efficient estimators. Built upon our theoretic analysis, we propose a novel query-efficient Nonlinear Gradient Projection-based Boundary Blackbox Attack (NonLinear-BA). We conduct extensive experiments on four image datasets: ImageNet, CelebA, CIFAR-10, and MNIST, and show the superiority of the proposed methods compared with the state-of-the-art baselines. In particular, we show that the projection-based boundary blackbox attacks are able to achieve much smaller magnitude of perturbations with 100% attack success rate based on efficient queries. Both linear and nonlinear projections demonstrate their advantages under different conditions. We also evaluate NonLinear-BA against the commercial online API MEGVII Face++, and demonstrate the high blackbox attack performance both quantitatively and qualitatively. The code is publicly available at https://github.com/AI-secure/NonLinear-BA. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/li21f.html
http://proceedings.mlr.press/v130/li21f.html One-Sketch-for-All: Non-linear Random Features from Compressed Linear Measurements The commonly used Gaussian kernel has a tuning parameter $\gamma$. This makes the design of quantization schemes for random Fourier features (RFF) challenging, which is a popular technique to approximate the Gaussian kernel. Intuitively one would expect that a different quantizer is needed for a different $\gamma$ value (and we need to store a different set of quantized data for each $\gamma$). Fortunately, the recent work \citep{Report:Li_2021_RFF} showed that only one Lloyd-Max (LM) quantizer is needed as the marginal distribution of RFF is free of the tuning parameter $\gamma$. On the other hand, \citet{Report:Li_2021_RFF} still required to store a different set of quantized data for each $\gamma$ value. In this paper, we adopt the “one-sketch-for-all” strategy for quantizing RFFs. Basically, we only store one set of quantized data after applying random projections on the original data. From the same set of quantized data, we can construct approximate RFFs to approximate Gaussian kernels for any tuning parameter $\gamma$. Compared with \citet{Report:Li_2021_RFF}, our proposed scheme would lose some accuracy as one would expect. Nevertheless, the proposed method still perform noticeably better than the quantization scheme based on random rounding. We provide statistical analysis on the properties of the proposed method and experiments are conducted to empirically illustrate its effectiveness. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/li21e.html
http://proceedings.mlr.press/v130/li21e.html Rate-improved inexact augmented Lagrangian method for constrained nonconvex optimization First-order methods have been studied for nonlinear constrained optimization within the framework of the augmented Lagrangian method (ALM) or penalty method. We propose an improved inexact ALM (iALM) and conduct a unified analysis for nonconvex problems with either convex or nonconvex constraints. Under certain regularity conditions (that are also assumed by existing works), we show an $\tilde{O}(\varepsilon^{-\frac{5}{2}})$ complexity result for a problem with a nonconvex objective and convex constraints and an $\tilde{O}(\varepsilon^{-3})$ complexity result for a problem with a nonconvex objective and nonconvex constraints, where the complexity is measured by the number of first-order oracles to yield an $\varepsilon$-KKT solution. Both results are the best known. The same-order complexity results have been achieved by penalty methods. However, two different analysis techniques are used to obtain the results, and more importantly, the penalty methods generally perform significantly worse than iALM in practice. Our improved iALM and analysis close the gap between theory and practice. Numerical experiments on nonconvex problems with convex or nonconvex constraints are provided to demonstrate the effectiveness of our proposed method. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/li21d.html
http://proceedings.mlr.press/v130/li21d.html Unifying Clustered and Non-stationary Bandits Non-stationary bandits and clustered bandits lift the restrictive assumptions in contextual bandits and provide solutions to many important real-world scenarios. Though they have been studied independently so far, we point out the essence in solving these two problems overlaps considerably. In this work, we connect these two strands of bandit research under the notion of test of homogeneity, which seamlessly addresses change detection for non-stationary bandit and cluster identification for clustered bandit in a unified solution framework. Rigorous regret analysis and extensive empirical evaluations demonstrate the value of our proposed solution, especially its flexibility in handling various environment assumptions, e.g., a clustered non-stationary environment. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/li21c.html
http://proceedings.mlr.press/v130/li21c.html Tight Regret Bounds for Infinite-armed Linear Contextual Bandits Linear contextual bandit is a class of sequential decision-making problems with important applications in recommendation systems, online advertising, healthcare, and other machine learning-related tasks. While there is much prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we consider the linear contextual bandit problem with (changing) infinite action sets. We prove a regret upper bound on the order of O(\sqrt{d^2T\log T}) \poly(\log\log T) where d is the domain dimension and T is the time horizon. Our upper bound matches the previous lower bound of \Omega(\sqrt{d^2 T\log T}) in [Li et al., 2019] up to iterated logarithmic terms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/li21b.html
http://proceedings.mlr.press/v130/li21b.html Online Forgetting Process for Linear Regression Models Motivated by the EU’s "Right To Be Forgotten" regulation, we initiate a study of statistical data deletion problems where users’ data are accessible only for a limited period of time. This setting is formulated as an online supervised learning task with \textit{constant memory limit}. We propose a deletion-aware algorithm \texttt{FIFD-OLS} for the low dimensional case, and witness a catastrophic rank swinging phenomenon due to the data deletion operation, which leads to statistical inefficiency. As a remedy, we propose the \texttt{FIFD-Adaptive Ridge} algorithm with a novel online regularization scheme, that effectively offsets the uncertainty from deletion. In theory, we provide the cumulative regret upper bound for both online forgetting algorithms. In the experiment, we showed \texttt{FIFD-Adaptive Ridge} outperforms the ridge regression algorithm with fixed regularization level, and hopefully sheds some light on more complex statistical models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/li21a.html
http://proceedings.mlr.press/v130/li21a.html PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to automate. We present PClean, a probabilistic programming language (PPL) for leveraging dataset-specific knowledge to automate Bayesian cleaning. Compared to general-purpose PPLs, PClean tackles a restricted problem domain, enabling three modeling and inference innovations: (1) a non-parametric model of relational database instances, which users’ programs customize; (2) a novel sequential Monte Carlo inference algorithm that exploits the structure of PClean’s model class; and (3) a compiler that generates near-optimal SMC proposals and blocked-Gibbs rejuvenation kernels based on the user’s model and data. We show empirically that short (< 50-line) PClean programs can: be faster and more accurate than generic PPL inference on data-cleaning benchmarks; match state-of-the-art data-cleaning systems in terms of accuracy and runtime (unlike generic PPL inference in the same runtime); and scale to real-world datasets with millions of records. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lew21a.html
http://proceedings.mlr.press/v130/lew21a.html LassoNet: Neural Networks with Feature Sparsity Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by allowing a feature to participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. In experiments with real and simulated data, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lemhadri21a.html
http://proceedings.mlr.press/v130/lemhadri21a.html Distribution Regression for Sequential Data Distribution regression refers to the supervised learning problem where labels are only available for groups of inputs instead of individual inputs. In this paper, we develop a rigorous mathematical framework for distribution regression where inputs are complex data streams. Leveraging properties of the expected signature and a recent signature kernel trick for sequential data from stochastic analysis, we introduce two new learning techniques, one feature-based and the other kernel-based. Each is suited to a different data regime in terms of the number of data streams and the dimensionality of the individual streams. We provide theoretical results on the universality of both approaches and demonstrate empirically their robustness to irregularly sampled multivariate time-series, achieving state-of-the-art performance on both synthetic and real-world examples from thermodynamics, mathematical finance and agricultural science. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lemercier21a.html
http://proceedings.mlr.press/v130/lemercier21a.html Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifically, Daskalakis et al 2018, Liang-Stokes 2019, show last iterate convergence of the so called “Optimistic Gradient Descent/Ascent" for the case of \textit{unconstrained} min-max optimization. Moreover, in Mertikopoulos et al 2019 the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm uses \textit{vanishing stepsizes}. In this work, we show that "Optimistic Multiplicative-Weights Update (OMWU)" with \textit{constant stepsize}, exhibits last iterate convergence locally for convex-concave games, generalizing the results of Daskalakis and Panageas 2019 where last iterate convergence of OMWU was shown only for the \textit{bilinear case}. To the best of our knowledge, this is the first result about last-iterate convergence for constrained zero sum games (beyond the bilinear case) in which the dynamics use constant step-sizes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lei21a.html
http://proceedings.mlr.press/v130/lei21a.html Online Model Selection for Reinforcement Learning with Function Approximation Deep reinforcement learning has achieved impressive successes yet often requires a very large amount of interaction data. This result is perhaps unsurprising, as using complicated function approximation often requires more data to fit, and early theoretical results on linear Markov decision processes provide regret bounds that scale with the dimension of the linear approximation. Ideally, we would like to automatically identify the minimal dimension of the approximation that is sufficient to encode an optimal policy. Towards this end, we consider the problem of model selection in RL with function approximation, given a set of candidate RL algorithms with known regret guarantees. The learner’s goal is to adapt to the complexity of the optimal algorithm without knowing it a priori. We present a meta-algorithm that successively rejects increasingly complex models using a simple statistical test. Given at least one candidate that satisfies realizability, we prove the meta-algorithm adapts to the optimal complexity with regret that is only marginally suboptimal in the number of episodes and number of candidate algorithms. The dimension and horizon dependencies remain optimal with respect to the best candidate, and our meta-algorithmic approach is flexible to incorporate multiple candidate algorithms and models. Finally, we show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds that depend on the gaps between the maximal values attainable by the candidates. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lee21d.html
http://proceedings.mlr.press/v130/lee21d.html SDF-Bayes: Cautious Optimism in Safe Dose-Finding Clinical Trials with Drug Combinations and Heterogeneous Patient Groups Phase I clinical trials are designed to test the safety (non-toxicity) of drugs and find the maximum tolerated dose (MTD). This task becomes significantly more challenging when multiple-drug dose-combinations (DC) are involved, due to the inherent conflict between the exponentially increasing DC candidates and the limited patient budget. This paper proposes a novel Bayesian design, SDF-Bayes, for finding the MTD for drug combinations in the presence of safety constraints. Rather than the conventional principle of escalating or de-escalating the current dose of one drug (perhaps alternating between drugs), SDF-Bayes proceeds by cautious optimism: it chooses the next DC that, on the basis of current information, is most likely to be the MTD (optimism), subject to the constraint that it only chooses DCs that have a high probability of being safe (caution). We also propose an extension, SDF-Bayes-AR, that accounts for patient heterogeneity and enables heterogeneous patient recruitment. Extensive experiments based on both synthetic and real-world datasets demonstrate the advantages of SDF-Bayes over state of the art DC trial designs in terms of accuracy and safety. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lee21c.html
http://proceedings.mlr.press/v130/lee21c.html Reinforcement Learning for Mean Field Games with Strategic Complementarities Mean Field Games (MFG) are the class of games with a very large number of agents and the standard equilibrium concept is a Mean Field Equilibrium (MFE). Algorithms for learning MFE in dynamic MFGs are unknown in general. Our focus is on an important subclass that possess a monotonicity property called Strategic Complementarities (MFG-SC). We introduce a natural refinement to the equilibrium concept that we call Trembling-Hand-Perfect MFE (T-MFE), which allows agents to employ a measure of randomization while accounting for the impact of such randomization on their payoffs. We propose a simple algorithm for computing T-MFE under a known model. We also introduce a model-free and a model-based approach to learning T-MFE and provide sample complexities of both algorithms. We also develop a fully online learning scheme that obviates the need for a simulator. Finally, we empirically evaluate the performance of the proposed algorithms via examples motivated by real-world applications. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lee21b.html
http://proceedings.mlr.press/v130/lee21b.html A Variational Information Bottleneck Approach to Multi-Omics Data Integration Integration of data from multiple omics techniques is becoming increasingly important in biomedical research. Due to non-uniformity and technical limitations in omics platforms, such integrative analyses on multiple omics, which we refer to as views, involve learning from incomplete observations with various view-missing patterns. This is challenging because i) complex interactions within and across observed views need to be properly addressed for optimal predictive power and ii) observations with various view-missing patterns need to be flexibly integrated. To address such challenges, we propose a deep variational information bottleneck (IB) approach for incomplete multi-view observations. Our method applies the IB framework on marginal and joint representations of the observed views to focus on intra-view and inter-view interactions that are relevant for the target. Most importantly, by modeling the joint representations as a product of marginal representations, we can efficiently learn from observed views with various view-missing patterns. Experiments on real-world datasets show that our method consistently achieves gain from data integration and outperforms state-of-the-art benchmarks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lee21a.html
http://proceedings.mlr.press/v130/lee21a.html Flow-based Alignment Approaches for Probability Measures in Different Spaces Gromov-Wasserstein (GW) is a powerful tool to compare probability measures whose supports are in different metric spaces. However, GW suffers from a computational drawback since it requires to solve a complex non-convex quadratic program. In this work, we consider a specific family of cost metrics, namely, tree metrics for supports of each probability measure, to develop efficient and scalable discrepancies between the probability measures. Leveraging a tree structure, we propose to align flows from a root to each support instead of pair-wise tree metrics of supports, i.e., flows from a support to another support, in GW. Consequently, we propose a novel discrepancy, named Flow-based Alignment (FlowAlign), by matching the flows of the probability measures. FlowAlign is computationally fast and scalable for large-scale applications. Further exploring the tree structure, we propose a variant of FlowAlign, named Depth-based Alignment (DepthAlign), by aligning the flows hierarchically along each depth level of the tree structures. Theoretically, we prove that both FlowAlign and DepthAlign are pseudo-metrics. We also derive tree-sliced variants of the proposed discrepancies for applications without prior knowledge about tree structures for probability measures, computed by averaging FlowAlign/DepthAlign using random tree metrics, adaptively sampled from supports of probability measures. Empirically, we test our proposed approaches against other variants of GW baselines on a few benchmark tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/le21b.html
http://proceedings.mlr.press/v130/le21b.html Entropy Partial Transport with Tree Metrics: Theory and Practice Optimal transport (OT) theory provides powerful tools to compare probability measures. However, OT is limited to nonnegative measures having the same mass, and suffers serious drawbacks about its computation and statistics. This leads to several proposals of regularized variants of OT in the recent literature. In this work, we consider an entropy partial transport (EPT) problem for nonnegative measures on a tree having different masses. The EPT is shown to be equivalent to a standard complete OT problem on a one-node extended tree. We derive its dual formulation, then leverage this to propose a novel regularization for EPT which admits fast computation and negative definiteness. To our knowledge, the proposed regularized EPT is the first approach that yields a closed-form solution among available variants of unbalanced OT for general nonnegative measures. For practical applications without prior knowledge about the tree structure for measures, we propose tree-sliced variants of the regularized EPT, computed by averaging the regularized EPT between these measures using random tree metrics, built adaptively from support data points. Exploiting the negative definiteness of our regularized EPT, we introduce a positive definite kernel, and evaluate it against other baselines on benchmark tasks such as document classification with word embedding and topological data analysis. In addition, we empirically demonstrate that our regularization also provides effective approximations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/le21a.html
http://proceedings.mlr.press/v130/le21a.html An Analysis of the Adaptation Speed of Causal Models Consider a collection of datasets generated by unknown interventions on an unknown structural causal model $G$. Recently, Bengio et al. (2020) conjectured that among all candidate models, $G$ is the fastest to adapt from one dataset to another, along with promising experiments. Indeed, intuitively $G$ has less mechanisms to adapt, but this justification is incomplete. Our contribution is a more thorough analysis of this hypothesis. We investigate the adaptation speed of cause-effect SCMs. Using convergence rates from stochastic optimization, we justify that a relevant proxy for adaptation speed is distance in parameter space after intervention. Applying this proxy to categorical and normal cause-effect models, we show two results. When the intervention is on the cause variable, the SCM with the correct causal direction is advantaged by a large factor. When the intervention is on the effect variable, we characterize the relative adaptation speed. Surprisingly, we find situations where the anticausal model is advantaged, falsifying the initial hypothesis. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/le-priol21a.html
http://proceedings.mlr.press/v130/le-priol21a.html Thresholded Adaptive Validation: Tuning the Graphical Lasso for Graph Recovery Many Machine Learning algorithms are formulated as regularized optimization problems, but their performance hinges on a regularization parameter that needs to be calibrated to each application at hand. In this paper, we propose a general calibration scheme for regularized optimization problems and apply it to the graphical lasso, which is a method for Gaussian graphical modeling. The scheme is equipped with theoretical guarantees and motivates a thresholding pipeline that can improve graph recovery. Moreover, requiring at most one line search over the regularization path, the calibration scheme is computationally more efficient than competing schemes that are based on resampling. Finally, we show in simulations that our approach can improve on the graph recovery of other approaches considerably. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/laszkiewicz21a.html
http://proceedings.mlr.press/v130/laszkiewicz21a.html Linearly Constrained Gaussian Processes with Boundary Conditions One goal in Bayesian machine learning is to encode prior knowledge into prior distributions, to model data efficiently. We consider prior knowledge from systems of linear partial differential equations together with their boundary conditions. We construct multi-output Gaussian process priors with realizations in the solution set of such systems, in particular only such solutions can be represented by Gaussian process regression. The construction is fully algorithmic via Gröbner bases and it does not employ any approximation. It builds these priors combining two parametrizations via a pullback: the first parametrizes the solutions for the system of differential equations and the second parametrizes all functions adhering to the boundary conditions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lange-hegermann21a.html
http://proceedings.mlr.press/v130/lange-hegermann21a.html Beyond Perturbation Stability: LP Recovery Guarantees for MAP Inference on Noisy Stable Instances Several works have shown that perturbation stable instances of the MAP inference problem can be solved exactly using a natural linear programming (LP) relaxation. However, most of these works give few (or no) guarantees for the LP solutions on instances that do not satisfy the relatively strict perturbation stability definitions. In this work, we go beyond these stability results by showing that the LP approximately recovers the MAP solution of a stable instance even after the instance is corrupted by noise. This "noisy stable" model realistically fits with practical MAP inference problems: we design an algorithm for finding "close" stable instances, and show that several real-world instances from computer vision have nearby instances that are perturbation stable. These results suggest a new theoretical explanation for the excellent performance of this LP relaxation in practice. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lang21a.html
http://proceedings.mlr.press/v130/lang21a.html Neural Function Modules with Sparse Arguments: A Dynamic Approach to Integrating Information across Layers Feed-forward neural networks consist of a sequence of layers, in which each layer performs some processing on the information from the previous layer. A downside to this approach is that each layer (or module, as multiple modules can operate in parallel) is tasked with processing the entire hidden state, rather than a particular part of the state which is most relevant for that module. Methods which only operate on a small number of input variables are an essential part of most programming languages, and they allow for improved modularity and code re-usability. Our proposed method, Neural Function Modules (NFM), aims to introduce the same structural capability into deep learning. Most of the work in the context of feed-forward networks combining top-down and bottom-up feedback is limited to classification problems. The key contribution of our work is to combine attention, sparsity, top-down and bottom-up feedback, in a flexible algorithm which, as we show, improves the results in standard classification, out-of-domain generalization, generative modeling, and learning representations in the context of reinforcement learning. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/lamb21a.html
http://proceedings.mlr.press/v130/lamb21a.html All of the Fairness for Edge Prediction with Optimal Transport Machine learning and data mining algorithms have been increasingly used recently to support decision-making systems in many areas of high societal importance such as healthcare, education, or security. While being very efficient in their predictive abilities, the deployed algorithms sometimes tend to learn an inductive model with a discriminative bias due to the presence of this latter in the learning sample. This problem gave rise to a new field of algorithmic fairness where the goal is to correct the discriminative bias introduced by a certain attribute in order to decorrelate it from the model’s output. In this paper, we study the problem of fairness for the task of edge prediction in graphs, a largely underinvestigated scenario compared to a more popular setting of fair classification. To this end, we formulate the problem of fair edge prediction, analyze it theoretically, and propose an embedding-agnostic repairing procedure for the adjacency matrix of an arbitrary graph with a trade-off between the group and individual fairness. We experimentally show the versatility of our approach and its capacity to provide explicit control over different notions of fairness and prediction accuracy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/laclau21a.html
http://proceedings.mlr.press/v130/laclau21a.html On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression We study the convergence rates of the EM algorithm for learning two-component mixed linear regression under all regimes of signal-to-noise ratio (SNR). We resolve a long-standing question that many recent results have attempted to tackle: we completely characterize the convergence behavior of EM, and show that the EM algorithm achieves minimax optimal sample complexity under all SNR regimes. In particular, when the SNR is sufficiently large, the EM updates converge to the true parameter $\theta^{*}$ at the standard parametric convergence rate $\calo((d/n)^{1/2})$ after $\calo(\log(n/d))$ iterations. In the regime where the SNR is above $\calo((d/n)^{1/4})$ and below some constant, the EM iterates converge to a $\calo({\rm SNR}^{-1} (d/n)^{1/2})$ neighborhood of the true parameter, when the number of iterations is of the order $\calo({\rm SNR}^{-2} \log(n/d))$. In the low SNR regime where the SNR is below $\calo((d/n)^{1/4})$, we show that EM converges to a $\calo((d/n)^{1/4})$ neighborhood of the true parameters, after $\calo((n/d)^{1/2})$ iterations. Notably, these results are achieved under mild conditions of either random initialization or an efficiently computable local initialization. By providing tight convergence guarantees of the EM algorithm in middle-to-low SNR regimes, we fill the remaining gap in the literature, and significantly, reveal that in low SNR, EM changes rate, matching the $n^{-1/4}$ rate of the MLE, a behavior that previous work had been unable to show. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kwon21b.html
http://proceedings.mlr.press/v130/kwon21b.html Efficient Computation and Analysis of Distributional Shapley Values Distributional data Shapley value (DShapley) has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. DShapley develops the foundational game theory concept of Shapley values into a statistical framework and can be applied to identify data points that are useful (or harmful) to a learning algorithm. Estimating DShapley is computationally expensive, however, and this can be a major challenge to using it in practice. Moreover, there has been little mathematical analyses of how this value depends on data characteristics. In this paper, we derive the first analytic expressions for DShapley for the canonical problems of linear regression, binary classification, and non-parametric density estimation. These analytic forms provide new algorithms to estimate DShapley that are several orders of magnitude faster than previous state-of-the-art methods. Furthermore, our formulas are directly interpretable and provide quantitative insights into how the value varies for different types of data. We demonstrate the practical efficacy of our approach on multiple real and synthetic datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kwon21a.html
http://proceedings.mlr.press/v130/kwon21a.html Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting We consider off-policy evaluation in the contextual bandit setting for the purpose of obtaining a robust off-policy selection strategy, where the selection strategy is evaluated based on the value of the chosen policy in a set of proposal (target) policies. We propose a new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage. The lower bound is built around the so-called Self-normalized Importance Weighting (SN) estimator. It combines the use of a semi-empirical Efron-Stein tail inequality to control the concentration and Harris’ inequality to control the bias. The new approach is evaluated on a number of synthetic and real datasets and is found to be superior to its main competitors, both in terms of tightness of the confidence intervals and the quality of the policies chosen. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kuzborskij21a.html
http://proceedings.mlr.press/v130/kuzborskij21a.html Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not only loose, in that it does not capture that EM can make more progress than a gradient step, but the assumptions fail to hold for textbook examples of EM like Gaussian mixtures. In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence. Then, we show how the KL divergence is related to first-order stationarity via Bregman divergences. In contrast to previous works, the analysis is invariant to the choice of parametrization and holds with minimal assumptions. We also show applications of these ideas to local linear (and superlinear) convergence rates, generalized EM, and non-exponential family distributions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kunstner21a.html
http://proceedings.mlr.press/v130/kunstner21a.html Context-Specific Likelihood Weighting Sampling is a popular method for approximate inference when exact inference is impractical. Generally, sampling algorithms do not exploit context-specific independence (CSI) properties of probability distributions. We introduce context-specific likelihood weighting (CS-LW), a new sampling methodology, which besides exploiting the classical conditional independence properties, also exploits CSI properties. Unlike the standard likelihood weighting, CS-LW is based on partial assignments of random variables and requires fewer samples for convergence due to the sampling variance reduction. Furthermore, the speed of generating samples increases. Our novel notion of contextual assignments theoretically justifies CS-LW. We empirically show that CS-LW is competitive with state-of-the-art algorithms for approximate inference in the presence of a significant amount of CSIs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kumar21b.html
http://proceedings.mlr.press/v130/kumar21b.html The Teaching Dimension of Kernel Perceptron Algorithmic machine teaching has been studied under the linear setting where exact teaching is possible. However, little is known for teaching nonlinear learners. Here, we establish the sample complexity of teaching, aka teaching dimension, for kernelized perceptrons for different families of feature maps. As a warm-up, we show that the teaching complexity is $\Theta(d)$ for the exact teaching of linear perceptrons in $\mathbb{R}^d$, and $\Theta(d^k)$ for kernel perceptron with a polynomial kernel of order $k$. Furthermore, under certain smooth assumptions on the data distribution, we establish a rigorous bound on the complexity for approximately teaching a Gaussian kernel perceptron. We provide numerical examples of the optimal (approximate) teaching set under several canonical settings for linear, polynomial and Gaussian kernel perceptions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kumar21a.html
http://proceedings.mlr.press/v130/kumar21a.html Quantifying the Privacy Risks of Learning High-Dimensional Graphical Models Models leak information about their training data. This enables attackers to infer sensitive information about their training sets, notably determine if a data sample was part of the model’s training set. The existing works empirically show the possibility of these membership inference (tracing) attacks against complex deep learning models. However, the attack results are dependent on the specific training data, can be obtained only after the tedious process of training the model and performing the attack, and are missing any measure of the confidence and unused potential power of the attack. In this paper, we theoretically analyze the maximum power of tracing attacks against high-dimensional graphical models, with the focus on Bayesian networks. We provide a tight upper bound on the power (true positive rate) of these attacks, with respect to their error (false positive rate), for a given model structure even before learning its parameters. As it should be, the bound is independent of the knowledge and algorithm of any specific attack. It can help in identifying which model structures leak more information, how adding new parameters to the model increases its privacy risk, and what can be gained by adding new data points to decrease the overall information leakage. It provides a measure of the potential leakage of a model given its structure, as a function of the model complexity and the size of the training set. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kumar-murakonda21a.html
http://proceedings.mlr.press/v130/kumar-murakonda21a.html Tractable contextual bandits beyond realizability Tractable contextual bandit algorithms often rely on the realizability assumption – i.e., that the true expected reward model belongs to a known class, such as linear functions. In this work, we present a tractable bandit algorithm that is not sensitive to the realizability assumption and computationally reduces to solving a constrained regression problem in every epoch. When realizability does not hold, our algorithm ensures the same guarantees on regret achieved by realizability-based algorithms under realizability, up to an additive term that accounts for the misspecification error. This extra term is proportional to T times a function of the mean squared error between the best model in the class and the true model, where T is the total number of time-steps. Our work sheds light on the bias-variance trade-off for tractable contextual bandits. This trade-off is not captured by algorithms that assume realizability, since under this assumption there exists an estimator in the class that attains zero bias. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kumar-krishnamurthy21a.html
http://proceedings.mlr.press/v130/kumar-krishnamurthy21a.html Quick Streaming Algorithms for Maximization of Monotone Submodular Functions in Linear Time We consider the problem of monotone, submodular maximization over a ground set of size $n$ subject to cardinality constraint $k$. For this problem, we introduce the first deterministic algorithms with linear time complexity; these algorithms are streaming algorithms. Our single-pass algorithm obtains a constant ratio in $\lceil n / c \rceil + c$ oracle queries, for any $c \ge 1$. In addition, we propose a deterministic, multi-pass streaming algorithm with a constant number of passes that achieves nearly the optimal ratio with linear query and time complexities. We prove a lower bound that implies no constant-factor approximation exists using $o(n)$ queries, even if queries to infeasible sets are allowed. An empirical analysis demonstrates that our algorithms require fewer queries (often substantially less than $n$) yet still achieve better objective value than the current state-of-the-art algorithms, including single-pass, multi-pass, and non-streaming algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kuhnle21a.html
http://proceedings.mlr.press/v130/kuhnle21a.html Fair for All: Best-effort Fairness Guarantees for Classification Standard approaches to group-based notions of fairness, such as parity and equalized odds, try to equalize absolute measures of performance across known groups (based on race, gender, etc.). Consequently, a group that is inherently harder to classify may hold back the performance on other groups; and no guarantees can be provided for unforeseen groups. Instead, we propose a fairness notion whose guarantee, on each group $g$ in a class $\mathcal{G}$, is relative to the performance of the best classifier on $g$. We apply this notion to broad classes of groups, in particular, where (a) $\mathcal{G}$ consists of all possible groups (subsets) in the data, and (b) $\mathcal{G}$ is more streamlined. For the first setting, which is akin to groups being completely unknown, we devise the PF (Proportional Fairness) classifier, which guarantees, on any possible group $g$, an accuracy that is proportional to that of the optimal classifier for $g$, scaled by the relative size of $g$ in the data set. Due to including all possible groups, some of which could be too complex to be relevant, the worst-case theoretical guarantees here have to be proportionally weaker for smaller subsets. For the second setting, we devise the BeFair (Best-effort Fair) framework which seeks an accuracy, on every $g \in \mathcal{G}$, which approximates that of the optimal classifier on $g$, independent of the size of $g$. Aiming for such a guarantee results in a non-convex problem, and we design novel techniques to get around this difficulty when $\mathcal{G}$ is the set of linear hypotheses. We test our algorithms on real-world data sets, and present interesting comparative insights on their performance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/krishnaswamy21a.html
http://proceedings.mlr.press/v130/krishnaswamy21a.html Revisiting Projection-free Online Learning: the Strongly Convex Case Projection-free optimization algorithms, which are mostly based on the classical Frank-Wolfe method, have gained significant interest in the machine learning community in recent years due to their ability to handle convex constraints that are popular in many applications, but for which computing projections is often computationally impractical in high-dimensional settings, and hence prohibit the use of most standard projection-based methods. In particular, a significant research effort was put on projection-free methods for online learning. In this paper we revisit the Online Frank-Wolfe (OFW) method suggested by Hazan and Kale \cite{Hazan12} and fill a gap that has been left unnoticed for several years: OFW achieves a faster rate of $O(T^{2/3})$ on strongly convex functions (as opposed to the standard $O(T^{3/4})$ for convex but not strongly convex functions), where $T$ is the sequence length. This is somewhat surprising since it is known that for offline optimization, in general, strong convexity does not lead to faster rates for Frank-Wolfe. We also revisit the bandit setting under strong convexity and prove a similar bound of $\tilde O(T^{2/3})$ (instead of $O(T^{3/4})$ without strong convexity). Hence, in the current state-of-affairs, the best projection-free upper-bounds for the full-information and bandit settings with strongly convex and nonsmooth functions match up to logarithmic factors in $T$. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kretzu21a.html
http://proceedings.mlr.press/v130/kretzu21a.html A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free! Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. In many scenarios communication between devices is energy demanding and time consuming and forms the bottleneck of the entire system. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages. By combining our scheme with a new variance reduction technique that progressively throughout the iterations reduces the adverse effect of the injected quantization noise, we obtain a scheme that converges linearly on strongly convex decentralized problems while using compressed communication only. We prove that our method can solve the problems without any increase in the number of communications compared to the baseline which does not perform any communication compression while still allowing for a significant compression factor which depends on the conditioning of the problem and the topology of the network. We confirm our theoretical findings in numerical experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kovalev21a.html
http://proceedings.mlr.press/v130/kovalev21a.html Tight Differential Privacy for Discrete-Valued Mechanisms and for the Subsampled Gaussian Mechanism Using FFT We propose a numerical accountant for evaluating the tight (ε,δ)-privacy loss for algorithms with discrete one dimensional output. The method is based on the privacy loss distribution formalism and it uses the recently introduced fast Fourier transform based accounting technique. We carry out an error analysis of the method in terms of moment bounds of the privacy loss distribution which leads to rigorous lower and upper bounds for the true (ε,δ)-values. As an application, we present a novel approach to accurate privacy accounting of the subsampled Gaussian mechanism. This completes the previously proposed analysis by giving strict lower and upper bounds for the privacy parameters. We demonstrate the performance of the accountant on the binomial mechanism and show that our approach allows decreasing noise variance up to 75 percent at equal privacy compared to existing bounds in the literature. We also illustrate how to compute tight bounds for the exponential mechanism applied to counting queries. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/koskela21a.html
http://proceedings.mlr.press/v130/koskela21a.html Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach We propose a novel estimator of the mutual information between two ordinal vectors $x$ and $y$. Our approach is inductive (as opposed to deductive) in that it depends on the data generating distribution solely through some nonparametric properties revealing associations in the data, and does not require having enough data to fully characterize the true joint distributions $P_{x, y}$. Specifically, our approach consists of (i) noting that $I\left(y; x\right) = I\left(u_y; u_x\right)$ where $u_y$ and $u_x$ are the copula-uniform dual representations of $y$ and $x$ (i.e. their images under the probability integral transform), and (ii) estimating the copula entropies $h\left(u_y\right)$, $h\left(u_x\right)$ and $h\left(u_y, u_x\right)$ by solving a maximum-entropy problem over the space of copula densities under a constraint of the type $\alpha_m = E\left[\phi_m(u_y, u_x)\right]$. We prove that, so long as the constraint is feasible, this problem admits a unique solution, it is in the exponential family, and it can be learned by solving a convex optimization problem. The resulting estimator, which we denote MIND, is marginal-invariant, always non-negative, unbounded for any sample size $n$, consistent, has MSE rate $O(1/n)$, and is more data-efficient than competing approaches. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kom-samo21a.html
http://proceedings.mlr.press/v130/kom-samo21a.html Sharp Analysis of a Simple Model for Random Forests Random forests have become an important tool for improving accuracy in regression and classification problems since their inception by Leo Breiman in 2001. In this paper, we revisit a historically important random forest model, called centered random forests, originally proposed by Breiman in 2004 and later studied by Gérard Biau in 2012, where a feature is selected at random and the splits occurs at the midpoint of the node along the chosen feature. If the regression function is $d$-dimensional and Lipschitz, we show that, given access to $n$ observations, the mean-squared prediction error is $O((n(\log n)^{(d-1)/2})^{-\frac{1}{d\log2+1}})$. This positively answers an outstanding question of Biau about whether the rate of convergence for this random forest model could be improved beyond $O(n^{-\frac{1}{d(4/3)\log2+1}})$. Furthermore, by a refined analysis of the approximation and estimation errors for linear models, we show that our new rate cannot be improved in general. Finally, we generalize our analysis and improve current prediction error bounds for another random forest model, called median random forests, in which each tree is constructed from subsampled data and the splits are performed at the empirical median along a chosen feature. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/klusowski21b.html
http://proceedings.mlr.press/v130/klusowski21b.html Nonparametric Variable Screening with Optimal Decision Stumps Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive finite sample performance guarantees for variable selection in nonparametric models using a single-level CART decision tree (a decision stump). Under standard operating assumptions in variable screening literature, we find that the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Furthermore, unlike previous marginal screening methods that estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/klusowski21a.html
http://proceedings.mlr.press/v130/klusowski21a.html Causal Autoregressive Flows Two apparently unrelated fields — normalizing flows and causality — have recently received considerable attention in the machine learning community. In this work, we highlight an intrinsic correspondence between a simple family of autoregressive normalizing flows and identifiable causal models. We exploit the fact that autoregressive flow architectures define an ordering over variables, analogous to a causal ordering, to show that they are well-suited to performing a range of causal inference tasks, ranging from causal discovery to making interventional and counterfactual predictions. First, we show that causal models derived from both affine and additive autoregressive flows with fixed orderings over variables are identifiable, i.e. the true direction of causal influence can be recovered. This provides a generalization of the additive noise model well-known in causal discovery. Second, we derive a bivariate measure of causal direction based on likelihood ratios, leveraging the fact that flow models can estimate normalized log-densities of data. Third, we demonstrate that flows naturally allow for direct evaluation of both interventional and counterfactual queries, the latter case being possible due to the invertible nature of flows. Finally, throughout a series of experiments on synthetic and real data, the proposed method is shown to outperform current approaches for causal discovery as well as making accurate interventional and counterfactual predictions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/khemakhem21a.html
http://proceedings.mlr.press/v130/khemakhem21a.html Self-Supervised Steering Angle Prediction for Vehicle Control Using Visual Odometry Vision-based learning methods for self-driving cars have primarily used supervised approaches that require a large number of labels for training. However, those labels are usually difficult and expensive to obtain. In this paper, we demonstrate how a model can be trained to control a vehicle’s trajectory using camera poses estimated through visual odometry methods in an entirely self-supervised fashion. We propose a scalable framework that leverages trajectory information from several different runs using a camera setup placed at the front of a car. Experimental results on the CARLA simulator demonstrate that our proposed approach performs at par with the model trained with supervision. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/khan21a.html
http://proceedings.mlr.press/v130/khan21a.html Projection-Free Optimization on Uniformly Convex Sets The Frank-Wolfe method solves smooth constrained convex optimization problems at a generic sublinear rate of $\mathcal{O}(1/T)$, and it (or its variants) enjoys accelerated convergence rates for two fundamental classes of constraints: polytopes and strongly-convex sets. Uniformly convex sets non-trivially subsume strongly convex sets and form a large variety of \textit{curved} convex sets commonly encountered in machine learning and signal processing. For instance, the $\ell_p$-balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show that these sets systematically induce accelerated convergence rates for the original Frank-Wolfe algorithm, which continuously interpolate between known rates. Our accelerated convergence rates emphasize that it is the curvature of the constraint sets – not just their strong convexity – that leads to accelerated convergence rates. These results also importantly highlight that the Frank-Wolfe algorithm is adaptive to much more generic constraint set structures, thus explaining faster empirical convergence. Finally, we also show accelerated convergence rates when the set is only locally uniformly convex around the optima and provide similar results in online linear optimization. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kerdreux21a.html
http://proceedings.mlr.press/v130/kerdreux21a.html On the role of data in PAC-Bayes The dominant term in PAC-Bayes bounds is often the Kullback-Leibler divergence between the posterior and prior. For so-called linear PAC-Bayes risk bounds based on the empirical risk of a fixed posterior kernel, it is possible to minimize the expected value of the bound by choosing the prior to be the expected posterior, which we call the <em>oracle</em> prior on the account that it is distribution dependent. In this work, we show that the bound based on the oracle prior can be suboptimal: In some cases, a stronger bound is obtained by using a data-dependent oracle prior, i.e., a conditional expectation of the posterior, given a subset of the training data that is then excluded from the empirical risk term. While using data to learn a prior is a known heuristic, its essential role in optimal bounds is new. In fact, we show that using data can mean the difference between vacuous and nonvacuous bounds. We apply this new principle in the setting of nonconvex learning, simulating data-dependent oracle priors on MNIST and Fashion MNIST with and without held-out data, and demonstrating new nonvacuous bounds in both cases. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/karolina-dziugaite21a.html
http://proceedings.mlr.press/v130/karolina-dziugaite21a.html Does Invariant Risk Minimization Capture Invariance? We show that the Invariant Risk Minimization (IRM) formulation of Arjovsky et al. (2019) can fail to capture "natural" invariances, at least when used in its practical "linear" form, and even on very simple problems which directly follow the motivating examples for IRM. This can lead to worse generalization on new environments, even when compared to unconstrained ERM. The issue stems from a significant gap between the linear variant (as in their concrete method IRMv1) and the full non-linear IRM formulation. Additionally, even when capturing the "right" invariances, we show that it is possible for IRM to learn a sub-optimal predictor, due to the loss function not being invariant across environments. The issues arise even when measuring invariance on the population distributions, but are exacerbated by the fact that IRM is extremely fragile to sampling. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kamath21a.html
http://proceedings.mlr.press/v130/kamath21a.html Graph Gamma Process Linear Dynamical Systems We introduce graph gamma process (GGP) linear dynamical systems to model real-valued multivariate time series. GGP generates $S$ latent states that are shared by $K$ different communities, each of which is characterized by its own pattern of activation probabilities imposed on a $S\times S$ directed sparse graph, and allow both $S$ and $K$ to grow without bound. For temporal pattern discovery, the latent representation under the model is used to decompose the time series into a parsimonious set of multivariate sub-sequences generated by formed communities. In each sub-sequence, different data dimensions often share similar temporal patterns but may exhibit distinct magnitudes, and hence allowing the superposition of all sub-sequences to exhibit diverse behaviors at different data dimensions. On both synthetic and real-world time series, the proposed nonparametric Bayesian dynamic models, which are initialized at random, consistently exhibit good predictive performance in comparison to a variety of baseline models, revealing interpretable latent state transition patterns and decomposing the time series into distinctly behaved sub-sequences. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/kalantari21a.html
http://proceedings.mlr.press/v130/kalantari21a.html Abstract Value Iteration for Hierarchical Reinforcement Learning We propose a novel hierarchical reinforcement learning framework for control with continuous state and action spaces. In our framework, the user specifies subgoal regions which are subsets of states; then, we (i) learn options that serve as transitions between these subgoal regions, and (ii) construct a high-level plan in the resulting abstract decision process (ADP). A key challenge is that the ADP may not be Markov; we propose two algorithms for planning in the ADP that address this issue. Our first algorithm is conservative, allowing us to prove theoretical guarantees on its performance, which help inform the design of subgoal regions. Our second algorithm is a practical one that interweaves planning at the abstract level and learning at the concrete level. In our experiments, we demonstrate that our approach outperforms state-of-the-art hierarchical reinforcement learning algorithms on several challenging benchmarks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jothimurugan21a.html
http://proceedings.mlr.press/v130/jothimurugan21a.html On the Consistency of Metric and Non-Metric K-Medoids We establish the consistency of K-medoids in the context of metric spaces. We start by proving that K-medoids is asymptotically equivalent to K-means restricted to the support of the underlying distribution under general conditions, including a wide selection of loss functions. This asymptotic equivalence, in turn, enables us to apply the work of Parna (1986) on the consistency of K-means. This general approach applies also to non-metric settings where only an ordering of the dissimilarities is available. We consider two types of ordinal information: one where all quadruple comparisons are available; and one where only triple comparisons are available. We provide some numerical experiments to illustrate our theory. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jiang21c.html
http://proceedings.mlr.press/v130/jiang21c.html Learning the Truth From Only One Side of the Story Learning under one-sided feedback (i.e., where we only observe the labels for examples we predicted positively on) is a fundamental problem in machine learning – applications include lending and recommendation systems. Despite this, there has been surprisingly little progress made in ways to mitigate the effects of the sampling bias that arises. We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution. We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically. Our method leverages variance estimation techniques to efficiently learn under uncertainty, offering a more principled alternative compared to existing approaches. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jiang21b.html
http://proceedings.mlr.press/v130/jiang21b.html Learning to Defend by Learning to Attack Adversarial training provides a principled approach for training robust neural networks. From an optimization perspective, adversarial training is essentially solving a bilevel optimization problem. The leader problem is trying to learn a robust classifier, while the follower maximization is trying to generate adversarial samples. Unfortunately, such a bilevel problem is difficult to solve due to its highly complicated structure. This work proposes a new adversarial training method based on a generic learning-to-learn (L2L) framework. Specifically, instead of applying existing hand-designed algorithms for the inner problem, we learn an optimizer, which is parametrized as a convolutional neural network. At the same time, a robust classifier is learned to defense the adversarial attack generated by the learned optimizer. Experiments over CIFAR-10 and CIFAR-100 datasets demonstrate that L2L outperforms existing adversarial training methods in both classification accuracy and computational efficiency. Moreover, our L2L framework can be extended to generative adversarial imitation learning and stabilize the training. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jiang21a.html
http://proceedings.mlr.press/v130/jiang21a.html Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations. While the need for interpretable machine learning has been established, many common approaches are slow, lack fidelity, or hard to evaluate. Amortized explanation methods reduce the cost of providing interpretations by learning a global selector model that returns feature importances for a single instance of data. The selector model is trained to optimize the fidelity of the interpretations, as evaluated by a predictor model for the target. Popular methods learn the selector and predictor model in concert, which we show allows predictions to be encoded within interpretations. We introduce EVAL-X as a method to quantitatively evaluate interpretations and REAL-X as an amortized explanation method, which learn a predictor model that approximates the true data generating distribution given any subset of the input. We show EVAL-X can detect when predictions are encoded in interpretations and show the advantages of REAL-X through quantitative and radiologist evaluation. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jethani21a.html
http://proceedings.mlr.press/v130/jethani21a.html Scalable Gaussian Process Variational Autoencoders Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jazbec21a.html
http://proceedings.mlr.press/v130/jazbec21a.html Online probabilistic label trees We introduce online probabilistic label trees (OPLTs), an algorithm that trains a label tree classifier in a fully online manner without any prior knowledge about the number of training instances, their features and labels. OPLTs are characterized by low time and space complexity as well as strong theoretical guarantees. They can be used for online multi-label and multi-class classification, including the very challenging scenarios of one- or few-shot learning. We demonstrate the attractiveness of OPLTs in a wide empirical study on several instances of the tasks mentioned above. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jasinska-kobus21a.html
http://proceedings.mlr.press/v130/jasinska-kobus21a.html Improving Classifier Confidence using Lossy Label-Invariant Transformations Providing reliable model uncertainty estimates is imperative to enabling robust decision making by autonomous agents and humans alike. While recently there have been significant advances in confidence calibration for trained models, examples with poor calibration persist in most calibrated models. Consequently, multiple techniques have been proposed that leverage label-invariant transformations of the input (i.e., an input manifold) to improve worst-case confidence calibration. However, manifold-based confidence calibration techniques generally do not scale and/or require expensive retraining when applied to models with large input spaces (e.g., ImageNet). In this paper, we present the recursive lossy label-invariant calibration (ReCal) technique that leverages label-invariant transformations of the input that induce a loss of discriminatory information to recursively group (and calibrate) inputs – without requiring model retraining. We show that ReCal outperforms other calibration methods on multiple datasets, especially, on large-scale datasets such as ImageNet. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jang21a.html
http://proceedings.mlr.press/v130/jang21a.html Sampling in Combinatorial Spaces with SurVAE Flow Augmented MCMC Hybrid Monte Carlo is a powerful Markov Chain Monte Carlo method for sampling from complex continuous distributions. However, a major limitation of HMC is its inability to be applied to discrete domains due to the lack of gradient signal. In this work, we introduce a new approach based on augmenting monte carlo methods with SurVAE Flow to sample from discrete distributions using a combination of neural transport methods like normalizing flows, variational dequantization, and the Metropolis-Hastings rule. Our method first learns a continuous embedding of the discrete space using a surjective map and subsequently learns a bijective transformation from the continuous space to an approximately Gaussian distributed latent variable. Sampling proceeds by simulating MCMC chains in the latent space and mapping these samples to the target discrete space via the learned transformations. We demonstrate the efficacy of our algorithm on a range of examples from statistics, computational physics, and, machine learning, and observe improvements compared to alternative algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jaini21a.html
http://proceedings.mlr.press/v130/jaini21a.html SONIA: A Symmetric Blockwise Truncated Optimization Algorithm This work presents a new optimization algorithm for empirical risk minimization. The algorithm bridges the gap between first- and second-order methods by computing a search direction that uses a second-order-type update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement. To this end, partial curvature information is incorporated to help with ill-conditioning, while simultaneously allowing the algorithm to scale to the large problem dimensions often encountered in machine learning applications. Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases. A stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees. Numerical results confirm the strengths of the new approach on standard machine learning problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/jahani21a.html
http://proceedings.mlr.press/v130/jahani21a.html Approximate Data Deletion from Machine Learning Models Deleting data from a trained machine learning (ML) model is a critical task in many applications. For example, we may want to remove the influence of training points that might be out of date or outliers. Regulations such as EU’s General Data Protection Regulation also stipulate that individuals can request to have their data deleted. The naive approach to data deletion is to retrain the ML model on the remaining data, but this is too time consuming. In this work, we propose a new approximate deletion method for linear and logistic models whose computational cost is linear in the the feature dimension d and independent of the number of training data n. This is a significant gain over all existing methods, which all have superlinear time dependence on the dimension. We also develop a new feature-injection test to evaluate the thoroughness of data deletion from ML models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/izzo21a.html
http://proceedings.mlr.press/v130/izzo21a.html Mean-Variance Analysis in Bayesian Optimization under Uncertainty We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three scenarios based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/iwazaki21a.html
http://proceedings.mlr.press/v130/iwazaki21a.html Improving predictions of Bayesian neural nets via local linearization The generalized Gauss-Newton (GGN) approximation is often used to make practical Bayesian deep learning approaches scalable by replacing a second order derivative with a product of first order derivatives. In this paper we argue that the GGN approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN), which turns the BNN into a generalized linear model (GLM). Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one. We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation. It extends previous results in this vein to general likelihoods and has an equivalent Gaussian process formulation, which enables alternative inference schemes for BNNs in function space. We demonstrate the effectiveness of our approach on several standard classification datasets as well as on out-of-distribution detection. We provide an implementation at https://github.com/AlexImmer/BNN-predictions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/immer21a.html
http://proceedings.mlr.press/v130/immer21a.html Regularized Policies are Reward Robust Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy. The primary motivation for using entropy is for exploration and disambiguating optimal policies; however, the theoretical effects are not entirely understood. In this work, we study the more general regularized RL objective and using Fenchel duality; we derive the dual problem which takes the form of an adversarial reward problem. In particular, we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward. Our result allows us to reinterpret the popular entropic regularization scheme as a form of robustification. Furthermore, due to the generality of our results, we apply to other existing regularization schemes. Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/husain21a.html
http://proceedings.mlr.press/v130/husain21a.html Learning User Preferences in Non-Stationary Environments Recommendation systems often use online collaborative filtering (CF) algorithms to identify items a given user likes over time, based on ratings that this user and a large number of other users have provided in the past. This problem has been studied extensively when users’ preferences do not change over time (static case); an assumption that is often violated in practical settings. In this paper, we introduce a novel model for online non-stationary recommendation systems which allows for temporal uncertainties in the users’ preferences. For this model, we propose a user-based CF algorithm, and provide a theoretical analysis of its achievable reward. Compared to related non-stationary multi-armed bandit literature, the main fundamental difficulty in our model lies in the fact that variations in the preferences of a certain user may affect the recommendations for other users severely. We also test our algorithm over real-world datasets, showing its effectiveness in real-world applications. One of the main surprising observations in our experiments is the fact our algorithm outperforms other static algorithms even when preferences do not change over time. This hints toward the general conclusion that in practice, dynamic algorithms, such as the one we propose, might be beneficial even in stationary environments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/huleihel21a.html
http://proceedings.mlr.press/v130/huleihel21a.html Alternating Direction Method of Multipliers for Quantization Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization (ADMM-Q) algorithm, which is a variant of the widely-used ADMM method applied to our discrete optimization problem. We establish the convergence of the iterates of ADMM-Q to certain stationary points. To the best of our knowledge, this is the first analysis of an ADMM-type method for problems with discrete variables/constraints. Based on our theoretical insights, we develop a few variants of ADMM-Q that can handle inexact update rules, and have improved performance via the use of "soft projection" and "injecting randomness to the algorithm". We empirically evaluate the efficacy of our proposed approaches. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/huang21a.html
http://proceedings.mlr.press/v130/huang21a.html Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $\varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $\varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time $\widetilde O(Nd)$. Our results extend recent work on computationally-efficient robust estimation to a more widely applicable incomplete-data setting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hu21b.html
http://proceedings.mlr.press/v130/hu21b.html Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the L2 estimation error with respect to the GD iteration, which is away from zero without a delicate choice of early stopping. In turn, through a comprehensive analysis of L2-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the L2 regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax optimal rate of the L2 estimation error is achieved. Numerical experiments confirm our theory and further demonstrate that the L2 regularization approach improves the training robustness and works for a wider range of neural networks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hu21a.html
http://proceedings.mlr.press/v130/hu21a.html On the proliferation of support vectors in high dimensions The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hsu21a.html
http://proceedings.mlr.press/v130/hsu21a.html Hyperparameter Transfer Learning with Adaptive Complexity Bayesian optimization (BO) is a data-efficient approach to automatically tune the hyperparameters of machine learning models. In practice, one frequently has to solve similar hyperparameter tuning problems sequentially. For example, one might have to tune a type of neural network learned across a series of different classification problems. Recent work on multi-task BO exploits knowledge gained from previous hyperparameter tuning tasks to speed up a new tuning task. However, previous approaches do not account for the fact that BO is a sequential decision making procedure. Hence, there is in general a mismatch between the number of evaluations collected in the current tuning task compared to the number of evaluations accumulated in all previously completed tasks. In this work, we enable multi-task BO to compensate for this mismatch, such that the transfer learning procedure is able to handle different data regimes in a principled way. We propose a new multi-task BO method that learns a set of ordered, non-linear basis functions of increasing complexity via nested drop-out and automatic relevance determination. Experiments on a variety of hyperparameter tuning problems show that our method improves the sample efficiency of recently published multi-task BO methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/horvath21a.html
http://proceedings.mlr.press/v130/horvath21a.html Bayesian Model Averaging for Causality Estimation and its Approximation based on Gaussian Scale Mixture Distributions In estimation of the causal effect under linear Structural Causal Models (SCMs), it is common practice to first identify the causal structure, estimate the probability distributions, and then calculate the causal effect. However, if the goal is to estimate the causal effect, it is not necessary to fix a single causal structure or probability distributions. In this paper, we first show from a Bayesian perspective that it is Bayes optimal to weight (average) the causal effects estimated under each model rather than estimating a single model. This idea is also known as Bayesian model averaging. Although the Bayesian model averaging is optimal, as the number of candidate models increases, the weighting calculations become computationally hard. We develop an approximation to the Bayes optimal estimator by using Gaussian scale mixture distributions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/horii21a.html
http://proceedings.mlr.press/v130/horii21a.html Non-Stationary Off-Policy Optimization Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hong21a.html
http://proceedings.mlr.press/v130/hong21a.html Learning with risk-averse feedback under potentially heavy tails We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses (and gradients) incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not change too fast or slow around just the quantile of interest. With this estimator in hand, we then derive a new learning algorithm which robustly chooses among candidates produced by stochastic gradient-driven sub-processes, obtain excess CVaR bounds, and finally complement the theory with a regression application. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/holland21b.html
http://proceedings.mlr.press/v130/holland21b.html Robustness and scalability under heavy tails, without strong convexity Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple, cost-efficient algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust validation sub-routine, which is used to boost the confidence of inexpensive gradient-based sub-processes. Compared with recent robust gradient descent methods from the literature, dimension dependence (both risk bounds and cost) is substantially improved, without relying upon strong convexity or expensive per-step robustification. We also empirically show that the proposed procedure cannot simply be replaced with naive cross-validation. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/holland21a.html
http://proceedings.mlr.press/v130/holland21a.html An Adaptive-MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) is a powerful MCMC algorithm based on simulating Hamiltonian dynamics. Its performance depends strongly on choosing appropriate values for two parameters: the step size used in the simulation, and how long the simulation runs for. The step-size parameter can be tuned using standard adaptive-MCMC strategies, but it is less obvious how to tune the simulation-length parameter. The no-U-turn sampler (NUTS) eliminates this problematic simulation-length parameter, but NUTS’s relatively complex control flow makes it difficult to efficiently run many parallel chains on accelerators such as GPUs. NUTS also spends some extra gradient evaluations relative to HMC in order to decide how long to run each iteration without violating detailed balance. We propose ChEES-HMC, a simple adaptive-MCMC scheme for automatically tuning HMC’s simulation-length parameter, which minimizes a proxy for the autocorrelation of the state’s second moments. We evaluate ChEES-HMC and NUTS on many tasks, and find that ChEES-HMC typically yields larger effective sample sizes per gradient evaluation than NUTS does. When running many chains on a GPU, ChEES-HMC can also run significantly more gradient evaluations per second than NUTS, allowing it to quickly provide accurate estimates of posterior expectations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hoffman21a.html
http://proceedings.mlr.press/v130/hoffman21a.html Learning Contact Dynamics using Physically Structured Neural Networks Learning physically structured representations of dynamical systems that include contact between different objects is an important problem for learning-based approaches in robotics. Black-box neural networks can learn to approximately represent discontinuous dynamics, but they typically require large quantities of data and often suffer from pathological behaviour when forecasting for longer time horizons. In this work, we use connections between deep neural networks and differential equations to design a family of deep network architectures for representing contact dynamics between objects. We show that these networks can learn discontinuous contact events in a data-efficient manner from noisy observations in settings that are traditionally difficult for black-box approaches and recent physics inspired neural networks. Our results indicate that an idealised form of touch feedback—which is heavily relied upon by biological systems—is a key component of making this learning problem tractable. Together with the inductive biases introduced through the network architectures, our techniques enable accurate learning of contact dynamics from observations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hochlehnert21a.html
http://proceedings.mlr.press/v130/hochlehnert21a.html Online Robust Control of Nonlinear Systems with Large Uncertainty Robust control is a core approach for controlling systems with performance guarantees that are robust to modeling error, and is widely used in real-world systems. However, current robust control approaches can only handle small system uncertainty, and thus require significant effort in system identification prior to controller design. We present an online approach that robustly controls a nonlinear system under large model uncertainty. Our approach is based on decomposing the problem into two sub-problems, “robust control design” (which assumes small model uncertainty) and “chasing consistent models”, which can be solved using existing tools from control theory and online learning, respectively. We provide a learning convergence analysis that yields a finite mistake bound on the number of times performance requirements are not met and can provide strong safety guarantees, by bounding the worst-case state deviation. To the best of our knowledge, this is the first approach for online robust control of nonlinear systems with such learning theoretic and safety guarantees. We also show how to instantiate this framework for general robotic systems, demonstrating the practicality of our approach. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ho21a.html
http://proceedings.mlr.press/v130/ho21a.html Shadow Manifold Hamiltonian Monte Carlo Hamiltonian Monte Carlo and its descendants have found success in machine learning and computational statistics due to their ability to draw samples in high dimensions with greater efficiency than classical MCMC. One of these derivatives, Riemannian manifold Hamiltonian Monte Carlo (RMHMC), better adapts the sampler to the geometry of the target density, allowing for improved performances in sampling problems with complex geometric features. Other approaches have boosted acceptance rates by sampling from an integrator-dependent “shadow density” and compensating for the induced bias via importance sampling. We combine the benefits of RMHMC with those attained by sampling from the shadow density, by deriving the shadow Hamiltonian corresponding to the generalized leapfrog integrator used in RMHMC. This leads to a new algorithm, shadow manifold Hamiltonian Monte Carlo, that shows improved performance over RMHMC, and leaves the target density invariant. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/heide21a.html
http://proceedings.mlr.press/v130/heide21a.html Stable ResNet Deep ResNet architectures have achieved state of the art performance on many tasks. While they solve the problem of gradient vanishing, they might suffer from gradient exploding as the depth becomes large (Yang et al. 2017). Moreover, recent results have shown that ResNet might lose expressivity as the depth goes to infinity (Yang et al. 2017, Hayou et al. 2019). To resolve these issues, we introduce a new class of ResNet architectures, calledStable ResNet, that have the property of stabilizing the gradient while ensuring expressivity in the infinite depth limit. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hayou21a.html
http://proceedings.mlr.press/v130/hayou21a.html The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry The Fisher information matrix (FIM) is fundamental to understanding the trainability of deep neural nets (DNN), since it describes the parameter space’s local metric. We investigate the spectral distribution of the conditional FIM, which is the FIM given a single sample, by focusing on fully-connected networks achieving dynamical isometry. Then, while dynamical isometry is known to keep specific backpropagated signals independent of the depth, we find that the parameter space’s local metric linearly depends on the depth even under the dynamical isometry. More precisely, we reveal that the conditional FIM’s spectrum concentrates around the maximum and the value grows linearly as the depth increases. To examine the spectrum, considering random initialization and the wide limit, we construct an algebraic methodology based on the free probability theory. As a byproduct, we provide an analysis of the solvable spectral distribution in two-hidden-layer cases. Lastly, experimental results verify that the appropriate learning rate for the online training of DNNs is in inverse proportional to depth, which is determined by the conditional FIM’s spectrum. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hayase21a.html
http://proceedings.mlr.press/v130/hayase21a.html Learning Partially Known Stochastic Dynamics with Empirical PAC Bayes Neural Stochastic Differential Equations model a dynamical environment with neural nets assigned to their drift and diffusion terms. The high expressive power of their nonlinearity comes at the expense of instability in the identification of the large set of free parameters. This paper presents a recipe to improve the prediction accuracy of such models in three steps: i) accounting for epistemic uncertainty by assuming probabilistic weights, ii) incorporation of partial knowledge on the state dynamics, and iii) training the resultant hybrid model by an objective derived from a PAC-Bayesian generalization bound. We observe in our experiments that this recipe effectively translates partial and noisy prior knowledge into an improved model fit. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/haussmann21a.html
http://proceedings.mlr.press/v130/haussmann21a.html DP-MERF: Differentially Private Mean Embeddings with RandomFeatures for Practical Privacy-preserving Data Generation We propose a differentially private data generation paradigm using random feature representations of kernel mean embeddings when comparing the distribution of true data with that of synthetic data. We exploit the random feature representations for two important benefits. First, we require a minimal privacy cost for training deep generative models. This is because unlike kernel-based distance metrics that require computing the kernel matrix on all pairs of true and synthetic data points, we can detach the data-dependent term from the term solely dependent on synthetic data. Hence, we need to perturb the data-dependent term once and for all and then use it repeatedly during the generator training. Second, we can obtain an analytic sensitivity of the kernel mean embedding as the random features are norm bounded by construction. This removes the necessity of hyper-parameter search for a clipping norm to handle the unknown sensitivity of a generator network. We provide several variants of our algorithm, differentially-private mean embeddings with random features (DP-MERF) to jointly generate labels and input features for datasets such as heterogeneous tabular data and image data. Our algorithm achieves drastically better privacy-utility trade-offs than existing methods when tested on several datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/harder21a.html
http://proceedings.mlr.press/v130/harder21a.html Adaptive Approximate Policy Iteration Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains. However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a O(T^{2/3}) regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of O(T^{3/4}) for the average-reward case with function approximation. Our algorithm and analysis rely on online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hao21b.html
http://proceedings.mlr.press/v130/hao21b.html Online Sparse Reinforcement Learning We investigate the hardness of online reinforcement learning in sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable, even if there exists a policy that collects well-conditioned data. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data, then a variant of Lasso fitted Q-iteration enjoys a regret of $O(N^{2/3})$ where $N$ is the number of episodes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hao21a.html
http://proceedings.mlr.press/v130/hao21a.html Toward a General Theory of Online Selective Sampling: Trading Off Mistakes and Queries While the literature on the theory of pool-based active learning has seen much progress in the past 15 years, and is now fairly mature, much less is known about its cousin problem: online selective sampling. In the stochastic online learning setting, there is a stream of iid data, and the learner is required to predict a label for each instance, and we are interested in the rate of growth of the number of mistakes the learner makes. In the selective sampling variant of this problem, after each prediction, the learner can optionally request to observe the true classification of the point. This introduces a trade-off between the number of these queries and the number of mistakes as a function of the number T of samples in the sequence. This work explores various properties of the optimal trade-off curve, both abstractly (for general VC classes), and more-concretely for several constructed examples that expose important properties of the trade-off. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/hanneke21a.html
http://proceedings.mlr.press/v130/hanneke21a.html On the High Accuracy Limitation of Adaptive Property Estimation Recent years have witnessed the success of adaptive (or unified) approaches in estimating symmetric properties of discrete distributions, where the learner first obtains a distribution estimator independent of the target property, and then plugs the estimator into the target property as the final estimator. Several such approaches have been proposed and proved to be adaptively optimal, i.e. they achieve the optimal sample complexity for a large class of properties within a low accuracy, especially for a large estimation error $\varepsilon\gg n^{-1/3}$ where $n$ is the sample size. In this paper, we characterize the high accuracy limitation, or the penalty for adaptation, for general adaptive approaches. Specifically, we obtain the first known adaptation lower bound that under a mild condition, any adaptive approach cannot achieve the optimal sample complexity for every $1$-Lipschitz property within accuracy $\varepsilon \ll n^{-1/3}$. In particular, this result disproves a conjecture in [Acharya et al. 2017] that the profile maximum likelihood (PML) plug-in approach is optimal in property estimation for all ranges of $\varepsilon$, and confirms a conjecture in [Han and Shiragur 2020] that their competitive analysis of the PML is tight. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/han21b.html
http://proceedings.mlr.press/v130/han21b.html Simultaneously Reconciled Quantile Forecasting of Hierarchically Related Time Series Many real-life applications involve simultaneously forecasting multiple time series that are hierarchically related via aggregation or disaggregation operations. For instance, commercial organizations often want to forecast inventories simultaneously at store, city, and state levels for resource planning purposes. In such applications, it is important that the forecasts, in addition to being reasonably accurate, are also consistent w.r.t one another. Although forecasting such hierarchical time series has been pursued by economists and data scientists, the current state-of-the-art models use strong assumptions, e.g., all forecasts being unbiased estimates, noise distribution being Gaussian. Besides, state-of-the-art models have not harnessed the power of modern nonlinear models, especially ones based on deep learning. In this paper, we propose using a flexible nonlinear model that optimizes quantile regression loss coupled with suitable regularization terms to maintain the consistency of forecasts across hierarchies. The theoretical framework introduced herein can be applied to any forecasting model with an underlying differentiable loss function. A proof of optimality of our proposed method is also provided. Simulation studies over a range of datasets highlight the efficacy of our approach. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/han21a.html
http://proceedings.mlr.press/v130/han21a.html Federated Learning with Compression: Unified Analysis and Sharp Guarantees In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are gradient compression and local computation with periodic communication. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distributions settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both strongly convex and non-convex objective functions. To mitigate data heterogeneity, we introduce a local gradient tracking scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results by demonstrating the effectiveness of our proposed methods on real-world datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/haddadpour21a.html
http://proceedings.mlr.press/v130/haddadpour21a.html Fractional moment-preserving initialization schemes for training deep neural networks A traditional approach to initialization in deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. On the other hand, several studies show that during the training process, the distribution of stochastic gradients can be heavy-tailed especially for small batch sizes. In this case, weights and therefore pre-activations can be modeled with a heavy-tailed distribution that has an inﬁnite variance but has a ﬁnite (non-integer) fractional moment of order $s$ with $s < 2$. Motivated by this fact, we develop initialization schemes for fully connected feed-forward networks that can provably preserve any given moment of order $s\in (0,2]$ over the layers for a class of activations including ReLU, Leaky ReLU, Randomized Leaky ReLU, and linear activations. These generalized schemes recover traditional initialization schemes in the limit $s \to 2$ and serve as part of a principled theory for initialization. For all these schemes, we show that the network output admits a ﬁnite almost sure limit as the number of layers grows, and the limit is heavy-tailed in some settings. This sheds further light into the origins of heavy tail during signal propagation in DNNs. We also prove that the logarithm of the norm of the network outputs, if properly scaled, will converge to a Gaussian distribution with an explicit mean and variance we can compute depending on the activation used, the value of $s$ chosen and the network width, where log-normality serves as a further justiﬁcation of why the norm of the network output can be heavy-tailed in DNNs. We also prove that our initialization scheme avoids small network output values more frequently compared to traditional approaches. Our results extend if dropout is used and the proposed initialization strategy does not have an extra cost during the training procedure. We show through numerical experiments that our initialization can improve the training and test performance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gurbuzbalaban21a.html
http://proceedings.mlr.press/v130/gurbuzbalaban21a.html Learning Temporal Point Processes with Intermittent Observations Marked temporal point processes (MTPP) have emerged as a powerful framework to model the underlying generative mechanism of asynchronous events localized in continuous time. Most existing models and inference methods in MTPP framework consider only the complete observation scenario i.e. the event sequence being modeled is completely observed with no missing events – an ideal setting barely encountered in practice. A recent line of work which considers missing events uses supervised learning techniques which require a missing or observed label for each event. In this work, we provide a novel unsupervised model and inference method for MTPPs in presence of missing events. We first model the generative processes of observed events and missing events using two MTPPs, where the missing events are represented as latent random variables. Then we devise an unsupervised training method that jointly learns both the MTPPs by means of variational inference. Experiments with real datasets show that our modeling and inference frameworks can effectively impute the missing data among the observed events, which in turn enhances its predictive prowess. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gupta21a.html
http://proceedings.mlr.press/v130/gupta21a.html Minimal enumeration of all possible total effects in a Markov equivalence class In observational studies, when a total causal effect of interest is not identified, the set of all possible effects can be reported instead. This typically occurs when the underlying causal DAG is only known up to a Markov equivalence class, or a refinement thereof due to background knowledge. As such, the class of possible causal DAGs is represented by a maximally oriented partially directed acyclic graph (MPDAG), which contains both directed and undirected edges. We characterize the minimal additional edge orientations required to identify a given total effect. A recursive algorithm is then developed to enumerate subclasses of DAGs, such that the total effect in each subclass is identified as a distinct functional of the observed distribution. This resolves an issue with existing methods, which often report possible total effects with duplicates, namely those that are numerically distinct due to sampling variability but are in fact causally identical. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/guo21c.html
http://proceedings.mlr.press/v130/guo21c.html Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings Cycle-consistent training is widely used for jointly learning a forward and inverse mapping between two domains of interest without the cumbersome requirement of collecting matched pairs within each domain. In this regard, the implicit assumption is that there exists (at least approximately) a ground-truth bijection such that a given input from either domain can be accurately reconstructed from successive application of the respective mappings. But in many applications no such bijection can be expected to exist and large reconstruction errors can compromise the success of cycle-consistent training. As one important instance of this limitation, we consider practically-relevant situations where there exists a many-to-one or surjective mapping between domains. To address this regime, we develop a conditional variational autoencoder (CVAE) approach that can be viewed as converting surjective mappings to implicit bijections whereby reconstruction errors in both directions can be minimized, and as a natural byproduct, realistic output diversity can be obtained in the one-to-many direction. As theoretical motivation, we analyze a simplified scenario whereby minima of the proposed CVAE-based energy function align with the recovery of ground-truth surjective mappings. On the empirical side, we consider a synthetic image dataset with known ground-truth, as well as a real-world application involving natural language generation from knowledge graphs and vice versa, a prototypical surjective case. For the latter, our CVAE pipeline can capture such many-to-one mappings during cycle training while promoting textural diversity for graph-to-text tasks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/guo21b.html
http://proceedings.mlr.press/v130/guo21b.html Consistent k-Median: Simpler, Better and Robust In this paper we introduce and study the online consistent k-clustering with outliers problem, generalizing the non-outlier version of the problem studied in Lattanzi-Vassilvitskii [18]. We show that a simple local-search based on-line algorithm can give a bicriteria constant approximation for the problem with O(k^2 log^2(nD)) swaps of medians (recourse) in total, where D is the diameter of the metric. When restricted to the problem without outliers, our algorithm is simpler, deterministic and gives better approximation ratio and recourse, compared to that of Lattanzi-Vassilvitskii [18]. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/guo21a.html
http://proceedings.mlr.press/v130/guo21a.html Latent variable modeling with random features Gaussian process-based latent variable models are flexible and theoretically grounded tools for nonlinear dimension reduction, but generalizing to non-Gaussian data likelihoods within this nonlinear framework is statistically challenging. Here, we use random features to develop a family of nonlinear dimension reduction models that are easily extensible to non-Gaussian data likelihoods; we call these random feature latent variable models (RFLVMs). By approximating a nonlinear relationship between the latent space and the observations with a function that is linear with respect to random features, we induce closed-form gradients of the posterior distribution with respect to the latent variable. This allows the RFLVM framework to support computationally tractable nonlinear latent variable models for a variety of data likelihoods in the exponential family without specialized derivations. Our generalized RFLVMs produce results comparable with other state-of-the-art dimension reduction methods on diverse types of data, including neural spike train recordings, images, and text data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gundersen21a.html
http://proceedings.mlr.press/v130/gundersen21a.html Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent We present a direct (primal only) derivation of Mirror Descent as a “partial” discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential function. We contrast this discretization to Natural Gradient Descent, which is obtained by a “full” forward Euler discretization. This view helps shed light on the relationship between the methods and allows generalizing Mirror Descent to any Riemannian geometry in $\mathbb{R}^d$, even when the metric tensor is not a Hessian, and thus there is no “dual.” Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gunasekar21a.html
http://proceedings.mlr.press/v130/gunasekar21a.html A Study of Condition Numbers for First-Order Optimization In this work we introduce a new framework for the theoretical study of convergence and tuning of first-order optimization algorithms (FOA). The study of such algorithms typically requires assumptions on the objective functions: the most popular ones are probably smoothness and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *-norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy. Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. Finally we discuss how our work impacts the theoretical understanding of FOA and their performances. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/guille-escuret21a.html
http://proceedings.mlr.press/v130/guille-escuret21a.html When Will Generative Adversarial Imitation Learning Algorithms Attain Global Convergence Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approach for jointly optimizing policy and reward from expert trajectories. A primary question about GAIL is whether applying a certain policy gradient algorithm to GAIL attains a global minimizer (i.e., yields the expert policy), for which existing understanding is very limited. Such global convergence has been shown only for the linear (or linear-type) MDP and linear (or linearizable) reward. In this paper, we study GAIL under general MDP and for nonlinear reward function classes (as long as the objective function is strongly concave with respect to the reward parameter). We characterize the global convergence with a sublinear rate for a broad range of commonly used policy gradient algorithms, all of which are implemented in an alternating manner with stochastic gradient ascent for reward update, including projected policy gradient (PPG)-GAIL, Frank-Wolfe policy gradient (FWPG)-GAIL, trust region policy optimization (TRPO)-GAIL and natural policy gradient (NPG)-GAIL. This is the first systematic theoretical study of GAIL for global convergence. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/guan21a.html
http://proceedings.mlr.press/v130/guan21a.html High-Dimensional Feature Selection for Sample Efficient Treatment Effect Estimation The estimation of causal treatment effects from observational data is a fundamental problem in causal inference. To avoid bias, the effect estimator must control for all confounders. Hence practitioners often collect data for as many covariates as possible to raise the chances of including the relevant confounders. While this addresses the bias, this has the side effect of significantly increasing the number of data samples required to accurately estimate the effect due to the increased dimensionality. In this work, we consider the setting where out of a large number of covariates $X$ that satisfy strong ignorability, an unknown sparse subset $S$ is sufficient to include to achieve zero bias, i.e. $c$-equivalent to $X$. We propose a common objective function involving outcomes across treatment cohorts with nonconvex joint sparsity regularization that is guaranteed to recover $S$ with high probability under a linear outcome model for $Y$ and subgaussian covariates for each of the treatment cohort. This improves the effect estimation sample complexity so that it scales with the cardinality of the sparse subset $S$ and $\log |X|$, as opposed to the cardinality of the full set $X$. We validate our approach with experiments on treatment effect estimation. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/greenewald21a.html
http://proceedings.mlr.press/v130/greenewald21a.html Minimax Optimal Regression over Sobolev Spaces via Laplacian Regularization on Neighborhood Graphs In this paper we study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression. Under standard regularity conditions, we establish upper bounds on the error of the Laplacian smoothing estimator \smash{$\widehat{f}$}, and a goodness-of-fit test also based on \smash{$\widehat{f}$}. These upper bounds match the minimax optimal estimation and testing rates of convergence over the first-order Sobolev class $H^1(\mathcal{X})$, for $\mathcal{X} \subseteq \mathbb{R}^d$ and $1 \leq d < 4$; in the estimation problem, for $d = 4$, they are optimal modulo a $\log n$ factor. Additionally, we prove that Laplacian smoothing is manifold-adaptive: if $\mathcal{X} \subseteq \mathbb{R}^d$ is an $m$-dimensional manifold with $m < d$, then the error rate of Laplacian smoothing (in either estimation or testing) depends only on $m$, in the same way it would if $\mathcal{X}$ were a full-dimensional set in $\mathbb{R}^m$. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/green21a.html
http://proceedings.mlr.press/v130/green21a.html Convergence Properties of Stochastic Hypergradients Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/grazzi21a.html
http://proceedings.mlr.press/v130/grazzi21a.html SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation Stochastic Gradient Descent (SGD) is being used routinely for optimizing non-convex functions. Yet, the standard convergence theory for SGD in the smooth non-convex setting gives a slow sublinear convergence to a stationary point. In this work, we provide several convergence theorems for SGD showing convergence to a global minimum for non-convex problems satisfying some extra structural assumptions. In particular, we focus on two large classes of structured non-convex functions: (i) Quasar (Strongly) Convex functions (a generalization of convex functions) and (ii) functions satisfying the Polyak-Łojasiewicz condition (a generalization of strongly-convex functions). Our analysis relies on an Expected Residual condition which we show is a strictly weaker assumption than previously used growth conditions, expected smoothness or bounded variance assumptions. We provide theoretical guarantees for the convergence of SGD for different step-size selections including constant, decreasing and the recently proposed stochastic Polyak step-size. In addition, all of our analysis holds for the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching and determine an optimal minibatch size. Finally, we show that for models that interpolate the training data, we can dispense of our Expected Residual condition and give state-of-the-art results in this setting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gower21a.html
http://proceedings.mlr.press/v130/gower21a.html Nested Barycentric Coordinate System as an Explicit Feature Map We introduce a new embedding technique based on barycentric coordinate system. We show that our embedding can be used to transforms the problem of polytope approximation into that of finding a linear classifier in a higher (but nevertheless quite sparse) dimensional representation. This embedding in effect maps a piecewise linear function into a single linear function, and allows us to invoke well-known algorithms for the latter problem to solve the former. We demonstrate that our embedding has applications to the problems of approximating separating polytopes – in fact, it can approximate any convex body and multiple convex bodies – as well as to classification by separating polytopes and piecewise linear regression. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gottlieb21a.html
http://proceedings.mlr.press/v130/gottlieb21a.html Local SGD: Unified Theory and New Efficient Methods We present a unified framework for analyzing local SGD methods in the convex and strongly convex regimes for distributed/federated training of supervised machine learning models. We recover several known methods as a special case of our general framework, including Local SGD/FedAvg, SCAFFOLD, and several variants of SGD not originally designed for federated learning. Our framework covers both the identical and heterogeneous data settings, supports both random and deterministic number of local steps, and can work with a wide array of local stochastic gradient estimators, including shifted estimators which are able to adjust the fixed points of local iterations for faster convergence. As an application of our framework, we develop multiple novel FL optimizers which are superior to existing methods. In particular, we develop the first linearly converging local SGD method which does not require any data homogeneity or other strong assumptions. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gorbunov21a.html
http://proceedings.mlr.press/v130/gorbunov21a.html Variational Selective Autoencoder: Learning from Partially-Observed Heterogeneous Data Learning from heterogeneous data poses challenges such as combining data from various sources and of different types. Meanwhile, heterogeneous data are often associated with missingness in real-world applications due to heterogeneity and noise of input sources. In this work, we propose the variational selective autoencoder (VSAE), a general framework to learn representations from partially-observed heterogeneous data. VSAE learns the latent dependencies in heterogeneous data by modeling the joint distribution of observed data, unobserved data, and the imputation mask which represents how the data are missing. It results in a unified model for various downstream tasks including data generation and imputation. Evaluation on both low-dimensional and high-dimensional heterogeneous datasets for these two tasks shows improvement over state-of-the-art models. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gong21a.html
http://proceedings.mlr.press/v130/gong21a.html Learning Smooth and Fair Representations This paper explores the statistical properties of fair representation learning, a pre-processing method that preemptively removes the correlations between features and sensitive attributes by mapping features to a fair representation space. We show that the demographic parity of a representation can be certified from a finite sample if and only if the mapping guarantees that the chi-squared mutual information between features and representations is finite for distributions of the features. Empirically, we find that smoothing representations with an additive Gaussian white noise provides generalization guarantees of fairness certificates, which improves upon existing fair representation learning approaches. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gitiaux21a.html
http://proceedings.mlr.press/v130/gitiaux21a.html Shuffled Model of Differential Privacy in Federated Learning We consider a distributed empirical risk minimization (ERM) optimization problem with communication efficiency and privacy requirements, motivated by the federated learning (FL) framework. We propose a distributed communication-efficient and local differentially private stochastic gradient descent (CLDP-SGD) algorithm and analyze its communication, privacy, and convergence trade-offs. Since each iteration of the CLDP-SGD aggregates the client-side local gradients, we develop (optimal) communication-efficient schemes for mean estimation for several $\ell_p$ spaces under local differential privacy (LDP). To overcome performance limitation of LDP, CLDP-SGD takes advantage of the inherent privacy amplification provided by client subsampling and data subsampling at each selected client (through SGD) as well as the recently developed shuffled model of privacy. For convex loss functions, we prove that the proposed CLDP-SGD algorithm matches the known lower bounds on the \textit{centralized} private ERM while using a finite number of bits per iteration for each client, \emph{i.e.,} effectively getting communication efficiency for “free”. We also provide preliminary experimental results supporting the theory. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/girgis21a.html
http://proceedings.mlr.press/v130/girgis21a.html Competing AI: How does competition feedback affect machine learning? This papers studies how competition affects machine learning (ML) predictors. As ML becomes more ubiquitous, it is often deployed by companies to compete over customers. For example, digital platforms like Yelp use ML to predict user preference and make recommendations. A service that is more often queried by users, perhaps because it more accurately anticipates user preferences, is also more likely to obtain additional user data (e.g. in the form of a Yelp review). Thus, competing predictors cause feedback loops whereby a predictor’s performance impacts what training data it receives and biases its predictions over time. We introduce a flexible model of competing ML predictors that enables both rapid experimentation and theoretical tractability. We show with empirical and mathematical analysis that competition causes predictors to specialize for specific sub-populations at the cost of worse performance over the general population. We further analyze the impact of predictor specialization on the overall prediction quality experienced by users. We show that having too few or too many competing predictors in a market can hurt the overall prediction quality. Our theory is complemented by experiments on several real datasets using popular learning algorithms, such as neural networks and nearest neighbor methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ginart21a.html
http://proceedings.mlr.press/v130/ginart21a.html A Limited-Capacity Minimax Theorem for Non-Convex Games or: How I Learned to Stop Worrying about Mixed-Nash and Love Neural Nets Adversarial training, a special case of multi-objective optimization, is an increasingly prevalent machine learning technique: some of its most notable applications include GAN-based generative modeling and self-play techniques in reinforcement learning which have been applied to complex games such as Go or Poker. In practice, a \emph{single} pair of networks is typically trained in order to find an approximate equilibrium of a highly nonconcave-nonconvex adversarial problem. However, while a classic result in game theory states such an equilibrium exists in concave-convex games, there is no analogous guarantee if the payoff is nonconcave-nonconvex. Our main contribution is to provide an approximate minimax theorem for a large class of games where the players pick neural networks including WGAN, StarCraft II and Blotto Game. Our findings rely on the fact that despite being nonconcave-nonconvex with respect to the neural networks parameters, these games are concave-convex with respect to the actual models (e.g., functions or distributions) represented by these neural networks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gidel21a.html
http://proceedings.mlr.press/v130/gidel21a.html Variational inference for nonlinear ordinary differential equations We apply the reparameterisation trick to obtain a variational formulation of Bayesian inference in nonlinear ODE models. By invoking the linear noise approximation we also extend this variational formulation to a stochastic kinetic model. Our proposed inference method does not depend on any emulation of the ODE solution and only requires the extension of automatic differentiation to an ODE. We achieve this through a novel and holistic approach that uses both forward and adjoint sensitivity analysis techniques. Consequently, this approach can cater to both small and large ODE models efficiently. Upon benchmarking on some widely used mechanistic models, the proposed inference method produced a reliable approximation to the posterior distribution, with a significant reduction in execution time, in comparison to MCMC. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ghosh21b.html
http://proceedings.mlr.press/v130/ghosh21b.html Problem-Complexity Adaptive Model Selection for Stochastic Linear Bandits We consider the problem of model selection for two popular stochastic linear bandit settings, and propose algorithms that adapts to the unknown problem complexity. In the first setting, we consider the $K$ armed mixture bandits, where the mean reward of arm $i \in [K]$ is $\mu_i+ ⟨\alpha_{i,t},\theta^* ⟩$, with $\alpha_{i,t} \in \mathbb{R}^d$ being the known context vector and $\mu_i \in [-1,1]$ and $\theta^*$ are unknown parameters. We define $\|\theta^*\|$ as the problem complexity and consider a sequence of nested hypothesis classes, each positing a different upper bound on $\|\theta^*\|$. Exploiting this, we propose Adaptive Linear Bandit (ALB), a novel phase based algorithm that adapts to the true problem complexity, $\|\theta^*\|$. We show that ALB achieves regret scaling of $\widetilde{O}(\|\theta^*\|\sqrt{T})$, where $\|\theta^*\|$ is apriori unknown. As a corollary, when $\theta^*=0$, ALB recovers the minimax regret for the simple bandit algorithm without such knowledge of $\theta^*$. ALB is the first algorithm that uses parameter norm as model section criteria for linear bandits. Prior state of art algorithms achieve a regret of $\widetilde{O}(L\sqrt{T})$, where $L$ is the upper bound on $\|\theta^*\|$, fed as an input to the problem. In the second setting, we consider the standard linear bandit problem (with possibly an infinite number of arms) where the sparsity of $\theta^*$, denoted by $d^* \leq d$, is unknown to the algorithm. Defining $d^*$ as the problem complexity (similar to Foster et. al ’19), we show that ALB achieves $\widetilde{O}(d^*\sqrt{T})$ regret, matching that of an oracle who knew the true sparsity level. This methodology is then extended to the case of finitely many arms and similar results are proven. We further verify through synthetic and real-data experiments that the performance gains are fundamental and not artifacts of mathematical bounds. In particular, we show $1.5-3$x drop in cumulative regret over non-adaptive algorithms. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ghosh21a.html
http://proceedings.mlr.press/v130/ghosh21a.html Graph Community Detection from Coarse Measurements: Recovery Conditions for the Coarsened Weighted Stochastic Block Model We study the problem of community recovery from coarse measurements of a graph. In contrast to the problem of community recovery of a fully observed graph, one often encounters situations when measurements of a graph are made at low-resolution, each measurement integrating across multiple graph nodes. Such low-resolution measurements effectively induce a coarse graph with its own communities. Our objective is to develop conditions on the graph structure, the quantity, and properties of measurements, under which we can recover the community organization in this coarse graph. In this paper, we build on the stochastic block model by mathematically formalizing the coarsening process, and characterizing its impact on the community members and connections. Accordingly, we characterize an error bound for community recovery. The error bound yields simple and closed-form asymptotic conditions to achieve the perfect recovery of the coarse graph communities. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ghoroghchian21a.html
http://proceedings.mlr.press/v130/ghoroghchian21a.html Robust and Private Learning of Halfspaces In this work, we study the trade-off between differential privacy and adversarial robustness under $L_2$-perturbations in the context of learning halfspaces. We prove nearly tight bounds on the sample complexity of robust private learning of halfspaces for a large regime of parameters. A highlight of our results is that robust and private learning is harder than robust or private learning alone. We complement our theoretical analysis with experimental results on the MNIST and USPS datasets, for a learning algorithm that is both differentially private and adversarially robust. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ghazi21a.html
http://proceedings.mlr.press/v130/ghazi21a.html Deep Generative Missingness Pattern-Set Mixture Models We propose a variational autoencoder architecture to model both ignorable and nonignorable missing data using pattern-set mixtures as proposed by Little (1993). Our model explicitly learns to cluster the missing data into missingness pattern sets based on the observed data and missingness masks. Underpinning our approach is the assumption that the data distribution under missingness is probabilistically semi-supervised by samples from the observed data distribution. Our setup trades off the characteristics of ignorable and nonignorable missingness and can thus be applied to data of both types. We evaluate our method on a wide range of data sets with different types of missingness and achieve state-of-the-art imputation performance. Our model outperforms many common imputation algorithms, especially when the amount of missing data is high and the missingness mechanism is nonignorable. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ghalebikesabi21a.html
http://proceedings.mlr.press/v130/ghalebikesabi21a.html Reinforcement Learning for Constrained Markov Decision Processes In this paper, we consider the problem of optimization and learning for constrained and multi-objective Markov decision processes, for both discounted rewards and expected average rewards. We formulate the problems as zero-sum games where one player (the agent) solves a Markov decision problem and its opponent solves a bandit optimization problem, which we here call Markov-Bandit games. We extend $Q$-learning to solve Markov-Bandit games and show that our new $Q$-learning algorithms converge to the optimal solutions of the zero-sum Markov-Bandit games, and hence converge to the optimal solutions of the constrained and multi-objective Markov decision problems. We provide numerical examples where we calculate the optimal policies and show by simulations that the algorithm converges to the calculated optimal policies. To the best of our knowledge, this is the first time Q-learning algorithms guarantee convergence to optimal stationary policies for the multi-objective Reinforcement Learning problem with discounted and expected average rewards, respectively. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gattami21a.html
http://proceedings.mlr.press/v130/gattami21a.html Learn to Expect the Unexpected: Probably Approximately Correct Domain Generalization Domain generalization is the problem of machine learning when the training data and the test data come from different “domains” (data distributions). We propose an elementary theoretical model of the domain generalization problem, introducing the concept of a meta-distribution over domains. In our model, the training data available to a learning algorithm consist of multiple datasets, each from a single domain, drawn in turn from the meta-distribution. We show that our model can capture a rich range of learning phenomena specific to domain generalization for three different settings: learning with Massart noise, learning decision trees, and feature selection. We demonstrate approaches that leverage domain generalization to reduce computational or data requirements in each of these settings. Experiments demonstrate that our feature selection algorithm indeed ignores spurious correlations and improves generalization. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/garg21a.html
http://proceedings.mlr.press/v130/garg21a.html Neural Enhanced Belief Propagation on Factor Graphs A graphical model is a structured representation of locally dependent random variables. A traditional method to reason over these random variables is to perform inference using belief propagation. When provided with the true data generating process, belief propagation can infer the optimal posterior probability estimates in tree structured factor graphs. However, in many cases we may only have access to a poor approximation of the data generating process, or we may face loops in the factor graph, leading to suboptimal estimates. In this work we first extend graph neural networks to factor graphs (FG-GNN). We then propose a new hybrid model that runs conjointly a FG-GNN with belief propagation. The FG-GNN receives as input messages from belief propagation at every inference iteration and outputs a corrected version of them. As a result, we obtain a more accurate algorithm that combines the benefits of both belief propagation and graph neural networks. We apply our ideas to error correction decoding tasks, and we show that our algorithm can outperform belief propagation for LDPC codes on bursty channels. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/garcia-satorras21a.html
http://proceedings.mlr.press/v130/garcia-satorras21a.html Selective Classification via One-Sided Prediction We propose a novel method for selective classification (SC), a problem which allows a classifier to abstain from predicting some instances, thus trading off accuracy against coverage (the fraction of instances predicted). In contrast to prior gating or confidence-set based work, our proposed method optimises a collection of class-wise decoupled one-sided empirical risks, and is in essence a method for explicitly finding the largest decision sets for each class that have few false positives. This one-sided prediction (OSP) based relaxation yields an SC scheme that attains near-optimal coverage in the practically relevant high target accuracy regime, and further admits efficient implementation, leading to a flexible and principled method for SC. We theoretically derive generalization bounds for SC and OSP, and empirically we show that our scheme strongly outperforms state of the art methods in coverage at small error levels. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gangrade21a.html
http://proceedings.mlr.press/v130/gangrade21a.html vqSGD: Vector Quantized Stochastic Gradient Descent In this work, we present a family of vector quantization schemes vqSGD (Vector-Quantized Stochastic Gradient Descent) that provide an asymptotic reduction in the communication cost with convergence guarantees in first-order distributed optimization. In the process we derive the following fundamental information theoretic fact: $\Theta(\frac{d}{R^2})$ bits are necessary and sufficient (up to an additive $O(\log d)$ term) to describe an unbiased estimator $\hat{g}(g)$ for any $g$ in the $d$-dimensional unit sphere, under the constraint that $\|\hat{g}(g)\|_2\le R$ almost surely. In particular, we consider a randomized scheme based on the convex hull of a point set, that returns an unbiased estimator of a $d$-dimensional gradient vector with almost surely bounded norm. We provide multiple efficient instances of our scheme, that are near optimal, and require only $o(d)$ bits of communication at the expense of tolerable increase in error. The instances of our quantization scheme are obtained using the properties of binary error-correcting codes and provide a smooth tradeoff between the communication and the estimation error of quantization. Furthermore, we show that vqSGD also offers some automatic privacy guarantees. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gandikota21a.html
http://proceedings.mlr.press/v130/gandikota21a.html Causal Inference with Selectively Deconfounded Data Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data; (b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large confounded observational dataset (confounder unobserved) alongside a small deconfounded observational dataset (confounder revealed) when estimating the ATE. Our theoretical results suggest that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases—say, genetics—we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/gan21a.html
http://proceedings.mlr.press/v130/gan21a.html γ-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based γ-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear time complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/fujisawa21a.html
http://proceedings.mlr.press/v130/fujisawa21a.html Free-rider Attacks on Model Aggregation in Federated Learning Free-rider attacks against federated learning consist in dissimulating participation to the federated learning process with the goal of obtaining the final aggregated model without actually contributing with any data. This kind of attacks are critical in sensitive applications of federated learning when data is scarce and the model has high commercial value. We introduce here the first theoretical and experimental analysis of free-rider attacks on federated learning schemes based on iterative parameters aggregation, such as FedAvg or FedProx, and provide formal guarantees for these attacks to converge to the aggregated models of the fair participants. We first show that a straightforward implementation of this attack can be simply achieved by not updating the local parameters during the iterative federated optimization. As this attack can be detected by adopting simple countermeasures at the server level, we subsequently study more complex disguising schemes based on stochastic updates of the free-rider parameters. We demonstrate the proposed strategies on a number of experimental scenarios, in both iid and non-iid settings. We conclude by providing recommendations to avoid free-rider attacks in real world applications of federated learning, especially in sensitive domains where security of data and models is critical. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/fraboni21a.html
http://proceedings.mlr.press/v130/fraboni21a.html Aggregating Incomplete and Noisy Rankings We consider the problem of learning the true ordering of a set of alternatives from largely incomplete and noisy rankings. We introduce a natural generalization of both the Mallows model, a popular model of ranking distributions, and the extensively studied model of ranking from pairwise comparisons. Our selective Mallows model outputs a noisy ranking on any given subset of alternatives, based on an underlying Mallows distribution. Assuming a sequence of subsets where each pair of alternatives appears frequently enough, we obtain strong asymptotically tight upper and lower bounds on the sample complexity of learning the underlying complete central ranking and the (identities and the) ranking of the top k alternatives from selective Mallows rankings. Moreover, building on the work of (Braverman and Mossel, 2009), we show how to efficiently compute the maximum likelihood complete ranking from selective Mallows rankings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/fotakis21a.html
http://proceedings.mlr.press/v130/fotakis21a.html Measure Transport with Kernel Stein Discrepancy Measure transport underpins several recent algorithms for posterior approximation in the Bayesian context, wherein a transport map is sought to minimise the Kullback–Leibler divergence (KLD) from the posterior to the approximation. The KLD is a strong mode of convergence, requiring absolute continuity of measures and placing restrictions on which transport maps can be permitted. Here we propose to minimise a kernel Stein discrepancy (KSD) instead, requiring only that the set of transport maps is dense in an $L^2$ sense and demonstrating how this condition can be validated. The consistency of the associated posterior approximation is established and empirical results suggest that KSD is competitive and more flexible alternative to KLD for measure transport. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/fisher21a.html
http://proceedings.mlr.press/v130/fisher21a.html A Contraction Approach to Model-based Reinforcement Learning Despite its experimental success, Model-based Reinforcement Learning still lacks a complete theoretical understanding. To this end, we analyze the error in the cumulative reward using a contraction approach. We consider both stochastic and deterministic state transitions for continuous (non-discrete) state and action spaces. This approach doesn’t require strong assumptions and can recover the typical quadratic error to the horizon. We prove that branched rollouts can reduce this error and are essential for deterministic transitions to have a Bellman contraction. Our analysis of policy mismatch error also applies to Imitation Learning. In this case, we show that GAN-type learning has an advantage over Behavioral Cloning when its discriminator is well-trained. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/fan21a.html
http://proceedings.mlr.press/v130/fan21a.html A Variational Inference Approach to Learning Multivariate Wold Processes Temporal point-processes are often used for mathematical modeling of sequences of discrete events with asynchronous timestamps. We focus on a class of temporal point-process models called multivariate Wold processes (MWP). These processes are well suited to model real-world communication dynamics. Statistical inference on such processes often requires learning their corresponding parameters using a set of observed timestamps. In this work, we relax some of the restrictive modeling assumptions made in the state-of-the-art and introduce a Bayesian approach for inferring the parameters of MWP. We develop a computationally efficient variational inference algorithm that allows scaling up the approach to high-dimensional processes and long sequences of observations. Our experimental results on both synthetic and real-world datasets show that our proposed algorithm outperforms existing methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/etesami21a.html
http://proceedings.mlr.press/v130/etesami21a.html Scalable Constrained Bayesian Optimization The global optimization of a high-dimensional black-box function under black-box constraints is a pervasive task in machine learning, control, and engineering. These problems are challenging since the feasible set is typically non-convex and hard to find, in addition to the curses of dimensionality and the heterogeneity of the underlying functions. In particular, these characteristics dramatically impact the performance of Bayesian optimization methods, that otherwise have become the defacto standard for sample-efficient optimization in unconstrained settings, leaving practitioners with evolutionary strategies or heuristics. We propose the scalable constrained Bayesian optimization (SCBO) algorithm that overcomes the above challenges and pushes the applicability of Bayesian optimization far beyond the state-of-the-art. A comprehensive experimental evaluation demonstrates that SCBO achieves excellent results on a variety of benchmarks. To this end, we propose two new control problems that we expect to be of independent value for the scientific community. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/eriksson21a.html
http://proceedings.mlr.press/v130/eriksson21a.html Fisher Auto-Encoders It has been conjectured that the Fisher divergence is more robust to model uncertainty than the conventional Kullback-Leibler (KL) divergence. This motivates the design of a new class of robust generative auto-encoders (AE) referred to as Fisher auto-encoders. Our approach is to design Fisher AEs by minimizing the Fisher divergence between the intractable joint distribution of observed data and latent variables, with that of the postulated/modeled joint distribution. In contrast to KL-based variational AEs (VAEs), the Fisher AE can exactly quantify the distance between the true and the model-based posterior distributions. Qualitative and quantitative results are provided on both MNIST and celebA datasets demonstrating the competitive performance of Fisher AEs in terms of robustness compared to other AEs such as VAEs and Wasserstein AEs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/elkhalil21a.html
http://proceedings.mlr.press/v130/elkhalil21a.html The Unexpected Deterministic and Universal Behavior of Large Softmax Classifiers This paper provides a large dimensional analysis of the Softmax classifier. We discover and prove that, when the classifier is trained on data satisfying loose statistical modeling assumptions, its weights become deterministic and solely depend on the data statistical means and covariances. As a striking consequence, despite the implicit and non-linear nature of the underlying optimization problem, the performance of the Softmax classifier is the same as if performed on a mere Gaussian mixture model, thereby disrupting the intuition that non-linearities inherently extract advanced statistical features from the data. Our findings are theoretically as well as numerically sustained on CNN representations of images produced by GANs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/el-amine-seddik21a.html
http://proceedings.mlr.press/v130/el-amine-seddik21a.html Improved Complexity Bounds in Wasserstein Barycenter Problem In this paper, we focus on computational aspects of the Wasserstein barycenter problem. We propose two algorithms to compute Wasserstein barycenters of $m$ discrete measures of size $n$ with accuracy $\e$. The first algorithm, based on mirror prox with a specific norm, meets the complexity of celebrated accelerated iterative Bregman projections (IBP), namely $\widetilde O(mn^2\sqrt n/\e)$, however, with no limitations in contrast to the (accelerated) IBP, which is numerically unstable under small regularization parameter. The second algorithm, based on area-convexity and dual extrapolation, improves the previously best-known convergence rates for the Wasserstein barycenter problem enjoying $\widetilde O(mn^2/\e)$ complexity. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/dvinskikh21a.html
http://proceedings.mlr.press/v130/dvinskikh21a.html On Riemannian Stochastic Approximation Schemes with Fixed Step-Size This paper studies fixed step-size stochastic approximation (SA) schemes, including stochastic gradient schemes, in a Riemannian framework. It is motivated by several applications, where geodesics can be computed explicitly, and their use accelerates crude Euclidean methods. A fixed step-size scheme defines a family of time-homogeneous Markov chains, parametrized by the step-size. Here, using this formulation, non-asymptotic performance bounds are derived, under Lyapunov conditions. Then, for any step-size, the corresponding Markov chain is proved to admit a unique stationary distribution, and to be geometrically ergodic. This result gives rise to a family of stationary distributions indexed by the step-size, which is further shown to converge to a Dirac measure, concentrated at the solution of the problem at hand, as the step-size goes to $0$. Finally, the asymptotic rate of this convergence is established, through an asymptotic expansion of the bias, and a central limit theorem. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/durmus21a.html
http://proceedings.mlr.press/v130/durmus21a.html A constrained risk inequality for general losses We provide a general constrained risk inequality that applies to arbitrary non-decreasing losses, extending a result of Brown and Low [\emph{Ann. Stat. 1996}]. Given two distributions $P_0$ and $P_1$, we find a lower bound for the risk of estimating a parameter $\theta(P_1)$ under $P_1$ given an upper bound on the risk of estimating the parameter $\theta(P_0)$ under $P_0$. The inequality is a useful pedagogical tool, as its proof relies only on the Cauchy-Schwartz inequality, it applies to general losses, and it transparently gives risk lower bounds on super-efficient and adaptive estimators. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/duchi21a.html
http://proceedings.mlr.press/v130/duchi21a.html No-Regret Algorithms for Private Gaussian Process Bandit Optimization The widespread proliferation of data-driven decision-making has ushered in a recent interest in the design of privacy-preserving algorithms. In this paper, we consider the ubiquitous problem of gaussian process (GP) bandit optimization from the lens of privacy-preserving statistics. We propose a solution for differentially private GP bandit optimization that combines uniform kernel approximation with random perturbations, providing a generic framework to create differentially-private (DP) Gaussian process bandit algorithms. For two specific DP settings - joint and local differential privacy, we provide algorithms based on efficient quadrature Fourier feature approximators, that are computationally efficient and provably no-regret for a class of stationary kernel functions. In contrast to previous work, our algorithms maintain differential privacy throughout the optimization procedure and critically do not rely on the sample path for prediction, making them scalable and straightforward to release as well. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/dubey21a.html
http://proceedings.mlr.press/v130/dubey21a.html Wasserstein Random Forests and Applications in Heterogeneous Treatment Effects We present new insights into causal inference in the context of Heterogeneous Treatment Effects by proposing natural variants of Random Forests to estimate the key conditional distributions. To achieve this, we recast Breiman’s original splitting criterion in terms of Wasserstein distances between empirical measures. This reformulation indicates that Random Forests are well adapted to estimate conditional distributions and provides a natural extension of the algorithm to multi- variate outputs. Following the philosophy of Breiman’s construction, we propose some variants of the splitting rule that are well-suited to the conditional distribution estimation problem. Some preliminary theoretical connections are established along with various numerical experiments, which show how our approach may help to conduct more transparent causal inference in complex situations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/du21a.html
http://proceedings.mlr.press/v130/du21a.html A Bayesian nonparametric approach to count-min sketch under power-law data streams The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens’ frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token’s frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/dolera21a.html
http://proceedings.mlr.press/v130/dolera21a.html A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data during its entire lifetime. Although major advances have been made in the field, one recurring problem which remains unsolved is that of Catastrophic Forgetting (CF). While the issue has been extensively studied empirically, little attention has been paid from a theoretical angle. In this paper, we show that the impact of CF increases as two tasks increasingly align. We introduce a measure of task similarity called the NTK overlap matrix which is at the core of CF. We analyze common projected gradient algorithms and demonstrate how they mitigate forgetting. Then, we propose a variant of Orthogonal Gradient Descent (OGD) which leverages structure of the data through Principal Component Analysis (PCA). Experiments support our theoretical findings and show how our method can help reduce CF on classical CL datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/doan21a.html
http://proceedings.mlr.press/v130/doan21a.html GANs with Conditional Independence Graphs: On Subadditivity of Probability Divergences Generative Adversarial Networks (GANs) are modern methods to learn the underlying distribution of a data set. GANs have been widely used in sample synthesis, de-noising, domain transfer, etc. GANs, however, are designed in a model-free fashion where no additional information about the underlying distribution is available. In many applications, however, practitioners have access to the underlying independence graph of the variables, either as a Bayesian network or a Markov Random Field (MRF). We ask: how can one use this additional information in designing model-based GANs? In this paper, we provide theoretical foundations to answer this question by studying subadditivity properties of probability divergences, which establish upper bounds on the distance between two high-dimensional distributions by the sum of distances between their marginals over (local) neighborhoods of the graphical structure of the Bayes-net or the MRF. We prove that several popular probability divergences satisfy some notion of subadditivity under mild conditions. These results lead to a principled design of a model-based GAN that uses a set of simple discriminators on the neighborhoods of the Bayes-net/MRF, rather than a giant discriminator on the entire network, providing significant statistical and computational benefits. Our experiments on synthetic and real-world datasets demonstrate the benefits of our principled design of model-based GANs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ding21e.html
http://proceedings.mlr.press/v130/ding21e.html Provably Efficient Safe Exploration via Primal-Dual Policy Optimization We study the safe reinforcement learning problem using the constrained Markov decision processes in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing safe reinforcement learning algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization \mbox{(OPDOP)} algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for constrained Markov decision processes in the function approximation setting, with safe exploration. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ding21d.html
http://proceedings.mlr.press/v130/ding21d.html Dual Principal Component Pursuit for Learning a Union of Hyperplanes: Theory and Algorithms State-of-the-art subspace clustering methods are based on convex formulations whose theoretical guarantees require the subspaces to be low-dimensional. Dual Principal Component Pursuit (DPCP) is a non-convex method that is specifically designed for learning high-dimensional subspaces, such as hyperplanes. However, existing analyses of DPCP in the multi-hyperplane case lack a precise characterization of the distribution of the data and involve quantities that are difficult to interpret. Moreover, the provable algorithm based on recursive linear programming is not efficient. In this paper, we introduce a new notion of geometric dominance, which explicitly captures the distribution of the data, and derive both geometric and probabilistic conditions under which a global solution to DPCP is a normal vector to a geometrically dominant hyperplane. We then prove that the DPCP problem for a union of hyperplanes satisfies a Riemannian regularity condition, and use this result to show that a scalable Riemannian subgradient method exhibits (local) linear convergence to the normal vector of the geometrically dominant hyperplane. Finally, we show that integrating DPCP into popular subspace clustering schemes, such as K-ensembles, leads to superior or competitive performance over the state-of-the-art in clustering hyperplanes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ding21c.html
http://proceedings.mlr.press/v130/ding21c.html Random Coordinate Underdamped Langevin Monte Carlo The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the other coordinates untouched. We investigate the computational complexity of RC-ULMC and compare it with the classical ULMC for strongly log-concave probability distributions. We show that RC-ULMC is always cheaper than the classical ULMC, with a significant cost reduction when the problem is highly skewed and high dimensional. Our complexity bound for RC-ULMC is also tight in terms of dimension dependence. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ding21b.html
http://proceedings.mlr.press/v130/ding21b.html An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $\tilde{O}(\sqrt{T})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ding21a.html
http://proceedings.mlr.press/v130/ding21a.html Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization The use of min-max optimization in the adversarial training of deep neural network classifiers, and the training of generative adversarial networks has motivated the study of nonconvex-nonconcave optimization objectives, which frequently arise in these applications. Unfortunately, recent results have established that even approximate first-order stationary points of such objectives are intractable, even under smoothness conditions, motivating the study of min-max objectives with additional structure. We introduce a new class of structured nonconvex-nonconcave min-max optimization problems, proposing a generalization of the extragradient algorithm which provably converges to a stationary point. The algorithm applies not only to Euclidean spaces, but also to general $\ell_p$-normed finite-dimensional real vector spaces. We also discuss its stability under stochastic oracles and provide bounds on its sample complexity. Our iteration complexity and sample complexity bounds either match or improve the best known bounds for the same or less general nonconvex-nonconcave settings, such as those that satisfy variational coherence or in which a weak solution to the associated variational inequality problem is assumed to exist. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/diakonikolas21a.html
http://proceedings.mlr.press/v130/diakonikolas21a.html Improving Adversarial Robustness via Unlabeled Out-of-Domain Data Data augmentation by incorporating cheap unlabeled data from multiple domains is a powerful way to improve prediction especially when there is limited labeled data. In this work, we investigate how adversarial robustness can be enhanced by leveraging out-of-domain unlabeled data. We demonstrate that for broad classes of distributions and classifiers, there exists a sample complexity gap between standard and robust classification. We quantify the extent to which this gap can be bridged by leveraging unlabeled samples from a shifted domain by providing both upper and lower bounds. Moreover, we show settings where we achieve better adversarial robustness when the unlabeled data come from a shifted domain rather than the same domain as the labeled data. We also investigate how to leverage out-of-domain data when some structural information, such as sparsity, is shared between labeled and unlabeled domains. Experimentally, we augment object recognition datasets (CIFAR-10, CINIC-10, and SVHN) with easy-to-obtain and unlabeled out-of-domain data and demonstrate substantial improvement in the model’s robustness against $\ell_\infty$ adversarial attacks on the original domain. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/deng21b.html
http://proceedings.mlr.press/v130/deng21b.html Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency Local SGD is a promising approach to overcome the communication overhead in distributed learning by reducing the synchronization frequency among worker nodes. Despite the recent theoretical advances of local SGD in empirical risk minimization, the efficiency of its counterpart in minimax optimization remains unexplored. Motivated by large scale minimax learning problems, such as adversarial robust learning and GANs, we propose local Stochastic Gradient Descent Ascent (local SGDA), where the primal and dual variables can be trained locally and averaged periodically to significantly reduce the number of communications. We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data with reduced number of communications and establish convergence rates under strongly-convex-strongly-concave and nonconvex-strongly-concave settings. In addition, we propose a novel variant, dubbed as local SGDA+, to solve nonconvex-nonconcave problems. We also give corroborating empirical evidence on different distributed minimax problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/deng21a.html
http://proceedings.mlr.press/v130/deng21a.html Combinatorial Gaussian Process Bandits with Probabilistically Triggered Arms Combinatorial bandit models and algorithms are used in many sequential decision-making tasks ranging from item list recommendation to influence maximization. Typical algorithms proposed for combinatorial bandits, including combinatorial UCB (CUCB) and combinatorial Thompson sampling (CTS) do not exploit correlations between base arms during the learning process. Moreover, their regret is usually analyzed under independent base arm outcomes. In this paper, we use Gaussian Processes (GPs) to model correlations between base arms. In particular, we consider a combinatorial bandit model with probabilistically triggered arms, and assume that the expected base arm outcome function is a sample from a GP. We assume that the learner has access to an exact computation oracle, which returns an optimal solution given expected base arm outcomes, and analyze the regret of Combinatorial Gaussian Process Upper Confidence Bound (ComGP-UCB) algorithm for this setting. Under (triggering probability modulated) Lipschitz continuity assumption on the expected reward function, we derive ($O( \sqrt{m T \log T \gamma_{T, \boldsymbol{\mu}}^{PTA}})$) $O(m \sqrt{\frac{T \log T}{p^*}})$ upper bounds for the regret of ComGP-UCB that hold with high probability, where $m$ denotes the number of base arms, $p^*$ denotes the minimum non-zero triggering probability, and $\gamma_{T, \boldsymbol{\mu}}^{PTA}$ denotes the pseudo-information gain. Finally, we show via simulations that when the correlations between base arm outcomes are strong, ComGP-UCB significantly outperforms CUCB and CTS. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/demirel21a.html
http://proceedings.mlr.press/v130/demirel21a.html Regularized ERM on random subspaces We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nyström approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to ex- tend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector ma- chines. This extension requires developing new proofs, that use different technical tools. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance. Theoretical results are illustrated with simple numerical experiments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/della-vecchia21a.html
http://proceedings.mlr.press/v130/della-vecchia21a.html A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which quantifies its level of non-stationarity. Our method generalizes previous approaches based on sliding windows and exponential discounting used to handle changing environments. We further propose a practical implementation of KeRNS, we analyze its regret and validate it experimentally. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/darwiche-domingues21a.html
http://proceedings.mlr.press/v130/darwiche-domingues21a.html Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms The need to evaluate treatment effectiveness is ubiquitous in most of empirical science, and interest in flexibly investigating effect heterogeneity is growing rapidly. To do so, a multitude of model-agnostic, nonparametric meta-learners have been proposed in recent years. Such learners decompose the treatment effect estimation problem into separate sub-problems, each solvable using standard supervised learning methods. Choosing between different meta-learners in a data-driven manner is difficult, as it requires access to counterfactual information. Therefore, with the ultimate goal of building better understanding of the conditions under which some learners can be expected to perform better than others a priori, we theoretically analyze four broad meta-learning strategies which rely on plug-in estimation and pseudo-outcome regression. We highlight how this theoretical reasoning can be used to guide principled algorithm design and translate our analyses into practice by considering a variety of neural network architectures as base-learners for the discussed meta-learning strategies. In a simulation study, we showcase the relative strengths of the learners under different data-generating processes. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/curth21a.html
http://proceedings.mlr.press/v130/curth21a.html A Change of Variables Method For Rectangular Matrix-Vector Products Rectangular matrix-vector products (MVPs) are used extensively throughout machine learning and are fundamental to neural networks such as multi-layer perceptrons. However, the use of rectangular MVPs in successive normalizing flow transformations is notably missing. This paper identifies this methodological gap and plugs it with a tall and wide MVP change of variables formula. Our theory builds up to a practical algorithm that envelops existing dimensionality increasing flow methods such as augmented flows. We show that tall MVPs are closely related to the stochastic inverse of wide MVPs and empirically demonstrate that they improve density estimation over existing dimension changing methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cunningham21a.html
http://proceedings.mlr.press/v130/cunningham21a.html Approximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning The recent mean field game (MFG) formalism facilitates otherwise intractable computation of approximate Nash equilibria in many-agent settings. In this paper, we consider discrete-time finite MFGs subject to finite-horizon objectives. We show that all discrete-time finite MFGs with non-constant fixed point operators fail to be contractive as typically assumed in existing MFG literature, barring convergence via fixed point iteration. Instead, we incorporate entropy-regularization and Boltzmann policies into the fixed point iteration. As a result, we obtain provable convergence to approximate fixed points where existing methods fail, and reach the original goal of approximate Nash equilibria. All proposed methods are evaluated with respect to their exploitability, on both instructive examples with tractable exact solutions and high-dimensional problems where exact methods become intractable. In high-dimensional scenarios, we apply established deep reinforcement learning methods and empirically combine fictitious play with our approximations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cui21a.html
http://proceedings.mlr.press/v130/cui21a.html Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression The Shapley value concept from cooperative game theory has become a popular technique for interpreting ML models, but efficiently estimating these values remains challenging, particularly in the model-agnostic setting. Here, we revisit the idea of estimating Shapley values via linear regression to understand and improve upon this approach. By analyzing the original KernelSHAP alongside a newly proposed unbiased version, we develop techniques to detect its convergence and calculate uncertainty estimates. We also find that the original version incurs a negligible increase in bias in exchange for significantly lower variance, and we propose a variance reduction technique that further accelerates the convergence of both estimators. Finally, we develop a version of KernelSHAP for stochastic cooperative games that yields fast new estimators for two global explanation methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/covert21a.html
http://proceedings.mlr.press/v130/covert21a.html Variational Autoencoder with Learned Latent Structure The manifold hypothesis states that high-dimensional data can be modeled as lying on or near a low-dimensional, nonlinear manifold. Variational Autoencoders (VAEs) approximate this manifold by learning mappings from low-dimensional latent vectors to high-dimensional data while encouraging a global structure in the latent space through the use of a specified prior distribution. When this prior does not match the structure of the true data manifold, it can lead to a less accurate model of the data. To resolve this mismatch, we introduce the Variational Autoencoder with Learned Latent Structure (VAELLS) which incorporates a learnable manifold model into the latent space of a VAE. This enables us to learn the nonlinear manifold structure from the data and use that structure to define a prior in the latent space. The integration of a latent manifold model not only ensures that our prior is well-matched to the data, but also allows us to define generative transformation paths in the latent space and describe class manifolds with transformations stemming from examples of each class. We validate our model on examples with known latent structure and also demonstrate its capabilities on a real-world dataset. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/connor21a.html
http://proceedings.mlr.press/v130/connor21a.html Offline detection of change-points in the mean for stationary graph signals. This paper addresses the problem of segmenting a stream of graph signals: we aim to detect changes in the mean of the multivariate signal defined over the nodes of a known graph. We propose an offline algorithm that relies on the concept of graph signal stationarity and allows the convenient translation of the problem from the original vertex domain to the spectral domain (Graph Fourier Transform), where it is much easier to solve. Although the obtained spectral representation is sparse in real applications, to the best of our knowledge this property has not been much exploited in the existing related literature. Our main contribution is a change-point detection algorithm that adopts a model selection perspective, which takes into account the sparsity of the spectral representation and determines automatically the number of change-points. Our detector comes with a proof of a non-asymptotic oracle inequality, numerical experiments demonstrate the validity of our method. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/concha-duarte21a.html
http://proceedings.mlr.press/v130/concha-duarte21a.html Differentially Private Weighted Sampling Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that sanitizes a weighted sample as to ensure element-level differential privacy, while retaining its utility to the maximum extent possible. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art even for the well-studied special case of private histograms, when no sampling is performed. We empirically observe significant performance gains of 20%-300% increase in key reporting for common Zipfian frequency distributions and accurate estimation with x2-8 lower frequencies. PWS is applied as a post-processing of a non-private sample, without requiring the original data. Therefore, it can be a seamless addition to existing implementations, such as those optimizes for distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications where privacy is desired. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cohen21b.html
http://proceedings.mlr.press/v130/cohen21b.html Aligning Time Series on Incomparable Spaces Dynamic time warping (DTW) is a useful method for aligning, comparing and combining time series, but it requires them to live in comparable spaces. In this work, we consider a setting in which time series live on different spaces without a sensible ground metric, causing DTW to become ill-defined. To alleviate this, we propose Gromov dynamic time warping (GDTW), a distance between time series on potentially incomparable spaces that avoids the comparability requirement by instead considering intra-relational geometry. We demonstrate its effectiveness at aligning, combining and comparing time series living on incomparable spaces. We further propose a smoothed version of GDTW as a differentiable loss and assess its properties in a variety of settings, including barycentric averaging, generative modeling and imitation learning. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cohen21a.html
http://proceedings.mlr.press/v130/cohen21a.html Online k-means Clustering We study the problem of learning a clustering of an online set of points. The specific formulation we use is the k-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred by the algorithm is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the k-means objective in hindsight. We show that provided the data lies in a bounded region, learning is possible, namely an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the k-means problem, which is known to be NP-hard. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + \epsilon)OPT$ and present a no-regret algorithm with runtime $O\left(T \mathrm{poly}(\log(T),k,d,1/\epsilon)^{O(kd)}\right)$. Our algorithm is based on maintaining a set of points of bounded size which is a coreset that helps identifying the \emph{relevant} regions of the space for running an adaptive, more efficient, variant of the MWUA. We show that simpler online algorithms, such as \emph{Follow The Leader} (FTL), fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. Our theoretical results answer an open question of Dasgupta (2008). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cohen-addad21a.html
http://proceedings.mlr.press/v130/cohen-addad21a.html A Hybrid Approximation to the Marginal Likelihood Computing the marginal likelihood or evidence is one of the core challenges in Bayesian analysis. While there are many established methods for estimating this quantity, they predominantly rely on using a large number of posterior samples obtained from a Markov Chain Monte Carlo (MCMC) algorithm. As the dimension of the parameter space increases, however, many of these methods become prohibitively slow and potentially inaccurate. In this paper, we propose a novel method in which we use the MCMC samples to learn a high probability partition of the parameter space and then form a deterministic approximation over each of these partition sets. This two-step procedure, which constitutes both a probabilistic and a deterministic component, is termed a Hybrid approximation to the marginal likelihood. We demonstrate its versatility in a plethora of examples with varying dimension and sample size, and we also highlight the Hybrid approximation’s effectiveness in situations where there is either a limited number or only approximate MCMC samples available. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chuu21a.html
http://proceedings.mlr.press/v130/chuu21a.html No-regret Algorithms for Multi-task Bayesian Optimization We consider multi-objective optimization (MOO) of an unknown vector-valued function in the non-parametric Bayesian optimization (BO) setting. Our aim is to maximize the expected cumulative utility of all objectives, as expressed by a given prior over a set of scalarization functions. Most existing BO algorithms do not model the fact that the multiple objectives, or equivalently, tasks can share similarities, and even the few that do lack rigorous, finite-time regret guarantees that capture explicitly inter-task structure. In this work, we address this problem by modelling inter-task dependencies using a multi-task kernel and develop two novel BO algorithms based on random scalarization of the objectives. Our algorithms employ vector-valued kernel regression as a stepping stone and belong to the upper confidence bound class of algorithms. Under a smoothness assumption that the unknown vector-valued function is an element of the reproducing kernel Hilbert space associated with the multi-task kernel, we derive worst-case regret bounds for our algorithms that explicitly capture the similarities between tasks. We numerically benchmark our algorithms on both synthetic and real-life MOO problems, and show the advantages offered by learning with multi-task kernels. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chowdhury21c.html
http://proceedings.mlr.press/v130/chowdhury21c.html Reinforcement Learning in Parametric MDPs with Exponential Families Extending model-based regret minimization strategies for Markov decision processes (MDPs) beyond discrete state-action spaces requires structural assumptions on the reward and transition models. Existing parametric approaches establish regret guarantees by making strong assumptions about either the state transition distribution or the value function as a function of state-action features, and often do not satisfactorily capture classical problems like linear dynamical systems or factored MDPs. This paper introduces a new MDP transition model defined by a collection of linearly parameterized exponential families with $d$ unknown parameters. For finite-horizon episodic RL with horizon $H$ in this MDP model, we propose a model-based upper confidence RL algorithm (Exp-UCRL) that solves a penalized maximum likelihood estimation problem to learn the $d$-dimensional representation of the transition distribution, balancing the exploitation-exploration tradeoff using confidence sets in the exponential family space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde O(d\sqrt{H^3 N})$, sub-linear in total time $N$, linear in dimension $d$, and polynomial in the planning horizon $H$. This is achieved by deriving a novel concentration inequality for conditional exponential families that might be of independent interest. The exponential family MDP model also admits an efficient posterior sampling-style algorithm for which a similar guarantee on the Bayesian regret is shown. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chowdhury21b.html
http://proceedings.mlr.press/v130/chowdhury21b.html Generalized Spectral Clustering via Gromov-Wasserstein Learning We establish a bridge between spectral clustering and Gromov-Wasserstein Learning (GWL), a recent optimal transport-based approach to graph partitioning. This connection both explains and improves upon the state-of-the-art performance of GWL. The Gromov-Wasserstein framework provides probabilistic correspondences between nodes of source and target graphs via a quadratic programming relaxation of the node matching problem. Our results utilize and connect the observations that the GW geometric structure remains valid for any rank-2 tensor, in particular the adjacency, distance, and various kernel matrices on graphs, and that the heat kernel outperforms the adjacency matrix in producing stable and informative node correspondences. Using the heat kernel in the GWL framework provides new multiscale graph comparisons without compromising theoretical guarantees, while immediately yielding improved empirical results. A key insight of the GWL framework toward graph partitioning was to compute GW correspondences from a source graph to a template graph with isolated, self-connected nodes. We show that when comparing against a two-node template graph using the heat kernel at the infinite time limit, the resulting partition agrees with the partition produced by the Fiedler vector. This in turn yields a new insight into the k-cut graph partitioning problem through the lens of optimal transport. Our experiments on a range of real-world networks achieve comparable results to, and in many cases outperform, the state-of-the-art achieved by GWL. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chowdhury21a.html
http://proceedings.mlr.press/v130/chowdhury21a.html Learning Individually Fair Classifier with Path-Specific Causal-Effect Constraint Machine learning is used to make decisions for individuals in various fields, which require us to achieve good prediction accuracy while ensuring fairness with respect to sensitive features (e.g., race and gender). This problem, however, remains difficult in complex real-world scenarios. To quantify unfairness under such situations, existing methods utilize path-specific causal effects. However, none of them can ensure fairness for each individual without making impractical functional assumptions about the data. In this paper, we propose a far more practical framework for learning an individually fair classifier. To avoid restrictive functional assumptions, we define the probability of individual unfairness (PIU) and solve an optimization problem where PIU’s upper bound, which can be estimated from data, is controlled to be close to zero. We elucidate why our method can guarantee fairness for each individual. Experimental results show that our method can learn an individually fair classifier at a slight cost of accuracy. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chikahara21a.html
http://proceedings.mlr.press/v130/chikahara21a.html Fast and Smooth Interpolation on Wasserstein Space We propose a new method for smoothly interpolating probability measures using the geometry of optimal transport. To that end, we reduce this problem to the classical Euclidean setting, allowing us to directly leverage the extensive toolbox of spline interpolation. Unlike previous approaches to measure-valued splines, our interpolated curves (i) have a clear interpretation as governing particle flows, which is natural for applications, and (ii) come with the first approximation guarantees on Wasserstein space. Finally, we demonstrate the broad applicability of our interpolation methodology by fitting surfaces of measures using thin-plate splines. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chewi21a.html
http://proceedings.mlr.press/v130/chewi21a.html List Learning with Attribute Noise We introduce and study the model of list learning with attribute noise. Learning with attribute noise was introduced by Shackelford and Volper (COLT, 1988) as a variant of PAC learning, in which the algorithm has access to noisy examples and uncorrupted labels, and the goal is to recover an accurate hypothesis. Sloan (COLT, 1988) and Goldman and Sloan (Algorithmica, 1995) discovered information-theoretic limits to learning in this model, which have impeded further progress. In this article we extend the model to that of list learning, drawing inspiration from the list-decoding model in coding theory, and its recent variant studied in the context of learning. On the positive side, we show that sparse conjunctions can be efficiently list learned under some assumptions on the underlying ground-truth distribution. On the negative side, our results show that even in the list-learning model, efficient learning of parities and majorities is not possible regardless of the representation used. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cheraghchi21a.html
http://proceedings.mlr.press/v130/cheraghchi21a.html Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios—well-specified, misspecified, and corrected models—to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21g.html
http://proceedings.mlr.press/v130/chen21g.html Accumulations of Projections—A Unified Framework for Random Sketches in Kernel Ridge Regression Building a sketch of an n-by-n empirical kernel matrix is a common approach to accelerate the computation of many kernel methods. In this paper, we propose a unified framework of constructing sketching methods in kernel ridge regression (KRR), which views the sketching matrix S as an accumulation of m rescaled sub-sampling matrices with independent columns. Our framework incorporates two commonly used sketching methods, sub-sampling sketches (known as the Nyström method) and sub-Gaussian sketches, as special cases with m=1 and m=infinity respectively. Under the new framework, we provide a unified error analysis of sketching approximation and show that our accumulation scheme improves the low accuracy of sub-sampling sketches when certain incoherence characteristic is high, and accelerates the more accurate but computationally heavier sub-Gaussian sketches. By optimally choosing the number m of accumulations, we show that a best trade-off between computational efficiency and statistical accuracy can be achieved. In practice, the sketching method can be as efficiently implemented as the sub-sampling sketches, as only minor extra matrix additions are needed. Our empirical evaluations also demonstrate that the proposed method may attain the accuracy close to sub-Gaussian sketches, while is as efficient as sub-sampling-based sketches. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21f.html
http://proceedings.mlr.press/v130/chen21f.html Fast Statistical Leverage Score Approximation in Kernel Ridge Regression Nyström approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems through sub-sampling the n-by-n empirical kernel matrix appearing in the objective function. However, the performance of such a sub-sampling method heavily relies on correctly estimating the statistical leverage scores for forming the sampling distribution, which can be as costly as solving the original KRR. In this work, we propose a linear time (modulo poly-log terms) algorithm to accurately approximate the statistical leverage scores in the stationary-kernel-based KRR with theoretical guarantees. Particularly, by analyzing the first-order condition of the KRR objective, we derive an analytic formula, which depends on both the input distribution and the spectral density of stationary kernels, for capturing the non-uniformity of the statistical leverage scores. Numerical experiments demonstrate that with the same prediction accuracy our method is orders of magnitude more efficient than existing methods in selecting the representative sub-samples in the Nyström approximation. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21e.html
http://proceedings.mlr.press/v130/chen21e.html Active Online Learning with Hidden Shifting Domains Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that depends on the durations and dimensionalities of the hidden domains. Our algorithm can adaptively deal with interleaving spans of inputs from different domains. We also generalize our results to non-linear regression for hypothesis classes with bounded eluder dimension and adaptive adversaries. Experiments on synthetic and realistic datasets demonstrate that our algorithm achieves lower regret than uniform queries and greedy queries with equal labeling budget. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21d.html
http://proceedings.mlr.press/v130/chen21d.html Communication Efficient Primal-Dual Algorithm for Nonconvex Nonsmooth Distributed Optimization Decentralized optimization problems frequently appear in the large scale machine learning problems. However, few works work on the difficult nonconvex nonsmooth case. In this paper, we propose a decentralized primal-dual algorithm to solve this type of problem in a decentralized manner and the proposed algorithm can achieve an $\mathcal{O}(1/\epsilon^2)$ iteration complexity to attain an $\epsilon-$solution, which is the well-known lower iteration complexity bound for nonconvex optimization. To our knowledge, it is the first algorithm achieving this rate under a nonconvex, nonsmooth decentralized setting. Furthermore, to reduce communication overhead, we also modifying our algorithm by compressing the vectors exchanged between agents. The iteration complexity of the algorithm with compression is still $\mathcal{O}(1/\epsilon^2)$. Besides, we apply the proposed algorithm to solve nonconvex linear regression problem and train deep learning model, both of which demonstrate the efficiency and efficacy of the proposed algorithm. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21c.html
http://proceedings.mlr.press/v130/chen21c.html Learning Prediction Intervals for Regression: Generalization and Calibration We study the generation of prediction intervals in regression for uncertainty quantification. This task can be formalized as an empirical constrained optimization problem that minimizes the average interval width while maintaining the coverage accuracy across data. We strengthen the existing literature by studying two aspects of this empirical optimization. First is a general learning theory to characterize the optimality-feasibility tradeoff that encompasses Lipschitz continuity and VC-subgraph classes, which are exemplified in regression trees and neural networks. Second is a calibration machinery and the corresponding statistical theory to optimally select the regularization parameter that manages this tradeoff, which bypasses the overfitting issues in previous approaches in coverage attainment. We empirically demonstrate the strengths of our interval generation and calibration algorithms in terms of testing performances compared to existing benchmarks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21b.html
http://proceedings.mlr.press/v130/chen21b.html CADA: Communication-Adaptive Distributed Adam Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for largescale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communicationadaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chen21a.html
http://proceedings.mlr.press/v130/chen21a.html Maximizing Agreements for Ranking, Clustering and Hierarchical Clustering via MAX-CUT In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structure (e.g., a ranking, a clustering, or a hierarchical clustering) in a way that maximizes agreement with the input. Well-studied problems such as rank aggregation, correlation clustering, and hierarchical clustering with triplet constraints fall in this class of problems. We study these problems on stochastic instances with a hidden embedded ground truth solution. Our main algorithmic contribution is a unified technique that uses the maximum cut problem in graphs to approximately solve these problems. Using this technique, we can often get approximation guarantees in the stochastic setting that are better than the known worst case inapproximability bounds for the corresponding problem. On the negative side, we improve the worst case inapproximability bound on several hierarchical clustering formulations through a reduction to related ranking problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/chatziafratis21a.html
http://proceedings.mlr.press/v130/chatziafratis21a.html Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning We study a family of algorithms, which we refer to as local update methods, generalizing many federated and meta-learning algorithms. We prove that for quadratic models, local update methods are equivalent to first-order optimization on a surrogate loss we exactly characterize. Moreover, fundamental algorithmic choices (such as learning rates) explicitly govern a trade-off between the condition number of the surrogate loss and its alignment with the true loss. We derive novel convergence rates showcasing these trade-offs and highlight their importance in communication-limited settings. Using these insights, we are able to compare local update methods based on their convergence/accuracy trade-off, not just their convergence to critical points of the empirical loss. Our results shed new light on a broad range of phenomena, including the efficacy of server momentum in federated learning and the impact of proximal client updates. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/charles21a.html
http://proceedings.mlr.press/v130/charles21a.html Approximation Algorithms for Orthogonal Non-negative Matrix Factorization In the non-negative matrix factorization (NMF) problem, the input is an $m\times n$ matrix $M$ with non-negative entries and the goal is to factorize it as $M\approx AW$. The $m\times k$ matrix $A$ and the $k\times n$ matrix $W$ are both constrained to have non-negative entries. This is in contrast to singular value decomposition, where the matrices $A$ and $W$ can have negative entries but must satisfy the orthogonality constraint: the columns of $A$ are orthogonal and the rows of $W$ are also orthogonal. The orthogonal non-negative matrix factorization (ONMF) problem imposes both the non-negativity and the orthogonality constraints, and previous work showed that it leads to better performances than NMF on many clustering tasks. We give the first constant-factor approximation algorithm for ONMF when one or both of $A$ and $W$ are subject to the orthogonality constraint. We also show an interesting connection to the correlation clustering problem on bipartite graphs. Our experiments on synthetic and real-world data show that our algorithm achieves similar or smaller errors compared to previous ONMF algorithms while ensuring perfect orthogonality (many previous algorithms do not satisfy the hard orthogonality constraint). Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/charikar21a.html
http://proceedings.mlr.press/v130/charikar21a.html Logical Team Q-learning: An approach towards factored policies in cooperative MARL We address the challenge of learning factored policies in cooperative MARL scenarios. In particular, we consider the situation in which a team of agents collaborates to optimize a common cost. The goal is to obtain factored policies that determine the individual behavior of each agent so that the resulting joint policy is optimal. The main contribution of this work is the introduction of Logical Team Q-learning (LTQL). LTQL does not rely on assumptions about the environment and hence is generally applicable to any collaborative MARL scenario. We derive LTQL as a stochastic approximation to a dynamic programming method we introduce in this work. We conclude the paper by providing experiments (both in the tabular and deep settings) that illustrate the claims. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cassano21a.html
http://proceedings.mlr.press/v130/cassano21a.html Feedback Coding for Active Learning The iterative selection of examples for labeling in active machine learning is conceptually similar to feedback channel coding in information theory: in both tasks, the objective is to seek a minimal sequence of actions to encode information in the presence of noise. While this high-level overlap has been previously noted, there remain open questions on how to best formulate active learning as a communications system to leverage existing analysis and algorithms in feedback coding. In this work, we formally identify and leverage the structural commonalities between the two problems, including the characterization of encoder and noisy channel components, to design a new algorithm. Specifically, we develop an optimal transport-based feedback coding scheme called Approximate Posterior Matching (APM) for the task of active example selection and explore its application to Bayesian logistic regression, a popular model in active learning. We evaluate APM on a variety of datasets and demonstrate learning performance comparable to existing active learning methods, at a reduced computational cost. These results demonstrate the potential of directly deploying concepts from feedback channel coding to design efficient active learning strategies. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/canal21a.html
http://proceedings.mlr.press/v130/canal21a.html Learning Bijective Feature Maps for Linear ICA Separating high-dimensional data like images into independent latent factors, i.e independent component analysis (ICA), remains an open research problem. As we show, existing probabilistic deep generative models (DGMs), which are tailor-made for image data, underperform on non-linear ICA tasks. To address this, we propose a DGM which combines bijective feature maps with a linear ICA model to learn interpretable latent structures for high-dimensional data. Given the complexities of jointly training such a hybrid model, we introduce novel theory that constrains linear ICA to lie close to the manifold of orthogonal rectangular matrices, the Stiefel manifold. By doing so we create models that converge quickly, are easy to train, and achieve better unsupervised latent factor discovery than flow-based models, linear ICA, and Variational Autoencoders on images. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/camuto21b.html
http://proceedings.mlr.press/v130/camuto21b.html Towards a Theoretical Understanding of the Robustness of Variational Autoencoders We make inroads into understanding the robustness of Variational Autoencoders (VAEs) to adversarial attacks and other input perturbations. While previous work has developed algorithmic approaches to attacking and defending VAEs, there remains a lack of formalization for what it means for a VAE to be robust. To address this, we develop a novel criterion for robustness in probabilistic models: $r$-robustness. We then use this to construct the first theoretical results for the robustness of VAEs, deriving margins in the input space for which we can provide guarantees about the resulting reconstruction. Informally, we are able to define a region within which any perturbation will produce a reconstruction that is similar to the original reconstruction. To support our analysis, we show that VAEs trained using disentangling methods not only score well under our robustness metrics, but that the reasons for this can be interpreted through our theoretical results. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/camuto21a.html
http://proceedings.mlr.press/v130/camuto21a.html Identification of Matrix Joint Block Diagonalization Given a set $\mathcal{C}=\{C_i\}_{i=1}^m$ of square matrices, the matrix blind joint block diagonalization problem (BJBDP) is to find a full column rank matrix $A$ such that $C_i=A\Sigma_iA^{\T}$ for all $i$, where $\Sigma_i$’s are all block diagonal matrices with as many diagonal blocks as possible. The BJBDP plays an important role in independent subspace analysis. This paper considers the identification problem for BJBDP, that is, under what conditions and by what means, we can identify the diagonalizer $A$ and the block diagonal structure of $\Sigma_i$, especially when there is noise in $C_i$’s. In this paper, we propose a “bi-block diagonalization” method to solve BJBDP, and establish sufficient conditions for when the method is able to accomplish the task. Numerical simulations validate our theoretical results. To the best of the authors’ knowledge, current numerical methods for BJBDP have no theoretical guarantees for the identification of the exact solution, whereas our method does. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/cai21a.html
http://proceedings.mlr.press/v130/cai21a.html Why did the distribution change? We describe a formal approach based on graphical causal models to identify the "root causes" of the change in the probability distribution of variables. After factorizing the joint distribution into conditional distributions of each variable, given its parents (the "causal mechanisms"), we attribute the change to changes of these causal mechanisms. This attribution analysis accounts for the fact that mechanisms often change independently and sometimes only some of them change. Through simulations, we study the performance of our distribution change attribution proposal. We then present a real-world case study identifying the drivers of the difference in the income distribution between men and women. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/budhathoki21a.html
http://proceedings.mlr.press/v130/budhathoki21a.html A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding the loss dynamics by training neural networks with the gradient descent (GD), in this work, we consider a broad class of optimization algorithms that are commonly used in practice. For example, we show from a dynamical system perspective that the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) may only converge to global minimum sublinearly. Our results rely on the connection between neural tangent kernel (NTK) and finitely-wide over-parameterized neural networks with ReLU activation, which leads to analyzing the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. As a consequence, we can leverage the classical convex optimization theory to understand the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network architectures. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bu21a.html
http://proceedings.mlr.press/v130/bu21a.html On the convergence of the Metropolis algorithm with fixed-order updates for multivariate binary probability distributions The Metropolis algorithm is arguably the most fundamental Markov chain Monte Carlo (MCMC) method. But the algorithm is not guaranteed to converge to the desired distribution in the case of multivariate binary distributions (e.g., Ising models or stochastic neural networks such as Boltzmann machines) if the variables (sites or neurons) are updated in a fixed order, a setting commonly used in practice. The reason is that the corresponding Markov chain may not be irreducible. We propose a modified Metropolis transition operator that behaves almost always identically to the standard Metropolis operator and prove that it ensures irreducibility and convergence to the limiting distribution in the multivariate binary case with fixed-order updates. The result provides an explanation for the behaviour of Metropolis MCMC in that setting and closes a long-standing theoretical gap. We experimentally studied the standard and modified Metropolis operator for models where they actually behave differently. If the standard algorithm also converges, the modified operator exhibits similar (if not better) performance in terms of convergence speed. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/brugge21a.html
http://proceedings.mlr.press/v130/brugge21a.html Follow Your Star: New Frameworks for Online Stochastic Matching with Known and Unknown Patience We study several generalizations of the Online Bipartite Matching problem. We consider settings with stochastic rewards, patience constraints, and weights (considering both vertex- and edge-weighted variants). We introduce a stochastic variant of the patience-constrained problem, where the patience is chosen randomly according to some known distribution and is not known in advance. We also consider stochastic arrival settings (i.e. the nature in which the online vertices arrive is determined by a known random process), which are natural settings that are able to beat the hard worst-case bounds of adversarial arrivals. We design black-box algorithms for star graphs under various models of patience, which solve the problem optimally for deterministic or geometrically-distributed patience, and yield a 1/2-approximation for any patience distribution. These star graph algorithms are then used as black boxes to solve the online matching problems under different arrival settings. We show improved (or first-known) competitive ratios for these problems. We also present negative results that include formalizing the concept of a stochasticity gap for LP upper bounds on these problems, showing some new stochasticity gaps for popular LPs, and bounding the worst-case performance of some greedy approaches. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/brubach21a.html
http://proceedings.mlr.press/v130/brubach21a.html Clustering multilayer graphs with missing nodes Relationship between agents can be conveniently represented by graphs. When these relationships have different modalities, they are better modelled by multilayer graphs where each layer is associated with one modality. Such graphs arise naturally in many contexts including biological and social networks. Clustering is a fundamental problem in network analysis where the goal is to regroup nodes with similar connectivity profiles. In the past decade, various clustering methods have been extended from the unilayer setting to multilayer graphs in order to incorporate the information provided by each layer. While most existing works assume – rather restrictively - that all layers share the same set of nodes, we propose a new framework that allows for layers to be defined on different sets of nodes. In particular, the nodes not recorded in a layer are treated as missing. Within this paradigm, we investigate several generalizations of well-known clustering methods in the complete setting to the incomplete one and prove consistency results under the Multi-Layer Stochastic Block Model assumption. Our theoretical results are complemented by thorough numerical comparisons between our proposed algorithms on synthetic data, and also on several real datasets, thus highlighting the promising behaviour of our methods in various realistic settings. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/braun21a.html
http://proceedings.mlr.press/v130/braun21a.html Rate-Regularization and Generalization in Variational Autoencoders Variational autoencoders (VAEs) optimize an objective that comprises a reconstruction loss (the distortion) and a KL term (the rate). The rate is an upper bound on the mutual information, which is often interpreted as a regularizer that controls the degree of compression. We here examine whether inclusion of the rate term also improves generalization. We perform rate-distortion analyses in which we control the strength of the rate term, the network capacity, and the difficulty of the generalization problem. Lowering the strength of the rate term paradoxically improves generalization in most settings, and reducing the mutual information typically leads to underfitting. Moreover, we show that generalization performance continues to improve even after the mutual information saturates, indicating that the gap on the bound (i.e. the KL divergence relative to the inference marginal) affects generalization. This suggests that the standard spherical Gaussian prior is not an inductive bias that typically improves generalization, prompting further work to understand what choices of priors improve generalization in VAEs. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bozkurt21a.html
http://proceedings.mlr.press/v130/bozkurt21a.html Nonlinear Functional Output Regression: A Dictionary Approach To address functional-output regression, we introduce projection learning (PL), a novel dictionary-based approach that learns to predict a function that is expanded on a dictionary while minimizing an empirical risk based on a functional loss. PL makes it possible to use non orthogonal dictionaries and can then be combined with dictionary learning; it is thus much more flexible than expansion-based approaches relying on vectorial losses. This general method is instantiated with reproducing kernel Hilbert spaces of vector-valued functions as kernel-based projection learning (KPL). For the functional square loss, two closed-form estimators are proposed, one for fully observed output functions and the other for partially observed ones. Both are backed theoretically by an excess risk analysis. Then, in the more general setting of integral losses based on differentiable ground losses, KPL is implemented using first-order optimization for both fully and partially observed output functions. Eventually, several robustness aspects of the proposed algorithms are highlighted on a toy dataset; and a study on two real datasets shows that they are competitive compared to other nonlinear approaches. Notably, using the square loss and a learnt dictionary, KPL enjoys a particularily attractive trade-off between computational cost and performances. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bouche21a.html
http://proceedings.mlr.press/v130/bouche21a.html Calibrated Adaptive Probabilistic ODE Solvers Probabilistic solvers for ordinary differential equations assign a posterior measure to the solution of an initial value problem. The joint covariance of this distribution provides an estimate of the (global) approximation error. The contraction rate of this error estimate as a function of the solver’s step-size identifies it as a well-calibrated worst-case error, but its explicit numerical value for a certain step size is not automatically a good estimate of the explicit error. Addressing this issue, we introduce, discuss, and assess several probabilistically motivated ways to calibrate the uncertainty estimate. Numerical experiments demonstrate that these calibration methods interact efficiently with adaptive step-size selection, resulting in descriptive, and efficiently computable posteriors. We demonstrate the efficiency of the methodology by benchmarking against the classic, widely used Dormand-Prince 4/5 Runge-Kutta method. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bosch21a.html
http://proceedings.mlr.press/v130/bosch21a.html Matérn Gaussian Processes on Graphs Gaussian processes are a versatile framework for learning unknown functions in a manner that permits one to utilize prior information about their properties. Although many different Gaussian process models are readily available when the input space is Euclidean, the choice is much more limited for Gaussian processes whose input space is an undirected graph. In this work, we leverage the stochastic partial differential equation characterization of Matérn Gaussian processes—a widely-used model class in the Euclidean setting—to study their analog for undirected graphs. We show that the resulting Gaussian processes inherit various attractive properties of their Euclidean and Riemannian analogs and provide techniques that allow them to be trained using standard methods, such as inducing points. This enables graph Matérn Gaussian processes to be employed in mini-batch and non-conjugate settings, thereby making them more accessible to practitioners and easier to deploy within larger learning frameworks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/borovitskiy21a.html
http://proceedings.mlr.press/v130/borovitskiy21a.html Stochastic Linear Bandits Robust to Adversarial Attacks We consider a stochastic linear bandit problem in which the rewards are not only subject to random noise, but also adversarial attacks subject to a suitable budget $C$ (i.e., an upper bound on the sum of corruption magnitudes across the time horizon). We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not. Both variants are shown to attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively having a linear and quadratic dependency on $C$ in general. We present algorithm-independent lower bounds showing that these additive terms are near-optimal. In addition, in a contextual setting, we revisit a setup of diverse contexts, and show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bogunovic21a.html
http://proceedings.mlr.press/v130/bogunovic21a.html Learning Complexity of Simulated Annealing Simulated annealing is an effective and general means of optimization. It is in fact inspired by metallurgy, where the temperature of a material determines its behavior in thermodynamics. Likewise, in simulated annealing, the actions that the algorithm takes depend entirely on the value of a variable which captures the notion of temperature. Typically, simulated annealing starts with a high temperature, which makes the algorithm pretty unpredictable, and gradually cools the temperature down to become more stable. A key component that plays a crucial role in the performance of simulated annealing is the criteria under which the temperature changes namely, the cooling schedule. Motivated by this, we study the following question in this work: "Given enough samples to the instances of a specific class of optimization problems, can we design optimal (or approximately optimal) cooling schedules that minimize the runtime or maximize the success rate of the algorithm on average when the underlying problem is drawn uniformly at random from the same class?" We provide positive results both in terms of sample complexity and simulation complexity. For sample complexity, we show that O (m^1/2) samples suffice to find an approximately optimal cooling schedule of length m. We complement this result by giving a lower bound of Ω (m^1/3) on the sample complexity of any learning algorithm that provides an almost optimal cooling schedule. These results are general and rely on no assumption. For simulation complexity, however, we make additional assumptions to measure the success rate of an algorithm. To this end, we introduce the monotone stationary graph that models the performance of simulated annealing. Based on this model, we present polynomial time algorithms with provable guarantees for the learning problem. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/blum21a.html
http://proceedings.mlr.press/v130/blum21a.html Differentiable Divergences Between Time Series Computing the discrepancy between time series of variable sizes is notoriously challenging. While dynamic time warping (DTW) is popularly used for this purpose, it is not differentiable everywhere and is known to lead to bad local optima when used as a “loss”. Soft-DTW addresses these issues, but it is not a positive definite divergence: due to the bias introduced by entropic regularization, it can be negative and it is not minimized when the time series are equal. We propose in this paper a new divergence, dubbed soft-DTW divergence, which aims to correct these issues. We study its properties; in particular, under conditions on the ground cost, we show that it is a valid divergence: it is non-negative and minimized if and only if the two time series are equal. We also propose a new “sharp” variant by further removing entropic bias. We showcase our divergences on time series averaging and demonstrate significant accuracy improvements compared to both DTW and soft-DTW on 84 time series classification datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/blondel21a.html
http://proceedings.mlr.press/v130/blondel21a.html On the Absence of Spurious Local Minima in Nonlinear Low-Rank Matrix Recovery Problems The restricted isometry property (RIP) is a well-known condition that guarantees the absence of spurious local minima in low-rank matrix recovery problems with linear measurements. In this paper, we introduce a novel property named bound difference property (BDP) to study low-rank matrix recovery problems with nonlinear measurements. Using RIP and BDP jointly, we propose a new criterion to certify the nonexistence of spurious local minima in the rank-1 case, and prove that it leads to a much stronger theoretical guarantee than the existing bounds on RIP. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bi21a.html
http://proceedings.mlr.press/v130/bi21a.html Efficient Statistics for Sparse Graphical Models from Truncated Samples In this paper, we study high-dimensional estimation from truncated samples. We focus on two fundamental and classical problems: (i) inference of sparse Gaussian graphical models and (ii) support recovery of sparse linear models. (i) For Gaussian graphical models, suppose d-dimensional samples x are generated from a Gaussian N(mu, Sigma) and observed only if they belong to a subset S of R^d. We show that mu and Sigma can be estimated with error epsilon in the Frobenius norm, using O (nz(Sigma^{-1})/epsilon^2) samples from a truncated N(mu, Sigma) and having access to a membership oracle for S. The set S is assumed to have non-trivial measure under the unknown distribution but is otherwise arbitrary. (ii) For sparse linear regression, suppose samples (x,y) are generated where y = <x,Omega*> + N(0,1) and (x, y) is seen only if y belongs to a truncation set S of the reals. We consider the case that Omega* is sparse with a support set of size k. Our main result is to establish precise conditions on the problem dimension d, the support size k, the number of observations n, and properties of the samples and the truncation that are sufficient to recover the support of Omega*. Specifically, we show that under some mild assumptions, only O(k^2 log d) samples are needed to estimate Omega* in the infinity-norm up to a bounded error. Similar results are also estabilished for estimating Omega* in the Euclidean norm up to arbitrary error. For both problems, our estimator minimizes the sum of the finite population negative log-likelihood function and an ell_1-regularization term. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bhattacharyya21a.html
http://proceedings.mlr.press/v130/bhattacharyya21a.html Differentiable Causal Discovery Under Unmeasured Confounding The data drawn from biological, economic, and social systems are often confounded due to the presence of unmeasured variables. Prior work in causal discovery has focused on discrete search procedures for selecting acyclic directed mixed graphs (ADMGs), specifically ancestral ADMGs, that encode ordinary conditional independence constraints among the observed variables of the system. However, confounded systems also exhibit more general equality restrictions that cannot be represented via these graphs, placing a limit on the kinds of structures that can be learned using ancestral ADMGs. In this work, we derive differentiable algebraic constraints that fully characterize the space of ancestral ADMGs, as well as more general classes of ADMGs, arid ADMGs and bow-free ADMGs, that capture all equality restrictions on the observed variables. We use these constraints to cast causal discovery as a continuous optimization problem and design differentiable procedures to find the best fitting ADMG when the data comes from a confounded linear system of equations with correlated errors. We demonstrate the efficacy of our method through simulations and application to a protein expression dataset. Code implementing our methods is open-source and publicly available at https://gitlab.com/rbhatta8/dcd and will be incorporated into the Ananke package. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bhattacharya21a.html
http://proceedings.mlr.press/v130/bhattacharya21a.html Power of Hints for Online Learning with Movement Costs We consider the online linear optimization problem with movement costs, a variant of online learning in which the learner must not only respond to cost vectors $c_t$ with points $x_t$ in order to maintain low regret, but is also penalized for movement by an additional cost $\|x_t-x_{t+1}\|^{1+\epsilon}$ for some $\epsilon>0$. Classically, simple algorithms that obtain the optimal $\sqrt{T}$ regret already are very stable and do not incur a significant movement cost. However, recent work has shown that when the learning algorithm is provided with weak “hint” vectors that have a positive correlation with the costs, the regret can be significantly improved to $\log(T)$. In this work, we study the stability of such algorithms, and provide matching upper and lower bounds showing that incorporating movement costs results in intricate tradeoffs between $\log(T)$ when $\epsilon\ge 1$ and $\sqrt{T}$ regret when $\epsilon=0$. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bhaskara21b.html
http://proceedings.mlr.press/v130/bhaskara21b.html Principal Component Regression with Semirandom Observations via Matrix Completion Principal Component Regression (PCR) is a popular method for prediction from data, and is one way to address the so-called multi-collinearity problem in regression. It was shown recently that algorithms for PCR such as hard singular value thresholding (HSVT) are also quite robust, in that they can handle data that has missing or noisy covariates. However, such spectral approaches require strong distributional assumptions on which entries are observed. Specifically, every covariate is assumed to be observed with probability (exactly) $p$, for some value of $p$. Our goal in this work is to weaken this requirement, and as a step towards this, we study a “semi-random” model. In this model, every covariate is revealed with probability $p$, and then an adversary comes in and reveals additional covariates. While the model seems intuitively easier, it is well known that algorithms such as HSVT perform poorly. Our approach is based on studying the closely related problem of Noisy Matrix Completion in a semi-random setting. By considering a new semidefinite programming relaxation, we develop new guarantees for matrix completion, which is our core technical contribution. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bhaskara21a.html
http://proceedings.mlr.press/v130/bhaskara21a.html On the Linear Convergence of Policy Gradient Methods for Finite MDPs We revisit the finite time analysis of policy gradient methods in the one of the simplest settings: finite state and action MDPs with a policy class consisting of all stochastic policies and with exact gradient evaluations. There has been some recent work viewing this setting as an instance of smooth non-linear optimization problems, to show sub-linear convergence rates with small step-sizes. Here, we take a completely different perspective based on illuminating connections with policy iteration, to show how many variants of policy gradient algorithms succeed with large step-sizes and attain a linear rate of convergence. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bhandari21a.html
http://proceedings.mlr.press/v130/bhandari21a.html Anderson acceleration of coordinate descent Acceleration of first order methods is mainly obtained via inertia à la Nesterov, or via nonlinear extrapolation. The latter has known a recent surge of interest, with successful applications to gradient and proximal gradient techniques. On multiple Machine Learning problems, coordinate descent achieves performance significantly superior to full-gradient methods. Speeding up coordinate descent in practice is not easy: inertially accelerated versions of coordinate descent are theoretically accelerated, but might not always lead to practical speed-ups. We propose an accelerated version of coordinate descent using extrapolation, showing considerable speed up in practice, compared to inertial accelerated coordinate descent and extrapolated (proximal) gradient descent. Experiments on least squares, Lasso, elastic net and logistic regression validate the approach. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bertrand21a.html
http://proceedings.mlr.press/v130/bertrand21a.html Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings where experimentation is limited, such as healthcare. But, in these very same settings, observed actions are often confounded by unobserved variables making OPE even more difficult. We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders. We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data. Our method involves two stages. In the first, we show how to use proxies to estimate stationary distribution ratios, extending recent work on breaking the curse of horizon to the confounded setting. In the second, we show optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions. We establish theoretical guarantees of consistency and benchmark our method empirically. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bennett21a.html
http://proceedings.mlr.press/v130/bennett21a.html Interpretable Random Forests via Rule Extraction We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm, which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as “black boxes” because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly restrictive for applications with critical decisions at stake. On the other hand, algorithms with a simple structure—typically decision trees, rule algorithms, or sparse linear models—are well known for their instability. This undesirable feature makes the conclusions of the data analysis unreliable and turns out to be a strong operational limitation. This motivates the design of SIRUS, based on random forests, which combines a simple structure, a remarkable stable behavior when data is perturbed, and an accuracy comparable to its competitors. We demonstrate the efficiency of the method both empirically (through experiments) and theoretically (with the proof of its asymptotic stability). A R/C++ software implementation sirus is available from CRAN. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/benard21a.html
http://proceedings.mlr.press/v130/benard21a.html Optimizing Percentile Criterion using Robust MDPs We address the problem of computing reliable policies in reinforcement learning problems with limited data. In particular, we compute policies that achieve good returns with high confidence when deployed. This objective, known as the percentile criterion, can be optimized using Robust MDPs (RMDPs). RMDPs generalize MDPs to allow for uncertain transition probabilities chosen adversarially from given ambiguity sets. We show that the RMDP solution’s sub-optimality depends on the spans of the ambiguity sets along the value function. We then propose new algorithms that minimize the span of ambiguity sets defined by weighted L1 and L-infinity norms. Our primary focus is on Bayesian guarantees, but we also describe how our methods apply to frequentist guarantees and derive new concentration inequalities for weighted L1 and L-infinity norms. Experimental results indicate that our optimized ambiguity sets improve significantly on prior construction methods. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/behzadian21a.html
http://proceedings.mlr.press/v130/behzadian21a.html Understanding and Mitigating Exploding Inverses in Invertible Neural Networks Invertible neural networks (INNs) have been used to design generative models, implement memory-saving gradient computation, and solve inverse problems. In this work, we show that commonly-used INN architectures suffer from exploding inverses and are thus prone to becoming numerically non-invertible. Across a wide range of INN use-cases, we reveal failures including the non-applicability of the change-of-variables formula on in- and out-of-distribution (OOD) data, incorrect gradients for memory-saving backprop, and the inability to sample from normalizing flow models. We further derive bi-Lipschitz properties of atomic building blocks of common architectures. These insights into the stability of INNs then provide ways forward to remedy these failures. For tasks where local invertibility is sufficient, like memory-saving backprop, we propose a flexible and efficient regularizer. For problems where global invertibility is necessary, such as applying normalizing flows on OOD data, we show the importance of designing stable INN building blocks. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/behrmann21a.html
http://proceedings.mlr.press/v130/behrmann21a.html Gaming Helps! Learning from Strategic Interactions in Natural Dynamics We consider an online regression setting in which individuals adapt to the regression model: arriving individuals may access the model throughout the process, and invest strategically in modifying their own features so as to improve their predicted score. Such feature manipulation, or “gaming”, has been observed in various scenarios—from credit assessment to school admissions, posing a challenge for the learner. Surprisingly, we find that such strategic manipulation may in fact help the learner recover the meaningful variables in settings where an agent can invest in improving meaningful features—that is, the features that, when changed, affect the true label, as opposed to non-meaningful features that have no effect. We show that even simple behavior on the learner’s part allows her to simultaneously i) accurately recover the meaningful features, and ii) incentivize agents to invest in these meaningful features, providing incentives for improvement. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bechavod21a.html
http://proceedings.mlr.press/v130/bechavod21a.html Contextual Blocking Bandits We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms’ mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. The above contextual setting captures important scenarios such as recommendation systems or ad placement with diverse users. This problem has been recently studied [Dickerson et al., AAAI 2018] in the full-information setting (i.e., assuming knowledge of the mean context-dependent arm rewards), where competitive ratio bounds have been derived. We focus on the bandit setting, where these means are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $\alpha$-optimal strategy in $T$ time steps, matching the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/basu21a.html
http://proceedings.mlr.press/v130/basu21a.html Logistic Q-Learning We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The method is closely related to the classic Relative Entropy Policy Search (REPS) algorithm of Peters et al. (2010), with the key difference that our method introduces a Q-function that enables efficient exact model-free implementation. The main feature of our algorithm (called QREPS) is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error. We provide a practical saddle-point optimization method for minimizing this loss function and provide an error-propagation analysis that relates the quality of the individual updates to the performance of the output policy. Finally, we demonstrate the effectiveness of our method on a range of benchmark problems. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bas-serrano21a.html
http://proceedings.mlr.press/v130/bas-serrano21a.html Implicit Regularization via Neural Feature Alignment We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a regularization effect induced by a dynamical alignment ofthe neural tangent features introduced by Jacot et al. (2018), along a small number of task-relevant directions. This can be interpreted as a combined mechanism of feature selection and compression. By extrapolating a new analysis of Rademacher complexity bounds for linear models, we motivate and study a heuristic complexity measure that captures this phenomenon, in terms of sequences of tangent kernel classes along optimization paths. The code for our experiments is available as https://github.com/tfjgeorge/ntk_alignment. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/baratin21a.html
http://proceedings.mlr.press/v130/baratin21a.html Fenchel-Young Losses with Skewed Entropies for Class-posterior Probability Estimation We study class-posterior probability estimation (CPE) for binary responses where one class has much fewer data than the other. For example, events such as species co-occurrence in ecology and wars in political science are often much rarer than non-events. Logistic regression has been widely used for CPE, while it tends to underestimate the probability of rare events. Its main drawback is symmetry of the logit link—symmetric links can be misled by small and imbalanced samples because it is more incentivized to overestimate the majority class with finite samples. Parametric skewed links have been proposed to overcome this limitation, but their estimation usually results in nonconvex optimization unlike the logit link. Such nonconvexity is knotty not only from the computational viewpoint but also in terms of the parameter identifiability. In this paper, we provide a procedure to derive a convex loss for a skewed link based on the recently proposed Fenchel-Young losses. The derived losses are always convex and have a nice property suitable for class imbalance. The simulation shows the practicality of the derived losses. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bao21b.html
http://proceedings.mlr.press/v130/bao21b.html One-Round Communication Efficient Distributed M-Estimation Communication cost and local computation complexity are two main bottlenecks of the distributed statistical learning. In this paper, we consider the distributed M-estimation problem in both regular and sparse case and propose a novel one-round communication efficient algorithm. For regular distributed M-estimator, the asymptotic normality is provided to conduct statistical inference. For sparse distributed M-estimator, we only require solving a quadratic Lasso problem in the master machine using the same local information as the regular distributed M-estimator. Consequently, the computation complexity of the local machine is sufficiently reduced compared with the existing debiased sparse estimator. Under mild conditions, the theoretical results guarantee that our proposed distributed estimators achieve (near)optimal statistical convergence rate. The effectiveness of our proposed algorithm is verified through experiments across different M-estimation problems using both synthetic and real benchmark datasets. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bao21a.html
http://proceedings.mlr.press/v130/bao21a.html The Sample Complexity of Level Set Approximation We study the problem of approximating the level set of an unknown function by sequentially querying its values. We introduce a family of algorithms called Bisect and Approximate through which we reduce the level set approximation problem to a local function approximation problem. We then show how this approach leads to rate-optimal sample complexity guarantees for Hölder functions, and we investigate how such rates improve when additional smoothness or other structural assumptions hold true. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/bachoc21a.html
http://proceedings.mlr.press/v130/bachoc21a.html An Optimal Reduction of TV-Denoising to Adaptive Online Learning We consider the problem of estimating a function from $n$ noisy samples whose discrete Total Variation (TV) is bounded by $C_n$. We reveal a deep connection to the seemingly disparate problem of \emph{Strongly Adaptive} online learning [Daniely et al 2015] and provide an $O(n \log n)$ time algorithm that attains the near minimax optimal rate of $\tilde O (n^{1/3}C_n^{2/3})$ under squared error loss. The resulting algorithm runs online and optimally \emph{adapts} to the \emph{unknown} smoothness parameter $C_n$. This leads to a new and more versatile alternative to wavelets-based methods for (1) adaptively estimating TV bounded functions; (2) online forecasting of TV bounded trends in time series. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/baby21a.html
http://proceedings.mlr.press/v130/baby21a.html Nearest Neighbour Based Estimates of Gradients: Sharp Nonasymptotic Bounds and Applications Motivated by a wide variety of applications, ranging from stochastic optimization to dimension reduction through variable selection, the problem of estimating gradients accurately is of crucial importance in statistics and learning theory. We consider here the classic regression setup, where a real valued square integrable r.v. Y is to be predicted upon observing a (possibly high dimensional) random vector X by means of a predictive function f(X) as accurately as possible in the mean-squared sense and study a nearest-neighbour-based pointwise estimate of the gradient of the optimal predictive function, the regression function m(x)=E[Y | X=x]. Under classic smoothness conditions combined with the assumption that the tails of Y-m(X) are sub-Gaussian, we prove nonasymptotic bounds improving upon those obtained for alternative estimation methods. Beyond the novel theoretical results established, several illustrative numerical experiments have been carried out. The latter provide strong empirical evidence that the estimation method proposed works very well for various statistical problems involving gradient estimation, namely dimensionality reduction, stochastic gradient descent optimization and quantifying disentanglement. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ausset21a.html
http://proceedings.mlr.press/v130/ausset21a.html Counterfactual Representation Learning with Balancing Weights A key to causal inference with observational data is achieving balance in predictive features associated with each treatment type. Recent literature has explored representation learning to achieve this goal. In this work, we discuss the pitfalls of these strategies – such as a steep trade-off between achieving balance and predictive power – and present a remedy via the integration of balancing weights in causal learning. Specifically, we theoretically link balance to the quality of propensity estimation, emphasize the importance of identifying a proper target population, and elaborate on the complementary roles of feature balancing and weight adjustments. Using these concepts, we then develop an algorithm for flexible, scalable and accurate estimation of causal effects. Finally, we show how the learned weighted representations may serve to facilitate alternative causal learning procedures with appealing statistical features. We conduct an extensive set of experiments on both synthetic examples and standard benchmarks, and report encouraging results relative to state-of-the-art baselines. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/assaad21a.html
http://proceedings.mlr.press/v130/assaad21a.html Bandit algorithms: Letting go of logarithmic regret for statistical robustness We study regret minimization in a stochastic multi-armed bandit setting, and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness. Considering broad classes of underlying arms’ distributions, we show that bandit learning algorithms with logarithmic regret are always inconsistent and that consistent learning algorithms always suffer a super-logarithmic regret. This result highlights the inevitable statistical fragility of all ‘logarithmic regret’ bandit algorithms available in the literature - for instance, if a UCB algorithm designed for 1-subGaussian distributions is used in a subGaussian setting with a mismatched variance parameter, the learning performance could be inconsistent. Next, we show a positive result: statistically robust and consistent learning performance is attainable if we allow the regret to be slightly worse than logarithmic. Specifically, we propose three classes of distribution oblivious algorithms that achieve an asymptotic regret that is arbitrarily close to logarithmic. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ashutosh21a.html
http://proceedings.mlr.press/v130/ashutosh21a.html Geometrically Enriched Latent Spaces A common assumption in generative models is that the generator immerses the latent space into a Euclidean ambient space. Instead, we consider the ambient space to be a Riemannian manifold, which allows for encoding domain knowledge through the associated Riemannian metric. Shortest paths can then be defined accordingly in the latent space to both follow the learned manifold and respect the ambient geometry. Through careful design of the ambient metric we can ensure that shortest paths are well-behaved even for deterministic generators that otherwise would exhibit a misleading bias. Experimentally we show that our approach improves the interpretability and the functionality of learned representations both using stochastic and deterministic generators. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/arvanitidis21a.html
http://proceedings.mlr.press/v130/arvanitidis21a.html Corralling Stochastic Bandit Algorithms We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algorithm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward, and depends on the gap between the highest reward and other rewards. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/arora21a.html
http://proceedings.mlr.press/v130/arora21a.html When MAML Can Adapt Fast and How to Assist When It Cannot Model-Agnostic Meta-Learning (MAML) and its variants have achieved success in meta-learning tasks on many datasets and settings. Nonetheless, we have just started to understand and analyze how they are able to adapt fast to new tasks. In this work, we contribute by conducting a series of empirical and theoretical studies, and discover several interesting, previously unknown properties of the algorithm. First, we find MAML adapts better with a deep architecture even if the tasks need only a shallow one. Secondly, linear layers can be added to the output layers of a shallower model to increase the depth without altering the modelling capacity, leading to improved performance in adaptation. Alternatively, an external and separate neural network meta-optimizer can also be used to transform the gradient updates of a smaller model so as to obtain improved performances in adaptation. Drawing from these evidences, we theorize that for a deep neural network to meta-learn well, the upper layers must transform the gradients of the bottom layers as if the upper layers were an external meta-optimizer, operating on a smaller network that is composed of the bottom layers. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/arnold21a.html
http://proceedings.mlr.press/v130/arnold21a.html Deep Probabilistic Accelerated Evaluation: A Robust Certifiable Rare-Event Simulation Methodology for Black-Box Safety-Critical Systems Evaluating the reliability of intelligent physical systems against rare safety-critical events poses a huge testing burden for real-world applications. Simulation provides a useful platform to evaluate the extremal risks of these systems before their deployments. Importance Sampling (IS), while proven to be powerful for rare-event simulation, faces challenges in handling these learning-based systems due to their black-box nature that fundamentally undermines its efficiency guarantee, which can lead to under-estimation without diagnostically detected. We propose a framework called Deep Probabilistic Accelerated Evaluation (Deep-PrAE) to design statistically guaranteed IS, by converting black-box samplers that are versatile but could lack guarantees, into one with what we call a relaxed efficiency certificate that allows accurate estimation of bounds on the safety-critical event probability. We present the theory of Deep-PrAE that combines the dominating point concept with rare-event set learning via deep neural network classifiers, and demonstrate its effectiveness in numerical examples including the safety-testing of an intelligent driving algorithm. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/arief21a.html
http://proceedings.mlr.press/v130/arief21a.html Faster & More Reliable Tuning of Neural Networks: Bayesian Optimization with Importance Sampling Many contemporary machine learning models require extensive tuning of hyperparameters to perform well. A variety of methods, such as Bayesian optimization, have been developed to automate and expedite this process. However, tuning remains extremely costly as it typically requires repeatedly fully training models. To address this issue, Bayesian optimization methods have been extended to use cheap, partially trained models to extrapolate to expensive complete models. While this approach enlarges the set of explored hyperparameters, including many low-fidelity observations adds to the intrinsic randomness of the procedure and makes extrapolation challenging. We propose to accelerate hyperparameter tuning for neural networks in a robust way by taking into account the relative amount of information contributed by each training example. To do so, we integrate importance sampling with Bayesian optimization, which significantly increases the quality of the black-box function evaluations and their runtime. To overcome the additional overhead cost of using importance sampling, we cast hyperparameter search as a multi-task Bayesian optimization problem over both hyperparameters and importance sampling design, which achieves the best of both worlds. Through learning a trade-off between training complexity and quality, our method improves upon validation error, in the average and worst-case. We show that this results in more reliable performance of our method in less wall-clock time across a variety of and datasets complex neural architectures. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ariafar21a.html
http://proceedings.mlr.press/v130/ariafar21a.html Efficient Balanced Treatment Assignments for Experimentation In this work, we address the problem of balanced treatment assignment for experiments by considering an interpretation of the problem as optimization of a two-sample test between test and control units. Using this lens we provide an assignment algorithm that is optimal with respect to the minimum spanning tree test of Friedman and Rafsky [1979]. This assignment to treatment groups may be performed exactly in polynomial time and allows for the design of experiments explicitly targeting the individual treatment effect. We provide a probabilistic interpretation of this process in terms of the most probable element of designs drawn from a determinantal point process. We provide a novel formulation of estimation as transductive inference and show how the tree structures used in design can also be used in an adjustment estimator. We conclude with a simulation study demonstrating the improved efficacy of our method. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/arbour21a.html
http://proceedings.mlr.press/v130/arbour21a.html Direct-Search for a Class of Stochastic Min-Max Problems Recent applications in machine learning have renewed the interest of the community in min-max optimization problems. While gradient-based optimization methods are widely used to solve such problems, there are however many scenarios where these techniques are not well-suited, or even not applicable when the gradient is not accessible. We investigate the use of direct-search methods that belong to a class of derivative-free techniques that only access the objective function through an oracle. In this work, we design a novel algorithm in the context of min-max saddle point games where one sequentially updates the min and the max player. We prove convergence of this algorithm under mild assumptions, where the objective of the max-player satisfies the Polyak-Ł{}ojasiewicz (PL) condition, while the min-player is characterized by a nonconvex objective. Our method only assumes dynamically adjusted accurate estimates of the oracle with a fixed probability. To the best of our knowledge, our analysis is the first one to address the convergence of a direct-search method for min-max objectives in a stochastic setting. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/anagnostidis21a.html
http://proceedings.mlr.press/v130/anagnostidis21a.html Robust Learning under Strong Noise via SQs This work provides several new insights on the robustness of Kearns’ statistical query framework against challenging label-noise models. First, we build on a recent result by \cite{DBLP:journals/corr/abs-2006-04787} that showed noise tolerance of distribution-independently evolvable concept classes under Massart noise. Specifically, we extend their characterization to more general noise models, including the Tsybakov model which considerably generalizes the Massart condition by allowing the flipping probability to be arbitrarily close to $\frac{1}{2}$ for a subset of the domain. As a corollary, we employ an evolutionary algorithm by \cite{DBLP:conf/colt/KanadeVV10} to obtain the first polynomial time algorithm with arbitrarily small excess error for learning linear threshold functions over any spherically symmetric distribution in the presence of spherically symmetric Tsybakov noise. Moreover, we posit access to a stronger oracle, in which for every labeled example we additionally obtain its flipping probability. In this model, we show that every SQ learnable class admits an efficient learning algorithm with $\opt + \epsilon$ misclassification error for a broad class of noise models. This setting substantially generalizes the widely-studied problem of classification under RCN with known noise rate, and corresponds to a non-convex optimization problem even when the noise function – i.e. the flipping probabilities of all points – is known in advance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/anagnostides21a.html
http://proceedings.mlr.press/v130/anagnostides21a.html Automatic structured variational inference Stochastic variational inference offers an attractive option as a default method for differentiable probabilistic programming. However, the performance of the variational approach depends on the choice of an appropriate variational family. Here, we introduce automatic structured variational inference (ASVI), a fully automated method for constructing structured variational families, inspired by the closed-form update in conjugate Bayesian models. These pseudo-conjugate families incorporate the forward pass of the input probabilistic program and can therefore capture complex statistical dependencies. Pseudo-conjugate families have the same space and time complexity of the input probabilistic program and are therefore tractable for a very large family of models including both continuous and discrete variables. We validate our automatic variational method on a wide range of both low- and high-dimensional inference problems. We find that ASVI provides a clear improvement in performance when compared with other popular approaches such as mean field family and inverse autoregressive flows. We provide a fully automatic open source implementation of ASVI in TensorFlow Probability. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ambrogioni21a.html
http://proceedings.mlr.press/v130/ambrogioni21a.html Momentum Improves Optimization on Riemannian Manifolds We develop a new Riemannian descent algorithm that relies on momentum to improve over existing first-order methods for geodesically convex optimization. In contrast, accelerated convergence rates proved in prior work have only been shown to hold for geodesically strongly-convex objective functions. We further extend our algorithm to geodesically weakly-quasi-convex objectives. Our proofs of convergence rely on a novel estimate sequence that illustrates the dependency of the convergence rate on the curvature of the manifold. We validate our theoretical results empirically on several optimization problems defined on the sphere and on the manifold of positive definite matrices. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/alimisis21a.html
http://proceedings.mlr.press/v130/alimisis21a.html CLAR: Contrastive Learning of Auditory Representations Learning rich visual representations using contrastive self-supervised learning has been extremely successful. However, it is still a major question whether we could use a similar approach to learn superior auditory representations. In this paper, we expand on prior work (SimCLR) to learn better auditory representations. We (1) introduce various data augmentations suitable for auditory data and evaluate their impact on predictive performance, (2) show that training with time-frequency audio features substantially improves the quality of the learned representations compared to raw signals, and (3) demonstrate that training with both supervised and contrastive losses simultaneously improves the learned representations compared to self-supervised pre-training followed by supervised fine-tuning. We illustrate that by combining all these methods and with substantially less labeled data, our framework (CLAR) achieves significant improvement on prediction performance compared to supervised approach. Moreover, compared to self-supervised approach, our framework converges faster with significantly better representations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/al-tahan21a.html
http://proceedings.mlr.press/v130/al-tahan21a.html On Data Efficiency of Meta-learning Meta-learning has enabled learning statistical models that can be quickly adapted to new prediction tasks. Motivated by use-cases in personalized federated learning, we study the often overlooked aspect of the modern meta-learning algorithms—their data efficiency. To shed more light on which methods are more efficient, we use techniques from algorithmic stability to derive bounds on the transfer risk that have important practical implications, indicating how much supervision is needed and how it must be allocated for each method to attain the desired level of generalization. Further, we introduce a new simple framework for evaluating meta-learning methods under a limit on the available supervision, conduct an empirical study of MAML, Reptile, andProtoNets, and demonstrate the differences in the behavior of these methods on few-shot and federated learning benchmarks. Finally, we propose active meta-learning, which incorporates active data selection into learning-to-learn, leading to better performance of all methods in the limited supervision regime. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/al-shedivat21a.html
http://proceedings.mlr.press/v130/al-shedivat21a.html Probabilistic Sequential Matrix Factorization We introduce the probabilistic sequential matrix factorization (PSMF) method for factorizing time-varying and non-stationary datasets consisting of high-dimensional time-series. In particular, we consider nonlinear Gaussian state-space models where sequential approximate inference results in the factorization of a data matrix into a dictionary and time-varying coefficients with potentially nonlinear Markovian dependencies. The assumed Markovian structure on the coefficients enables us to encode temporal dependencies into a low-dimensional feature space. The proposed inference method is solely based on an approximate extended Kalman filtering scheme, which makes the resulting method particularly efficient. PSMF can account for temporal nonlinearities and, more importantly, can be used to calibrate and estimate generic differentiable nonlinear subspace models. We also introduce a robust version of PSMF, called rPSMF, which uses Student-t filters to handle model misspecification. We show that PSMF can be used in multiple contexts: modeling time series with a periodic subspace, robustifying changepoint detection methods, and imputing missing data in several high-dimensional time-series, such as measurements of pollutants across London. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/akyildiz21a.html
http://proceedings.mlr.press/v130/akyildiz21a.html Mirror Descent View for Neural Network Quantization Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. It is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the Straight Through Estimator (STE) based method which is typically viewed as a “trick” to avoid vanishing gradients issue. Our experiments on CIFAR-10/100, TinyImageNet, and ImageNet classification datasets with VGG-16, ResNet-18, and MobileNetV2 architectures show that our MD variants yield state-of-the-art performance. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ajanthan21a.html
http://proceedings.mlr.press/v130/ajanthan21a.html Linear Regression Games: Convergence Guarantees to Approximate Out-of-Distribution Solutions Recently, invariant risk minimization (IRM) (Arjovsky et al. 2019) was proposed as a promising solution to address out-of-distribution (OOD) generalization. In Ahuja et al. (2020), it was shown that solving for the Nash equilibria of a new class of “ensemble-games” is equivalent to solving IRM. In this work, we extend the framework in Ahuja et al. (2020) for linear regressions by projecting the ensemble-game on an $\ell_{\infty}$ ball. We show that such projections help achieve non-trivial out-of-distribution guarantees despite not achieving perfect invariance. For linear models with confounders, we prove that Nash equilibria of these games are closer to the ideal OOD solutions than the standard empirical risk minimization (ERM) and we also provide learning algorithms that provably converge to these Nash Equilibria. Empirical comparisons of the proposed approach with the state-of-the-art show consistent gains in achieving OOD solutions in several settings involving anti-causal variables and confounders. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/ahuja21a.html
http://proceedings.mlr.press/v130/ahuja21a.html Quantum Tensor Networks, Stochastic Processes, and Weighted Automata Modeling joint probability distributions over sequences has been studied from many perspectives. The physics community developed matrix product states, a tensor-train decomposition for probabilistic modeling, motivated by the need to tractably model many-body systems. But similar models have also been studied in the stochastic processes and weighted automata literature, with little work on how these bodies of work relate to each other. We address this gap by showing how stationary or uniform versions of popular quantum tensor network models have equivalent representations in the stochastic processes and weighted automata literature, in the limit of infinitely long sequences. We demonstrate several equivalence results between models used in these three communities: (i) uniform variants of matrix product states, Born machines and locally purified states from the quantum tensor networks literature, (ii) predictive state representations, hidden Markov models, norm-observable operator models and hidden quantum Markov models from the stochastic process literature, and (iii) stochastic weighted automata, probabilistic automata and quadratic automata from the formal languages literature. Such connections may open the door for results and methods developed in one area to be applied in another. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/adhikary21a.html
http://proceedings.mlr.press/v130/adhikary21a.html Dirichlet Pruning for Convolutional Neural Networks We introduce Dirichlet pruning, a novel post-processing technique to transform a large neural network model into a compressed one. Dirichlet pruning is a form of structured pruning which assigns the Dirichlet distribution over each layer’s channels in convolutional layers (or neurons in fully-connected layers), and learns the parameters of the distribution over these units using variational inference. The learnt parameters allow us to informatively and intuitively remove unimportant units, resulting in a compact architecture containing only crucial features for a task at hand. This method yields low GPU footprint, as the number of parameters is linear in the number of channels (or neurons) and training requires as little as one epoch to converge. We perform extensive experiments, in particular on larger architectures such as VGG and WideResNet (94% and 72% compression rate, respectively) where our method achieves the state-of-the-art compression performance and provides interpretable features as a by-product. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/adamczewski21a.html
http://proceedings.mlr.press/v130/adamczewski21a.html Explore the Context: Optimal Data Collection for Context-Conditional Dynamics Models In this paper, we learn dynamics models for parametrized families of dynamical systems with varying properties. The dynamics models are formulated as stochastic processes conditioned on a latent context variable which is inferred from observed transitions of the respective system. The probabilistic formulation allows us to compute an action sequence which, for a limited number of environment interactions, optimally explores the given system within the parametrized family. This is achieved by steering the system through transitions being most informative for the context variable. We demonstrate the effectiveness of our method for exploration on a non-linear toy-problem and two well-known reinforcement learning environments. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/achterhold21a.html
http://proceedings.mlr.press/v130/achterhold21a.html Instance-Wise Minimax-Optimal Algorithms for Logistic Bandits Logistic Bandits have recently attracted substantial attention, by providing an uncluttered yet challenging framework for understanding the impact of non-linearity in parametrized bandits. It was shown by Faury et al. (2020) that the learning-theoretic difficulties of Logistic Bandits can be embodied by a large (sometimes prohibitively) problem-dependent constant $\kappa$, characterizing the magnitude of the reward’s non-linearity. In this paper we introduce an algorithm for which we provide a refined analysis. This allows for a better characterization of the effect of non-linearity and yields improved problem-dependent guarantees. In most favorable cases this leads to a regret upper-bound scaling as $\tilde{\mathcal{O}}(d\sqrt{T/\kappa})$, which dramatically improves over the $\tilde{\mathcal{O}}(d\sqrt{T}+\kappa)$ state-of-the-art guarantees. We prove that this rate is \emph{minimax-optimal} by deriving a $\Omega(d\sqrt{T/\kappa})$ problem-dependent lower-bound. Our analysis identifies two regimes (permanent and transitory) of the regret, which ultimately re-conciliates (Faury et al., 2020) with the Bayesian approach of Dong et al. (2019). In contrast to previous works, we find that in the permanent regime non-linearity can dramatically ease the exploration-exploitation trade-off. While it also impacts the length of the transitory phase in a problem-dependent fashion, we show that this impact is mild in most reasonable configurations. Thu, 18 Mar 2021 00:00:00 +0000
http://proceedings.mlr.press/v130/abeille21a.html
http://proceedings.mlr.press/v130/abeille21a.html