Proceedings of Machine Learning ResearchProceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence
Held in Online on 27-30 July 2021
Published as Volume 161 by the Proceedings of Machine Learning Research on 01 December 2021.
Volume Edited by:
Cassio de Campos
Marloes H. Maathuis
Series Editors:
Neil D. Lawrence
* Mark Reid
https://proceedings.mlr.press/v161/
Wed, 01 Dec 2021 16:47:32 +0000Wed, 01 Dec 2021 16:47:32 +0000Jekyll v3.9.0Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave SamplingWe provide a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain. Under certain conditions on the target distribution, we prove that $\tilde O(d^4\epsilon^{-2})$ stochastic gradient evaluations suffice to guarantee $\epsilon$-sampling error in terms of the total variation distance, where $d$ is the problem dimension. This improves existing results on the convergence rate of SGLD [Raginsky et al., 2017, Xu et al., 2018]. We further show that provided an additional Hessian Lipschitz condition on the log-density function, SGLD is guaranteed to achieve $\epsilon$-sampling error within $\tilde O(d^{15/4}\epsilon^{-3/2})$ stochastic gradient evaluations. Our proof technique provides a new way to study the convergence of Langevin based algorithms, and sheds some light on the design of fast stochastic gradient based sampling algorithms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zou21a.html
https://proceedings.mlr.press/v161/zou21a.htmlUnsupervised program synthesis for images by sampling without replacementProgram synthesis has emerged as a successful approach to the image parsing task. Most prior works rely on a two-step scheme involving supervised pretraining of a Seq2Seq model with synthetic programs followed by reinforcement learning (RL) for fine-tuning with real reference images. Fully unsupervised approaches promise to train the model directly on the target images without requiring curated pretraining datasets. However, they struggle with the inherent sparsity of meaningful programs in the search space. In this paper, we present the first unsupervised algorithm capable of parsing constructive solid geometry (CSG) images into context-free grammar (CFG) without pretraining. To tackle the <em>non-Markovian</em> sparse reward problem, we combine three key ingredients—(i) a grammar-encoded tree LSTM ensuring program validity (ii) entropy regularization and (iii) sampling without replacement from the CFG syntax tree. Empirically, our algorithm recovers meaningful programs in large search spaces (up to $3.8 \times 10^{28}$). Further, even though our approach is fully unsupervised, it generalizes better than supervised methods on the synthetic 2D CSG dataset. On the 2D computer aided design (CAD) dataset, our approach significantly outperforms the supervised pretrained model and is competitive to the refined model.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhou21b.html
https://proceedings.mlr.press/v161/zhou21b.htmlTask similarity aware meta learning: theory-inspired improvement on MAMLFew-shot learning ability is heavily desired for machine intelligence. By meta-learning a model initialization from training tasks with fast adaptation ability to new tasks, model-agnostic meta-learning (MAML) has achieved remarkable success in a number of few-shot learning applications. However, theoretical understandings on the learning ability of MAML remain absent yet, hindering developing new and more advanced meta learning methods in a principled way. In this work, we solve this problem by theoretically justifying the fast adaptation capability of MAML when applied to new tasks. Specifically, we prove that the learnt meta-initialization can benefit the fast adaptation to new tasks with only a few steps of gradient descent. This result explicitly reveals the benefits of the unique designs in MAML. Then we propose a theory-inspired task similarity aware MAML which clusters tasks into multiple groups according to the estimated optimal model parameters and learns group-specific initializations. The proposed method improves upon MAML by speeding up the adaptation and giving stronger few-shot learning ability. Experimental results on the few-shot classification tasks testify its advantages.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhou21a.html
https://proceedings.mlr.press/v161/zhou21a.htmlDiagnostics for conditional density models and Bayesian inference algorithmsThere has been growing interest in the AI community for precise uncertainty quantification. Conditional density models f(y|x), where x represents potentially high-dimensional features, are an integral part of uncertainty quantification in prediction and Bayesian inference. However, it is challenging to assess conditional density estimates and gain insight into modes of failure. While existing diagnostic tools can determine whether an approximated conditional density is compatible overall with a data sample, they lack a principled framework for identifying, locating, and interpreting the nature of statistically significant discrepancies over the entire feature space. In this paper, we present rigorous and easy-to-interpret diagnostics such as (i) the “Local Coverage Test” (LCT), which distinguishes an arbitrarily misspecified model from the true conditional density of the sample, and (ii) “Amortized Local P-P plots” (ALP) which can quickly provide interpretable graphical summaries of distributional differences at any location x in the feature space. Our validation procedures scale to high dimensions and can potentially adapt to any type of data at hand. We demonstrate the effectiveness of LCT and ALP through a simulated experiment and applications to prediction and parameter inference for image data.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhao21b.html
https://proceedings.mlr.press/v161/zhao21b.htmlBayLIME: Bayesian local interpretable model-agnostic explanationsGiven the pressing need for assuring algorithmic transparency, Explainable AI (XAI) has emerged as one of the key areas of AI research. In this paper, we develop a novel Bayesian extension to the LIME framework, one of the most widely used approaches in XAI – which we call BayLIME. Compared to LIME, BayLIME exploits prior knowledge and Bayesian reasoning to improve both the consistency in repeated explanations of a single prediction and the robustness to kernel settings. BayLIME also exhibits better explanation fidelity than the state-of-the-art (LIME, SHAP and GradCAM) by its ability to integrate prior knowledge from, e.g., a variety of other XAI techniques, as well as verification and validation (V&V) methods. We demonstrate the desirable properties of BayLIME through both theoretical analysis and extensive experiments.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhao21a.html
https://proceedings.mlr.press/v161/zhao21a.htmlEnabling long-range exploration in minimization of multimodal functionsWe consider the problem of minimizing multi-modal loss functions with a large number of local optima. Since the local gradient points to the direction of the steepest slope in an infinitesimal neighborhood, an optimizer guided by the local gradient is often trapped in a local minimum. To address this issue, we develop a novel nonlocal gradient to skip small local minima by capturing major structures of the loss’s landscape in black-box optimization. The nonlocal gradient is defined by a directional Gaussian smoothing (DGS) approach. The key idea of DGS is to conducts 1D long-range exploration with a large smoothing radius along $d$ orthogonal directions in $R^d$, each of which defines a nonlocal directional derivative as a 1D integral. Such long-range exploration enables the nonlocal gradient to skip small local minima. The $d$ directional derivatives are then assembled to form the nonlocal gradient. We use the Gauss-Hermite quadrature rule to approximate the $d$ 1D integrals to obtain an accurate estimator. The superior performance of our method is demonstrated in three sets of examples, including benchmark functions for global optimization, and two real-world scientific problems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhang21e.html
https://proceedings.mlr.press/v161/zhang21e.htmlDynamic visualization for L1 fusion convex clustering in near-linear timeConvex clustering has drawn recent attention because of its competitive performance and nice property to guarantee global optimality. However, convex clustering is infeasible due to its high computational cost for large-scale data sets. We propose a novel method to solve the L1 fusion convex clustering problem by dynamic programming. We develop the Convex clustering Path Algorithm In Near-linear Time (C-PAINT) algorithm to construct the solution path efficiently. The proposed C-PAINT yields the exact solution while other general solvers for convex problems applied in the convex clustering depend on tuning parameters such as step size and threshold, and it usually takes many iterations to converge. Including a sorting process that almost takes no time in practice, the main part of the algorithm takes only linear time. Thus, C-PAINT has superior scalability comparing to other state-of-art algorithms. Moreover, C-PAINT enables the path visualization of clustering solutions for large data. In particular, experiments show our proposed method can solve the convex clustering with 10^7 data points in two minutes. We demonstrate the proposed method using both synthetic data and real data. Our algorithms are implemented in the dpcc R package.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhang21d.html
https://proceedings.mlr.press/v161/zhang21d.htmlThe complexity of nonconvex-strongly-concave minimax optimizationThis paper studies the complexity for finding approximate stationary points of nonconvex-strongly-concave (NC-SC) smooth minimax problems, in both general and averaged smooth finite-sum settings. We establish nontrivial lower complexity bounds for the two settings, respectively. Our result reveals substantial gaps between these limits and best-known upper bounds in the literature. To close these gaps, we introduce a generic acceleration scheme that deploys existing gradient-based methods to solve a sequence of crafted strongly-convex-strongly-concave subproblems. In the general setting, the complexity of our proposed algorithm nearly matches the lower bound; in particular, it removes an additional poly-logarithmic dependence on accuracy present in previous works. In the averaged smooth finite-sum setting, our proposed algorithm improves over previous algorithms by providing a nearly-tight dependence on the condition number.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhang21c.html
https://proceedings.mlr.press/v161/zhang21c.htmlStructured sparsification with joint optimization of group convolution and channel shuffleRecent advances in convolutional neural networks (CNNs) usually come with the expense of excessive computational overhead and memory footprint. Network compression aims to alleviate this issue by training compact models with comparable performance. However, existing compression techniques either entail dedicated expert design or compromise with a moderate performance drop. In this paper, we propose a novel structured sparsification method for efficient network compression. The proposed method automatically induces structured sparsity on the convolutional weights, thereby facilitating the implementation of the compressed model with the highly-optimized group convolution. We further address the problem of inter-group communication with a learnable channel shuffle mechanism. The proposed approach can be easily applied to compress many network architectures with a negligible performance drop. Extensive experimental results and analysis demonstrate that our approach gives a competitive performance against the recent network compression counterparts with a sound accuracy-complexity trade-off.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhang21b.html
https://proceedings.mlr.press/v161/zhang21b.htmlOn the distributional properties of adaptive gradientsAdaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the <em>update</em> is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm-up of the Adam optimizer, contrary to what is believed in the current literature.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zhang21a.html
https://proceedings.mlr.press/v161/zhang21a.htmlNP-DRAW: A Non-Parametric Structured Latent Variable Model for Image GenerationIn this paper, we present a non-parametric structured latent variable model for image generation, called NP-DRAW, which sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas. Our key contributions are as follows. 1) We propose a non-parametric prior distribution over the appearance of image parts so that the latent variable “what-to-draw” per step becomes a categorical random variable. This improves the expressiveness and greatly eases the learning compared to Gaussians used in the literature. 2) We model the sequential dependency structure of parts via a Transformer, which is more powerful and easier to train compared to RNNs used in the literature. 3) We propose an effective heuristic parsing algorithm to pre-train the prior. Experiments on MNIST, Omniglot, CIFAR-10, and CelebA show that our method significantly outperforms previous structured image models like DRAW and AIR and is competitive to other generic generative models. Moreover, we show that our model’s inherent compositionality and interpretability bring significant benefits in the low-data learning regime and latent space editing.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zeng21b.html
https://proceedings.mlr.press/v161/zeng21b.htmlA decentralized policy gradient approach to multi-task reinforcement learningWe develop a mathematical framework for solving multi-task reinforcement learning (MTRL) problems based on a type of policy gradient method. The goal in MTRL is to learn a common policy that operates effectively in different environments; these environments have similar (or overlapping) state spaces, but have different rewards and dynamics. We highlight two fundamental challenges in MTRL that are not present in its single task counterpart, and illustrate them with simple examples. We then develop a decentralized entropyregularized policy gradient method for solving the MTRL problem, and study its finite-time convergence rate. We demonstrate the effectiveness of the proposed method using a series of numerical experiments. These experiments range from small-scale "GridWorld" problems that readily demonstrate the trade-offs involved in multi-task learning to large-scale problems, where common policies are learned to navigate an airborne drone in multiple (simulated) environments.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zeng21a.html
https://proceedings.mlr.press/v161/zeng21a.htmlPROVIDE: a probabilistic framework for unsupervised video decompositionUnsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in videos. We propose PROVIDE, a novel unsupervised framework for PRObabilistic VIdeo DEcomposition based on a temporal extension of iterative inference. PROVIDE is powerful enough to jointly model complex individual multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects’ dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/zablotskaia21a.html
https://proceedings.mlr.press/v161/zablotskaia21a.htmlLeveraging probabilistic circuits for nonparametric multi-output regressionInspired by recent advances in the field of expert-based approximations of Gaussian processes (GPs), we present an expert-based approach to large-scale multi-output regression using single-output GP experts. Employing a deeply structured mixture of single-output GPs encoded via a probabilistic circuit allows us to capture correlations between multiple output dimensions accurately. By recursively partitioning the covariate space and the output space, posterior inference in our model reduces to inference on single-output GP experts, which only need to be conditioned on a small subset of the observations. We show that inference can be performed exactly and efficiently in our model, that it can capture correlations between output dimensions and, hence, often outperforms approaches that do not incorporate inter-output correlations, as demonstrated on several data sets in terms of the negative log predictive density.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/yu21a.html
https://proceedings.mlr.press/v161/yu21a.htmlMulti-output Gaussian Processes for uncertainty-aware recommender systemsRecommender systems are often designed based on a collaborative filtering approach, where user preferences are predicted by modelling interactions between users and items. Many common approaches to solve the collaborative filtering task are based on learning representations of users and items, including simple matrix factorization, Gaussian process latent variable models, and neural-network based embeddings. While matrix factorization approaches fail to model nonlinear relations, neural networks can potentially capture such complex relations with unprecedented predictive power and are highly scalable. However, neither of them is able to model predictive uncertainties. In contrast, Gaussian Process based models can generate a predictive distribution, but cannot scale to large amounts of data. In this manuscript, we propose a novel approach combining the representation learning paradigm of collaborative filtering with multi-output Gaussian processes in a joint framework to generate uncertainty-aware recommendations. We introduce an efficient strategy for model training and inference, resulting in a model that scales to very large and sparse datasets and achieves competitive performance in terms of classical metrics quantifying the reconstruction error. In addition to accurately predicting user preferences, our model also provides meaningful uncertainty estimates about that prediction.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/yang21a.html
https://proceedings.mlr.press/v161/yang21a.htmlExplaining fast improvement in online imitation learningOnline imitation learning (IL) is an algorithmic framework that leverages interactions with expert policies for efficient policy optimization. Here policies are optimized by performing online learning on a sequence of loss functions that encourage the learner to mimic expert actions, and if the online learning has no regret, the agent can provably learn an expert-like policy. Online IL has demonstrated empirical successes in many applications and interestingly, its policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this phenomenon. Let $\xi$ denote the policy class bias and assume the online IL loss functions are convex, smooth, and non-negative. We prove that, after $N$ rounds of online IL with stochastic feedback, the policy improves in $\tilde{O}(1/N + \sqrt{\xi/N})$ in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in online IL has two benefits: both the policy improvement speed increases and the performance bias decreases.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/yan21a.html
https://proceedings.mlr.press/v161/yan21a.htmlSimple combinatorial algorithms for combinatorial bandits: corruptions and approximationsWe consider the stochastic combinatorial semi-bandit problem with adversarial corruptions. We provide a simple combinatorial algorithm that can achieve a regret of $\tilde{O}\left(C+d^2K/\Delta_{min}\right)$ where $C$ is the total amount of corruptions, $d$ is the maximal number of arms one can play in each round, $K$ is the number of arms. If one selects only one arm in each round, we achieves a regret of $\tilde{O}\left(C+\sum_{\Delta_i>0}(1/\Delta_i)\right)$. Our algorithm is combinatorial and improves on the previous combinatorial algorithm by [Gupta et al., COLT2019] (their bound is $\tilde{O}\left(KC+\sum_{\Delta_i>0}(1/\Delta_i)\right)$), and almost matches the best known bounds obtained by [Zimmert et al., ICML2019] and [Zimmert and Seldin, AISTATS2019] (up to logarithmic factor). Note that the algorithms in [Zimmert et al., ICML2019] and [Zimmert and Seldin, AISTATS2019] require one to solve complex convex programs while our algorithm is combinatorial, very easy to implement, requires weaker assumptions and has very low oracle complexity and running time. We also study the setting where we only get access to an approximation oracle for the stochastic combinatorial semi-bandit problem. Our algorithm achieves an (approximation) regret bound of $\tilde{O}\left(d\sqrt{KT}\right)$. Our algorithm is very simple, only worse than the best known regret bound by $\sqrt{d}$, and has much lower oracle complexity than previous work.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/xu21b.html
https://proceedings.mlr.press/v161/xu21b.htmlRobust reinforcement learning under minimax regret for green securityGreen security domains feature defenders who plan patrols in the face of uncertainty about the adversarial behavior of poachers, illegal loggers, and illegal fishers. Importantly, the deterrence effect of patrols on adversaries’ future behavior makes patrol planning a sequential decision-making problem. Therefore, we focus on robust sequential patrol planning for green security following the minimax regret criterion, which has not been considered in the literature. We formulate the problem as a game between the defender and nature who controls the parameter values of the adversarial behavior and design an algorithm MIRROR to find a robust policy. MIRROR uses two reinforcement learning–based oracles and solves a restricted game considering limited defender strategies and parameter values. We evaluate MIRROR on real-world poaching data.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/xu21a.html
https://proceedings.mlr.press/v161/xu21a.htmlExtendability of causal graphical models: Algorithms and computational complexityFinding a consistent DAG extension for a given partially directed acyclic graph (PDAG) is a basic building block used in graphical causal analysis. In 1992, Dor and Tarsi proposed an algorithm with time complexity O(n^4), which has been widely used in causal theory and practice so far. It is a long-standing open question whether an extension can be computed faster and, in particular, it was conjectured that a linear-time method may exist. The main contributions of our work are two-fold: Firstly, we propose a new algorithm for the extension problem for PDAGs which runs in time O(n^3); secondly, we show that, under a computational intractability assumption, our cubic algorithm is optimal. Thus, our impossibility result disproves the conjecture that a linear-time method exists. Based on these results, we present a full complexity landscape for finding extensions in various causal graphical models. We extend the techniques to recognition problems and apply them to design an effective algorithm for closing a PDAG under the orientation rules of Meek.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/wienobst21a.html
https://proceedings.mlr.press/v161/wienobst21a.htmlCertification of iterative predictions in Bayesian neural networksWe consider the problem of computing reach-avoid probabilities for iterative predictions made with Bayesian neural network (BNN) models. Specifically, we leverage bound propagation techniques and backward recursion to compute lower bounds for the probability that trajectories of the BNN model reach a given set of states while avoiding a set of unsafe states. We use the lower bounds in the context of control and reinforcement learning to provide safety certification for given control policies, as well as to synthesize control policies that improve the certification bounds. On a set of benchmarks, we demonstrate that our framework can be employed to certify policies over BNNs predictions for problems of more than $10$ dimensions, and to effectively synthesize policies that significantly increase the lower bound on the satisfaction probability.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/wicker21a.html
https://proceedings.mlr.press/v161/wicker21a.htmlExploring the loss landscape in neural architecture searchNeural architecture search (NAS) has seen a steep rise in interest over the last few years. Many algorithms for NAS consist of searching through a space of architectures by iteratively choosing an architecture, evaluating its performance by training it, and using all prior evaluations to come up with the next choice. The evaluation step is noisy - the final accuracy varies based on the random initialization of the weights. Prior work has focused on devising new search algorithms to handle this noise, rather than quantifying or understanding the level of noise in architecture evaluations. In this work, we show that (1) the simplest hill-climbing algorithm is a powerful baseline for NAS, and (2), when the noise in popular NAS benchmark datasets is reduced to a minimum, the loss landscape becomes near-convex, causing hill-climbing to outperform many popular state-of-the-art algorithms. We further back up this observation by showing that the number of local minima is substantially reduced as the noise decreases and by giving a theoretical characterization of the performance of local search in NAS. Based on our findings, for NAS research we suggest (1) using local search as a baseline, and (2) denoising the training pipeline when possible.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/white21a.html
https://proceedings.mlr.press/v161/white21a.htmlLocal explanations via necessity and sufficiency: unifying theory and practiceNecessity and sufficiency are the building blocks of all successful explanations. Yet despite their importance, these notions have been conceptually underdeveloped and inconsistently applied in explainable artificial intelligence (XAI), a fast-growing research area that is so far lacking in firm theoretical foundations. Building on work in logic, probability, and causality, we establish the central role of necessity and sufficiency in XAI, unifying seemingly disparate methods in a single formal framework. We provide a sound and complete algorithm for computing explanatory factors with respect to a given context, and demonstrate its flexibility and competitive performance against state of the art alternatives on various tasks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/watson21a.html
https://proceedings.mlr.press/v161/watson21a.htmlExplicit pairwise factorized graph neural network for semi-supervised node classificationNode features and structural information of a graph are both crucial for semi-supervised node classification problems. A variety of graph neural network (GNN) based approaches have been proposed to tackle these problems, which typically determine output labels through feature aggregation. This can be problematic, as it implies conditional independence of output nodes given hidden representations, despite their direct connections in the graph. To learn the direct influence among output nodes in a graph, we propose the Explicit Pairwise Factorized Graph Neural Network (EPFGNN), which models the whole graph as a partially observed Markov Random Field. It contains explicit pairwise factors to model output-output relations and uses a GNN backbone to model input-output relations. To balance model complexity and expressivity, the pairwise factors have a shared component and a separate scaling coefficient for each edge. We apply the EM algorithm to train our model, and utilize a star-shaped piecewise likelihood for the tractable surrogate objective. We conduct experiments on various datasets, which shows that our model can effectively improve the performance for semi-supervised node classification on graphs.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/wang21d.html
https://proceedings.mlr.press/v161/wang21d.htmlCORe: Capitalizing On Rewards in Bandit ExplorationWe propose a bandit algorithm that explores purely by randomizing its past observations. In particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the variance in the past observed rewards. We name the algorithm Capitalizing On Rewards (CORe). The algorithm is general and can be easily applied to different bandit settings. The main benefit of CORe is that its exploration is fully data-dependent. It does not rely on any external noise and adapts to different problems without parameter tuning. We derive a $\tilde O(d\sqrt{n\log K})$ gap-free bound on the n-round regret of CORe in a stochastic linear bandit, where d is the number of features and K is the number of arms. Extensive empirical evaluation on multiple synthetic and real-world problems demonstrates the effectiveness of CORe.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/wang21c.html
https://proceedings.mlr.press/v161/wang21c.htmlStatistically robust neural network classificationDespite their numerous successes, there are many scenarios where adversarial risk metrics do not provide an appropriate measure of robustness. For example, test-time perturbations may occur in a probabilistic manner rather than being generated by an explicit adversary, while the poor train–test generalization of adversarial metrics can limit their usage to simple problems. Motivated by this, we develop a probabilistic robust risk framework, the statistically robust risk (SRR), which considers pointwise corruption distributions, as opposed to worst-case adversaries. The SRR provides a distinct and complementary measure of robust performance, compared to natural and adversarial risk. We show that the SRR admits estimation and training schemes which are as simple and efficient as for the natural risk: these simply require noising the inputs, but with a principled derivation for exactly how and why this should be done. Furthermore, we demonstrate both theoretically and experimentally that it can provide superior generalization performance compared with adversarial risks, enabling application to high-dimensional datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/wang21b.html
https://proceedings.mlr.press/v161/wang21b.htmlNatural language adversarial defense through synonym encodingIn the area of natural language processing, deep learning models are recently known to be vulnerable to various types of adversarial perturbations, but relatively few works are done on the defense side. Especially, there exists few effective defense method against the successful synonym substitution based attacks that preserve the syntactic structure and semantic information of the original text while fooling the deep learning models. We contribute in this direction and propose a novel adversarial defense method called <em>Synonym Encoding Method</em> (SEM). Specifically, SEM inserts an encoder before the input layer of the target model to map each cluster of synonyms to a unique encoding and trains the model to eliminate possible adversarial perturbations without modifying the network architecture or adding extra data. Extensive experiments demonstrate that SEM can effectively defend the current synonym substitution based attacks and block the transferability of adversarial examples. SEM is also easy and efficient to scale to large models and big datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/wang21a.html
https://proceedings.mlr.press/v161/wang21a.htmlPost-hoc loss-calibration for Bayesian neural networksBayesian decision theory provides an elegant framework for acting optimally under uncertainty when tractable posterior distributions are available. Modern Bayesian models, however, typically involve intractable posteriors that are approximated with, potentially crude, surrogates. This difficulty has engendered loss-calibrated techniques that aim to learn posterior approximations that favor high-utility decisions. In this paper, focusing on Bayesian neural networks, we develop methods for correcting approximate posterior predictive distributions encouraging them to prefer high-utility decisions. In contrast to previous work, our approach is agnostic to the choice of the approximate inference algorithm, allows for efficient test time decision making through amortization, and empirically produces higher quality decisions. We demonstrate the effectiveness of our approach through controlled experiments spanning a diversity of tasks and datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/vadera21a.html
https://proceedings.mlr.press/v161/vadera21a.htmlKnow your limits: Uncertainty estimation with ReLU classifiers fails at reliable OOD detectionA crucial requirement for reliable deployment of deep learning models for safety-critical applications is the ability to identify out-of-distribution (OOD) data points, samples which differ from the training data and on which a model might underperform. Previous work has attempted to tackle this problem using uncertainty estimation techniques. However, there is empirical evidence that a large family of these techniques do not detect OOD reliably in classification tasks. This paper gives a theoretical explanation for said experimental findings and illustrates it on synthetic data. We prove that such techniques are not able to reliably identify OOD samples in a classification setting, since their level of confidence is generalized to unseen areas of the feature space. This result stems from the interplay between the representation of ReLU networks as piece-wise affine transformations, the saturating nature of activation functions like softmax, and the most widely-used uncertainty metrics.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ulmer21a.html
https://proceedings.mlr.press/v161/ulmer21a.htmlProbabilistic selection of inducing points in sparse Gaussian processesSparse Gaussian processes and various extensions thereof are enabled through inducing points, that simultaneously bottleneck the predictive capacity and act as the main contributor towards model complexity. However, the number of inducing points is generally not associated with uncertainty which prevents us from applying the apparatus of Bayesian reasoning for identifying an appropriate trade-off. In this work we place a point process prior on the inducing points and approximate the associated posterior through stochastic variational inference. By letting the prior encourage a moderate number of inducing points, we enable the model to learn which and how many points to utilise. We experimentally show that fewer inducing points are preferred by the model as the points become less informative, and further demonstrate how the method can be employed in deep Gaussian processes and latent variable modelling.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/uhrenholt21a.html
https://proceedings.mlr.press/v161/uhrenholt21a.htmlBias-corrected peaks-over-threshold estimation of the CVaRThe conditional value-at-risk (CVaR) is a useful risk measure in fields such as machine learning, finance, insurance, energy, etc. When measuring very extreme risk, the commonly used CVaR estimation method of sample averaging does not work well due to limited data above the value-at-risk (VaR), the quantile corresponding to the CVaR level. To mitigate this problem, the CVaR can be estimated by extrapolating above a lower threshold than the VaR using a generalized Pareto distribution (GPD), which is often referred to as the peaks-over-threshold (POT) approach. This method often requires a very high threshold to fit well, leading to high variance in estimation, and can induce significant bias if the threshold is chosen too low. In this paper, we address this bias-variance tradeoff by deriving a new expression for the GPD approximation error of the CVaR, a bias term induced by the choice of threshold, as well as a bias correction method for the estimated GPD parameters. This leads to the derivation of a new CVaR estimator that is asymptotically unbiased and less sensitive to lower thresholds being used. An asymptotic confidence interval for the estimator is also constructed. In a practical setting, we show through experiments that our estimator provides a significant performance improvement compared with competing CVaR estimators in finite samples from heavy-tailed distributions.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/troop21a.html
https://proceedings.mlr.press/v161/troop21a.htmlCausal and interventional Markov boundariesFeature selection is an important problem in machine learning, which aims to select variables that lead to an optimal predictive model. In this paper, we focus on feature selection for post-intervention outcome prediction from pre-intervention variables. We are motivated by healthcare settings, where the goal is often to select the treatment that will maximize a specific patient’s outcome; however, we often do not have sufficient randomized control trial data to identify well the conditional treatment effect. We show how we can use observational data to improve feature selection and effect estimation in two cases: (a) using observational data when we know the causal graph, and (b) when we do not know the causal graph but have observational and limited experimental data. Our paper extends the notion of Markov boundary to treatment-outcome pairs. We provide theoretical guarantees for the methods we introduce. In simulated data, we show that combining observational and experimental data improves feature selection and effect estimation.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/triantafillou21a.html
https://proceedings.mlr.press/v161/triantafillou21a.htmlInformation theoretic meta learning with Gaussian processesWe formulate meta learning using information theoretic concepts; namely, mutual information and the information bottleneck. The idea is to learn a stochastic representation or encoding of the task description, given by a training set, that is highly informative about predicting the validation set. By making use of variational approximations to the mutual information, we derive a general and tractable framework for meta learning. This framework unifies existing gradient-based algorithms and also allows us to derive new algorithms. In particular, we develop a memory-based algorithm that uses Gaussian processes to obtain non-parametric encoding representations. We demonstrate our method on a few-shot regression problem and on four few-shot classification problems, obtaining competitive accuracy when compared to existing baselines.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/titsias21a.html
https://proceedings.mlr.press/v161/titsias21a.htmlIncorporating causal graphical prior knowledge into predictive modeling via simple data augmentationCausal graphs (CGs) are compact representations of the knowledge of the data generating processes behind the data distributions. When a CG is available, e.g., from the domain knowledge, we can infer the conditional independence (CI) relations that should hold in the data distribution. However, it is not straightforward how to incorporate this knowledge into predictive modeling. In this work, we propose a model-agnostic data augmentation method that allows us to exploit the prior knowledge of the CI encoded in a CG for supervised machine learning. We theoretically justify the proposed method by providing an excess risk bound indicating that the proposed method suppresses overfitting by reducing the apparent complexity of the predictor hypothesis class. Using real-world data with CGs provided by domain experts, we experimentally show that the proposed method is effective in improving the prediction accuracy, especially in the small-data regime.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/teshima21a.html
https://proceedings.mlr.press/v161/teshima21a.htmlBandits with partially observable confounded dataWe study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/tennenholtz21a.html
https://proceedings.mlr.press/v161/tennenholtz21a.htmlCombining pseudo-point and state space approximations for sum-separable Gaussian ProcessesGaussian processes (GPs) are important probabilistic tools for inference and learning in spatio-temporal modelling problems such as those in climate science and epidemiology. However, existing GP approximations do not simultaneously support large numbers of off-the-grid spatial data-points and long time-series which is a hallmark of many applications. Pseudo-point approximations, one of the gold-standard methods for scaling GPs to large data sets, are well suited for handling off-the-grid spatial data. However, they cannot handle long temporal observation horizons effectively reverting to cubic computational scaling in the time dimension. State space GP approximations are well suited to handling temporal data, if the temporal GP prior admits a Markov form, leading to linear complexity in the number of temporal observations, but have a cubic spatial cost and cannot handle off-the-grid spatial data. In this work we show that there is a simple and elegant way to combine pseudo-point methods with the state space GP approximation framework to get the best of both worlds. The approach hinges on a surprising conditional independence property which applies to space–time separable GPs. We demonstrate empirically that the combined approach is more scalable and applicable to a greater range of spatio-temporal problems than either method on its own.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/tebbutt21a.html
https://proceedings.mlr.press/v161/tebbutt21a.htmlSymmetric Wasserstein autoencodersLeveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resulting algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the quality of the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over the state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/sun21a.html
https://proceedings.mlr.press/v161/sun21a.htmlConfidence in causal discovery with linear causal modelsStructural causal models postulate noisy functional relations among a set of interacting variables. The causal structure underlying each such model is naturally represented by a directed graph whose edges indicate for each variable which other variables it causally depends upon. Under a number of different model assumptions, it has been shown that this causal graph and, thus also, causal effects are identifiable from mere observational data. For these models, practical algorithms have been devised to learn the graph. Moreover, when the graph is known, standard techniques may be used to give estimates and confidence intervals for causal effects. We argue, however, that a two-step method that first learns a graph and then treats the graph as known yields confidence intervals that are overly optimistic and can drastically fail to account for the uncertain causal structure. To address this issue we lay out a framework based on test inversion that allows us to give confidence regions for total causal effects that capture both sources of uncertainty: causal structure and numerical size of nonzero effects. Our ideas are developed in the context of bivariate linear causal models with homoscedastic errors, but as we exemplify they are generalizable to larger systems as well as other settings such as, in particular, linear non-Gaussian models.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/strieder21a.html
https://proceedings.mlr.press/v161/strieder21a.htmlLearning proposals for probabilistic programs with inference combinatorsWe develop operators for construction of proposals in probabilistic programs, which we refer to as inference combinators. Inference combinators define a grammar over importance samplers that compose primitive operations such as application of a transition kernel and importance resampling. Proposals in these samplers can be parameterized using neural networks, which in turn can be trained by optimizing variational objectives. The result is a framework for user-programmable variational methods that are correct by construction and can be tailored to specific models. We demonstrate the flexibility of this framework by implementing advanced variational methods based on amortized Gibbs sampling and annealing.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/stites21a.html
https://proceedings.mlr.press/v161/stites21a.htmlSubseasonal climate prediction in the western US using Bayesian spatial modelsSubseasonal climate forecasting is the task of predicting climate variables, such as temperature and precipitation, in a two-week to two-month time horizon. The primary predictors for such prediction problem are spatio-temporal satellite and ground measurements of a variety of climate variables in the atmosphere, ocean, and land, which however have rather limited predictive signal at the subseasonal time horizon. We propose a carefully constructed spatial hierarchical Bayesian regression model that makes use of the inherent spatial structure of the subseasonal climate prediction task. We use our Bayesian model to then derive decision-theoretically optimal point estimates with respect to various performance measures of interest to climate science. As we show, our approach handily improves on various off-the-shelf ML baselines. Since our method is based on a Bayesian framework, we are also able to quantify the uncertainty in our predictions, which is particularly crucial for difficult tasks such as the subseasonal prediction, where we expect any model to have considerable uncertainty at different test locations under different scenarios.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/srinivasan21b.html
https://proceedings.mlr.press/v161/srinivasan21b.htmlPath dependent structural equation modelsCausal analyses of longitudinal data generally assume that the qualitative causal structure relating variables remains invariant over time. In structured systems that transition between qualitatively different states in discrete time steps, such an approach is deficient on two fronts. First, time-varying variables may have state-specific causal relationships that need to be captured. Second, an intervention can result in state transitions downstream of the intervention different from those actually observed in the data. In other words, interventions may counterfactually alter the subsequent temporal evolution of the system.We introduce a generalization of causal graphical models, Path Dependent Structural Equation Models (PDSEMs), that can describe such systems. We show how causal inference may be performed in such models and illustrate its use in simulations and data obtained from a septoplasty surgical procedure.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/srinivasan21a.html
https://proceedings.mlr.press/v161/srinivasan21a.htmlPLSO: A generative framework for decomposing nonstationary time-series into piecewise stationary oscillatory componentsTo capture the slowly time-varying spectral content of real-world time-series, a common paradigm is to partition the data into approximately stationary intervals and perform inference in the time-frequency domain. However, this approach lacks a corresponding nonstationary time-domain generative model for the entire data and thus, time-domain inference occurs in each interval separately. This results in distortion/discontinuity around interval boundaries and can consequently lead to erroneous inferences based on any quantities derived from the posterior, such as the phase. To address these shortcomings, we propose the Piecewise Locally Stationary Oscillation (PLSO) model for decomposing time-series data with slowly time-varying spectra into several oscillatory, piecewise-stationary processes. PLSO, as a nonstationary time-domain generative model, enables inference on the entire time-series without boundary effects and simultaneously provides a characterization of its time-varying spectral properties. We also propose a novel two-stage inference algorithm that combines Kalman theory and an accelerated proximal gradient algorithm. We demonstrate these points through experiments on simulated data and real neural data from the rat and the human brain.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/song21a.html
https://proceedings.mlr.press/v161/song21a.htmlUnsupervised anomaly detection with adversarial mirrored autoencodersDetecting out-of-distribution (OOD) samples is of paramount importance in all Machine Learning applications. Deep generative modeling has emerged as a dominant paradigm to model complex data distributions without labels. However, prior work has shown that generative models tend to assign higher likelihoods to OOD samples compared to the data distribution on which they were trained. First, we propose Adversarial Mirrored Autoencoder (AMA), a variant of Adversarial Autoencoder, which uses a mirrored Wasserstein loss in the discriminator to enforce better semantic-level reconstruction. We also propose a latent space regularization to learn a compact manifold for in-distribution samples. The use of AMA produces better feature representations that improve anomaly detection performance. Second, we put forward an alternative measure of anomaly score to replace the reconstruction-based metric which has been traditionally used in generative model-based anomaly detection methods. Our method outperforms the current state-of-the-art methods for anomaly detection on several OOD detection benchmarks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/somepalli21a.html
https://proceedings.mlr.press/v161/somepalli21a.htmlInvariant representation learning for treatment effect estimationThe defining challenge for causal inference from observational data is the presence of ‘confounders’, covariates that affect both treatment assignment and the outcome. To address this challenge, practitioners collect and adjust for the covariates, hoping that they adequately correct for confounding. However, including every observed covariate in the adjustment runs the risk of including ‘bad controls’, variables that <em>induce</em> bias when they are conditioned on. The problem is that we do not always know which variables in the covariate set are safe to adjust for and which are not. To address this problem, we develop Nearly Invariant Causal Estimation (NICE). NICE uses invariant risk minimization (IRM) [Arj19] to learn a representation of the covariates that, under some assumptions, strips out bad controls but preserves sufficient information to adjust for confounding. Adjusting for the learned representation, rather than the covariates themselves, avoids the induced bias and provides valid causal inferences. We evaluate NICE on both synthetic and semi-synthetic data. When the covariates contain unknown collider variables and other bad controls, NICE performs better than adjusting for all the covariates.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/shi21a.html
https://proceedings.mlr.press/v161/shi21a.htmlGraph-based semi-supervised learning through the lens of safetyGraph-based semi-supervised learning (G-SSL) algorithms have witnessed rapid development and widespread usage across a variety of applications in recent years. However, the theoretical characterisation of the efficacy of such algorithms has remained an under-explored area. We introduce a novel algorithm for G-SSL, CSX, whose objective function extends those of Label Propagation and Expander, two popular G-SSL algorithms. We provide data-dependent generalisation error bounds for all three aforementioned algorithms when they are applied to graphs drawn from a partially labelled extension of a versatile latent space graph generative model. The bounds we obtain enable us to characterise the predictive performance as measured by accuracy in terms of homophily and label quantity. Building on this we develop a key notion of GLM-safety which enables us to compare G-SSL algorithms on the basis of the range of graphs on which they obtain a guaranteed accuracy. We show that the proposed algorithm CSX has a better GLM-safety profile than Label Propagation and Expander while achieving comparable or better accuracy on synthetic as well as real-world benchmark networks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/sheshadri21a.html
https://proceedings.mlr.press/v161/sheshadri21a.htmlSketching curvature for efficient out-of-distribution detection for deep neural networksIn order to safely deploy Deep Neural Networks (DNNs) within the perception pipelines of real-time decision making systems, there is a need for safeguards that can detect out-of-training-distribution (OoD) inputs both efficiently and accurately. Building on recent work leveraging the local curvature of DNNs to reason about epistemic uncertainty, we propose Sketching Curvature for OoD Detection (SCOD), an architecture-agnostic framework for equipping any trained DNN with a task-relevant epistemic uncertainty estimate. Offline, given a trained model and its training data, SCOD employs tools from matrix sketching to tractably compute a low-rank approximation of the Fisher information matrix which characterizes which directions in the weight space are most influential on the predictions over the training data. Online, we estimate uncertainty by measuring how much perturbations orthogonal to these directions can alter predictions at a new test input. We apply SCOD to pre-trained networks of varying architectures on several tasks, ranging from regression to classification. We demonstrate that SCOD achieves comparable of better OoD detection performance with lower computational burden relative to existing baselines.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/sharma21a.html
https://proceedings.mlr.press/v161/sharma21a.htmlPrincipal component analysis in the stochastic differential privacy modelIn this paper, we study the differentially private Principal Component Analysis (PCA) problem in stochastic optimization settings. We first propose a new stochastic gradient perturbation PCA mechanism (DP-SPCA) for the calculation of the right singular subspace to achieve $(\epsilon,\delta)$-differential privacy. For achieving a better utility guarantee and performance, we then present a new differential privacy stochastic variance reduction mechanism (DP-VRPCA) with gradient perturbation for PCA. To the best of our knowledge, this is the first work of stochastic gradient perturbation for $(\epsilon,\delta)$-differentially private PCA. We also compare the proposed algorithms with existing state-of-the-art methods, and experiments on real-world datasets and on classification tasks confirm the improved theoretical guarantees of our algorithms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/shang21a.html
https://proceedings.mlr.press/v161/shang21a.htmlOn the distribution of penultimate activations of classification networksThis paper studies probability distributions of penultimate activations of classification networks. We show that, when a classification network is trained with the cross-entropy loss, its final classification layer forms a Generative-Discriminative pair with a generative classifier based on a specific distribution of penultimate activations. More importantly, the distribution is parameterized by the weights of the final fully-connected layer, and can be considered as a generative model that synthesizes the penultimate activations without feeding input data. We empirically demonstrate that this generative model enables stable knowledge distillation in the presence of domain shift, and can transfer knowledge from a classifier to variational autoencoders and generative adversarial networks for class-conditional image generation.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/seo21a.html
https://proceedings.mlr.press/v161/seo21a.htmlIdentifying untrustworthy predictions in neural networks by geometric gradient analysisThe susceptibility of deep neural networks to untrustworthy predictions, including out-of-distribution (OOD) data and adversarial examples, still prevent their widespread use in safety-critical applications. Most existing methods either require a retraining of a given model to achieve robust identification of adversarial attacks or are limited to out-of-distribution sample detection only. In this work, we propose a geometric gradient analysis (GGA) to improve the identification of untrustworthy predictions without retraining of a given model. GGA analyzes the geometry of the loss landscape of neural networks based on the saliency maps of their respective input. We observe considerable differences between the input gradient geometry of trustworthy and untrustworthy predictions. Using these differences, GGA outperforms prior approaches in detecting OOD data and adversarial attacks, including state-of-the-art and adaptive attacks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/schwinn21a.html
https://proceedings.mlr.press/v161/schwinn21a.htmlClassification with abstention but without disparitiesClassification with abstention has gained a lot of attention in recent years as it allows to incorporate human decision-makers in the process. Yet, abstention can potentially amplify disparities and lead to discriminatory predictions. The goal of this work is to build a general purpose classification algorithm, which is able to abstain from prediction, while avoiding disparate impact. We formalize this problem as risk minimization under fairness and abstention constraints for which we derive the form of the optimal classifier. Building on this result, we propose a post-processing classification algorithm, which is able to modify any off-the-shelf score-based classifier using only unlabeled sample. We establish finite sample risk, fairness, and abstention guarantees for the proposed algorithm. In particular, it is shown that fairness and abstention constraints can be achieved independently from the initial classifier as long as sufficiently many unlabeled data is available. The risk guarantee is established in terms of the quality of the initial classifier. Our post-processing scheme reduces to a sparse linear program allowing for an efficient implementation, which we provide. Finally, we validate our method empirically showing that moderate abstention rates allow to bypass the risk-fairness trade-off.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/schreuder21a.html
https://proceedings.mlr.press/v161/schreuder21a.htmlDoubly non-central beta matrix factorization for DNA methylation dataWe present a new non-negative matrix factorization model for $(0,1)$ bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently general that it can be adapted to many other domains where latent representations of $(0,1)$ bounded-support data are of interest. Although the DNCB distribution lacks a closed-form conjugate prior, several augmentations let us derive an efficient posterior inference algorithm composed entirely of analytic updates. Our model improves out-of-sample predictive performance on both real and synthetic DNA methylation datasets over state-of-the-art methods in bioinformatics. In addition, our model yields meaningful latent representations that accord with existing biological knowledge.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/schein21a.html
https://proceedings.mlr.press/v161/schein21a.htmlEfficient online inference for nonparametric mixture modelsNatural data are often well-described as belonging to latent clusters. When the number of clusters is unknown, Bayesian nonparametric (BNP) models can provide a flexible and powerful technique to model the data. However, algorithms for inference in nonparametric mixture models fail to meet two critical requirements for practical use: (1) that inference can be performed online, and (2) that inference is efficient in the large time/sample limit. In this work, we propose a novel Bayesian recursion to efficiently infer a posterior distribution over discrete latent variables from a sequence of observations in an online manner, assuming a Chinese Restaurant Process prior on the sequence of latent variables. Our recursive filter, which we call the Recursive Chinese Restaurant Process (R-CRP), has quasilinear average time complexity and logarithmic average space complexity in the total number of observations. We experimentally compare our filtering method against both online and offline inference algorithms including Markov chain Monte Carlo, variational approximations and DP-Means, and demonstrate that our inference algorithm achieves comparable or better performance for a fraction of the runtime.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/schaeffer21a.html
https://proceedings.mlr.press/v161/schaeffer21a.htmlModeling financial uncertainty with multivariate temporal entropy-based curriculumsIn the financial realm, profit generation greatly relies on the complicated task of stock prediction. Lately, neural methods have shown success in exploiting stock affecting signals from textual data across news and tweets to forecast stock performance. However, the dynamic, stochastic, and variably influential nature of text and prices makes it difficult to train neural stock trading models, limiting predictive performance and profits. To transcend this limitation, we propose a novel multi-modal curriculum learning approach: FinCLASS, which evaluates stock affecting signals via entropy-based heuristics and measures their linguistic and price-based complexities in a time-aware, hierarchical fashion. We show that training financial models can benefit by exposing neural networks to easier examples of stock affecting signals early during the training phase, before introducing samples having more complex linguistic and price-based temporal variations. Through experiments on benchmark English tweets and Chinese financial news spanning two major indexes and four global markets, we show how FinCLASS outperforms state-of-the-art across financial tasks of stock movement prediction, volatility regression, and profit generation. Through ablative and qualitative experiments, we set the case for FinCLASS as a generalizable framework for developing natural language-centric neural models for financial tasks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/sawhney21a.html
https://proceedings.mlr.press/v161/sawhney21a.htmlImproved generalization bounds of group invariant / equivariant deep networks via quotient feature spacesNumerous invariant (or equivariant) neural networks have succeeded in handling the invariant data such as point clouds and graphs. However, a generalization theory for the neural networks has not been well developed, because several essential factors for the theory, such as network size and margin distribution, are not deeply connected to the invariance and equivariance. In this study, we develop a novel generalization error bound for invariant and equivariant deep neural networks. To describe the effect of invariance and equivariance on generalization, we develop a notion of a <em>quotient feature space</em>, which measures the effect of group actions for the properties. Our main result proves that the volume of quotient feature spaces can describe the generalization error. Furthermore, the bound shows that the invariance and equivariance significantly improves the leading term of the bound. We apply our result to a specific invariant and equivariant networks, such as DeepSets (Zaheer et al., NIPS 2017), and show that their generalization bound is considerably improved by $\sqrt{n!}$, where $n!$ is the number of permutations. We also discuss the expressive power of invariant DNNs and show that they can achieve an optimal approximation rate. Moreover, we conducted experiments to support our results.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/sannai21a.html
https://proceedings.mlr.press/v161/sannai21a.htmlHierarchical infinite relational modelThis paper describes the hierarchical infinite relational model (HIRM), a new probabilistic generative model for noisy, sparse, and heterogeneous relational data. Given a set of relations defined over a collection of domains, the model first infers multiple non-overlapping clusters of relations using a top-level Chinese restaurant process. Within each cluster of relations, a Dirichlet process mixture is then used to partition the domain entities and model the probability distribution of relation values. The HIRM generalizes the standard infinite relational model and can be used for a variety of data analysis tasks including dependence detection, clustering, and density estimation. We present new algorithms for fully Bayesian posterior inference via Gibbs sampling. We illustrate the efficacy of the method on a density estimation benchmark of twenty object-attribute datasets with up to 18 million cells and use it to discover relational structure in real-world datasets from politics and genomics.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/saad21a.html
https://proceedings.mlr.press/v161/saad21a.htmlThe neural moving average model for scalable variational inference of state space modelsVariational inference has had great success in scaling approximate Bayesian inference to big data by exploiting mini-batch training. To date, however, this strategy has been most applicable to models of independent data. We propose an extension to state space models of time series data based on a novel generative model for latent temporal states: the neural moving average model. This permits a subsequence to be sampled without drawing from the entire distribution, enabling training iterations to use mini-batches of the time series at low computational cost. We illustrate our method on autoregressive, Lotka-Volterra, FitzHugh-Nagumo and stochastic volatility models, achieving accurate parameter estimation in a short time.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ryder21a.html
https://proceedings.mlr.press/v161/ryder21a.htmlUnbiased gradient estimation for variational auto-encoders using coupled Markov chainsThe variational auto-encoder (VAE) is a deep latent variable model that has two neural networks in an autoencoder-like architecture; one of them parameterizes the model’s likelihood. Fitting its parameters via maximum likelihood (ML) is challenging since the computation of the marginal likelihood involves an intractable integral over the latent space; thus the VAE is trained instead by maximizing a variational lower bound. Here, we develop a ML training scheme for VAEs by introducing unbiased estimators of the log-likelihood gradient. We obtain the estimators by augmenting the latent space with a set of importance samples, similarly to the importance weighted auto-encoder (IWAE), and then constructing a Markov chain Monte Carlo coupling procedure on this augmented space. We provide the conditions under which the estimators can be computed in finite time and with finite variance. We show experimentally that VAEs fitted with unbiased estimators exhibit better predictive performance.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ruiz21a.html
https://proceedings.mlr.press/v161/ruiz21a.htmlCompositional abstraction error and a category of causal modelsInterventional causal models describe several joint distributions over some variables used to describe a system, one for each intervention setting. They provide a formal recipe for how to move between the different joint distributions and make predictions about the variables upon intervening on the system. Yet, it is difficult to formalise how we may change the underlying variables used to describe the system, say moving from fine-grained to coarse-grained variables. Here, we argue that compositionality is a desideratum for such model transformations and the associated errors: When abstracting a reference model M iteratively, first obtaining M’ and then further simplifying that to obtain M”, we expect the composite transformation from M to M” to exist and its error to be bounded by the errors incurred by each individual transformation step. Category theory, the study of mathematical objects via compositional transformations between them, offers a natural language to develop our framework for model transformations and abstractions. We introduce a category of finite interventional causal models and, leveraging theory of enriched categories, prove the desired compositionality properties for our framework.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/rischel21a.html
https://proceedings.mlr.press/v161/rischel21a.htmlMaximal ancestral graph structure learning via exact searchGeneralizing Bayesian networks, maximal ancestral graphs (MAGs) are a theoretically appealing model class for dealing with unobserved variables. Despite significant advances in developing practical exact algorithms for learning score-optimal Bayesian networks, practical exact algorithms for learning score-optimal MAGs have not been developed to-date. We develop here methodology for score-based structure learning of directed maximal ancestral graphs. In particular, we develop local score computation employing a linear Gaussian BIC score, as well as score pruning techniques, which are essential for exact structure learning approaches. Furthermore, employing dynamic programming and branch and bound, we present a first exact search algorithm that is guaranteed to find a globally optimal MAG for given local scores. The experiments show that our approach is able to find considerably higher scoring MAGs than previously proposed in-exact approaches.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/rantanen21a.html
https://proceedings.mlr.press/v161/rantanen21a.htmlClass balancing GAN with a classifier in the loopGenerative Adversarial Networks (GANs) have swiftly evolved to imitate increasingly complex image distributions. However, majority of the developments focus on performance of GANs on balanced datasets. We find that the existing GANs and their training regimes which work well on balanced datasets fail to be effective in case of imbalanced (i.e. long-tailed) datasets. In this work we introduce a novel theoretically motivated Class Balancing regularizer for training GANs. Our regularizer makes use of the knowledge from a pre-trained classifier to ensure balanced learning of all the classes in the dataset. This is achieved via modelling the effective class frequency based on the exponential forgetting observed in neural networks and encouraging the GAN to focus on underrepresented classes. We demonstrate the utility of our regularizer in learning representations for long-tailed distributions via achieving better performance than existing approaches over multiple datasets. Specifically, when applied to an unconditional GAN, it improves the FID from $13.03$ to $9.01$ on the long-tailed iNaturalist-$2019$ dataset.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/rangwani21a.html
https://proceedings.mlr.press/v161/rangwani21a.htmlVariance reduction in frequency estimators via control variates methodGenerating succinct summaries (also known as <em>sketches</em>) of massive data streams is becoming increasingly important. Such a task typically requires fast, accurate, and small space algorithms in order to support the downstream applications, mainly in areas such as data analysis, machine learning and data mining. A fundamental and well-studied problem in this context is that of estimating the frequencies of the items appearing in a data stream. The Count-Min-Sketch (Cormode and Muthukrishnan, J. Algorithms, 55(1):58–75, 2005) and Count-Sketch (Charikar et al., Theor. Comput. Sci., 312(1):3–15, 2004) are two known classical algorithms for this purpose. However, a limitation of these techniques is that the variance of their estimate tends to be large. In this work, we address this problem and suggest a technique that reduces the variance in their respective estimates, at the cost of little computational overhead. Our technique relies on the classical Control-Variate trick (Lavenberg and Welch, Manage. Sci., 27:322–335, 1981) used for reducing variance in Monte-Carlo simulation. We present a theoretical analysis of our proposal by carefully choosing the control variates and complement them with experiments on synthetic as well as real-world datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/pratap21a.html
https://proceedings.mlr.press/v161/pratap21a.htmlCompetitive policy optimizationA core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. We propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture players’ interactions. We instantiate CoPO in two ways: (i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/prajapat21a.html
https://proceedings.mlr.press/v161/prajapat21a.htmlDistribution-free uncertainty quantification for classification under label shiftTrustworthy deployment of ML models requires a proper measure of uncertainty, especially in safety-critical applications. We focus on uncertainty quantification (UQ) for classification problems via two avenues — prediction sets using conformal prediction and calibration of probabilistic predictors by post-hoc binning — since these possess distribution-free guarantees for i.i.d. data. Two common ways of generalizing beyond the i.i.d. setting include handling <em>covariate</em> and <em>label</em> shift. Within the context of distribution-free UQ, the former has already received attention, but not the latter. It is known that label shift hurts prediction, and we first argue that it also hurts UQ, by showing degradation in coverage and calibration. Piggybacking on recent progress in addressing label shift (for better prediction), we examine the right way to achieve UQ by reweighting the aforementioned conformal and calibration procedures whenever some unlabeled data from the target distribution is available. We examine these techniques theoretically in a distribution-free framework and demonstrate their excellent practical performance.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/podkopaev21a.html
https://proceedings.mlr.press/v161/podkopaev21a.htmlGP-ConvCNP: Better generalization for conditional convolutional Neural Processes on time series dataNeural Processes (NPs) are a family of conditional generative models that are able to model a distribution over functions, in a way that allows them to perform predictions at test time conditioned on a number of context points. A recent addition to this family, Convolutional Conditional Neural Processes (ConvCNP), have shown remarkable improvement in performance over prior art, but we find that they sometimes struggle to generalize when applied to time series data. In particular, they are not robust to distribution shifts and fail to extrapolate observed patterns into the future. By incorporating a Gaussian Process into the model, we are able to remedy this and at the same time improve performance within distribution. As an added benefit, the Gaussian Process reintroduces the possibility to sample from the model, a key feature of other members in the NP family.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/petersen21a.html
https://proceedings.mlr.press/v161/petersen21a.htmlAddressing fairness in classification with a model-agnostic multi-objective algorithmThe goal of fairness in classification is to learn a classifier that does not discriminate against groups of individuals based on sensitive attributes, such as race and gender. One approach to designing fair algorithms is to use relaxations of fairness notions as regularization terms or in a constrained optimization problem. We observe that the hyperbolic tangent function can approximate the indicator function. We leverage this property to define a differentiable relaxation that approximates fairness notions provably better than existing relaxations. In addition, we propose a model-agnostic multi-objective architecture that can simultaneously optimize for multiple fairness notions and multiple sensitive attributes and supports all statistical parity-based notions of fairness. We use our relaxation with the multi-objective architecture to learn fair classifiers. Experiments on public datasets show that our method suffers a significantly lower loss of accuracy than current debiasing algorithms relative to the unconstrained model.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/padh21a.html
https://proceedings.mlr.press/v161/padh21a.htmlTowards tractable optimism in model-based reinforcement learningThe principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/pacchiano21a.html
https://proceedings.mlr.press/v161/pacchiano21a.htmlUncertainty-aware sensitivity analysis using Rényi divergencesFor nonlinear supervised learning models, assessing the importance of predictor variables or their interactions is not straightforward because importance can vary in the domain of the variables. Importance can be assessed locally with sensitivity analysis using general methods that rely on the model’s predictions or their derivatives. In this work, we extend derivative based sensitivity analysis to a Bayesian setting by differentiating the R\’{e}nyi divergence of a model’s predictive distribution. By utilising the predictive distribution instead of a point prediction, the model uncertainty is taken into account in a principled way. Our empirical results on simulated and real data sets demonstrate accurate and reliable identification of important variables and interaction effects compared to alternative methods.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/paananen21a.html
https://proceedings.mlr.press/v161/paananen21a.htmlNo-regret approximate inference via Bayesian optimisationWe consider Bayesian inference problems where the likelihood function is either expensive to evaluate or only available via noisy estimates. This setting encompasses application scenarios involving, for example, large datasets or models whose likelihood evaluations require expensive simulations. We formulate this problem within a Bayesian optimisation framework over a space of probability distributions and derive an upper confidence bound (UCB) algorithm to propose non-parametric distribution candidates. The algorithm is designed to minimise regret, which is defined as the Kullback-Leibler divergence with respect to the true posterior in this case. Equipped with a Gaussian process surrogate model, we show that the resulting UCB algorithm achieves asymptotically no regret. The method can be easily implemented as a batch Bayesian optimisation algorithm whose point evaluations are selected via Markov chain Monte Carlo. Experimental results demonstrate the method’s performance on inference problems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/oliveira21a.html
https://proceedings.mlr.press/v161/oliveira21a.htmlApproximation algorithm for submodular maximization under submodular coverWe study a new optimization problem called <em>submodular maximization under submodular cover</em> (SMSC), which requires to find a fixed-size set such that one monotone submodular function $f$ is maximized subject to that another monotone submodular function $g$ is maximized approximately. SMSC is preferable to submodular function maximization when one wants to maximize two objective functions simultaneously. We propose an optimization framework for SMSC, which guarantees a constant-factor approximation. Our algorithm’s key idea is to construct a new instance of submodular function maximization from a given instance of SMSC, which can be approximated efficiently. Besides, if we are given an approximation oracle for submodular function maximization, our algorithm provably produces nearly optimal solutions. We experimentally evaluate the proposed algorithm in terms of sensor placement and movie recommendation using real-world data.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ohsaka21a.html
https://proceedings.mlr.press/v161/ohsaka21a.htmlMixed variable Bayesian optimization with frequency modulated kernelsThe sample efficiency of Bayesian optimization(BO) is often boosted by Gaussian Process(GP) surrogate models. However, on mixed variable spaces, surrogate models other than GPs are prevalent, mainly due to the lack of kernels which can model complex dependencies across different types of variables. In this paper, we propose the frequency modulated(FM) kernel flexibly modeling dependencies among different types of variables, so that BO can enjoy the further improved sample efficiency. The FM kernel uses distances on continuous variables to modulate the graph Fourier spectrum derived from discrete variables. However, the frequency modulation does not always define a kernel with the similarity measure behavior which returns higher values for pairs of more similar points. Therefore, we specify and prove conditions for FM kernels to be positive definite and to exhibit the similarity measure behavior. In experiments, we demonstrate the improved sample efficiency of GP BO using FM kernels(BO-FM). On synthetic problems and hyperparameter optimization problems, BO-FM outperforms competitors consistently. Also, the importance of the frequency modulation principle is empirically demonstrated on the same problems. On joint optimization of neural architectures and SGD hyperparameters, BO-FM outperforms competitors including Regularized evolution(RE) and BOHB. Remarkably, BO-FM performs better even than RE andBOHB using three times as many evaluations.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/oh21a.html
https://proceedings.mlr.press/v161/oh21a.htmlThe promises and pitfalls of deep kernel learningDeep kernel learning and related techniques promise to combine the representational power of neural networks with the reliable uncertainty estimates of Gaussian processes. One crucial aspect of these models is an expectation that, because they are treated as Gaussian process models optimized using the marginal likelihood, they are protected from overfitting. However, we identify pathological behavior, including overfitting, on a simple toy example. We explore this pathology, explaining its origins and considering how it applies to real datasets. Through careful experimentation on UCI datasets, CIFAR-10, and the UTKFace dataset, we find that the overfitting from overparameterized deep kernel learning, in which the model is “somewhat Bayesian”, can in certain scenarios be worse than that from not being Bayesian at all. However, we find that a fully Bayesian treatment of deep kernel learning can rectify this overfitting and obtain the desired performance improvements over standard neural networks and Gaussian processes.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ober21a.html
https://proceedings.mlr.press/v161/ober21a.htmlMatrix games with bandit feedbackWe study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best-response to the learner’s mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/o-donoghue21a.html
https://proceedings.mlr.press/v161/o-donoghue21a.htmlRobust principal component analysis for generalized multi-view modelsIt has long been known that principal component analysis (PCA) is not robust with respect to gross data corruption. This has been addressed by robust principal component analysis (RPCA). The first computationally tractable definition of RPCA decomposes a data matrix into a low-rank and a sparse component. The low-rank component represents the principal components, while the sparse component accounts for the data corruption. Previous works consider the corruption of individual entries or whole columns of the data matrix. In contrast, we consider a more general form of data corruption that affects groups of measurements. We show that the decomposition approach remains computationally tractable and allows the exact recovery of the decomposition when only the corrupted data matrix is given. Experiments on synthetic data corroborate our theoretical findings, and experiments on several real-world datasets from different domains demonstrate the wide applicability of our generalized approach.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nussbaum21a.html
https://proceedings.mlr.press/v161/nussbaum21a.htmlTensor-train density estimationEstimation of probability density function from samples is one of the central problems in statistics and machine learning. Modern neural network-based models can learn high dimensional distributions but have problems with hyperparameter selection and are often prone to instabilities during training and inference. We propose a new efficient tensor train-based model for density estimation (TTDE). Such density parametrization allows exact sampling, calculation of cumulative and marginal density functions, and partition function. It also has very intuitive hyperparameters. We develop an efficient non-adversarial training procedure for TTDE based on the Riemannian optimization. Experimental results demonstrate the competitive performance of the proposed method in density estimation and sampling tasks, while TTDE significantly outperforms competitors in training speed.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/novikov21a.html
https://proceedings.mlr.press/v161/novikov21a.htmlTrusted-maximizers entropy search for efficient Bayesian optimizationInformation-based Bayesian optimization (BO) algorithms have achieved state-of-the-art performance in optimizing a black-box objective function. However, they usually require several approximations or simplifying assumptions (without clearly understanding their effects on the BO performance) and/or their generalization to batch BO is computationally unwieldy, especially with an increasing batch size. To alleviate these issues, this paper presents a novel trusted-maximizers entropy search (TES) acquisition function: It measures how much an input query contributes to the information gain on the maximizer over a finite set of trusted maximizers, i.e., inputs optimizing functions that are sampled from the Gaussian process posterior belief of the objective function. Evaluating TES requires either only a stochastic approximation with sampling or a deterministic approximation with expectation propagation, both of which are investigated and empirically evaluated using synthetic benchmark objective functions and real-world optimization problems, e.g., hyperparameter tuning of a convolutional neural network and synthesizing physically realizable faces to fool a black-box face recognition system. Though TES can naturally be generalized to a batch variant with either approximation, the latter is amenable to be scaled to a much larger batch size in our experiments.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nguyen21d.html
https://proceedings.mlr.press/v161/nguyen21d.htmlLearning to learn with Gaussian processesThis paper presents Gaussian process meta-learning (GPML) for few-shot regression, which explicitly exploits the distance between regression problems/tasks using a novel task kernel. It contrasts sharply with the popular metric-based meta-learning approach which is based on the distance between data inputs or their embeddings in the few-shot learning literature. Apart from the superior predictive performance by capturing the diversity of different tasks, GPML offers a set of representative tasks that are useful for understanding the task distribution. We empirically demonstrate the performance and interpretability of GPML in several few-shot regression problems involving a multimodal task distribution and real-world datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nguyen21c.html
https://proceedings.mlr.press/v161/nguyen21c.htmlProbabilistic task modelling for meta-learningWe propose <em>probabilistic task modelling</em> – a generative probabilistic model for collections of tasks used in meta-learning. The proposed model combines variational auto-encoding and latent Dirichlet allocation to model each task as a mixture of Gaussian distribution in an embedding space. Such modelling provides an explicit representation of a task through its task-theme mixture. We present an efficient approximation inference technique based on variational inference method for empirical Bayes parameter estimation. We perform empirical evaluations to validate the <em>task uncertainty</em> and <em>task distance</em> produced by the proposed method through correlation diagrams of the prediction accuracy on testing tasks. We also carry out experiments of task selection in meta-learning to demonstrate how the task relatedness inferred from the proposed model help to facilitate meta-learning algorithms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nguyen21b.html
https://proceedings.mlr.press/v161/nguyen21b.htmlMost: multi-source domain adaptation via optimal transport for student-teacher learningMulti-source domain adaptation (DA) is more challenging than conventional DA because the knowledge is transferred from several source domains to a target domain. To this end, we propose in this paper a novel model for multi-source DA using the theory of optimal transport and imitation learning. More specifically, our approach consists of two cooperative agents: a teacher classifier and a student classifier. The teacher classifier is a combined expert that leverages knowledge of domain experts that can be theoretically guaranteed to handle perfectly source examples, while the student classifier acting on the target domain tries to imitate the teacher classifier acting on the source domains. Our rigorous theory developed based on optimal transport makes this cross-domain imitation possible and also helps to mitigate not only the data shift but also the label shift, which are inherently thorny issues in DA research. We conduct comprehensive experiments on real-world datasets to demonstrate the merit of our approach and its optimal transport based imitation learning viewpoint. Experimental results show that our proposed method achieves state-of-the-art performance on benchmark datasets for multi-source domain adaptation including Digits-five, Office-Caltech10, and Office-31 to the best of our knowledge.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nguyen21a.html
https://proceedings.mlr.press/v161/nguyen21a.htmlGenerative Archimedean copulasWe propose a new generative modeling technique for learning multidimensional cumulative distribution functions (CDFs) in the form of copulas. Specifically, we consider certain classes of copulas known as Archimedean and hierarchical Archimedean copulas, popular for their parsimonious representation and ability to model different tail dependencies. We consider their representation as mixture models with Laplace transforms of latent random variables from generative neural networks. This alternative representation allows for computational efficiencies and easy sampling, especially in high dimensions. We describe multiple methods for optimizing the network parameters. Finally, we present empirical results that demonstrate the efficacy of our proposed method in learning multidimensional CDFs and its computational efficiency compared to existing methods.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ng21a.html
https://proceedings.mlr.press/v161/ng21a.htmlGraph reparameterizations for enabling 1000+ Monte Carlo iterations in Bayesian deep neural networksUncertainty estimation in deep models is essential in many real-world applications and has benefited from developments over the last several years. Recent evidence suggests that existing solutions dependent on simple Gaussian formulations may not be sufficient. However, moving to other distributions necessitates Monte Carlo (MC) sampling to estimate quantities such as the KL divergence: it could be expensive and scales poorly as the dimensions of both the input data and the model grow. This is directly related to the structure of the computation graph, which can grow linearly as a function of the number of MC samples needed. Here, we construct a framework to describe these computation graphs, and identify probability families where the graph size can be independent or only weakly dependent on the number of MC samples. These families correspond directly to large classes of distributions. Empirically, we can run a much larger number of iterations for MC approximations for larger architectures used in computer vision with gains in performance measured in confident accuracy, stability of training, memory and training time.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nazarovs21b.html
https://proceedings.mlr.press/v161/nazarovs21b.htmlA variational approximation for analyzing the dynamics of panel dataPanel data involving longitudinal measurements of the same set of participants or entities taken over multiple time points is common in studies to understand early childhood development and disease modeling. Deep hybrid models that marry the predictive power of neural networks with physical simulators such as differential equations, are starting to drive advances in such applications. The task of modeling not just the observations/data but the hidden dynamics that are captured by the measurements poses interesting statistical/computational questions. We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing such panel data. We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem. We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms using MC based sampling methods and numerical ODE solvers. We demonstrate ME-NODE’s utility on tasks spanning the spectrum from simulations and toy datasets to real longitudinal 3D imaging data from an Alzheimer’s disease (AD) study, and study the performance for accuracy of reconstruction for interpolation, uncertainty estimates and personalized prediction.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nazarovs21a.html
https://proceedings.mlr.press/v161/nazarovs21a.htmlGlobal explanations with decision rules: a co-learning approachBlack-box machine learning models can be extremely accurate. Yet, in critical applications such as in healthcare or justice, if models cannot be explained, domain experts will be reluctant to use them. A common way to explain a black-box model is to approximate it by a simpler model such as a decision tree. In this paper, we propose a co-learning framework to learn decision rules as explanations of black-box models through knowledge distillation and simultaneously constrain the black-box model by these explanations; all of this in a differentiable manner. To do so, we introduce the soft truncated Gaussian mixture analysis (STruGMA), a probabilistic model which encapsulates hyper-rectangle decision rules. With STruGMA, global explanations can be extracted by any rule learner such as decision lists, sets or trees. We provide evidences through experiments that our framework can globally explain differentiable black-box models such as neural networks. In particular, the explanation fidelity is increased, while the accuracy of the models is marginally impacted.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/nanfack21a.html
https://proceedings.mlr.press/v161/nanfack21a.htmlvariational combinatorial sequential monte carlo methods for bayesian phylogenetic inferenceBayesian phylogenetic inference is often conducted via local or sequential search over topologies and branch lengths using algorithms such as random-walk Markov chain Monte Carlo (MCMC) or Combinatorial Sequential Monte Carlo (CSMC). However, when MCMC is used for evolutionary parameter learning, convergence requires long runs with inefficient exploration of the state space. We introduce Variational Combinatorial Sequential Monte Carlo (VCSMC), a powerful framework that establishes variational sequential search to learn distributions over intricate combinatorial structures. We then develop nested CSMC, an efficient proposal distribution for CSMC and prove that nested CSMC is an exact approximation to the (intractable) locally optimal proposal. We use nested CSMC to define a second objective, VNCSMC which yields tighter lower bounds than VCSMC. We show that VCSMC and VNCSMC are computationally efficient and explore higher probability spaces than existing methods on a range of tasks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/moretti21a.html
https://proceedings.mlr.press/v161/moretti21a.htmlFlexAE: flexibly learning latent priors for wasserstein auto-encodersAuto-Encoder (AE) based neural generative frameworks model the joint-distribution between the data and the latent space using an Encoder-Decoder pair, with regularization imposed in terms of a prior over the latent space. Despite their advantages, such as stability in training, efficient inference, the performance of AE based models has not reached the superior standards of the other generative models such as Generative Adversarial Networks (GANs). Motivated by this, we examine the effect of the latent prior on the generation quality of deterministic AE models in this paper. Specifically, we consider the class of Generative AE models with deterministic Encoder-Decoder pair (such as Wasserstein Auto-Encoder (WAE), Adversarial Auto-Encoder (AAE)), and show that having a fixed prior distribution, <em>a priori</em>, oblivious to the dimensionality of the ‘true’ latent space, will lead to the infeasibility of the optimization problem considered. As a remedy to the issue mentioned above, we introduce an additional state space in the form of flexibly learnable latent priors, in the optimization objective of WAE/AAE. Additionally, we employ a latent-space interpolation based smoothing scheme to address the non-smoothness that may arise from highly flexible priors. We show the efficacy of our proposed models, called FlexAE and FlexAE-SR, through several experiments on multiple datasets, and demonstrate that FlexAE-SR is the new state-of-the-art for the AE based generative models in terms of generation quality as measured by several metrics such as Fr\’echet Inception Distance, Precision/Recall score.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/mondal21a.html
https://proceedings.mlr.press/v161/mondal21a.htmlA unifying framework for observer-aware planning and its complexityBeing aware of observers and the inferences they make about an agent’s behavior is crucial for successful multi-agent interaction. Existing works on observer-aware planning use different assumptions and techniques to produce observer-aware behaviors. We argue that observer-aware planning, in its most general form, can be modeled as an Interactive POMDP (I-POMDP), which requires complex modeling and is hard to solve. Hence, we introduce a less complex framework for producing observer-aware behaviors called Observer-Aware MDP (OAMDP) and analyze its relationship to I-POMDP. We establish the complexity of OAMDPs and show that they can improve interpretability of agent behaviors in several scenarios.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/miura21a.html
https://proceedings.mlr.press/v161/miura21a.htmlThe curious case of adversarially robust models: More data can help, double descend, or hurt generalizationAdversarial training has shown its ability in producing models that are robust to perturbations on the input data, but usually at the expense of a decrease in the standard accuracy. To mitigate this issue, it is commonly believed that more training data will eventually help such adversarially robust models generalize better on the benign/unperturbed test data. In this paper, however, we challenge this conventional belief and show that more training data can hurt the generalization of adversarially robust models in classification problems. We first investigate the Gaussian mixture classification with a linear loss and identify three regimes based on the strength of the adversary. In the weak adversary regime, more data improves the generalization of adversarially robust models. In the medium adversary regime, with more training data, the generalization loss exhibits a double descent curve, which implies the existence of an intermediate stage where more training data hurts the generalization. In the strong adversary regime, more data almost immediately causes the generalization error to increase. Then we analyze a two-dimensional classification problem with a 0-1 loss. We prove that more data always hurts generalization of adversarially trained models with large perturbations. Empirical studies confirm our theoretical results.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/min21a.html
https://proceedings.mlr.press/v161/min21a.htmlFederated stochastic gradient Langevin dynamicsStochastic gradient MCMC methods, such as stochastic gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling. Although we can easily extend SGLD to distributed settings, it suffers from two issues when applied to federated non-IID data. First, the variance of these estimates increases significantly. Second, delaying communication causes the Markov chains to diverge from the true posterior even for very simple models. To alleviate both these problems, we propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates. Notably, conducive gradients are easy to compute, and since we only calculate the approximations once, they incur negligible overhead. We apply conducive gradients to distributed stochastic gradient Langevin dynamics (DSGLD) and call the resulting method “federated stochastic gradient Langevin dynamics” (FSGLD). We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails. We also show that FSGLD outperforms DSGLD for non-IID federated data with experiments on metric learning and neural networks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/mekkaoui21a.html
https://proceedings.mlr.press/v161/mekkaoui21a.htmlq-Paths: Generalizing the geometric annealing path using power meansMany common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of closed form energy function. In this work, we introduce $q$-paths, a family of paths which is derived from a generalized notion of the mean, includes the geometric and arithmetic mixtures as special cases, and admits a simple closed form involving the deformed logarithm function from nonextensive thermodynamics. Following previous analysis of the geometric path, we interpret our $q$-paths as corresponding to a $q$-exponential family of distributions, and provide a variational representation of intermediate densities as minimizing a mixture of $\alpha$-divergences to the endpoints. We show that small deviations away from the geometric path yield empirical gains for Bayesian inference using Sequential Monte Carlo and generative model evaluation using Annealed Importance Sampling.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/masrani21a.html
https://proceedings.mlr.press/v161/masrani21a.htmlNearest neighbor search under uncertaintyNearest Neighbor Search (NNS) is a central task in knowledge representation, learning, and reasoning. There is vast literature on efficient algorithms for constructing data structures and performing exact and approximate NNS. This paper studies NNS under Uncertainty (NNSU). Specifically, consider the setting in which an NNS algorithm has access only to a stochastic distance oracle that provides a noisy, unbiased estimate of the distance between any pair of points, rather than the exact distance. This models many situations of practical importance, including NNS based on human similarity judgements, physical measurements, or fast, randomized approximations to exact distances. A naive approach to NNSU could employ any standard NNS algorithm and repeatedly query and average results from the stochastic oracle (to reduce noise) whenever it needs a pairwise distance. The problem is that a sufficient number of repeated queries is unknown in advance; e.g., a point may be distant from all but one other point (crude distance estimates suffice) or it may be close to a large number of other points (accurate estimates are necessary). This paper shows how ideas from cover trees and multi-armed bandits can be leveraged to develop an NNSU algorithm that has optimal dependence on the dataset size and the (unknown) geometry of the dataset.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/mason21a.html
https://proceedings.mlr.press/v161/mason21a.htmlA weaker faithfulness assumption based on triple interactionsOne of the core assumptions in causal discovery is the faithfulness assumption—i.e. assuming that independencies found in the data are due to separations in the true causal graph. This assumption can, however, be violated in many ways, including xor connections, deterministic functions or cancelling paths. In this work, we propose a weaker assumption that we call 2-adjacency faithfulness. In contrast to adjacency faithfulness, which assumes that there is no conditional independence between each pair of variables that are connected in the causal graph, we only require no conditional independence between a node and a subset of its Markov blanket that can contain up to two nodes. Equivalently, we adapt orientation faithfulness to this setting. We further propose a sound orientation rule for causal discovery that applies under weaker assumptions. As a proof of concept, we derive a modified Grow and Shrink algorithm that recovers the Markov blanket of a target node and prove its correctness under strictly weaker assumptions than the standard faithfulness assumption.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/marx21a.html
https://proceedings.mlr.press/v161/marx21a.htmlNeural markov logic networksWe introduce neural Markov logic networks (NMLNs), a statistical relational learning system that borrows ideas from Markov logic. Like Markov logic networks (MLNs), NMLNs are an exponential-family model for modelling distributions over possible worlds, but unlike MLNs, they do not rely on explicitly specified first-order logic rules. Instead, NMLNs learn an implicit representation of such rules as a neural network that acts as a potential function on fragments of the relational structure. Similarly to many neural symbolic methods, NMLNs can exploit embeddings of constants but, unlike them, NMLNs work well also in their absence. This is extremely important for predicting in settings other than the transductive one. We showcase the potential of NMLNs on knowledge-base completion, triple classification and on generation of molecular (graph) data.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/marra21a.html
https://proceedings.mlr.press/v161/marra21a.htmlDeep kernels with probabilistic embeddings for small-data learningGaussian Processes (GPs) are known to provide accurate predictions and uncertainty estimates even with small amounts of labeled data by capturing similarity between data points through their kernel function. However traditional GP kernels are not very effective at capturing similarity between high dimensional data points. Neural networks can be used to learn good representations that encode intricate structures in high dimensional data, and can be used as inputs to the GP kernel. However the huge data requirement of neural networks makes this approach ineffective in small data settings. To solves the conflicting problems of representation learning and data efficiency, we propose to learn deep kernels on probabilistic embeddings by using a probabilistic neural network. Our approach maps high-dimensional data to a probability distribution in a low dimensional subspace and then computes a kernel between these distributions to capture similarity. To enable end-to-end learning, we derive a functional gradient descent procedure for training the model. Experiments on a variety of datasets show that our approach outperforms the state-of-the-art in GP kernel learning in both supervised and semi-supervised settings. We also extend our approach to other small-data paradigms such as few-shot classification where it outperforms previous approaches on mini-Imagenet and CUB datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/mallick21a.html
https://proceedings.mlr.press/v161/mallick21a.htmlCausal additive models with unobserved variablesCausal discovery from data affected by unobserved variables is an important but difficult problem to solve. The effects that unobserved variables have on the relationships between observed variables are more complex in nonlinear cases than in linear cases. In this study, we focus on causal additive models in the presence of unobserved variables. Causal additive models exhibit structural equations that are additive in the variables and error terms. We take into account the presence of not only unobserved common causes but also unobserved intermediate variables. Our theoretical results show that, when the causal relationships are nonlinear and there are unobserved variables, it is not possible to identify all the causal relationships between observed variables through regression and independence tests. However, our theoretical results also show that it is possible to avoid incorrect inferences. We propose a method to identify all the causal relationships that are theoretically possible to identify without being biased by unobserved variables. The empirical results using artificial data and simulated functional magnetic resonance imaging (fMRI) data show that our method effectively infers causal structures in the presence of unobserved variables.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/maeda21a.html
https://proceedings.mlr.press/v161/maeda21a.htmlImproving uncertainty calibration of deep neural networks via truth discovery and geometric optimizationDeep Neural Networks (DNNs), despite their tremendous success in recent years, could still cast doubts on their predictions due to the intrinsic uncertainty associated with their learning process. Ensemble techniques and post-hoc calibrations are two types of approaches that have individually shown promise in improving the uncertainty calibration of DNNs. However, the synergistic effect of the two types of methods has not been well explored. In this paper, we propose a truth discovery framework to integrate ensemble-based and post-hoc calibration methods. Using the geometric variance of the ensemble candidates as a good indicator for sample uncertainty, we design an accuracy-preserving truth estimator with provably no accuracy drop. Furthermore, we show that post-hoc calibration can also be enhanced by truth discovery-regularized optimization. On large-scale datasets including CIFAR and ImageNet, our method shows consistent improvement against state-of-the-art calibration approaches on both histogram-based and kernel density-based evaluation metrics. Our code is available at https://github.com/horsepurve/truly-uncertain.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ma21a.html
https://proceedings.mlr.press/v161/ma21a.htmlPath-BN: Towards effective batch normalization in the Path Space for ReLU networksNeural networks with ReLU activation functions (abbrev. ReLU Networks), have demonstrated their success in many applications. Recently, researchers noticed that ReLU networks are positively scale-invariant (PSI) while the weights are not. This mismatch may lead to undesirable behaviors in the optimization process. Hence, some new algorithms that conduct optimization directly in the <em>path space</em> (the path space is proven to be PSI) were developed, such as Stochastic Gradient Descent (SGD) in the path space. %nd it was shown that, SGD in the path space is superior to that in the weight space. However, it is still unknown that whether other deep learning techniques such as batch normalization (BN), could also have their counterparts in the path space. In this paper, we conduct a formal study on the design of BN in the path space. First, we propose <em>path-reparameterization</em> of ReLU networks, in which the weights in the networks are reparameterized by path-values. Then, the feedforward and backward propagation of the path-reparameterized networks can calculate the values of the hidden nodes and the gradients in the path space, respectively. Next, we design the a novel way to do batch normalization for the path-reparameterized ReLU networks, called <em>Path-BN</em>. Specifically, we notice that, path-reparameterized ReLU NNs have a portion of constant weights which play more critical roles to form the basis of the path space. We propose to exclude these constant weights when doing batch normalization and prove that, by doing so, the scale and the direction of the trained parameters can be more effectively decoupled during training. Finally, we conduct experiments on benchmark datasets. The results show that our proposed Path-BN can improve the performance of the optimization algorithms in the path space.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/luo21b.html
https://proceedings.mlr.press/v161/luo21b.htmlHierarchical probabilistic model for blind source separation via Legendre transformationWe present a novel blind source separation (BSS) method, called information geometric blind source separation (IGBSS). Our formulation is based on the log-linear model equipped with a hierarchically structured sample space, which has theoretical guarantees to uniquely recover a set of source signals by minimizing the KL divergence from a set of mixed signals. Source signals, received signals, and mixing matrices are realized as different layers in our hierarchical sample space. Our empirical results have demonstrated on images and time series data that our approach is superior to well established techniques and is able to separate signals with complex interactions.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/luo21a.html
https://proceedings.mlr.press/v161/luo21a.htmlVariance-dependent best arm identificationWe study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $\theta_i$ and variance $\sigma_i^2$. Assume $\theta_1 > \theta_2 \geq \cdots \geq\theta_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called <em>grouped median elimination</em>. The proposed algorithm guarantees to output the best arm with probability $(1-\delta)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where $\Delta_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $\Delta_1 = \Delta_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/lu21a.html
https://proceedings.mlr.press/v161/lu21a.htmlStrategically efficient exploration in competitive multi-agent reinforcement learningHigh sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty can significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/loftin21a.html
https://proceedings.mlr.press/v161/loftin21a.htmlSimilarity measure for sparse time course data based on Gaussian processesWe propose a similarity measure for sparsely sampled time course data in the form of a log-likelihood ratio of Gaussian processes (GP). The proposed GP similarity is similar to a Bayes factor and provides enhanced robustness to noise in sparse time series, such as those found in various biological settings, e.g., gene transcriptomics. We show that the GP measure is equivalent to the Euclidean distance when the noise variance in the GP is negligible compared to the noise variance of the signal. Our numerical experiments on both synthetic and real data show improved performance of the GP similarity when used in conjunction with two distance-based clustering methods.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/liu21a.html
https://proceedings.mlr.press/v161/liu21a.htmlOn random kernels of residual architecturesWe analyze the finite corrections to the neural tangent kernel (NTK) of residual and densely connected networks, as a function of both depth and width. Surprisingly, our analysis reveals that given a fixed depth, residual networks provide the best tradeoff between the parameter complexity and the coefficient of variation (normalized variance), followed by densely connected networks and vanilla MLPs. While in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers’ width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, the convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/littwin21a.html
https://proceedings.mlr.press/v161/littwin21a.htmlDimension reduction for data with heterogeneous missingnessDimension reduction plays a pivotal role in analysing high-dimensional data. However, observations with missing values present serious difficulties in directly applying standard dimension reduction techniques. As a large number of dimension reduction approaches are based on the Gram matrix, we first investigate the effects of missingness on dimension reduction by studying the statistical properties of the Gram matrix with or without missingness, and then we present a bias-corrected Gram matrix with nice statistical properties under heterogeneous missingness. Extensive empirical results, on both simulated and publicly available real datasets, show that the proposed unbiased Gram matrix can significantly improve a broad spectrum of representative dimension reduction approaches.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ling21a.html
https://proceedings.mlr.press/v161/ling21a.htmlBayesian optimization for modular black-box systems with switching costsMost existing black-box optimization methods assume that all variables in the system being optimized have equal cost and can change freely at each iteration. However, in many real-world systems, inputs are passed through a sequence of different operations or modules, making variables in earlier stages of processing more costly to update. Such structure induces a dynamic cost from switching variables in the early parts of a data processing pipeline. In this work, we propose a new algorithm for switch-cost-aware optimization called Lazy Modular Bayesian Optimization (LaMBO). This method efficiently identifies the global optimum while minimizing cost through a passive change of variables in early modules. The method is theoretically grounded which achieves a vanishing regret regularized with switching cost. We apply LaMBO to multiple synthetic functions and a three-stage image segmentation pipeline used in a neuroimaging task, where we obtain promising improvements over existing cost-aware Bayesian optimization algorithms. Our results demonstrate that LaMBO is an effective strategy for black-box optimization capable of minimizing switching costs.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/lin21c.html
https://proceedings.mlr.press/v161/lin21c.htmlEscaping from zero gradient: Revisiting action-constrained reinforcement learning via Frank-Wolfe policy optimizationAction-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy gradient and the projection, which results in sample-inefficient training and slow convergence. To tackle this issue, we propose a learning algorithm that decouples the action constraints from the policy parameter update by leveraging state-wise Frank-Wolfe and a regression-based policy update scheme. Moreover, we show that the proposed algorithm enjoys convergence and policy improvement properties in the tabular case as well as generalizes the popular DDPG algorithm for action-constrained RL in the general case. Through experiments, we demonstrate that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/lin21b.html
https://proceedings.mlr.press/v161/lin21b.htmlAn unsupervised video game playstyle metric via state discretizationOn playing video games, different players usually have their own playstyles. Recently, there have been great improvements for the video game AIs on the playing strength. However, past researches for analyzing the behaviors of players still used heuristic rules or the behavior features with the game-environment support, thus being exhausted for the developers to define the features of discriminating various playstyles. In this paper, we propose the first metric for video game playstyles directly from the game observations and actions, without any prior specification on the playstyle in the target game. Our proposed method is built upon a novel scheme of learning discrete representations that can map game observations into latent discrete states, such that playstyles can be exhibited from these discrete states. Namely, we measure the playstyle distance based on game observations aligned to the same states. We demonstrate high playstyle accuracy of our metric in experiments on some video game platforms, including TORCS, RGSK, and seven Atari games, and for different agents including rule-based AI bots, learning-based AI bots, and human players.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/lin21a.html
https://proceedings.mlr.press/v161/lin21a.htmlCLAIM: curriculum learning policy for influence maximization in unknown social networksInfluence maximization is the problem of finding a small subset of nodes in a network that can maximize the diffusion of information. Recently, it has also found application in HIV prevention, substance abuse prevention, micro-finance adoption, etc., where the goal is to identify the set of peer leaders in a real-world physical social network who can disseminate information to a large group of people. Unlike online social networks, real-world networks are not completely known, and collecting information about the network is costly as it involves surveying multiple people. In this paper, we focus on this problem of network discovery for influence maximization. The existing work in this direction proposes a reinforcement learning framework. As the environment interactions in real-world settings are costly, so it is important for the reinforcement learning algorithms to have minimum possible environment interactions, i.e, to be sample efficient. In this work, we propose CLAIM - Curriculum LeArning Policy for Influence Maximization to improve the sample efficiency of RL methods. We conduct experiments on real-world datasets and show that our approach can outperform the current best approach.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/li21b.html
https://proceedings.mlr.press/v161/li21b.htmlTractable computation of expected kernelsComputing the expectation of kernel functions is a ubiquitous task in machine learning, with applications from classical support vector machines to exploiting kernel embeddings of distributions in probabilistic modeling, statistical inference, causal discovery, and deep learning. In all these scenarios, we tend to resort to Monte Carlo estimates as expectations of kernels are intractable in general. In this work, we characterize the conditions under which we can compute expected kernels exactly and efficiently, by leveraging recent advances in probabilistic circuit representations. We first construct a circuit representation for kernels and propose an approach to such tractable computation. We then demonstrate possible advancements for kernel embedding frameworks by exploiting tractable expected kernels to derive new algorithms for two challenging scenarios: 1) reasoning under missing data with kernel support vector regressors; 2) devising a collapsed black-box importance sampling scheme. Finally, we empirically evaluate both algorithms and show that they outperform standard baselines on a variety of datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/li21a.html
https://proceedings.mlr.press/v161/li21a.htmlConvergence behavior of belief propagation: estimating regions of attraction via Lyapunov functionsIn this work, we estimate the regions of attraction for belief propagation. This extends existing stability analysis and provides initial message values for which belief propagation is guaranteed to converge. Our approach utilizes the theory of Lyapunov functions that, however, does not readily yield useful regions of attraction. Therefore, we utilize polynomial sum-of-squares relaxations and provide an algorithm that computes valid Lyapunov functions. This admits a novel way of studying the solution space of belief propagation. Finally, we apply our approach to small-scale models and discuss the effect of the potentials on the regions of attraction.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/leisenberger21a.html
https://proceedings.mlr.press/v161/leisenberger21a.htmlA Nonmyopic Approach to Cost-Constrained Bayesian OptimizationBayesian optimization (BO) is a popular method for optimizing expensive-to-evaluate black-box functions. BO budgets are typically given in iterations, which implicitly assumes each evaluation has the same cost. In fact, in many BO applications, evaluation costs vary significantly in different regions of the search space. In hyperparameter optimization, the time spent on neural network training increases with layer size; in clinical trials, the monetary cost of drug compounds vary; and in optimal control, control actions have differing complexities. Cost-constrained BO measures convergence with alternative cost metrics such as time, money, or energy, for which the sample efficiency of standard BO methods is ill-suited. For cost-constrained BO, cost efficiency is far more important than sample efficiency. In this paper, we formulate cost-constrained BO as a constrained Markov decision process (CMDP), and develop an efficient rollout approximation to the optimal CMDP policy that takes both the cost and future iterations into account. We validate our method on a collection of hyperparameter optimization problems as well as a sensor set selection application.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/lee21a.html
https://proceedings.mlr.press/v161/lee21a.htmlHierarchical learning of Hidden Markov Models with clustering regularizationHierarchical learning of generative models is useful for representing and interpreting complex data. For instance, one application is to learn an HMM to represent an individual’s eye fixations on a stimuli, and then cluster individuals’ HMMs to discover common eye gaze strategies. However, learning the individual representation models from observations and clustering individual models to group models are often considered as two separate tasks. In this paper, we propose a novel tree structure variational Bayesian method to learn the individual model and group model simultaneously by treating the group models as the parents of individual models, so that the individual model is learned from observations and regularized by its parents, and conversely, the parent model will be optimized to best represent its children. Due to the regularization process, our method has advantages when the number of training samples decreases. Experimental results on the synthetic datasets demonstrate the effectiveness of the proposed method.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/lan21a.html
https://proceedings.mlr.press/v161/lan21a.htmlDisentangling mixtures of unknown causal interventionsIn many real-world scenarios, such as gene knockout experiments, targeted interventions are often accompanied by unknown interventions at off-target sites. Moreover, different units can get randomly exposed to different unknown interventions, thereby creating a mixture of interventions. Identifying different components of this mixture can be very valuable in some applications. Motivated by such situations, in this work, we study the problem of identifying all components present in a mixture of interventions on a given causal Bayesian Network. We construct examples to show that, in general, the components are not identifiable from the mixture distribution. Next, assuming that the given network satisfies a positivity condition, we show that, if the set of mixture components satisfy a mild exclusion assumption, then they can be uniquely identified. Our proof gives an efficient algorithm to recover these targets from the exponentially large search space of possible targets. In the more realistic scenario, where distributions are given via finitely many samples, we conduct a simulation study to analyze the performance of an algorithm derived from our identifiability proof.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kumar21a.html
https://proceedings.mlr.press/v161/kumar21a.htmlLearnable uncertainty under Laplace approximationsLaplace approximations are classic, computationally lightweight means for constructing Bayesian neural networks (BNNs). As in other approximate BNNs, one cannot necessarily expect the induced predictive uncertainty to be calibrated. Here we develop a formalism to explicitly “train” the uncertainty in a decoupled way to the prediction itself. To this end, we introduce <em>uncertainty units</em> for Laplace-approximated networks: Hidden units associated with a particular weight structure that can be added to any pre-trained, point-estimated network. Due to their weights, these units are inactive—they do not affect the predictions. But their presence changes the geometry (in particular the Hessian) of the loss landscape, thereby affecting the network’s uncertainty estimates under a Laplace approximation. We show that such units can be trained via an uncertainty-aware objective, improving standard Laplace approximations’ performance in various uncertainty quantification tasks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kristiadi21a.html
https://proceedings.mlr.press/v161/kristiadi21a.htmlTrumpets: Injective flows for inference and inverse problemsWe propose injective generative models called Trumpets that generalize invertible normalizing flows. The proposed generators progressively increase dimension from a low-dimensional latent space. We demonstrate that Trumpets can be trained orders of magnitudes faster than standard flows while yielding samples of comparable or better quality. They retain many of the advantages of the standard flows such as training based on maximum likelihood and a fast, exact inverse of the generator. Since Trumpets are injective and have fast inverses, they can be effectively used for downstream Bayesian inference. To wit, we use Trumpet priors for maximum a posteriori estimation in the context of image reconstruction from compressive measurements, outperforming competitive baselines in terms of reconstruction quality and speed. We then propose an efficient method for posterior characterization and uncertainty quantification with Trumpets by taking advantage of the low-dimensional latent spaceWed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kothari21a.html
https://proceedings.mlr.press/v161/kothari21a.htmlInvestigating vulnerabilities of deep neural policiesReinforcement learning policies based on deep neural networks are vulnerable to imperceptible adversarial perturbations to their inputs, in much the same way as neural network image classifiers. Recent work has proposed several methods to improve the robustness of deep reinforcement learning agents to adversarial perturbations based on training in the presence of these imperceptible perturbations (i.e. adversarial training). In this paper, we study the effects of adversarial training on the neural policy learned by the agent. In particular, we follow two distinct parallel approaches to investigate the outcomes of adversarial training on deep neural policies based on worst-case distributional shift and feature sensitivity. For the first approach, we compare the Fourier spectrum of minimal perturbations computed for both adversarially trained and vanilla trained neural policies. Via experiments in the OpenAI Atari environments we show that minimal perturbations computed for adversarially trained policies are more focused on lower frequencies in the Fourier domain, indicating a higher sensitivity of these policies to low frequency perturbations. For the second approach, we propose a novel method to measure the feature sensitivities of deep neural policies and we compare these feature sensitivity differences in state-of-the-art adversarially trained deep neural policies and vanilla trained deep neural policies. We believe our results can be an initial step towards understanding the relationship between adversarial training and different notions of robustness for neural policies.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/korkmaz21a.html
https://proceedings.mlr.press/v161/korkmaz21a.htmlStochastic model for sunk cost biasWe present a novel model for capturing the behavior of an agent exhibiting sunk-cost bias in a stochastic environment. Agents exhibiting sunk-cost bias take into account the effort they have already spent on an endeavor when they evaluate whether to continue or abandon it. We model planning tasks in which an agent with this type of bias tries to reach a designated goal. Our model structures this problem as a type of Markov decision process: loosely speaking, the agent traverses a directed acyclic graph with probabilistic transitions, paying costs for its actions as it tries to reach a target node containing a specified reward. The agent’s sunk cost bias is modeled by a cost that it incurs for abandoning the traversal: if the agent decides to stop traversing the graph, it incurs a cost of $\lambda \cdot C_{sunk}$, where ${\lambda \geq 0}$ is a parameter that captures the extent of the bias and $C_{sunk}$ is the sum of costs already invested. We analyze the behavior of two types of agents: naive agents that are unaware of their bias, and sophisticated agents that are aware of it. Since optimal (bias-free) behavior in this problem can involve abandoning the traversal before reaching the goal, the bias exhibited by these types of agents can result in sub-optimal behavior by shifting their decisions about abandonment. We show that in contrast to optimal agents, it is computationally hard to compute the optimal policy for a sophisticated agent. Our main results quantify the loss exhibited by these two types of agents with respect to an optimal agent. We present both general and topology-specific bounds.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kleinberg21a.html
https://proceedings.mlr.press/v161/kleinberg21a.htmlRegstar: efficient strategy synthesis for adversarial patrolling gamesWe design a new efficient strategy synthesis method applicable to adversarial patrolling problems on graphs with arbitrary-length edges and possibly imperfect intrusion detection. The core ingredient is an efficient algorithm for computing the value and the gradient of a function assigning to every strategy its “protection” achieved. This allows for designing an efficient strategy improvement algorithm by differentiable programming and optimization techniques. Our method is the first one applicable to real-world patrolling graphs of reasonable sizes. It outperforms the state-of-the-art strategy synthesis algorithm by a margin.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/klaska21a.html
https://proceedings.mlr.press/v161/klaska21a.htmlpRSL: Interpretable multi-label stacking by learning probabilistic rulesA key task in multi-label classification is modeling the structure between the involved classes. Modeling this structure by probabilistic and interpretable means enables application in a broad variety of tasks such as zero-shot learning or learning from incomplete data. In this paper, we present the probabilistic rule stacking learner (pRSL) which uses probabilistic propositional logic rules and belief propagation to combine the predictions of several underlying classifiers. We derive algorithms for exact and approximate inference and learning, and show that pRSL reaches state-of-the-art performance on various benchmark datasets. In the process, we introduce a novel multicategorical generalization of the noisy-or gate. Additionally, we report simulation results on the quality of loopy belief propagation algorithms for approximate inference in bipartite noisy-or networks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kirchhof21a.html
https://proceedings.mlr.press/v161/kirchhof21a.htmlGeometric rates of convergence for kernel-based sampling algorithmsThe rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated. Under verifiable conditions on the chosen kernel and target measure, we establish a near-geometric rate of convergence for target measures that are nearly atomic. Furthermore, we show these algorithms perform comparably to the theoretical best possible sampling algorithm under the maximum mean discrepancy. An analysis is also conducted in a distributed setting. Our theoretical developments are supported by empirical observations on simulated data as well as a real world application.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/khanna21a.html
https://proceedings.mlr.press/v161/khanna21a.htmlHierarchical Indian buffet neural networks for Bayesian continual learningWe place an Indian Buffet process (IBP) prior over the structure of a Bayesian Neural Network (BNN), thus allowing the complexity of the BNN to increase and decrease automatically. We further extend this model such that the prior on the structure of each hidden layer is shared globally across all layers, using a Hierarchical-IBP (H-IBP). We apply this model to the problem of resource allocation in Continual Learning (CL) where new tasks occur and the network requires extra resources. Our model uses online variational inference with reparameterisation of the Bernoulli and Beta distributions, which constitute the IBP and H-IBP priors. As we automatically learn the number of weights in each layer of the BNN, overfitting and underfitting problems are largely overcome. We show empirically that our approach offers a competitive edge over existing methods in CL.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kessler21a.html
https://proceedings.mlr.press/v161/kessler21a.htmlConstrained differentially private federated learning for low-bandwidth devicesFederated learning becomes a prominent approach when different entities want to learn collaboratively a common model without sharing their training data. %Compared to traditional machine learning, it does not require to collect and centralize all data before training a common model. However, Federated learning has two main drawbacks. First, it is quite bandwidth inefficient as it involves a lot of message exchanges between the aggregating server and the participating entities. This bandwidth and corresponding processing costs could be prohibitive if the participating entities are, for example, mobile devices. Furthermore, although federated learning improves privacy by not sharing data, recent attacks have shown that it still leaks information about the training data. This paper presents a novel privacy-preserving federated learning scheme. The proposed scheme provides theoretical privacy guarantees, as it is based on Differential Privacy. Furthermore, it optimizes the model accuracy by constraining the model learning phase on few selected weights. Finally, as shown experimentally, it reduces the upstream <em>and</em> downstream bandwidth by up to 99.9% compared to standard federated learning, making it practical for mobile systems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kerkouche21a.html
https://proceedings.mlr.press/v161/kerkouche21a.htmlApproximate implication with d-separationThe graphical structure of Probabilistic Graphical Models (PGMs) encodes the conditional independence (CI) relations that hold in the modeled distribution. Graph algorithms, such as d-separation, use this structure to infer additional conditional independencies, and to query whether a specific CI holds in the distribution. The premise of all current systems-of-inference for deriving CIs in PGMs, is that the set of CIs used for the construction of the PGM hold exactly. In practice, algorithms for extracting the structure of PGMs from data, discover approximate CIs that do not hold exactly in the distribution. In this paper, we ask how the error in this set propagates to the inferred CIs read off the graphical structure. More precisely, what guarantee can we provide on the inferred CI when the set of CIs that entailed it hold only approximately? It has recently been shown that in the general case, no such guarantee can be provided. We prove that such a guarantee exists for the set of CIs inferred in directed graphical models, making the d-separation algorithm a sound and complete system for inferring approximate CIs. We also prove an approximation guarantee for independence relations derived from marginal CIs.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kenig21a.html
https://proceedings.mlr.press/v161/kenig21a.htmlSGD with low-dimensional gradients with applications to private and distributed learningIn this paper, we consider constrained optimization problems subject to a convex set C Stochastic gradient descent (SGD) is a simple and popular stochastic optimization algorithm that has been the workhorse of machine learning for many years. We show a new and surprising fact about SGD, in that depending on the constraint set C, one can operate SGD with much lower-dimensional stochastic gradients without affecting its performance. In particular, we design an optimization algorithm that operates with the lower-dimensional (compressed) stochastic gradients, and establish that with the right set of parameters it has the exact same dimension-free convergence guarantees as that of regular SGD for popular convex and nonconvex optimization settings. We also present two applications of these bounds, one for improving the empirical risk minimization bounds in differentially private nonconvex optimization, and other for reducing communication costs with distributed SGD. Additionally, we also show that this connection between constraint set structure and gradient compression also extends beyond SGD to the conditional gradient (Frank-Wolfe) method. The geometry of the constraint set, captured by its Gaussian width, plays an important role in all our results.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kasiviswanathan21a.html
https://proceedings.mlr.press/v161/kasiviswanathan21a.htmlUncertainty in minimum cost multicuts for image and motion segmentationThe minimum cost lifted multicut approach has proven practically good performance in a wide range of applications such as image decomposition, mesh segmentation, multiple object tracking and motion segmentation. It addresses such problems in a graph-based model, where real valued costs are assigned to the edges between entities such that the minimum cut decomposes the graph into an optimal number of segments. Driven by a probabilistic formulation of minimum cost multicuts, we provide a measure for the uncertainties of the decisions made during the optimization. We argue that the access to such uncertainties is crucial for many practical applications and conduct an evaluation by means of sparsifications on three different, widely used datasets in the context of image decomposition (BSDS-500) and motion segmentation (DAVIS$_{2016}$ and FBMS$_{59}$) in terms of variation of information (VI) and Rand index (RI).Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kardoost21a.html
https://proceedings.mlr.press/v161/kardoost21a.htmlGradient-based optimization for multi-resource spatial coverage problemsResource allocation for coverage of geographical spaces is a challenging problem in robotics, sensor networks and security domains. Conventional solution approaches either: (a) rely on exploiting spatio-temporal structure of specific coverage problems, or (b) use genetic algorithms when targeting general coverage problems where no special exploitable structure exists. In this work, we propose the coverage gradient theorem, which provides a gradient estimator for a broad class of spatial coverage objectives using a combination of Newton-Leibniz theorem and implicit boundary differentiation. We also propose a tractable framework to approximate the coverage objectives and their gradients using spatial discretization and empirically demonstrate the efficacy of our framework on multi-resource spatial coverage problems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kamra21a.html
https://proceedings.mlr.press/v161/kamra21a.htmlRISAN: Robust instance specific deep abstention networkIn this paper, we propose deep architectures for learning instance specific abstain (reject option) binary classifiers. The proposed approach uses double sigmoid loss function as described by Kulin Shah and Naresh Manwani in ("Online Active Learning of Reject Option Classifiers", AAAI, 2020), as a performance measure. We show that the double sigmoid loss is classification calibrated. We also show that the excess risk of 0-d-1 loss is upper bounded by the excess risk of double sigmoid loss. We derive the generalization error bounds for the proposed architecture for reject option classifiers. To show the effectiveness of the proposed approach, we experiment with several real world datasets. We observe that the proposed approach not only performs comparable to the state-of-the-art approaches, it is also robust against label noise. We also provide visualizations to observe the important features learned by the network corresponding to the abstaining decision.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/kalra21a.html
https://proceedings.mlr.press/v161/kalra21a.htmlGeneralization error bounds for deep unfolding RNNsRecurrent Neural Networks (RNNs) are powerful models with the ability to model sequential data. However, they are often viewed as black-boxes and lack in interpretability. Deep unfolding methods take a step towards interpretability by designing deep neural networks as learned variations of iterative optimization algorithms to solve various signal processing tasks. In this paper, we explore theoretical aspects of deep unfolding RNNs in terms of their generalization ability. Specifically, we derive generalization error bounds for a class of deep unfolding RNNs via Rademacher complexity analysis. To our knowledge, these are the first generalization bounds proposed for deep unfolding RNNs. We show theoretically that our bounds are tighter than similar ones for other recent RNNs, in terms of the number of timesteps. By training models in a classification setting, we demonstrate that deep unfolding RNNs can outperform traditional RNNs in standard sequence classification tasks. These experiments allow us to relate the empirical generalization error to the theoretical bounds. In particular, we show that over-parametrized deep unfolding models like reweighted-RNN achieve tight theoretical error bounds with minimal decrease in accuracy, when trained with explicit regularization.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/joukovsky21a.html
https://proceedings.mlr.press/v161/joukovsky21a.htmlTreeBERT: A tree-based pre-trained model for programming languageSource code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present TreeBERT, a tree-based pre-trained model for improving programming language-oriented generation tasks. To utilize tree structure, TreeBERT represents the AST corresponding to the code as a set of composition paths and introduces node position embedding. The model is trained by tree masked language modeling (TMLM) and node order prediction (NOP) with a hybrid objective. TMLM uses a novel masking strategy designed according to the tree’s characteristics to help the model understand the AST and infer the missing semantics of the AST. With NOP, TreeBERT extracts the syntactical structure by learning the order constraints of nodes in AST. We pre-trained TreeBERT on datasets covering multiple programming languages. On code summarization and code documentation tasks, TreeBERT outperforms other pre-trained models and state-of-the-art models designed for these tasks. Furthermore, TreeBERT performs well when transferred to the pre-trained unseen programming language.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/jiang21a.html
https://proceedings.mlr.press/v161/jiang21a.htmlVariational refinement for importance sampling using the forward Kullback-Leibler divergenceVariational Inference (VI) is a popular alternative to asymptotically exact sampling in Bayesian inference. Its main workhorse is optimization over a reverse Kullback-Leibler divergence (RKL), which typically underestimates the tail of the posterior leading to miscalibration and potential degeneracy. Importance sampling (IS), on the other hand, is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures. The quality of IS crucially depends on the choice of the proposal distribution. Ideally, the proposal distribution has heavier tails than the target, which is rarely achievable by minimizing the RKL. We thus propose a novel combination of optimization and sampling techniques for approximate Bayesian inference by constructing an IS proposal distribution through the minimization of a forward KL (FKL) divergence. This approach guarantees asymptotic consistency and a fast convergence towards both the optimal IS estimator and the optimal variational approximation. We empirically demonstrate on real data that our method is competitive with variational boosting and MCMC.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/jerfel21a.html
https://proceedings.mlr.press/v161/jerfel21a.htmlSubset-of-data variational inference for deep Gaussian-processes regressionDeep Gaussian Processes (DGPs) are multi-layer, flexible extensions of Gaussian Processes but their training remains challenging. Most existing methods for inference in DGPs use sparse approximation which require optimization over a large number of inducing inputs and their locations across layers. In this paper, we simplify the training by setting the locations to a fixed subset of data and sampling the inducing inputs from a variational distribution. This reduces the trainable parameters and computation cost without any performance degradation, as demonstrated by our empirical results on regression data sets. Our modifications simplify and stabilize DGP training methods while making them amenable to sampling schemes such as leverage score and determinantal point processes.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/jain21a.html
https://proceedings.mlr.press/v161/jain21a.htmlGenerating adversarial examples with graph neural networksRecent years have witnessed the deployment of adversarial attacks to evaluate the robustness of Neural Networks. Past work in this field has relied on traditional optimization algorithms that ignore the inherent structure of the problem and data, or generative methods that rely purely on learning and often fail to generate adversarial examples where they are hard to find. To alleviate these deficiencies, we propose a novel attack based on a graph neural network (GNN) that takes advantage of the strengths of both approaches; we call it AdvGNN. Our GNN architecture closely resembles the network we wish to attack. During inference, we perform forward-backward passes through the GNN layers to guide an iterative procedure towards adversarial examples. During training, its parameters are estimated via a loss function that encourages the efficient computation of adversarial examples over a time horizon. We show that our method beats state-of-the-art adversarial attacks, including PGD-attack, MI-FGSM, and Carlini and Wagner attack, reducing the time required to generate adversarial examples with small perturbation norms by over 65%. Moreover, AdvGNN achieves good generalization performance on unseen networks. Finally, we provide a new challenging dataset specifically designed to allow for a more illustrative comparison of adversarial attacks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/jaeckle21a.html
https://proceedings.mlr.press/v161/jaeckle21a.htmlEfficient debiased evidence estimation by multilevel Monte Carlo samplingIn this paper, we propose a new stochastic optimization algorithm for Bayesian inference based on multilevel Monte Carlo (MLMC) methods. In Bayesian statistics, biased estimators of the model evidence have been often used as stochastic objectives because the existing debiasing techniques are computationally costly to apply. To overcome this issue, we apply an MLMC sampling technique to construct low-variance unbiased estimators both for the model evidence and its gradient. In the theoretical analysis, we show that the computational cost required for our proposed MLMC estimator to estimate the model evidence or its gradient with a given accuracy is an order of magnitude smaller than those of the previously known estimators. Our numerical experiments confirm considerable computational savings compared to the conventional estimators. Combining our MLMC estimator with gradient-based stochastic optimization results in a new scalable, efficient, debiased inference algorithm for Bayesian statistical models.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ishikawa21a.html
https://proceedings.mlr.press/v161/ishikawa21a.htmlInference of causal effects when control variables are unknownConventional methods in causal effect inference typically rely on specifying a valid set of control variables. When this set is unknown or misspecified, inferences will be erroneous. We propose a method for inferring average causal effects when all potential confounders are observed, but the control variables are unknown. When the data-generating process belongs to the class of acyclical linear structural causal models, we prove that the method yields asymptotically valid confidence intervals. Our results build upon a smooth characterization of linear directed acyclic graphs. We verify the capability of the method to produce valid confidence intervals for average causal effects using synthetic data, even when the appropriate specification of control variables is unknown.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/hult21a.html
https://proceedings.mlr.press/v161/hult21a.htmlStochastic continuous normalizing flows: training SDEs as ODEsWe provide a general theoretical framework for stochastic continuous normalizing flows, an extension of continuous normalizing flows for density estimation of stochastic differential equations (SDEs). Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated. Doing so enables the treatment of SDEs as random ordinary differential equations, which can be trained using existing techniques. For scalar loss functions, this approach naturally recovers the stochastic adjoint method of Li et al. [2020] for training neural SDEs, while supporting a more flexible class of approximations.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/hodgkinson21a.html
https://proceedings.mlr.press/v161/hodgkinson21a.htmlTighter Generalization Bounds for Iterative Differentially Private Learning AlgorithmsThis paper studies the relationship between generalization and privacy preservation of machine learning in two steps. We first establish an alignment between the two facets for any learning algorithm. We prove that $(\varepsilon, \delta)$-differential privacy implies an on-average generalization bound for a multi-sample-set learning algorithm, which further leads to a high-probability bound for any learning algorithm. We then investigate how the iterative nature shared by most learning algorithms influences privacy preservation and further generalization. Three composition theorems are proved to approximate the differential privacy of an iterative algorithm through the differential privacy of its every iteration. Integrating the above two steps, we eventually deliver generalization bounds for iterative learning algorithms. Our results are strictly tighter than the existing works. Particularly, our generalization bounds do not rely on the model size which is prohibitively large in deep learning. Experiments of MLP, VGG, and ResNet on MNIST, CIFAR-10, and CIFAR-100 are in full agreement with our theory. The theory applies to a wide spectrum of learning algorithms. In this paper, it is applied to the Gaussian mechanism as an example.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/he21a.html
https://proceedings.mlr.press/v161/he21a.htmlGaussian process nowcasting: application to COVID-19 mortality reportingUpdating observations of a signal due to the delays in the measurement process is a common problem in signal processing, with prominent examples in a wide range of fields. An important example of this problem is the nowcasting of COVID-19 mortality: given a stream of reported counts of daily deaths, can we correct for the delays in reporting to paint an accurate picture of the present, with uncertainty? Without this correction, raw data will often mislead by suggesting an improving situation. We present a flexible approach using a latent Gaussian process that is capable of describing the changing auto-correlation structure present in the reporting time-delay surface. This approach also yields robust estimates of uncertainty for the estimated nowcasted numbers of deaths. We test assumptions in model specification such as the choice of kernel or hyper priors, and evaluate model performance on a challenging real dataset from Brazil. Our experiments show that Gaussian process nowcasting performs favourably against both comparable methods, and against a small sample of expert human predictions. Our approach has substantial practical utility in disease modelling — by applying our approach to COVID-19 mortality data from Brazil, where reporting delays are large, we can make informative predictions on important epidemiological quantities such as the current effective reproduction number.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/hawryluk21a.html
https://proceedings.mlr.press/v161/hawryluk21a.htmlMeasuring data leakage in machine-learning models with Fisher informationMachine-learning models contain information about the data they were trained on. This information leaks either through the model itself or through predictions made by the model. Consequently, when the training data contains sensitive attributes, assessing the amount of information leakage is paramount. We propose a method to quantify this leakage using the Fisher information of the model about the data. Unlike the worst-case <em>a priori</em> guarantees of differential privacy, <em>Fisher information loss</em> measures leakage with respect to specific examples, attributes, or sub-populations within the dataset. We motivate Fisher information loss through the Cram\’{e}r-Rao bound and delineate the implied threat model. We provide efficient methods to compute Fisher information loss for output-perturbed generalized linear models. Finally, we empirically validate Fisher information loss as a useful measure of information leakage.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/hannun21a.html
https://proceedings.mlr.press/v161/hannun21a.htmlTestification of Condorcet Winners in dueling banditsSeveral algorithms for finding the best arm in the dueling bandits setting assume the existence of a Condorcet winner (CW), that is, an arm that uniformly dominates all other arms. Yet, by simply relying on this assumption but not verifying it, such algorithms may produce doubtful results in cases where it actually fails to hold. Even worse, the problem may not be noticed, and an alleged CW still be produced. In this paper, we therefore address the problem as a ”testification” task, by which we mean a combination of testing and identification: The online identification of the CW is combined with the statistical testing of the CW assumption. Thus, instead of returning a supposed CW at some point, the learner has the possibility to stop sampling and refuse an answer in case it feels confident that the CW assumption is violated. Analyzing the testification problem formally, we derive lower bounds on the expected sample complexity of any online algorithm solving it. Moreover, a concrete algorithm is proposed, which achieves the optimal sample complexity up to logarithmic terms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/haddenhorst21a.html
https://proceedings.mlr.press/v161/haddenhorst21a.htmlInteger programming-based error-correcting output code design for robust classificationError-Correcting Output Codes (ECOCs) offer a principled approach for combining binary classifiers into multiclass classifiers. In this paper, we study the problem of designing optimal ECOCs to achieve both nominal and adversarial accuracy using Support Vector Machines (SVMs) and binary deep neural networks. We develop a scalable Integer Programming (IP) formulation to design minimal codebooks with desirable error correcting properties. Our work leverages the advances in IP solution techniques to generate codebooks with optimality guarantees. To achieve tractability, we exploit the underlying graph-theoretic structure of the constraint set. Particularly, the size of the constraint set can be significantly reduced using edge clique covers. Using this reduction technique along with Plotkin’s bound in coding theory, we demonstrate that our approach is scalable to a large number of classes. The resulting codebooks achieve a high nominal accuracy relative to standard codebooks (e.g., one-vs-all, one-vs-one, and dense/sparse codes). Interestingly, our codebooks provide non-trivial robustness to white-box attacks without any adversarial training.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gupta21c.html
https://proceedings.mlr.press/v161/gupta21c.htmlEstimating treatment effects with observed confounders and mediatorsGiven a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor formula applies when an observed mediator transmits the causal effect. In this paper, we investigate the over-identified scenario where both confounders and mediators are observed, rendering both estimators valid. Addressing the linear Gaussian causal model, we demonstrate that either estimator can dominate the other by an unbounded constant factor. Next, we derive an optimal estimator, which leverages all observed variables, and bound its finite-sample variance. We show that it strictly outperforms the backdoor and frontdoor estimators and that this improvement can be unbounded. We also present a procedure for combining two datasets, one with observed confounders and another with observed mediators. Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gupta21b.html
https://proceedings.mlr.press/v161/gupta21b.htmlLocalNewton: Reducing communication rounds for distributed learningTo address the communication bottleneck problem in distributed optimization within a master-worker framework, we propose LocalNewton, a distributed second-order algorithm with <em>local averaging</em>. In LocalNewton, the worker machines update their model in every iteration by finding a suitable second-order descent direction using only the data and model stored in their own local memory. We let the workers run multiple such iterations locally and communicate the models to the master node only once every few (say $L$) iterations. LocalNewton is highly practical since it requires only one hyperparameter, the number $L$ of local iterations. We use novel matrix concentration based techniques to obtain theoretical guarantees for LocalNewton, and we validate them with detailed empirical evaluation. To enhance practicability, we devise an adaptive scheme to choose $L$, and we show that this reduces the number of local iterations in worker machines between two model synchronizations as the training proceeds, successively refining the model quality at the master. Via extensive experiments using several real-world datasets with AWS Lambda workers and an AWS EC2 master, we show that LocalNewton requires fewer than $60%$ of the communication rounds (between master and workers) and less than $40%$ of the end-to-end running time, compared to state-of-the-art algorithms, to reach the same training loss.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gupta21a.html
https://proceedings.mlr.press/v161/gupta21a.htmlActive multi-fidelity Bayesian online changepoint detectionOnline algorithms for detecting changepoints, or abrupt shifts in the behavior of a time series, are often deployed with limited resources, e.g., to edge computing settings such as mobile phones or industrial sensors. In these scenarios it may be beneficial to trade the cost of collecting an environmental measurement against the quality or “fidelity” of this measurement and how the measurement affects changepoint estimation. For instance, one might decide between inertial measurements or GPS to determine changepoints for motion. A Bayesian approach to changepoint detection is particularly appealing because we can represent our posterior uncertainty about changepoints and make active, cost-sensitive decisions about data fidelity to reduce this posterior uncertainty. Moreover, the total cost could be dramatically lowered through active fidelity switching, while remaining robust to changes in data distribution. We propose a multi-fidelity approach that makes cost-sensitive decisions about which data fidelity to collect based on maximizing information gain with respect to changepoints. We evaluate this framework on synthetic, video, and audio data and show that this information-based approach results in accurate predictions while reducing total cost.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gundersen21a.html
https://proceedings.mlr.press/v161/gundersen21a.htmlStaying in shape: learning invariant shape representations using contrastive learningCreating representations of shapes that are invariant to isometric or almost-isometric transformations has long been an area of interest in shape analysis, since enforcing invariance allows the learning of more effective and robust shape representations. Most existing invariant shape representations are handcrafted, and previous work on learning shape representations do not focus on producing invariant representations. To solve the problem of learning unsupervised invariant shape representations, we use contrastive learning, which produces discriminative representations through learning invariance to user-specified data augmentations. To produce representations that are specifically isometry and almost-isometry invariant, we propose new data augmentations that randomly sample these transformations. We show experimentally that our method outperforms previous unsupervised learning approaches in both effectiveness and robustness.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gu21a.html
https://proceedings.mlr.press/v161/gu21a.htmlProbabilistic DAG searchExciting contemporary machine learning problems have recently been phrased in the classic formalism of tree search — most famously, the game of Go. Interestingly, the state-space underlying these sequential decision-making problems often posses a more general latent structure than can be captured by a tree. In this work, we develop a probabilistic framework to exploit a search space’s latent structure and thereby share information across the search tree. The method is based on a combination of approximate inference in jointly Gaussian models for the explored part of the problem, and an abstraction for the unexplored part that imposes a reduction of complexity ad hoc. We empirically find our algorithm to compare favorably to existing non-probabilistic alternatives in Tic-Tac-Toe and a feature selection application.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/grosse21a.html
https://proceedings.mlr.press/v161/grosse21a.htmlExact and approximate hierarchical clustering using A*Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel <em>trellis</em> data structure. This results in an exact algorithm that scales beyond previous state of the art (from a search space with $10^{12}$ trees to $10^{15}$ trees) and an approximate algorithm that improves over baselines, even in enormous search spaces (that contain more than $10^{1000}$ trees). Empirically we demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/greenberg21a.html
https://proceedings.mlr.press/v161/greenberg21a.htmlCondition number bounds for causal inferenceAn important achievement in the field of causal inference was a complete characterization of when a causal effect, in a system modeled by a causal graph, can be determined uniquely from purely observational data. The identification algorithms resulting from this work produce exact <em>symbolic</em> expressions for causal effects, in terms of the observational probabilities. More recent work has looked at the <em>numerical</em> properties of these expressions, in particular using the classical notion of the <em>condition number</em>. In its classical interpretation, the condition number quantifies the sensitivity of the output values of the expressions to small numerical perturbations in the input observational probabilities. In the context of causal identification, the condition number has also been shown to be related to the effect of certain kinds of uncertainties in the <em>structure</em> of the causal graphical model. In this paper, we first give an upper bound on the condition number for the interesting case of causal graphical models with small “confounded components”. We then develop a tight characterization of the condition number of any given causal identification problem. Finally, we use our tight characterization to give a specific example where the condition number can be much lower than that obtained via generic bounds on the condition number, and to show that even “equivalent” expressions for causal identification can behave very differently with respect to their numerical stability properties.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gordon21a.html
https://proceedings.mlr.press/v161/gordon21a.htmlContextual policy transfer in reinforcement learning domains via deep mixtures-of-expertsIn reinforcement learning, agents that consider the context or current state when transferring source policies have been shown to outperform context-free approaches. However, existing approaches suffer from limitations, including sensitivity to sparse or delayed rewards and estimation errors in values. One important insight is that explicit learned models of the source dynamics, when available, could benefit contextual transfer in such settings. In this paper, we assume a family of tasks with shared sub-goals but different dynamics, and availability of estimated dynamics and policies for source tasks. To deal with possible estimation errors in dynamics, we introduce a novel Bayesian mixture-of-experts for learning state-dependent beliefs over source task dynamics that match the target dynamics using state transitions collected from the target task. The mixture is easy to interpret, is robust to estimation errors in dynamics, and is compatible with most RL algorithms. We incorporate it into standard policy reuse frameworks and demonstrate its effectiveness on benchmarks from OpenAI gym.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gimelfarb21a.html
https://proceedings.mlr.press/v161/gimelfarb21a.htmlDecentralized multi-agent active search for sparse signalsActive search refers to the problem of efficiently locating targets in an unknown environment by actively making data-collection decisions. In this paper, we are focusing on multiple aerial robots (agents) detecting targets such as gas leaks, radiation sources or human survivors of disasters. One of the main challenges of active search with multiple agents in unknown environments is impracticality of central coordination due to the difficulties of connectivity maintenance. In this paper, we propose two distinct active search algorithms that allow for multiple robots to independently make data-collection decisions without a central coordinator. Throughout we consider that targets are sparsely located around the environment in keeping with compressive sensing assumptions and its applicability in real world scenarios. Additionally, while most common sensing algorithms assume that agents can sense the entire environment (e.g. compressive sensing) or sense point-wise (e.g. Bayesian Optimization) at a time, we make a realistic assumption that each agent can only sense a contiguous region of space at each time step. We provide simulation results as well as theoretical analysis to demonstrate the efficacy of our proposed algorithms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ghods21a.html
https://proceedings.mlr.press/v161/ghods21a.htmlNo-regret learning with high-probability in adversarial Markov decision processesIn a variety of problems, a decision-maker is unaware of the loss function associated with a task, yet it has to minimize this unknown loss in order to accomplish the task. Furthermore, the decision-maker’s task may evolve, resulting in a varying loss function. In this setting, we explore sequential decision-making problems modeled by adversarial Markov decision processes, where the loss function may arbitrarily change at every time step. We consider the bandit feedback scenario, where the agent observes only the loss corresponding to its actions. We propose an algorithm, called <em>online relative-entropy policy search with implicit exploration</em>, that achieves a sublinear regret not only in expectation but, more importantly, with high probability. In particular, we prove that by employing an <em> optimistically biased</em> loss estimator, the proposed algorithm achieves a regret of $\tilde{\mathcal{O}}((T|\act||\st|)^{\pp} \sqrt{\tau})$, where $|\st|$ is the number of states, $|\act|$ is the number of actions, $\tau$ is the mixing time, and $T$ is the time horizon. To our knowledge, the proposed algorithm is the first scheme that enjoys such high-probability regret bounds for general adversarial Markov decision processes under the presence of bandit feedback.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ghasemi21a.html
https://proceedings.mlr.press/v161/ghasemi21a.htmlLearning probabilistic sentential decision diagrams under logic constraints by sampling and averagingProbabilistic Sentential Decision Diagrams (PSDDs) are effective tools for combining uncertain knowledge in the form of (learned) probabilities and certain knowledge in the form of logical constraints. Despite some promising recent advances in the topic, very little attention has been given to the problem of effectively learning PSDDs from data and logical constraints in large domains. In this paper, we show that a simple strategy of sampling and averaging PSDDs leads to state-of-the-art performance in many tasks. We overcome some of the issues with previous methods by employing a top-down generation of circuits from a logic formula represented as a BDD. We discuss how to locally grow the circuit while achieving a good trade-off between complexity and goodness-of-fit of the resulting model. Generalization error is further decreased by aggregating sampled circuits through an ensemble of models. Experiments with various domains show that the approach efficiently learns good models even in very low data regimes, while remaining competitive for large sample sizes.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/geh21a.html
https://proceedings.mlr.press/v161/geh21a.htmlCorrelated weights in infinite limits of deep convolutional neural networksInfinite width limits of deep neural networks often have tractable forms. They have been used to analyse the behaviour of finite networks, as well as being useful methods in their own right. When investigating infinitely wide convolutional neural networks (CNNs), it was observed that the correlations arising from spatial weight sharing disappear in the infinite limit. This is undesirable, as spatial correlation is the main motivation behind CNNs. We show that the loss of this property is not a consequence of the infinite limit, but rather of choosing an independent weight prior. Correlating the weights maintains the correlations in the activations. Varying the amount of correlation interpolates between independent-weight limits and mean-pooling. Empirical evaluation of the infinitely wide network shows that optimal performance is achieved between the extremes, indicating that correlations can be useful.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/garriga-alonso21a.html
https://proceedings.mlr.press/v161/garriga-alonso21a.htmlLearning and certification under instance-targeted poisoningIn this paper, we study PAC learnability and certification under instance-targeted poisoning attacks, where the adversary may change a fraction of the training set with the goal of fooling the learner at a specific target instance. Our first contribution is to formalize the problem in various settings, and explicitly discussing subtle aspects such as learner’s randomness and whether (or not) adversary’s attack can depend on it. We show that when the budget of the adversary scales sublinearly with the sample complexity, PAC learnability and certification are achievable. In contrast, when the adversary’s budget grows linearly with the sample complexity, the adversary can potentially drive up the expected 0-1 loss to one. We also study distribution-specific PAC learning in the same attack model and show that proper learning with certification is possible for learning half spaces under natural distributions. Finally, we empirically study the robustness of K nearest neighbour, logistic regression, multi-layer perceptron, and convolutional neural network on real data sets against targeted-poisoning attacks. Our experimental results show that many models, especially state-of-the-art neural networks, are indeed vulnerable to these strong attacks. Interestingly, we observe that methods with high standard accuracy might be more vulnerable to instance-targeted poisoning attacks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gao21b.html
https://proceedings.mlr.press/v161/gao21b.htmlContrastive prototype learning with augmented embeddings for few-shot learningMost recent few-shot learning (FSL) methods are based on meta-learning with episodic training. In each meta-training episode, a discriminative feature embedding and/or classifier are first constructed from a support set in an inner loop, and then evaluated in an outer loop using a query set for model updating. This query set sample centered learning objective is however intrinsically limited in addressing the lack of training data problem in the support set. In this paper, a novel contrastive prototype learning with augmented embeddings (CPLAE) model is proposed to overcome this limitation. First, data augmentations are introduced to both the support and query sets with each sample now being represented as an augmented embedding (AE) composed of concatenated embeddings of both the original and augmented versions. Second, a novel support set class prototype centered contrastive loss is proposed for contrastive prototype learning (CPL). With a class prototype as an anchor, CPL aims to pull the query samples of the same class closer and those of different classes further away. This support set sample centered loss is highly complementary to the existing query centered loss, fully exploiting the limited training data in each episode. Extensive experiments on several benchmarks demonstrate that our proposed CPLAE achieves new state-of-the-art.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gao21a.html
https://proceedings.mlr.press/v161/gao21a.htmlGeneralized parametric path problemsParametric path problems arise independently in diverse domains, ranging from transportation to finance, where they are studied under various assumptions. We formulate a general path problem with relaxed assumptions, and describe how this formulation is applicable in these domains. We study the complexity of the general problem, and a variant of it where preprocessing is allowed. We show that when the parametric weights are linear functions, algorithms remain tractable even under our relaxed assumptions. Furthermore, we show that if the weights are allowed to be non-linear, the problem becomes NP-hard. We also study the multi-dimensional version of the problem where the weight functions are parameterized by multiple parameters. We show that even with two parameters, this problem is NP-hard.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/gajjar21a.html
https://proceedings.mlr.press/v161/gajjar21a.htmlPartial Identifiability in Discrete Data with Measurement ErrorWhen data contains measurement errors, it is necessary to make modeling assumptions relating the error-prone measurements to the unobserved true values. Work on measurement error has largely focused on models that fully identify the parameter of interest. As a result, many practically useful models that result in <em>bounds</em> on the target parameter – known as partial identification – have been neglected. In this work, we present a method for partial identification in a class of measurement error models involving discrete variables. We focus on models that impose linear constraints on the target parameter, allowing us to compute partial identification bounds using off-the-shelf LP solvers. We show how several common measurement error assumptions can be composed with an extended class of instrumental variable-type models to create such linear constraint sets. We further show how this approach can be used to bound causal parameters, such as the average treatment effect, when treatment or outcome variables are measured with error. Using data from the Oregon Health Insurance Experiment, we apply this method to estimate bounds on the effect Medicaid enrollment has on depression when depression is measured with error.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/finkelstein21b.html
https://proceedings.mlr.press/v161/finkelstein21b.htmlEntropic Inequality Constraints from e-separation Relations in Directed Acyclic Graphs with Hidden VariablesDirected acyclic graphs (DAGs) with hidde variables are often used to characterize causal relations between variables in a system. When some variables are unobserved, DAGs imply a notoriously complicated set of constraints on the distribution of observed variables. In this work, we present entropic inequality constraints that are implied by e-separation relations in hidden variable DAGs with discrete observed variables. The constraints can intuitively be understood to follow from the fact that the capacity of variables along a causal pathway to convey information is restricted by their entropy; e.g. at the extreme case, a variable with entropy 0 can convey no information. We show how these constraints can be used to learn about the true causal model from an observed data distribution. In addition, we propose a measure of causal influence called the minimal mediary entropy, and demonstrate that it can concisely augment traditional measures such as the average treatment effect.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/finkelstein21a.html
https://proceedings.mlr.press/v161/finkelstein21a.htmlOn the effects of quantisation on model uncertainty in Bayesian neural networksBayesian neural networks (BNNs) are making significant progress in many research areas where decision-making needs to be accompanied by uncertainty estimation. Being able to quantify uncertainty while making decisions is essential for understanding when the model is over-/under-confident, and hence BNNs are attracting interest in safety-critical applications, such as autonomous driving, healthcare, and robotics. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their increased memory and compute costs. In this work, we investigate quantisation of BNNs by compressing 32-bit floating-point weights and activations to their integer counterparts, that has already been successful in reducing the compute demand in standard pointwise neural networks. We study three types of quantised BNNs, we evaluate them under a wide range of different settings, and we empirically demonstrate that a uniform quantisation scheme applied to BNNs does not substantially decrease their quality of uncertainty estimation.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ferianc21a.html
https://proceedings.mlr.press/v161/ferianc21a.htmlNon-PSD matrix sketching with applications to regression and optimizationA variety of dimensionality reduction techniques have been applied for computations involving large matrices. The underlying matrix is randomly compressed into a smaller one, while approximately retaining many of its original properties. As a result, much of the expensive computation can be performed on the small matrix. The sketching of positive semidefinite (PSD) matrices is well understood, but there are many applications where the related matrices are not PSD, including Hessian matrices in non-convex optimization and covariance matrices in regression applications involving complex numbers. In this paper, we present novel dimensionality reduction methods for non-PSD matrices, as well as their "square-roots", which involve matrices with complex entries. We show how these techniques can be used for multiple downstream tasks. In particular, we show how to use the proposed matrix sketching techniques for both convex and non-convex optimization, lp-regression for every 1<=p<infinity, and vector-matrix-vector queries.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/feng21a.html
https://proceedings.mlr.press/v161/feng21a.htmlLifted reasoning meets weighted model integrationExact inference in probabilistic graphical models is particularly challenging in the presence of relational and other deterministic constraints. For discrete domains, weighted model counting has emerged as an effective and general approach in a variety of formalisms. Weighted first-order model counting, which allows relational atoms and function-free first order logic has pushed the envelope further, by exploiting symmetry properties over indistinguishable groups of objects, and by extension avoids the need to perform inference on the exponential ground theory. Given the limitation to discrete domains, the formulation of weighted model integration was proposed as an extension to weighted model counting for mixed discrete-continuous domains over both symbolic and numeric weight functions. While that formulation has enjoyed considerable attention in recent years, there is very little understanding on whether the task can be solved at a lifted level, that is, whether we can reason with relational models by avoiding grounding. In this paper, we consider this question. We show how to generalize algorithmic ideas known in the circuit compilation for function-free lifted inference to functions with a continuous range.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/feldstein21a.html
https://proceedings.mlr.press/v161/feldstein21a.htmlBayesian streaming sparse Tucker decompositionTucker decomposition is a classical tensor factorization model. Compared with the most widely used CP decomposition, the Tucker model is much more flexible and interpretable in that it accounts for every possible (multiplicative) interaction between the factors in different modes. However, this also brings in the risk of overfitting and computational challenges, especially in the case of fast streaming data. To address these issues, we develop BASS-Tucker, a BAyesian Streaming Sparse Tucker decomposition method. We place a spike-and-slab prior over the core tensor elements to automatically select meaningful factor interactions so as to prevent overfitting and to further enhance the interpretability. To enable efficient streaming factorization, we use conditional moment matching and Delta’s method to develop one-shot incremental update of the latent factors and core tensor upon receiving each streaming batch. Thereby, we avoid processing the data points one by one as in the standard assumed density filtering, which needs to update the core tensor for each point and is quite inefficient. We explicitly introduce and update a sparse prior approximation in the running posterior to fulfill effective sparse estimation in the streaming inference. We show the advantage of BASS-Tucker in several real-world applications.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/fang21b.html
https://proceedings.mlr.press/v161/fang21b.htmlEfficient greedy coordinate descent via variable partitioningGreedy coordinate descent (GCD) is an efficient optimization algorithm for a wide range of machine learning and data mining applications. GCD could be significantly faster than randomized coordinate descent (RCD) if they have similar per iteration cost. Nevertheless, in some cases, the greedy rule used in GCD cannot be efficiently implemented, leading to huge per iteration cost and making GCD slower than RCD. To alleviate the cost per iteration, the existing solutions rely on maximum inner product search (MIPS) as an approximate greedy rule. But it has been empirically shown that GCD with approximate greedy rule could suffer from slow convergence even with the state-of-the-art MIPS algorithms. We propose a hybrid coordinate descent algorithm with a simple variable partition strategy to tackle the cases when greedy rule cannot be implemented efficiently. The convergence rate and theoretical properties of the new algorithm are presented. The proposed method is shown to be especially useful when the data matrix has a group structure. Numerical experiments with both synthetic and real-world data demonstrate that our new algorithm is competitive against RCD, GCD, approximate GCD with MIPS and their accelerated variants.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/fang21a.html
https://proceedings.mlr.press/v161/fang21a.htmlDependency in DAG models with hidden variablesDirected acyclic graph models with hidden variables have been much studied, particularly in view of their computational efficiency and connection with causal methods. In this paper we provide the circumstances under which it is possible for two variables to be identically equal, while all other observed variables stay jointly independent of them and mutually of each other. We find that this is possible if and only if the two variables are ‘densely connected’; in other words, if applications of identifiable causal interventions on the graph cannot (non-trivially) separate them. As a consequence of this, we can also allow such pairs of random variables have any bivariate joint distribution that we choose. This has implications for model search, since it suggests that we can reduce to only consider graphs in which densely connected vertices are always joined by an edge.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/evans21a.html
https://proceedings.mlr.press/v161/evans21a.htmlTowards robust episodic meta-learningMeta-learning learns across historical tasks with the goal to discover a representation from which it is easy to adapt to unseen tasks. Episodic meta-learning attempts to simulate a realistic setting by generating a set of small artificial tasks from a larger set of training tasks for meta-training and proceeds in a similar fashion for meta-testing. However, this (meta-)learning paradigm has recently been shown to be brittle, suggesting that the inductive bias encoded in the learned representations is inadequate. In this work we propose to compose episodes to robustify meta-learning in the few-shot setting in order to learn more efficiently and to generalize better to new tasks. We make use of active learning scoring rules to select the data to be included in the episodes. We assume that the meta-learner is given new tasks at random, but the data associated to the tasks can be selected from a larger pool of unlabeled data, and investigate where active learning can boost the performance of episodic meta-learning. We show that instead of selecting samples at random, it is better to select samples in an active manner especially in settings with out-of-distribution and class-imbalanced tasks. We evaluate our method with Prototypical Networks, foMAML and protoMAML, reporting significant improvements on public benchmarks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ermis21a.html
https://proceedings.mlr.press/v161/ermis21a.htmlHigh-dimensional Bayesian optimization with sparse axis-aligned subspacesBayesian optimization (BO) is a powerful paradigm for efficient optimization of black-box objective functions. High-dimensional BO presents a particular challenge, in part because the curse of dimensionality makes it difficult to define—as well as do inference over—a suitable class of surrogate models. We argue that Gaussian process surrogate models defined on sparse axis-aligned subspaces offer an attractive compromise between flexibility and parsimony. We demonstrate that our approach, which relies on Hamiltonian Monte Carlo for inference, can rapidly identify sparse subspaces relevant to modeling the unknown objective function, enabling sample-efficient high-dimensional BO. In an extensive suite of experiments comparing to existing methods for high-dimensional BO we demonstrate that our algorithm, Sparse Axis-Aligned Subspace BO (SAASBO), achieves excellent performance on several synthetic and real-world problems without the need to set problem-specific hyperparameters.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/eriksson21a.html
https://proceedings.mlr.press/v161/eriksson21a.htmlWhen is particle filtering efficient for planning in partially observed linear dynamical systems?Particle filtering is a popular method for inferring latent states in stochastic dynamical systems, whose theoretical properties have been well studied in machine learning and statistics communities. In many control problems, e.g., partially observed linear dynamical systems (POLDS), oftentimes the inferred latent state is further used for planning at each step. This paper initiates a rigorous study on the efficiency of particle filtering for sequential planning, and gives the first particle complexity bounds. Though errors in past actions may affect the future, we are able to bound the number of particles needed so that the long-run reward of the policy based on particle filtering is close to that based on exact inference. In particular, we show that, in stable systems, polynomially many particles suffice. Key in our proof is a coupling of the ideal sequence based on the exact planning and the sequence generated by approximate planning based on particle filtering. We believe this technique can be useful in other sequential decision-making problems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/du21a.html
https://proceedings.mlr.press/v161/du21a.htmlPALM: Probabilistic area loss Minimization for Protein Sequence AlignmentProtein sequence alignment is a fundamental problem in computational structure biology and popular for protein 3D structural prediction and protein homology detection. Most of the developed programs for detecting protein sequence alignments are based upon the likelihood information of amino acids and are sensitive to alignment noises. We present a novel method PALM for modeling pairwise protein structure alignments, using the area distance to reduce the biological measurement noise. PALM generatively learn the alignment of two protein sequences with probabilistic area distance objective, which can denoise the measurement errors contained in the ground-truth alignments. During learning, we show that the optimization is computationally efficient by estimating the gradients via dynamically sampling alignments. Empirically, we show that PALM can generate sequence alignments with higher precision and recall, as well as smaller area distance than the competing methods especially for long protein sequences and remote homologies. This study implies for learning over large-scale protein sequence alignment problems, one could potentially give PALM a try.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ding21c.html
https://proceedings.mlr.press/v161/ding21c.htmlDefending SVMs against poisoning attacks: the hardness and DBSCAN approachAdversarial machine learning has attracted a great amount of attention in recent years. Due to the great importance of support vector machines (SVM) in machine learning, we consider defending SVM against poisoning attacks in this paper. We study two commonly used strategies for defending: designing robust SVM algorithms and data sanitization. Though several robust SVM algorithms have been proposed before, most of them either are in lack of adversarial-resilience, or rely on strong assumptions about the data distribution or the attacker’s behavior. Moreover, the research on the hardness of designing a quality-guaranteed adversarially-resilient SVM algorithm is still quite limited. We are the first, to the best of our knowledge, to prove that even the simplest hard-margin one-class SVM with adversarial outliers problem is NP-complete, and has no fully PTAS unless P=NP. For data sanitization, we explain the effectiveness of DBSCAN (as a density-based outlier removal method) for defending against poisoning attacks. In particular, we link it to the intrinsic dimensionality by proving a sampling theorem in doubling metrics. In our empirical experiments, we systematically compare several defenses including the DBSCAN and robust SVM methods, and investigate the influences from the intrinsic dimensionality and poisoned fraction to their performances.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ding21b.html
https://proceedings.mlr.press/v161/ding21b.htmlXOR-SGD: provable convex stochastic optimization for decision-making under uncertaintyMany decision-making problems under uncertainty can be formulated as convex stochastic optimization, which minimizes a convex objective in expectation across exponentially many probabilistic scenarios. Despite its convexity, evaluating the objective function is #P-hard. Previous approaches use samples from MCMC and its variants to approximate the objective function but have a slow mixing rate. We present XOR-SGD, a stochastic gradient descent (SGD) approach guaranteed to converge to solutions that are at most a constant away from the true optimum in linear number of iterations. XOR-SGD harnesses XOR-sampling, which reduces the sample approximation of the expectation into queries of NP oracles via hashing and projection. We evaluate XOR-SGD on two real-world applications. The first stochastic inventory management problem searches for a robust inventory management plan in preparation for the virus pandemics, natural disasters, etc. The second network design problem decides an optimal land conservation plan which promotes the free movement of wild-life animals. We show that our approach finds better solutions with drastically fewer samples needed compared to a couple of state-of-the-art solvers.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ding21a.html
https://proceedings.mlr.press/v161/ding21a.htmlWeighted model counting with conditional weights for Bayesian networksWeighted model counting (WMC) has emerged as the unifying inference mechanism across many (probabilistic) domains. Encoding an inference problem as an instance of WMC typically necessitates adding extra literals and clauses. This is partly so because the predominant definition of WMC assigns weights to models based on weights on literals, and this severely restricts what probability distributions can be represented. We develop a measure-theoretic perspective on WMC and propose a way to encode conditional weights on literals analogously to conditional probabilities. This representation can be as succinct as standard WMC with weights on literals but can also expand as needed to represent probability distributions with less structure. To demonstrate the performance benefits of conditional weights over the addition of extra literals, we develop a new WMC encoding for Bayesian networks and adapt a state-of-the-art WMC algorithm ADDMC to the new format. Our experiments show that the new encoding significantly improves the performance of the algorithm on most benchmark instances.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/dilkas21a.html
https://proceedings.mlr.press/v161/dilkas21a.htmlThompson sampling for Markov games with piecewise stationary opponent policiesReinforcement learning problems with multiple agents pose the challenge of efficiently adapting to nonstationary dynamics arising from other agents’ strategic behavior. Although several algorithms exist for these problems with promising empirical results, regret analysis and efficient use of other-agent models in general-sum games is very limited. We propose an algorithm (TSMG) for general-sum Markov games against agents that switch between several stationary policies, combining change detection with Thompson sampling to learn parametric models of these policies. Under standard assumptions for parametric Markov decision process (MDP) learning, the expected regret of TSMG in the worst case over policy parameters and switch schedules is near-optimal in time and number of switches, up to logarithmic factors. Our experiments on simulated games show that TSMG can outperform standard Thompson sampling and a version of Thompson sampling with a static reset schedule, despite the violation of an assumption that the MDPs induced by the other player are ergodic.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/digiovanni21a.html
https://proceedings.mlr.press/v161/digiovanni21a.htmlRandom probabilistic circuitsDensity estimation could be viewed as a core component in machine learning, since a good estimator could be used to solve many tasks such as classification, regression, and imputing missing values. The main challenge of density estimation is balancing the model expressiveness and its learning and inference complexity. Probabilistic circuits (PCs) model a probability distribution as a computational graph. By imposing specific structural properties on such models many inference tasks become tractable. However, learning PCs usually relies on greedy and time consuming procedures. In this paper we propose a new unified approach to efficiently learn PCs having several structural properties. We introduce extremely randomized PCs (XPCs), PCs with a random structure. We show their advantage on standard density estimation benchmarks when compared to other density estimators.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/di-mauro21a.html
https://proceedings.mlr.press/v161/di-mauro21a.htmlA heuristic for statistical seriationWe study the statistical seriation problem, where the goal is to estimate a matrix whose rows satisfy the same shape constraint after a permutation of the columns. This is a important classical problem, with close connections to statistical literature in permutation-based models and also has wide applications ranging from archaeology to biology. Specifically, we consider the case where the rows are monotonically increasing after an unknown permutation of the columns. Past work has shown that the least-squares estimator is optimal up to logarithmic factors, but efficient algorithms for computing the least-squares estimator remain unknown to date. We approach this important problem from a heuristic perspective. Specifically, we replace the combinatorial permutation constraint by a continuous regularization term, and then use projected gradient descent to obtain a local minimum of the non-convex objective. We show that the attained local minimum is the global minimum in certain special cases under the noiseless setting, and preserves desirable properties under the noisy setting. Simulation results reveal that our proposed algorithm outperforms prior algorithms when (1) the underlying model is more complex than simplistic parametric assumptions such as low-rankedness, or (2) the signal-to-noise ratio is high. Under partial observations, the proposed algorithm requires an initialization, and different initializations may lead to different local minima. We empirically observe that the proposed algorithm yields consistent improvement over the initialization, even though different initializations start with different levels of quality.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/dhull21a.html
https://proceedings.mlr.press/v161/dhull21a.htmlSum-product laws and efficient algorithms for imprecise Markov chainsWe propose two sum-product laws for imprecise Markov chains, and use these laws to derive two algorithms to efficiently compute lower and upper expectations for imprecise Markov chains under complete independence and epistemic irrelevance. These algorithms work for inferences that have a corresponding sum-product decomposition, and we argue that many well-known inferences fit their scope. We illustrate our results on a simple epidemiological example.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/de-bock21a.html
https://proceedings.mlr.press/v161/de-bock21a.htmlAsynchronous $ε$-Greedy Bayesian OptimisationBatch Bayesian optimisation (BO) is a successful technique for the optimisation of expensive black-box functions. Asynchronous BO can reduce wallclock time by starting a new evaluation as soon as another finishes, thus maximising resource utilisation. To maximise resource allocation, we develop a novel asynchronous BO method, AEGiS (Asynchronous $\epsilon$-Greedy Global Search) that combines greedy search, exploiting the surrogate’s mean prediction, with Thompson sampling and random selection from the approximate Pareto set describing the trade-off between exploitation (surrogate mean prediction) and exploration (surrogate posterior variance). We demonstrate empirically the efficacy of AEGiS on synthetic benchmark problems, meta-surrogate hyperparameter tuning problems and real-world problems, showing that AEGiS generally outperforms existing methods for asynchronous BO. When a single worker is available performance is no worse than BO using expected improvement.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/de-ath21a.html
https://proceedings.mlr.press/v161/de-ath21a.htmlSDM-Net: A simple and effective model for generalized zero-shot learningZero-Shot Learning (ZSL) is a classification task where some classes referred to as <em>unseen classes</em> have no training images. Instead, we only have side information about seen and unseen classes, often in the form of semantic or descriptive attributes. Lack of training images from a set of classes restricts the use of standard classification techniques and losses, including the widespread cross-entropy loss. We introduce a novel Similarity Distribution Matching Network (SDM-Net) which is a standard fully connected neural network architecture with a non-trainable penultimate layer consisting of class attributes. The output layer of SDM-Net consists of both seen and unseen classes. To enable zero-shot learning, during training, we regularize the model such that the predicted distribution of unseen class is close in KL divergence to the distribution of similarities between the correct seen class and all the unseen classes. We evaluate the proposed model on five benchmark datasets for zero-shot learning, AwA1, AwA2, aPY, SUN, and CUB datasets. We show that, despite the simplicity, our approach achieves competitive performance with state-of-the-art methods in Generalized-ZSL setting for all of these datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/daghaghi21a.html
https://proceedings.mlr.press/v161/daghaghi21a.htmlMinimax sample complexity for turn-based stochastic gameThe empirical success of multi-agent reinforcement learning is encouraging, while few theoretical guarantees have been revealed. In this work, we prove that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG). Specifically, we perform planning in an empirical TBSG by utilizing a ‘simulator’ that allows sampling from arbitrary state-action pair. We show that the empirical Nash equilibrium strategy is an approximate Nash equilibrium strategy in the true TBSG and give both problem-dependent and problem-independent bound. We develop reward perturbation techniques to tackle the non-stationarity in the game and Taylor-expansion-type analysis to improve the dependence on approximation error. With these novel techniques, we prove the minimax sample complexity of turn-based stochastic game.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/cui21a.html
https://proceedings.mlr.press/v161/cui21a.htmlFormal verification of neural networks for safety-critical tasks in deep reinforcement learningIn the last years, neural networks achieved groundbreaking successes in a wide variety of applications. However, for safety critical tasks, such as robotics and healthcare, it is necessary to provide some specific guarantees before the deployment in a real world context. Even in these scenarios, where high cost equipment and human safety are involved, the evaluation of the models is usually performed with the standard metrics (i.e., cumulative reward or success rate). In this paper, we introduce a novel metric for the evaluation of models in safety critical tasks, the <em>violation rate</em>. We build our work upon the concept of formal verification for neural networks, providing a new formulation for the safety properties that aims to ensure that the agent always makes rational decisions. To perform this evaluation, we present ProVe (Property Verifier), a novel approach based on the interval algebra, designed for the analysis of our novel <em>behavioral</em> properties. We apply our method to different domains (i.e., mapless navigation for mobile robots, trajectory generation for manipulators, and the standard ACAS benchmark). Results show that the violation rate computed by ProVe provides a good evaluation for the safety of trained models.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/corsi21a.html
https://proceedings.mlr.press/v161/corsi21a.htmlScaling Hamiltonian Monte Carlo inference for Bayesian neural networks with symmetric splittingHamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) approach that exhibits favourable exploration properties in high-dimensional models such as neural networks. Unfortunately, HMC has limited use in large-data regimes and little work has explored suitable approaches that aim to preserve the entire Hamiltonian. In our work, we introduce a new symmetric integration scheme for split HMC that does not rely on stochastic gradients. We show that our new formulation is more efficient than previous approaches and is easy to implement with a single GPU. As a result, we are able to perform full HMC over common deep learning architectures using entire data sets. In addition, when we compare with stochastic gradient MCMC, we show that our method achieves better performance in both accuracy and uncertainty quantification. Our approach demonstrates HMC as a feasible option when considering inference schemes for large-scale machine learning problems.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/cobb21a.html
https://proceedings.mlr.press/v161/cobb21a.htmlFeaturized density ratio estimationDensity ratio estimation serves as an important technique in the unsupervised machine learning toolbox. However, such ratios are difficult to estimate for complex, high-dimensional data, particularly when the densities of interest are sufficiently different. In our work, we propose to leverage an invertible generative model to map the two distributions into a common feature space prior to estimation. This featurization brings the densities closer together in latent space, sidestepping pathological scenarios where the learned density ratios in input space can be arbitrarily inaccurate. At the same time, the invertibility of our feature map guarantees that the ratios computed in feature space are equivalent to those in input space. Empirically, we demonstrate the efficacy of our approach in a variety of downstream tasks that require access to accurate density ratios such as mutual information estimation, targeted sampling in deep generative models, and classification with data augmentation.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/choi21a.html
https://proceedings.mlr.press/v161/choi21a.htmlContingency-aware influence maximization: A reinforcement learning approachThe influence maximization (IM) problem aims at finding a subset of seed nodes in a social network that maximize the spread of influence. In this study, we focus on a sub-class of IM problems, where whether the nodes are willing to be the seeds when being invited is uncertain, called contingency-aware IM. Such contingency aware IM is critical for applications for non-profit organizations in low resource communities (e.g., spreading awareness of disease prevention). Despite the initial success, a major practical obstacle in promoting the solutions to more communities is the tremendous runtime of the greedy algorithms and the lack of high performance computing (HPC) for the non-profits in the field – whenever there is a new social network, the non-profits usually do not have the HPCs to recalculate the solutions. Motivated by this and inspired by the line of works that use reinforcement learning (RL) to address combinatorial optimization on graphs, we formalize the problem as a Markov Decision Process (MDP), and use RL to learn an IM policy over historically seen networks, and generalize to unseen networks with negligible runtime at test phase. To fully exploit the properties of our targeted problem, we propose two technical innovations that improve the existing methods, including state-abstraction and theoretically grounded reward shaping. Empirical results show that our method achieves influence as high as the state-of-the-art methods for contingency-aware IM, while having negligible runtime at test phase.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/chen21b.html
https://proceedings.mlr.press/v161/chen21b.htmlCombinatorial semi-bandit in the non-stationary environmentIn this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/chen21a.html
https://proceedings.mlr.press/v161/chen21a.htmlMulti-task and meta-learning with sparse linear banditsMotivated by recent developments on meta-learning with linear contextual bandit tasks, we study the benefit of feature learning in both the multi-task and meta-learning settings. We focus on the case that the task weight vectors are <em>jointly sparse</em>, i.e. they share the same small set of predictive features. Starting from previous work on standard linear regression with the group-lasso estimator we provide novel oracle-inequalities for this estimator when samples are collected by a bandit policy. Subsequently, building on a recent lasso-bandit policy, we investigate its group-lasso variant and analyze its regret bound. We specialize the proposed policy to the multi-task and meta-learning settings, demonstrating its theoretical advantage. We also point out a deficiency in the state-of-the-art lower bound and observe that our method has a smaller upper bound. Preliminary experiments confirm the effectiveness of our approach in practice.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/cella21a.html
https://proceedings.mlr.press/v161/cella21a.htmlVariational inference with continuously-indexed normalizing flowsContinuously-indexed flows (CIFs) have recently achieved improvements over baseline normalizing flows on a variety of density estimation tasks. CIFs do not possess a closed-form marginal density, and so, unlike standard flows, cannot be plugged in directly to a variational inference (VI) scheme in order to produce a more expressive family of approximate posteriors. However, we show here how CIFs can be used as part of an auxiliary VI scheme to formulate and train expressive posterior approximations in a natural way. We exploit the conditional independence structure of multi-layer CIFs to build the required auxiliary inference models, which we show empirically yield low-variance estimators of the model evidence. We then demonstrate the advantages of CIFs over baseline flows in VI problems when the posterior distribution of interest possesses a complicated topology, obtaining improved results in both the Bayesian inference and surrogate maximum likelihood settings.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/caterini21a.html
https://proceedings.mlr.press/v161/caterini21a.htmlTime-variant variational transfer for value functionsIn most of the transfer learning approaches to reinforcement learning (RL) the distribution over the tasks is assumed to be stationary. Therefore, the target and source tasks are i.i.d. samples of the same distribution. Unfortunately, this assumption rarely holds in real-world conditions, e.g., due to seasonality or periodicity, evolution in the environment or faults in the sensors/actuators. In the context of this work, we consider the problem of transferring value functions through a variational method when the distribution that generates the tasks is time-variant, proposing a solution that leverages this temporal structure inherent in the task generating process. Furthermore, by means of a finite-sample analysis, the previously mentioned solution is theoretically compared to its time-invariant version. Finally, the experimental evaluation of the proposed technique is carried out on the lake Como water system representing a real-world scenario and on three different RL environments with three distinct temporal dynamics.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/canonaco21a.html
https://proceedings.mlr.press/v161/canonaco21a.htmlProceedings of the thirty-seventh conference on Uncertainty in Artificial Intelligence — PrefaceThe Conference on Uncertainty in Artificial Intelligence (UAI) is a premier international conference on research related to representation, inference, learning and decision making in the presence of uncertainty within the field of Artificial Intelligence. This volume contains all papers that were accepted for the Thirty-seventh UAI Conference, held Online from July 27 to 30, 2021.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/campos21a.html
https://proceedings.mlr.press/v161/campos21a.htmlAn optimization and generalization analysis for max-pooling networksMax-Pooling operations are a core component of deep learning architectures. In particular, they are part of most convolutional architectures used in machine vision, since pooling is a natural approach to pattern detection problems. However, these architectures are not well understood from a theoretical perspective. For example, we do not understand when they can be globally optimized, and what is the effect of over-parameterization on generalization. Here we perform a theoretical analysis of a convolutional max-pooling architecture, proving that it can be globally optimized, and can generalize well even for highly over-parameterized models. Our analysis focuses on a data generating distribution inspired by pattern detection problem, where a “discriminative” pattern needs to be detected among “spurious” patterns. We empirically validate that CNNs significantly outperform fully connected networks in our setting, as predicted by our theoretical results.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/brutzkus21a.html
https://proceedings.mlr.press/v161/brutzkus21a.htmlLearning in Multi-Player Stochastic GamesWe consider the problem of simultaneous learning in stochastic games with many players in the finite-horizon setting. While the typical target solution for a stochastic game is a Nash equilibrium, this is intractable with many players. We instead focus on variants of <em> correlated equilibria</em>, such as those studied for extensive-form games. We begin with a hardness result for the adversarial MDP problem: even for a horizon of 3, obtaining sublinear regret against the best non-stationary policy is NP-hard when both rewards and transitions are adversarial. This implies that convergence to even the weakest natural solution concept—normal-form coarse correlated equilibrium—is not possible via black-box reduction to a no-regret algorithm even in stochastic games with constant horizon (unless $NP\subseteqBPP$). Instead, we turn to a different target: algorithms which <em> generate</em> an equilibrium when they are used by all players. Our main result is algorithm which generates an <em> extensive-form</em> correlated equilibrium, whose runtime is exponential in the horizon but polynomial in all other parameters. We give a similar algorithm which is polynomial in all parameters for “fast-mixing” stochastic games. We also show a method for efficiently reaching normal-form coarse correlated equilibria in “single-controller” stochastic games which follows the traditional no-regret approach. When shared randomness is available, the two generative algorithms can be extended to give simultaneous regret bounds and converge in the traditional sense.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/brown21a.html
https://proceedings.mlr.press/v161/brown21a.htmlFaster lifting for two-variable logic using cell graphsWe consider the weighted first-order model counting (WFOMC) task, a problem with important applications to inference and learning in structured graphical models. Bringing together earlier work [Van den Broeck et al., 2011, 2014], a formal proof was given by Beame et al. [2015] showing that the two-variable fragment of first-order logic, FO^2, is domain-liftable, meaning it admits an algorithm for WFOMC whose runtime is polynomial in the given domain size. However, applying this theoretical upper bound is often impractical for real-world problem instances. We show how to adapt their proof into a fast algorithm for lifted inference in FO^2, using only off-the-shelf tools for knowledge compilation, and several careful optimizations involving the cell graph of the input sentence, a novel construct we define that encodes the interactions between the cells of the sentence. Experimental results show that, despite our approach being largely orthogonal to that of Forclift [Van den Broeck et al., 2011], our algorithm often outperforms it, scaling to larger domain sizes on more complex input sentences.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/bremen21a.html
https://proceedings.mlr.press/v161/bremen21a.htmlOptimized auxiliary particle filters: adapting mixture proposals via convex optimizationAuxiliary particle filters (APFs) are a class of sequential Monte Carlo (SMC) methods for Bayesian inference in state-space models. In their original derivation, APFs operate in an extended state space using an auxiliary variable to improve inference. In this work, we propose <em>optimized auxiliary particle filters</em>, a framework where the traditional APF auxiliary variables are interpreted as weights in a importance sampling mixture proposal. Under this interpretation, we devise a mechanism for proposing the mixture weights that is inspired by recent advances in multiple and adaptive importance sampling. In particular, we propose to select the mixture weights by formulating a convex optimization problem, with the aim of approximating the filtering posterior at each timestep. Further, we propose a weighting scheme that generalizes previous results on the APF (Pitt et al. 2012), proving unbiasedness and consistency of our estimators. Our framework demonstrates significantly improved estimates on a range of metrics compared to state-of-the-art particle filters at similar computational complexity in challenging and widely used dynamical models.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/branchini21a.html
https://proceedings.mlr.press/v161/branchini21a.htmlFinite-time theory for momentum Q-learning Existing studies indicate that momentum ideas in conventional optimization can be used to improve the performance of Q-learning algorithms. However, the finite-time analysis for momentum-based Q-learning algorithms is only available for the tabular case without function approximation. This paper analyzes a class of momentum-based Q-learning algorithms with finite-time convergence guarantee. Specifically, we propose the MomentumQ algorithm, which integrates the Nesterov’s and Polyak’s momentum schemes, and generalizes the existing momentum-based Q-learning algorithms. For the infinite state-action space case, we establish the convergence guarantee for MomentumQ with linear function approximation under Markovian sampling. In particular, we characterize a finite-time convergence rate which is provably faster than the vanilla Q-learning. This is the first finite-time analysis for momentum-based Q-learning algorithms with function approximation. For the tabular case under synchronous sampling, we also obtain a finite-time convergence rate that is slightly better than the SpeedyQ (Azar et al., NIPS 2011). Finally, we demonstrate through various experiments that the proposed MomentumQ outperforms other momentum-based Q-learning algorithms.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/bowen21a.html
https://proceedings.mlr.press/v161/bowen21a.htmlMin/max stability and box distributionsIn representation learning, capturing correlations between the represented elements is paramount. A recent line of work introduces the notion of learning region-based representations, with the objective of being able to better capture these correlations as set interactions. Box models use regions which are products of intervals on $[0,1]$ (i.e., "boxes"), representing joint probability distributions via Lebesgue measure. To mitigate issues with training, a recent work models the endpoints of these intervals using Gumbel distributions, chosen due to their min/max-stability. In this work we analyze min/max-stability on a bounded domain and provide a specific family of such distributions which, replacing Gumbel, allow for stochastic boxes embedded in a finite measure space. This allows for a latent noise model which is a probability measure. Furthermore, we demonstrate an equivalence between this region-based representation and a density representation, where intersection is given by products of densities. We compare our model to previous region-based probability models, and demonstrate it is capable of being trained effectively to modeling correlations.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/boratko21a.html
https://proceedings.mlr.press/v161/boratko21a.htmlA Bayesian nonparametric conditional two-sample test with an application to Local Causal DiscoveryFor a continuous random variable $Z$, testing conditional independence $X\indep Y|Z$ is known to be a particularly hard problem. It constitutes a key ingredient of many constraint-based causal discovery algorithms. These algorithms are often applied to datasets containing binary variables, which indicate the ‘context’ of the observations, e.g. a control or treatment group within an experiment. In these settings, conditional independence testing with $X$ or $Y$ binary (and the other continuous) is paramount to the performance of the causal discovery algorithm. To our knowledge no nonparametric ‘mixed’ conditional independence test currently exists, and in practice tests that assume all variables to be continuous are used instead. In this paper we aim to fill this gap, as we combine elements of Holmes et al. (2015) and Teymur and Filippi (2020) to propose a novel Bayesian nonparametric conditional two-sample test. Applied to the Local Causal Discovery algorithm, we investigate its performance on both synthetic and real-world data, and compare with state-of-the-art conditional independence tests.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/boeken21a.html
https://proceedings.mlr.press/v161/boeken21a.htmlImproving approximate optimal transport distances using quantizationOptimal transport (OT) is a popular tool in machine learning to compare probability measures geometrically, but it comes with substantial computational burden. Linear programming algorithms for computing OT distances scale cubically in the size of the input, making OT impractical in the large-sample regime. We introduce a practical algorithm, which relies on a quantization step, to estimate OT distances between measures given cheap sample access. We also provide a variant of our algorithm to improve the performance of approximate solvers, focusing on those for entropy-regularized transport. We give theoretical guarantees on the benefits of this quantization step and display experiments showing that it behaves well in practice, providing a practical approximation algorithm that can be used as a drop-in replacement for existing OT estimators.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/beugnot21a.html
https://proceedings.mlr.press/v161/beugnot21a.htmlSequential core-set Monte CarloSequential Monte Carlo (SMC) is a general-purpose methodology for recursive Bayesian inference, and is widely used in state space modeling and probabilistic programming. Its resample-move variant reduces the variance of posterior estimates by interleaving Markov chain Monte Carlo (MCMC) steps for particle “rejuvenation”; but this requires accessing all past observations and leads to linearly growing memory size and quadratic computation cost. Under the assumption of exchangeability, we introduce sequential core-set Monte Carlo (SCMC), which achieves constant space and linear time by rejuvenating based on sparse, weighted subsets of past data. In contrast to earlier approaches, which uniformly subsample or throw away observations, SCMC uses a novel online version of a state-of-the-art Bayesian core-set algorithm to incrementally construct a nonparametric, data- and model-dependent variational representation of the unnormalized target density. Experiments demonstrate significantly reduced approximation errors at negligible additional cost.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/beronov21a.html
https://proceedings.mlr.press/v161/beronov21a.htmlA kernel two-sample test with selection biasHypothesis testing can help decision-making by quantifying distributional differences between two populations from observational data. However, these tests may inherit biases embedded in the data collection mechanism (some instances often being systematically more likely included in our sample) and consistently reproduce biased decisions. We propose a two-sample test that adjusts for selection bias by accounting for differences in marginal distributions of confounding variables. Our test statistic is a weighted distance between samples embedded in a reproducing kernel Hilbert space, whose balancing weights provably correct for bias. We establish the asymptotic distributions under null and alternative hypotheses, and prove the consistency of empirical approximations to the underlying population quantity. We conclude with performance evaluations on artificial data and experiments on treatment effect studies from economics.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/bellot21b.html
https://proceedings.mlr.press/v161/bellot21b.htmlApplication of kernel hypothesis testing on set-valued dataWe present a general framework for kernel hypothesis testing on distributions of sets of individual examples. Sets may represent many common data sources such as groups of observations in time series, collections of words in text or a batch of images of a given phenomenon. This observation pattern, however, differs from the common assumptions required for hypothesis testing: each set differs in size, may have differing levels of noise, and also may incorporate nuisance variability, irrelevant for the analysis of the phenomenon of interest; all features that bias test decisions if not accounted for. In this paper, we propose to interpret sets as independent samples from a collection of latent probability distributions, and introduce kernel two-sample and independence tests in this latent space of distributions. We prove the consistency of these tests and observe them to outperform in a wide range of synthetic and real data experiments, where previously heuristics were needed for feature extraction and testing.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/bellot21a.html
https://proceedings.mlr.press/v161/bellot21a.htmlAction redundancy in reinforcement learningMaximum Entropy (MaxEnt) reinforcement learning is a powerful learning paradigm which seeks to maximize return under entropy regularization. However, action entropy does not necessarily coincide with state entropy, e.g., when multiple actions produce the same transition. Instead, we propose to maximize the transition entropy, i.e., the entropy of next states. We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and <b>action redundancy</b>. Particularly, we explore the latter in both deterministic and stochastic settings and develop tractable approximation methods in a near model-free setup. We construct algorithms to minimize action redundancy and demonstrate their effectiveness on a synthetic environment with multiple redundant actions as well as contemporary benchmarks in Atari and Mujoco. Our results suggest that action redundancy is a fundamental problem in reinforcement learning.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/baram21a.html
https://proceedings.mlr.press/v161/baram21a.htmlUnsupervised constrained community detection via self-expressive graph neural networkGraph neural networks (GNNs) are able to achieve promising performance on multiple graph downstream tasks such as node classification and link prediction. Comparatively lesser work has been done to design GNNs which can operate directly for community detection on graphs. Traditionally, GNNs are trained on a semi-supervised or self-supervised loss function and then clustering algorithms are applied to detect communities. However, such decoupled approaches are inherently sub-optimal. Designing an unsupervised loss function to train a GNN and extract communities in an integrated manner is a fundamental challenge. To tackle this problem, we combine the principle of self-expressiveness with the framework of self-supervised graph neural network for unsupervised community detection for the first time in literature. Our solution is trained in an end-to-end fashion and achieves state-of-the-art community detection performance on multiple publicly available datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/bandyopadhyay21a.html
https://proceedings.mlr.press/v161/bandyopadhyay21a.htmlReZero is all you need: fast convergence at large depthDeep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. [2017] used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on an image recognition task. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/bachlechner21a.html
https://proceedings.mlr.press/v161/bachlechner21a.htmlConstrained labeling for weakly supervised learningCuration of large fully supervised datasets has become one of the major roadblocks for machine learning. Weak supervision provides an alternative to supervised learning by training with cheap, noisy, and possibly correlated labeling functions from varying sources. The key challenge in weakly supervised learning is combining the different weak supervision signals while navigating misleading correlations in their errors. In this paper, we propose a simple data-free approach for combining weak supervision signals by defining a constrained space for the possible labels of the weak signals and training with a random labeling within this constrained space. Our method is efficient and stable, converging after a few iterations of gradient descent. We prove theoretical conditions under which the worst-case error of the randomized label decreases with the rank of the linear constraints. We show experimentally that our method outperforms other weak supervision methods on various text- and image-classification tasks.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/arachie21a.html
https://proceedings.mlr.press/v161/arachie21a.htmlIdentifying regions of trusted predictionsQuantifying the probability of a label prediction being correct on a given test point or a given sub-population enables users to better decide how to use and when to trust machine learning derived predictors. In this work, combining aspects of prior work on conformal predictions and selective classification, we provide a unifying framework for confidence requirements that allows for distinguishing between various sources of uncertainty in the learning process as well as various region specifications. We then consider a set of common prior assumptions on the data generating process and show how these allow learning justifiably trusted predictors.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ananthakrishnan21a.html
https://proceedings.mlr.press/v161/ananthakrishnan21a.htmlMarkov equivalence of max-linear Bayesian networksMax-linear Bayesian networks have emerged as highly applicable models for causal inference from extreme value data. However, conditional independence (CI) for max-linear Bayesian networks behaves differently than for classical Gaussian Bayesian networks. We establish the parallel between the two theories via tropicalization, and establish the surprising result that the Markov equivalence classes for max-linear Bayesian networks coincide with the ones obtained by regular CI. Our paper opens up many open problems at the intersection of extreme value statistics, causal inference and tropical geometry.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/amendola21a.html
https://proceedings.mlr.press/v161/amendola21a.htmlSparse linear networks with a fixed butterfly structure: theory and practiceA butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and at times outperform existing well-known architectures, but it also offers faster training and prediction in deployment. To understand the optimization problems posed by neural networks with a butterfly network, we also study the optimization landscape of the encoder-decoder network, where the encoder is replaced by a butterfly network followed by a dense linear layer in smaller dimension. Theoretical result presented in the paper explains why the training speed and outcome are not compromised by our proposed approach.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ailon21a.html
https://proceedings.mlr.press/v161/ailon21a.htmlConditionally independent data generationConditional independence (CI) is a fundamental concept with wide applications in machine learning and causal inference. Although the problems of testing CI and estimating divergences have been extensively studied, the complementary problem of generating data that satisfies CI has received much less attention. A special case of the generation problem is to produce conditionally independent predictions. Given samples from an input data distribution, we formulate the problem of generating samples from a distribution that is close to the input distribution and satisfies CI. We establish a characterization of CI in terms of a general divergence identity. Based on one version of this identity, an architecture is proposed that leverages the capabilities of generative adversarial networks (GANs) to enforce CI in an end-to-end differentiable manner. As one illustration of the problem formulation and architecture, we consider applications to notions of fairness that can be written as CIs, specifically equalized odds and conditional statistical parity. We demonstrate conditionally independent prediction that trades off adherence to fairness criteria against classification accuracy.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/ahuja21a.html
https://proceedings.mlr.press/v161/ahuja21a.htmlKnown unknowns: Learning novel concepts using reasoning-by-eliminationPeople can learn new visual concepts without any samples, from information given by language or by deductive reasoning. For instance, people can use <em>elimination</em> to infer the meaning of novel labels from their context. While recognizing novel concepts was intensively studied in zero-shot learning with semantic descriptions, training models to learn by elimination is much less studied. Here we describe the first approach to train an agent to reason-by-elimination, by providing instructions that contain both familiar concepts and unfamiliar ones (“pick the red box and the green wambim”). In our framework, the agent combines a perception module with a reasoning module that includes internal memory. It uses reinforcement learning to construct a reasoning policy that, by considering all available items in a room, can make a correct inference even for never-seen objects or concepts. Furthermore, it can then perform one-shot learning and use newly learned concepts for inferring additional novel concepts. We evaluate this approach in a new set of environments, showing that agents successfully learn to reason by elimination, and can also learn novel concepts and use them for further reasoning. This approach paves the way to handle open-world environments by extending the abundant supervised learning approaches with reasoning frameworks that can handle novel concepts.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/agrawal21a.html
https://proceedings.mlr.press/v161/agrawal21a.htmlTowards a unified framework for fair and stable graph representation learningAs the representations output by Graph Neural Networks (GNNs) are increasingly employed in real-world applications, it becomes important to ensure that these representations are fair and stable. In this work, we establish a key connection between counterfactual fairness and stability and leverage it to propose a novel framework, NIFTY (uNIfying Fairness and stabiliTY), which can be used with any GNN to learn fair and stable representations. We introduce a novel objective function that simultaneously accounts for fairness and stability and develop a layer-wise weight normalization using the Lipschitz constant to enhance neural message passing in GNNs. In doing so, we enforce fairness and stability both in the objective function as well as in the GNN architecture. Further, we show theoretically that our layer-wise weight normalization promotes counterfactual fairness and stability in the resulting representations. We introduce three new graph datasets comprising of high-stakes decisions in criminal justice and financial lending domains. Extensive experimentation with the above datasets demonstrates the efficacy of our framework.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/agarwal21b.html
https://proceedings.mlr.press/v161/agarwal21b.htmlCommunication efficient parallel reinforcement learningWe consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions using reinforcement learning for $T$ rounds. The agents share their data with a central server to minimize their regret. We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds. We provide dist-UCRL which runs at each agent and prove that the total cumulative regret of $M$ agents is upper bounded as $\Tilde{O}(DS\sqrt{MAT})$ for a Markov Decision Process with diameter $D$, number of states $S$, and number of actions $A$. The agents synchronize after their visitations to any state-action pair exceeds a certain threshold. Using this, we obtain a bound of $O\left(MSA\log(MT)\right)$ on the total number of communications rounds. Finally, we evaluate the algorithm against multiple environments and demonstrate that the proposed algorithm performs at par with an always communication version of the UCRL2 algorithm, while with significantly lower communication.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/agarwal21a.html
https://proceedings.mlr.press/v161/agarwal21a.htmlPossibilistic preference elicitation by minimax regretIdentifying the preferences of a given user through elicitation is a central part of multi-criteria decision aid (MCDA) or preference learning tasks. Two classical ways to perform this elicitation is to use either a robust or a Bayesian approach. However, both have their shortcoming: the robust approach has strong guarantees through very strong hypotheses, but cannot integrate uncertain information. While the Bayesian approach can integrate uncertainties, but sacrifices the previous guarantees and asks for stronger model assumptions. In this paper, we propose and test a method based on possibility theory, which keeps the guarantees of the robust approach without needing its strong hypotheses. Among other things, we show that it can detect user errors as well as model misspecification.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/adam21a.html
https://proceedings.mlr.press/v161/adam21a.htmlStatistical mechanical analysis of neural network pruningDeep learning architectures with a huge number of parameters are often compressed using <em>pruning</em> techniques to ensure computational efficiency of inference during deployment. Despite multitude of empirical advances, there is a lack of theoretical understanding of the effectiveness of different pruning methods. We inspect different pruning techniques under the statistical mechanics formulation of a teacher-student framework and derive their generalization error (GE) bounds. It has been shown that <em>Determinantal Point Process</em> (DPP) based <em>node</em> pruning method is notably superior to competing approaches when tested on real datasets. Using GE bounds in the aforementioned setup we provide theoretical guarantees for their empirical observations. Another consistent finding in literature is that sparse neural networks (<em>edge pruned</em>) generalize better than dense neural networks (<em>node pruned</em>) for a fixed number of parameters. We use our theoretical setup to prove this finding and show that even the baseline <em>random edge pruning</em> method performs better than the <em>DPP node pruning</em> method. We also validate this empirically on real datasets.Wed, 01 Dec 2021 00:00:00 +0000
https://proceedings.mlr.press/v161/acharyya21a.html
https://proceedings.mlr.press/v161/acharyya21a.html