Proceedings of Machine Learning Research

Proceedings of Machine Learning Research Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence Held in Eindhoven, The Netherlands on 01-05 August 2022 Published as Volume 180 by the Proceedings of Machine Learning Research on 17 August 2022. Volume Edited by: James Cussens Kun Zhang Series Editors: Neil D. Lawrence https://proceedings.mlr.press/v180/ Wed, 08 Feb 2023 10:38:37 +0000 Wed, 08 Feb 2023 10:38:37 +0000 Jekyll v3.9.3 Causal inference with treatment measurement error: a nonparametric instrumental variable approach We propose a kernel-based nonparametric estimator for the causal effect when the cause is corrupted by error. We do so by generalizing estimation in the instrumental variable setting. Despite significant work on regression with measurement error, additionally handling unobserved confounding in the continuous setting is non-trivial: we have seen little prior work. As a by-product of our investigation, we clarify a connection between mean embeddings and characteristic functions, and how learning one simultaneously allows one to learn the other. This opens the way for kernel method research to leverage existing results in characteristic function estimation. Finally, we empirically show that our proposed method, MEKIV, improves over baselines and is robust under changes in the strength of measurement error and to the type of error distributions. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhu22a.html https://proceedings.mlr.press/v180/zhu22a.html Information design for multiple independent and self-interested defenders: Work less, pay off more This paper studies the problem of information design in a general security game setting in which multiple independent self-interested defenders attempt to provide protection simultaneously on the same set of important targets against an unknown attacker. A principal, who can be one of the defenders, has access to certain private information (i.e., attacker type) whereas other defenders do not. We investigate the question of how that principal, with additional private information, can influence the decisions of the defenders by partially and strategically revealing her information. We focus on the algorithmic study of information design for private signaling in this game setting. In particular, we develop a polynomial-time ellipsoid algorithm to compute an optimal private signaling scheme. Our key finding is that the separation oracle in the ellipsoid approach can be carefully reduced to bipartite matching. Furthermore, we introduce a compact representation of any ex-ante persuasive signaling schemes by exploiting intrinsic security resource allocation structures, enabling us to compute an optimal scheme significantly faster. Our experiment results show that by strategically revealing private information, the principal can significantly enhance the protection effectiveness on the targets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhou22c.html https://proceedings.mlr.press/v180/zhou22c.html Convergence Analysis of Linear Coupling with Inexact Proximal Operator Linear coupling is recently proposed to accelerate first-order algorithms by linking gradient descent and mirror descent together, which is able to achieve the accelerated convergence rate for first-order algorithms. This work focuses on the convergence analysis of linear coupling for convex composite minimization when the proximal operator cannot be exactly computed. It is of particular interest to study the convergence of linear coupling because it not only achieves the accelerated convergence rate for first-order algorithm but also works for generic norms. We present convergence analysis of linear coupling by allowing the proximal operator to be computed up to a certain precision. Our analysis illustrates that the accelerated convergence rate of linear coupling with inexact proximal operator can be preserved if the error sequence of inexact proximal operator decreases in a sufficiently fast rate. More importantly, our analysis leads to better bounds than existing works on inexact proximal operator. Experiment results on several real-world datasets verify our theoretical results. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhou22b.html https://proceedings.mlr.press/v180/zhou22b.html Causal discovery with heterogeneous observational data We consider the problem of causal discovery (structure learning) from heterogeneous observational data. Most existing methods assume homogeneous sampling scheme and causal mechanism, which may lead to misleading conclusions when violated. We propose a novel approach that exploits data heterogeneity to infer possibly cyclic causal structures from causally insufficient systems. The core idea is to model the direct causal effects as functions of exogenous covariates that help explain sampling and causal heterogeneity. We investigate the structure identifiability properties of the proposed model. Structure learning is carried out in a fully Bayesian fashion, which provides natural uncertainty quantification. We demonstrate its utility through extensive simulations and two real-world applications. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhou22a.html https://proceedings.mlr.press/v180/zhou22a.html Research on video adversarial attack with long living cycle In recent years, the vulnerability of networks has attracted the attention of researchers. However, in these methods, the impact of video compression coding on the added adversarial perturbation, i.e., the robustness of the video adversarial example, is not considered. When an adversarial sample is just generated, its attack capability is the strongest. However, with multiple video encoding and video decoding in Internet transmission, the added adversarial disturbance will be continuously eliminated, eventually leading to the attack on the adversarial sample performance disappearing. We define this phenomenon as the decay of the lifetime of adversarial examples. We propose an adversarial attack method based on optimized integer space to resist this performance degradation. The robustness of anti-coding, the visual concealment, and the attack success rate are all considered during the attack process. In addition, we have also reduced the rounding loss caused by normalization in the deep neural network model process. The contributions of our methods are 1) We show the performance degradation caused by video compression coding on existing video adversarial attack methods, which seems an effective way for detecting of defending video adversarial examples. 2) A robust video adversarial attack method is proposed to resist video compression coding. The experiment shows that our method performs better on the robustness of anti-coding, visual concealment, and attack success rate. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhao22a.html https://proceedings.mlr.press/v180/zhao22a.html Stability of SGD: Tightness analysis and improved bounds Stochastic Gradient Descent (SGD) based methods have been widely used for training large-scale machine learning models that also generalize well in practice. Several explanations have been offered for this generalization performance, a prominent one being algorithmic stability Hardt et al [2016]. However, there are no known examples of smooth loss functions for which the analysis can be shown to be tight. Furthermore, apart from properties of the loss function, data distribution has also been shown to be an important factor in generalization performance. This raises the question: is the stability analysis of Hardt et al [2016] tight for smooth functions, and if not, for what kind of loss functions and data distributions can the stability analysis be improved? In this paper we first settle open questions regarding tightness of bounds in the data-independent setting: we show that for general datasets, the existing analysis for convex and strongly-convex loss functions is tight, but it can be improved for non-convex loss functions. Next, we give novel and improved data-dependent bounds: we show stability upper bounds for a large class of convex regularized loss functions, with negligible regularization parameters, and improve existing data-dependent bounds in the non-convex setting. We hope that our results will initiate further efforts to better understand the data-dependent setting under non-convex loss functions, leading to an improved understanding of the generalization abilities of deep networks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhang22b.html https://proceedings.mlr.press/v180/zhang22b.html Distributed adversarial training to robustify deep neural networks at scale Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective and popular approach, known as adversarial training (AT), has been shown to mitigate the negative impact of adversarial attacks by virtue of a min-max robust training method. While effective, it remains unclear whether it can successfully be adapted to the distributed learning context. The power of distributed optimization over multiple machines enables us to scale up robust training over large models and datasets. Spurred by that, we propose distributed adversarial training (DAT), a large-batch adversarial training framework implemented over multiple machines. We show that DAT is general, which supports training over labeled and unlabeled data, multiple types of attack generation methods, and gradient compression operations favored for distributed optimization. Theoretically, we provide, under standard conditions in the optimization theory, the convergence rate of DAT to the first-order stationary points in general non-convex settings. Empirically, we demonstrate that DAT either matches or outperforms state-of-the-art robust accuracies and achieves a graceful training speedup (e.g., on ResNet-50 under ImageNet). Codes are available at https://github.com/dat-2022/dat. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhang22a.html https://proceedings.mlr.press/v180/zhang22a.html Asymptotic optimality for active learning processes Active Learning (AL) aims to optimize basic learned model(s) iteratively by selecting and annotating unlabeled data samples that are deemed to best maximise the model performance with minimal required data. However, the learned model is easy to overfit due to the biased distribution (sampling bias and dataset shift) formed by non-uniform sampling used in AL. Considering AL as an iterative sequential optimization process, we first provide a perspective on AL in terms of statistical properties, i.e., asymptotic unbiasedness, consistency and asymptotic efficiency, with respect to basic estimators when the sample size (size of labeled set) becomes large, and in the limit as sample size tends to infinity. We then discuss how biases affect AL. Finally, we proposed a flexible AL framework that aims to mitigate the impact of bias in AL by minimizing generalization error and importance-weighted training loss simultaneously. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/zhan22a.html https://proceedings.mlr.press/v180/zhan22a.html Principle of relevant information for graph sparsification Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector w. We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at https://github.com/SJYuCNEL/PRI-Graphs. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yu22c.html https://proceedings.mlr.press/v180/yu22c.html Predictive Whittle networks for time series Recent developments have shown that modeling in the spectral domain improves the accuracy in time series forecasting. However, state-of-the-art neural spectral forecasters do not generally yield trustworthy predictions. In particular, they lack the means to gauge predictive likelihoods and provide uncertainty estimates. We propose predictive Whittle networks to bridge this gap, which exploit both the advances of neural forecasting in the spectral domain and leverage tractable likelihoods of probabilistic circuits. For this purpose, we propose a novel Whittle forecasting loss that makes use of these predictive likelihoods to guide the training of the neural forecasting component. We demonstrate how predictive Whittle networks improve real-world forecasting accuracy, while also allowing a transformation back into the time domain, in order to provide the necessary feedback of when the model’s prediction may become erratic. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yu22b.html https://proceedings.mlr.press/v180/yu22b.html Learning binary multi-scale games on networks Network games are a natural modeling framework for strategic interactions of agents whose actions have local impact on others. Recently, a multi-scale network game model has been proposed to capture local effects at multiple network scales, such as among both individuals and groups. We propose a framework to learn the utility functions of binary multi-scale games from agents’ behavioral data. Departing from much prior work in this area, we model agent behavior as following logit-response dynamics, rather than acting according to a Nash equilibrium. This defines a generative time-series model of joint behavior of both agents and groups, which enables us to naturally cast the learning problem as maximum likelihood estimation (MLE). We show that in the important special case of multi-scale linear-quadratic games, this MLE problem is convex. Extensive experiments using both synthetic and real data demonstrate that our proposed modeling and learning approach is effective in both game parameter estimation as well as prediction of future behavior, even when we learn the game from only a single behavior time series. Furthermore, we show how to use our framework to develop a statistical test for the existence of multi-scale structure in the game, and use it to demonstrate that real time-series data indeed exhibits such structure. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yu22a.html https://proceedings.mlr.press/v180/yu22a.html Cross-domain adaptive transfer reinforcement learning based on state-action correspondence Despite the impressive success achieved in various domains, deep reinforcement learning (DRL) is still faced with the sample inefficiency problem. Transfer learning (TL), which leverages prior knowledge from different but related tasks to accelerate the target task learning, has emerged as a promising direction to improve RL efficiency. The majority of prior work considers TL across tasks with the same state-action spaces, while transferring across domains with different state-action spaces is relatively unexplored. Furthermore, such existing cross-domain transfer approaches only enable transfer from a single source policy, leaving open the important question of how to best transfer from multiple source policies. This paper proposes a novel framework called Cross-domain Adaptive Transfer (CAT) to accelerate DRL. CAT learns the state-action correspondence from each source task to the target task and adaptively transfers knowledge from multiple source task policies to the target policy. CAT can be easily combined with existing DRL algorithms and experimental results show that CAT significantly accelerates learning and outperforms other cross-domain transfer methods on multiple continuous action control tasks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/you22a.html https://proceedings.mlr.press/v180/you22a.html Active learning with label comparisons Supervised learning typically relies on manual annotation of the true labels. When there are many potential classes, searching for the best one can be prohibitive for a human annotator. On the other hand, comparing two candidate labels is often much easier. We focus on this type of pairwise supervision and ask how it can be used effectively in learning, and in particular in active learning. We obtain several insightful results in this context. In principle, finding the best of k labels can be done with k-1 active queries. We show that there is a natural class where this approach is sub-optimal, and that there is a more comparison-efficient active learning scheme. A key element in our analysis is the “label neighborhood graph” of the true distribution, which has an edge between two classes if they share a decision boundary. We also show that in the PAC setting, pairwise comparisons cannot provide improved sample complexity in the worst case. We complement our theoretical results with experiments, clearly demonstrating the effect of the neighborhood graph on sample complexity. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yona22a.html https://proceedings.mlr.press/v180/yona22a.html Offline stochastic shortest path: Learning, evaluation and towards optimality Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yin22b.html https://proceedings.mlr.press/v180/yin22b.html Superposing many tickets into one: A performance booster for sparse neural network training Recent works on sparse neural network training have shown that a compelling trade-off between performance and efficiency can be achieved. Existing sparse training methods usually strive to find the best sparse subnetwork possible in one single run, without involving any expensive dense or pre-training steps. For instance, dynamic sparse training (DST), as one of the most prominent directions, is capable of reaching a competitive performance of dense training by iteratively evolving the sparse topology during the course of training. In this paper, we argue that it is better to allocate the limited resources to create multiple low-loss sparse subnetworks and superpose them into a stronger one, instead of allocating all resources entirely to find an individual subnetwork. To achieve this, two desiderata are required: (1) efficiently producing many low-loss subnetworks, the so-called cheap tickets, within one training process limited to the standard training time used in dense training; (2) effectively superposing these cheap tickets into one stronger subnetwork without going over the constrained parameter budget. To corroborate our conjecture, we present a novel sparse training approach, termed \textbf{Sup-tickets}, which can satisfy the above two desiderata concurrently in a single sparse-to-sparse training process. Across various models on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods and demonstrates consistent performance improvement. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yin22a.html https://proceedings.mlr.press/v180/yin22a.html Accelerating training of batch normalization: A manifold perspective Batch normalization (BN) has become a critical component across diverse deep neural networks. The network with BN is invariant to positively linear re-scale transformation, which makes there exist infinite functionally equivalent networks with different scales of weights. However, optimizing these equivalent networks with the first-order method such as stochastic gradient descent will obtain a series of iterates converging to different local optima owing to their different gradients across training. To obviate this, we propose a quotient manifold PSI manifold, in which all the equivalent weights of the network with BN are regarded as the same element. Next, we construct gradient descent and stochastic gradient descent on the proposed PSI manifold to train the network with BN. The two algorithms guarantee that every group of equivalent weights (caused by positively re-scaling) converge to the equivalent optima. Besides that, we give convergence rates of the proposed algorithms on the PSI manifold. The results show that our methods accelerate training compared with the algorithms on the Euclidean weight space. Finally, empirical results verify that our algorithms consistently improve the existing methods in both convergence rate and generalization ability under various experimental settings. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yi22a.html https://proceedings.mlr.press/v180/yi22a.html Future gradient descent for adapting the temporal shifting data distribution in online recommendation systems One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error. To overcome, we propose to learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training so that the recommendation model can be trained as if we were able to look ahead at the future of its deployment. Compared with Batch Update, a widely used paradigm, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error measured by a gradient variation term in a local regret. We demonstrate the empirical advantage by comparing with various representative baselines. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ye22b.html https://proceedings.mlr.press/v180/ye22b.html Pareto navigation gradient descent: a first-order algorithm for optimization in pareto set Many modern machine learning applications, such as multi-task learning, require finding optimal model parameters to trade-off multiple objective functions that may conflict with each other. The notion of the Pareto set allows us to focus on the set of (often infinite number of) models that cannot be strictly improved. But it does not provide an actionable procedure for picking one or a few special models to return to practical users. In this paper, we consider optimization in Pareto set (OPT-in-Pareto), the problem of finding Pareto models that optimize an extra reference criterion function within the Pareto set. This function can either encode a specific preference from the users, or represent a generic diversity measure for obtaining a set of diversified Pareto models that are representative of the whole Pareto set. Unfortunately, despite being a highly useful framework, efficient algorithms for OPT-in-Pareto have been largely missing, especially for large-scale, non-convex, and non-linear objectives in deep learning. A naive approach is to apply Riemannian manifold gradient descent on the Pareto set, which yields a high computational cost due to the need for eigen-calculation of Hessian matrices. We propose a first-order algorithm that approximately solves OPT-in-Pareto using only gradient information, with both high practical efficiency and theoretically guaranteed convergence property. Empirically, we demonstrate that our method works efficiently for a variety of challenging multi-task-related problems. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ye22a.html https://proceedings.mlr.press/v180/ye22a.html Noisy L0-sparse subspace clustering on dimensionality reduced data Sparse subspace clustering methods with sparsity induced by L0-norm, such as L0-Sparse Subspace Clustering (L0-SSC), are demonstrated to be more effective than its L1 counterpart such as Sparse Subspace Clustering (SSC). However, the theoretical analysis of L0-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy L0-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy L0-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy L0-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy L0-SSC, we propose Noisy-DR-L0-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-L0-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy L0-SSC on the dimensionality reduced data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-L0-SSC. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yang22e.html https://proceedings.mlr.press/v180/yang22e.html CoSPA: An improved masked language model with copy mechanism for Chinese spelling correction Existing BERT-based models for Chinese spelling correction (CSC) have three issues. 1) Bert tends to rectify a correct low-frequency collocation into a high-frequency one and leads to over-correcting. 2) It fails to completely detect phonic or morphological errors by the current learned similarity knowledge between Chinese characters, and the recall rate still has room to improve. 3) Two-dimensional glyph information of Chinese characters is overlooked and some morphological misused characters may be difficult to detect. This paper proposes a hybrid approach, CoSPA, to address these issues. 1) This paper proposes an alterable copy mechanism to alleviate over-correcting by jointly learning to copy a correct character from input sentence, or generate a character from BERT. No method has used copy mechanism in BERT for CSC. 2) The attention mechanism is further applied on the phonic and shape representation of each character at the output layer. 3) Shape representation is enhanced by mining character glyph with ResNet, and fused with stroke representation via an adaptive gating unit. The experimental results show that CoSPA outperforms the previous state-of-the-art methods on SIGHAN2015 datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yang22d.html https://proceedings.mlr.press/v180/yang22d.html Robust textual embedding against word-level adversarial attacks We attribute the vulnerability of natural language processing models to the fact that similar inputs are converted to dissimilar representations in the embedding space, leading to inconsistent outputs, and we propose a novel robust training method, termed \textit{Fast Triplet Metric Learning (FTML)}. Specifically, we argue that the original sample should have similar representation with its adversarial counterparts and distinguish its representation from other samples for better robustness. To this end, we adopt the triplet metric learning into the standard training to pull words closer to their positive samples (\textit{i.e.}, synonyms) and push away their negative samples (\textit{i.e.}, non-synonyms) in the embedding space. Extensive experiments demonstrate that FTML can significantly promote the model robustness against various advanced adversarial attacks while keeping competitive classification accuracy on original samples. Besides, our method is efficient as it only needs to adjust the embedding and introduces very little overhead on the standard training. Our work shows great potential of improving the textual robustness through robust word embedding. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yang22c.html https://proceedings.mlr.press/v180/yang22c.html Self-supervised representations for multi-view reinforcement learning Learning policies from raw, pixel images are quite important for the real-world application of deep reinforcement learning (RL). Standard model-free RL algorithms focus on single-view settings and unify the representation learning and policy learning into an end-to-end training process. However, such a learning paradigm is sample-inefficiency and sensitive to hyper-parameters when supervised merely by the reward signals. Based on this, we present Self-Supervised Representations (S2R) for multi-view reinforcement learning, a sample-efficient representation learning method for learning features from high-dimensional images. In S2R, we introduce a representation learning framework and define a novel multi-view auxiliary objective based on the multi-view image states and Conditional Entropy Bottleneck (CEB) principle. We integrate S2R with the deep RL agent to learn robust representations that preserve task-relevant information while discarding task-irrelevant information and find optimal policies that maximize the expected return. Empirically, we demonstrate the effectiveness of S2R in the visual DeepMind Control (DMControl) suite and show its better performance on the default DMControl tasks and their variants by replacing the tasks’ default background with a random image or natural video. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yang22b.html https://proceedings.mlr.press/v180/yang22b.html Differentially private SGDA for minimax problems Stochastic gradient descent ascent (SGDA) and its variants have been the workhorse for solving minimax problems. However, in contrast to the well-studied stochastic gradient descent (SGD) with differential privacy (DP) constraints, there is little work on understanding the generalization (utility) of SGDA with DP constraints. In this paper, we use the algorithmic stability approach to establish the generalization (utility) of DP-SGDA in different settings. In particular, for the convex-concave setting, we prove that the DP-SGDA can achieve an optimal utility rate in terms of the weak primal-dual population risk in both smooth and non-smooth cases. To our best knowledge, this is the first-ever-known result for DP-SGDA in the non-smooth case. We further provide its utility analysis in the nonconvex-strongly-concave setting which is the first-ever-known result in terms of the primal population risk. The convergence and generalization results for this nonconvex setting are new even in the non-private setting. Finally, numerical experiments are conducted to demonstrate the effectiveness of DP-SGDA for both convex and nonconvex cases. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yang22a.html https://proceedings.mlr.press/v180/yang22a.html Addressing token uniformity in transformers via singular value transformation Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the token uniformity problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yan22b.html https://proceedings.mlr.press/v180/yan22b.html Finite-horizon equilibria for neuro-symbolic concurrent stochastic games We present novel techniques for neuro-symbolic concurrent stochastic games, a recently proposed modelling formalism to represent a set of probabilistic agents operating in a continuous-space environment using a combination of neural network based perception mechanisms and traditional symbolic methods. To date, only zero-sum variants of the model were studied, which is too restrictive when agents have distinct objectives. We formalise notions of equilibria for these models and present algorithms to synthesise them. Focusing on the finite-horizon setting, and (global) social welfare subgame-perfect optimality, we consider two distinct types: Nash equilibria and correlated equilibria. We first show that an exact solution based on backward induction may yield arbitrarily bad equilibria. We then propose an approximation algorithm called frozen subgame improvement, which proceeds through iterative solution of nonlinear programs. We develop a prototype implementation and demonstrate the benefits of our approach on two case studies: an automated car-parking system and an aircraft collision avoidance system. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/yan22a.html https://proceedings.mlr.press/v180/yan22a.html Deterministic policy gradient: Convergence analysis The deterministic policy gradient (DPG) method proposed in Silver et al. [2014] has been demonstrated to exhibit superior performance particularly for applications with multi-dimensional and continuous action spaces. However, it remains unclear whether DPG converges, and if so, how fast it converges and whether it converges as efficiently as other PG methods. In this paper, we provide a theoretical analysis of DPG to answer those questions. We study the single timescale DPG (often the case in practice) in both on-policy and off-policy settings, and show that both algorithms attain an $\epsilon$-accurate stationary policy with a sample complexity of $\mathcal{O}(\epsilon^{-2})$. Moreover, we establish the convergence rate for DPG under Gaussian noise exploration, which is widely adopted in practice to improve the performance of DPG. To our best knowledge, this is the first non-asymptotic convergence characterization for DPG methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/xiong22a.html https://proceedings.mlr.press/v180/xiong22a.html Fine-Grained matching with multi-perspective similarity modeling for cross-modal retrieval Cross-modal retrieval relies on learning inter-modal correspondences. Most existing approaches focus on learning global or local correspondence and fail to explore fine-grained multi-level alignments. Moreover, it remains to be investigated how to infer more accurate similarity scores. In this paper, we propose a novel fine-grained matching with Multi-Perspective Similarity Modeling (MPSM) Network for cross-modal retrieval. Specifically, the Knowledge Graph Iterative Dissemination (KGID) module is designed to iteratively broadcast global semantic knowledge, enabling domain information to be integrated and relevant nodes to be associated, resulting in fine-grained modality representations. Subsequently, vector-based similarity representations are learned from multiple perspectives to model multi-level alignments comprehensively. The Relation Graph Reconstruction (SRGR) module is further developed to enhance cross-modal correspondence by constructing similarity relation graphs and adaptively reconstructing them. Extensive experiments on the Flickr30K and MSCOCO datasets validate that our model significantly outperforms several state-of-the-art baselines. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/xie22a.html https://proceedings.mlr.press/v180/xie22a.html Partial likelihood Thompson sampling We consider the problem of deciding how best to target and prioritize existing vaccines that may offer protection against new variants of an infectious disease. Sequential experiments are a promising approach; however, challenges due to delayed feedback and the overall ebb and flow of disease prevalence make available methods inapplicable for this task. We present a method, partial likelihood Thompson sampling, that can handle these challenges. Our method involves running Thompson sampling with belief updates determined by partial likelihood each time we observe an event. To test our approach, we ran a semi-synthetic experiment based on 200 days of COVID-19 infection data in the US. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wu22c.html https://proceedings.mlr.press/v180/wu22c.html Differentially private multi-party data release for linear regression Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects. However the majority of prior work has focused on scenarios where a single party owns all the data. In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects. Within the context of linear regression that allow all parties to train models on the complete data without the ability to infer private attributes or identities of individuals, we start with directly applying Gaussian mechanism and show it has the small eigenvalue problem. We further propose our novel method and prove it asymptotically converges to the optimal (non-private) solutions with increasing dataset size. We substantiate the theoretical results through experiments on both artificial and real-world datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wu22b.html https://proceedings.mlr.press/v180/wu22b.html Residual bootstrap exploration for stochastic linear bandit We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wu22a.html https://proceedings.mlr.press/v180/wu22a.html A new constructive criterion for Markov equivalence of MAGs Ancestral graphs are an important tool for encoding causal knowledge as they represent uncertainty about the presence of latent confounding and selection bias, and they can be inferred from data. As for other graphical models, several maximal ancestral graphs (MAGs) may encode the same statistical information in the form of conditional independencies. Such MAGs are said to be Markov equivalent. This work concerns graphical characterizations and computational aspects of Markov equivalence between MAGs. These issues have been studied in past years leading to several criteria and methods to test Markov equivalence. The state-of-the-art algorithm, provided by Hu and Evans [UAI 2020], runs in time $O(n^5)$ for instances with $n$ vertices. We propose a new constructive graphical criterion for the Markov equivalence of MAGs, which allows us to develop a practically effective equivalence test with worst-case runtime $O(n^3)$. Additionally, our criterion is expressed in terms of natural graphical concepts, which is of independent value. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wienobst22a.html https://proceedings.mlr.press/v180/wienobst22a.html Causal discovery under a confounder blanket Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering the causal order among a subgraph of variables known to descend from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing background information. Under a structural assumption called the $\textit{confounder blanket principle}$, which we argue is essential for tractable causal discovery in high dimensions, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We present a structure learning algorithm that is provably sound and complete with respect to a so-called $\textit{lazy oracle}$. We design inference procedures with finite sample error control for linear and nonlinear systems, and demonstrate our approach on a range of simulated and real-world datasets. An accompanying $\texttt{R}$ package, $\texttt{cbl}$, is available from $\texttt{CRAN}$. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/watson22a.html https://proceedings.mlr.press/v180/watson22a.html Generalized Bayesian quadrature with spectral kernels Bayesian probabilistic integration, or Bayesian quadrature (BQ), has arisen as a popular means of numerical integral estimation with quantified uncertainty for problems where computational cost limits data availability. BQ leverages flexible Gaussian processes (GPs) to model an integrand which can be subsequently analytically integrated through properties of Gaussian distributions. However, BQ is inherently limited by the fact that the method relies on the use of a strict set of kernels for use in the GP model of the integrand, reducing the flexibility of the method in modeling varied integrand types. In this paper, we present spectral Bayesian quadrature, a form of Bayesian quadrature that allows for the use of any shift-invariant kernel in the integrand GP model while still maintaining the analytical tractability of the integral posterior, increasing the flexibility of BQ methods to address varied problem settings. Additionally our method enables integration with respect to a uniform expectation, effectively computing definite integrals of challenging integrands. We derive the theory and error bounds for this model, as well as demonstrate GBQ’s improved accuracy, flexibility, and data efficiency, compared to traditional BQ and other numerical integration methods, on a variety of quadrature problems. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/warren22a.html https://proceedings.mlr.press/v180/warren22a.html Toward learning human-aligned cross-domain robust models by countering misaligned features Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with implementation available. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wang22d.html https://proceedings.mlr.press/v180/wang22d.html ST-MAML : A stochastic-task based method for task-heterogeneous meta-learning Optimization-based meta-learning typically assumes tasks are sampled from a single distribution - an assumption that oversimplifies and limits the diversity of tasks that meta-learning can model. Handling tasks from multiple distributions is challenging for meta-learning because it adds ambiguity to task identities. This paper proposes a novel method, ST-MAML, that empowers model-agnostic meta-learning (MAML) to learn from multiple task distributions. ST-MAML encodes tasks using a stochastic neural network module, that summarizes every task with a stochastic representation. The proposed Stochastic Task (ST) strategy learns a distribution of solutions for an ambiguous task and allows a meta-model to self-adapt to the current task. ST-MAML also propagates the task representation to enhance input variable encodings. Empirically, we demonstrate that ST-MAML outperforms the state-of-the-art on two few-shot image classification tasks, one curve regression benchmark, one image completion problem, and a real-world temperature prediction application. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wang22c.html https://proceedings.mlr.press/v180/wang22c.html Detecting textual adversarial examples through randomized substitution and vote A line of work has shown that natural text processing models are vulnerable to adversarial examples. Correspondingly, various defense methods are proposed to mitigate the threat of textual adversarial examples, \textit{e.g.} adversarial training, input transformations, detection, \textit{etc}. In this work, we treat the optimization process for synonym substitution based textual adversarial attacks as a specific sequence of word replacement, in which each word mutually influences other words. We identify that we could destroy such mutual interaction and eliminate the adversarial perturbation by randomly substituting a word with its synonyms. Based on this observation, we propose a novel textual adversarial example detection method, termed \textit{Randomized Substitution and Vote} (RS&V), which votes the prediction label by accumulating the logits of $k$ samples generated by randomly substituting the words in the input text with synonyms. The proposed RS&V is generally applicable to any existing neural networks without modification on the architecture or extra training, and it is orthogonal to prior work on making the classification network itself more robust. Empirical evaluations on three benchmark datasets demonstrate that our RS&V could detect the textual adversarial examples more successfully than the existing detection methods while maintaining the high classification accuracy on benign samples. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wang22b.html https://proceedings.mlr.press/v180/wang22b.html Meta-learning without data via Wasserstein distributionally-robust model fusion Existing meta-learning works assume that each task has available training and testing data. However, there are many available pre-trained models without accessing their training data in practice. We often need a single model to solve different tasks simultaneously as this is much more convenient to deploy the models. Our work aims to meta-learn a model initialization from these pre-trained models without using corresponding training data. We name this challenging problem setting as Data-Free Learning To Learn (DFL2L). We propose a distributionally robust optimization (DRO) framework to learn a black-box model to fuse and compress all the pre-trained models into a single network to address this problem. To encourage good generalization to the unseen new tasks, the proposed DRO framework diversifies the learned task embedding associated with each pre-trained model to cover the diversity in the underlying training task distributions. A model initialization is sampled from the black-box network during meta-testing as the meta learned initialization. Extensive experiments on offline and online DFL2L settings and several real image datasets demonstrate the effectiveness of the proposed methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wang22a.html https://proceedings.mlr.press/v180/wang22a.html Bias aware probabilistic Boolean matrix factorization Boolean matrix factorization (BMF) is a combinatorial problem arising from a wide range of applications including recommendation system, collaborative filtering, and dimensionality reduction. Currently, the noise model of existing BMF methods is often assumed to be homoscedastic; however, in real world data scenarios, the deviations of observed data from their true values are almost surely diverse due to stochastic noises, making each data point not equally suitable for fitting a model. In this case, it is not ideal to treat all data points as equally distributed. Motivated by such observations, we introduce a probabilistic BMF model that recognizes the object- and feature-wise bias distribution respectively, called bias aware BMF (BABF). To the best of our knowledge, BABF is the first approach for Boolean decomposition with consideration of the feature-wise and object-wise bias in binary data. We conducted experiments on datasets with different levels of background noise, bias level, and sizes of the signal patterns, to test the effectiveness of our method in various scenarios. We demonstrated that our model outperforms the state-of-the-art factorization methods in both accuracy and efficiency in recovering the original datasets, and the inferred bias level is highly significantly correlated with true existing bias in both simulated and real world datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/wan22a.html https://proceedings.mlr.press/v180/wan22a.html Bayesian federated estimation of causal effects from observational data We propose a Bayesian framework for estimating causal effects from federated observational data sources. Bayesian causal inference is an important approach to learning the distribution of the causal estimands and understanding the uncertainty of causal effects. Our framework estimates the posterior distributions of the causal effects to compute the higher-order statistics that capture the uncertainty. We integrate local causal effects from different data sources without centralizing them. We then estimate the treatment effects from observational data using a non-parametric reformulation of the classical potential outcomes framework. We model the potential outcomes as a random function distributed by Gaussian processes, with defining parameters that can be efficiently learned from multiple data sources. Our method avoids exchanging raw data among the sources, thus contributing towards privacy-preserving causal learning. The promise of our approach is demonstrated through a set of simulated and real-world examples. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/vo22a.html https://proceedings.mlr.press/v180/vo22a.html Intervention target estimation in the presence of latent variables This paper considers the problem of estimating unknown intervention targets in causal directed acyclic graphs from observational and interventional data in the presence of latent variables. The focus is on linear structural equation models with soft interventions. The existing approaches to this problem involve performing extensive conditional independence tests, and they estimate the unknown intervention targets alongside learning the structure of the causal model in its entirety. This joint learning approach results in algorithms that are not scalable as graph sizes grow. This paper proposes an approach that does not necessitate learning the entire causal model and focuses on learning only the intervention targets. The key idea of this approach is leveraging the property that interventions impose sparse changes in the precision matrix of a linear model. The proposed framework consists of a sequence of precision difference estimation steps. Furthermore, the necessary knowledge to refine an observational Markov equivalence class (MEC) to an interventional MEC is inferred. Simulation results are provided to illustrate the scalability of the proposed algorithm and compare it with those of the existing approaches. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/varici22a.html https://proceedings.mlr.press/v180/varici22a.html Causal forecasting: generalization bounds for autoregressive models Despite the increasing relevance of forecasting methods, causal implications of these algorithms remain largely unexplored. This is concerning considering that, even under simplifying assumptions such as causal sufficiency, the statistical risk of a model can differ significantly from its causal risk. Here, we study the problem of causal generalization—generalizing from the observational to interventional distributions—in forecasting. Our goal is to find answers to the question: How does the efficacy of an autoregressive (VAR) model in predicting statistical associations compare with its ability to predict under interventions? To this end, we introduce the framework of causal learning theory for forecasting. Using this framework, we obtain a characterization of the difference between statistical and causal risks, which helps identify sources of divergence between them. Under causal sufficiency, the problem of causal generalization amounts to learning under covariate shifts albeit with additional structure (restriction to interventional distributions under the VAR model). This structure allows us to obtain uniform convergence bounds on causal generalizability for the class of VAR models. To the best of our knowledge, this is the first work that provides theoretical guarantees for causal generalization in the time-series setting. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/vankadara22a.html https://proceedings.mlr.press/v180/vankadara22a.html Towards unsupervised open world semantic segmentation For the semantic segmentation of images, state-of-the-art deep neural networks (DNNs) achieve high segmentation accuracy if that task is restricted to a closed set of classes. However, as of now DNNs have limited ability to operate in an open world, where they are tasked to identify pixels belonging to unknown objects and eventually to learn novel classes, incrementally. Humans have the capability to say: “I don’t know what that is, but I’ve already seen something like that”. Therefore, it is desirable to perform such an incremental learning task in an unsupervised fashion. We introduce a method where unknown objects are clustered based on visual similarity. Those clusters are utilized to define new classes and serve as training data for unsupervised incremental learning. More precisely, the connected components of a predicted semantic segmentation are assessed by a segmentation quality estimate. Connected components with a low estimated prediction quality are candidates for a subsequent clustering. Additionally, the component-wise quality assessment allows for obtaining predicted segmentation masks for the image regions potentially containing unknown objects. The respective pixels of such masks are pseudo-labeled and afterwards used for re-training the DNN, i.e., without the use of ground truth generated by humans. In our experiments we demonstrate that, without access to ground truth and even with few data, a DNN’s class space can be extended by a novel class, achieving considerable segmentation accuracy. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/uhlemeyer22a.html https://proceedings.mlr.press/v180/uhlemeyer22a.html Multi-source domain adaptation via weighted joint distributions optimal transport This work addresses the problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets. Most current approaches tackle this problem by searching for an embedding that is invariant across source and target domains, which corresponds to searching for a universal classifier that works well on all domains. In this paper, we address this problem from a new perspective: instead of crushing diversity of the source distributions, we exploit it to adapt better to the target distribution. Our method, named Multi-Source Domain Adaptation via Weighted Joint Distribution Optimal Transport (MSDA-WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoret- ical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of- the-art performance on simulated and real datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/turrisi22a.html https://proceedings.mlr.press/v180/turrisi22a.html Learning linear non-Gaussian polytree models In the context of graphical causal discovery, we adapt the versatile framework of linear non-Gaussian acyclic models (LiNGAMs) to propose new algorithms to efficiently learn graphs that are polytrees. Our approach combines the Chow–Liu algorithm, which first learns the undirected tree structure, with novel schemes to orient the edges. The orientation schemes assess algebraic relations among moments of the data-generating distribution and are computationally inexpensive. We establish high-dimensional consistency results for our approach and compare different algorithmic versions in numerical experiments. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/tramontano22a.html https://proceedings.mlr.press/v180/tramontano22a.html Efficient inference for dynamic topic modeling with large vocabularies Dynamic topic modeling is a well established tool for capturing the temporal dynamics of the topics of a corpus. In this work, we develop a scalable dynamic topic model by utilizing the correlation among the words in the vocabulary. By correlating previously independent temporal processes for words, our new model allows us to reliably estimate the topic representations containing less frequent words. We develop an amortised variational inference method with self-normalised importance sampling approximation to the word distribution that dramatically reduces the computational complexity and the number of variational parameters in order to handle large vocabularies. With extensive experiments on text datasets, we show that our method significantly outperforms the previous works by modeling word correlations, and it is able to handle real world data with a large vocabulary which could not be processed by previous continuous dynamic topic models. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/tomasi22a.html https://proceedings.mlr.press/v180/tomasi22a.html Semi-supervised novelty detection using ensembles with regularized disagreement Deep neural networks often predict samples with high confidence even when they come from unseen classes and should instead be flagged for expert evaluation. Current novelty detection algorithms cannot reliably identify such near OOD points unless they have access to labeled data that is similar to these novel samples. In this paper, we develop a new ensemble-based procedure for semi-supervised novelty detection (SSND) that successfully leverages a mixture of unlabeled ID and novel-class samples to achieve good detection performance. In particular, we show how to achieve disagreement only on OOD data using early stopping regularization. While we prove this fact for a simple data distribution, our extensive experiments suggest that it holds true for more complex scenarios: our approach significantly outperforms state-of-the-art SSND methods on standard image data sets (SVHN/CIFAR-10/CIFAR-100) and medical image data sets with only a negligible increase in computation cost. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/tifrea22a.html https://proceedings.mlr.press/v180/tifrea22a.html Contrastive latent variable models for neural text generation Deep latent variable models such as variational autoencoders and energy-based models are widely used for neural text generation. Most of them focus on matching the prior distribution with the posterior distribution of the latent variable for text reconstruction. In addition to instance-level reconstruction, this paper aims to integrate contrastive learning in the latent space, forcing the latent variables to learn high-level semantics by exploring inter-instance relationships. Experiments on various text generation benchmarks show the effectiveness of our proposed method. We also empirically show that our method can mitigate the posterior collapse issue for latent variable based text generation models. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/teng22a.html https://proceedings.mlr.press/v180/teng22a.html High-probability bounds for robust stochastic Frank-Wolfe algorithm We develop and analyze robust Stochastic Frank-Wolfe type algorithms for projection-free stochastic convex optimization problems with heavy-tailed stochastic gradients. Existing works on the oracle complexity of such algorithms require a uniformly bounded variance assumption, and hold only in expectation. We develop tight high-probability bounds for robust versions of Stochastic Frank-Wolfe type algorithm under heavy-tailed assumptions, including infinite variance, on the stochastic gradient. Our methodological construction of the robust Stochastic Frank-Wolfe type algorithms leverage techniques from the robust statistic literature. Our theoretical analysis highlights the need to utilize robust versions of Stochastic Frank-Wolfe type algorithm for dealing with heavy-tailed data arising in practice. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/tang22a.html https://proceedings.mlr.press/v180/tang22a.html Marginal MAP estimation for inverse RL under occlusion with observer noise We consider the problem of learning the behavioral preferences of an expert engaged in a task from noisy and partially-observable demonstrations. This is motivated by real-world applications such as a line robot learning from observing a human worker, where some observations are occluded by environmental elements. Furthermore, robotic perception tends to be imperfect and noisy. Previous techniques for inverse reinforcement learning (IRL) take the approach of either omitting the missing portions or inferring it as part of expectation-maximization, which tends to be slow and prone to local optima. We present a new method that generalizes the well-known Bayesian maximum-a-posteriori (MAP) IRL method by marginalizing the occluded portions of the trajectory. This is then extended with an observation model to account for perception noise. This novel application of marginal MAP (MMAP) to IRL significantly improves on the previous IRL technique under occlusion in both formative evaluations on a toy problem and in a summative evaluation on a produce sorting line task by a physical robot. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/suresh22a.html https://proceedings.mlr.press/v180/suresh22a.html Simplified and unified analysis of various learning problems by reduction to Multiple-Instance Learning In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/suehiro22a.html https://proceedings.mlr.press/v180/suehiro22a.html A robustness test for estimating total effects with covariate adjustment Suppose we want to estimate a total effect with covariate adjustment in a linear structural equation model. We have a causal graph to decide what covariates to adjust for, but are uncertain about the graph. Here, we propose a testing procedure, that exploits the fact that there are multiple valid adjustment sets for the target total effect in the causal graph, to perform a robustness check on the graph. If the test rejects, it is a strong indication that we should not rely on the graph. We discuss what mistakes in the graph our testing procedure can detect and which ones it cannot and develop two strategies on how to select a list of valid adjustment sets for the procedure. We also connect our result to the related econometrics literature on coefficient stability tests. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/su22a.html https://proceedings.mlr.press/v180/su22a.html SMT-based weighted model integration with structure awareness Weighted Model Integration (WMI) is a popular formalism aimed at unifying approaches for probabilistic inference in hybrid domains, involving logical and algebraic constraints. Despite a considerable amount of recent work, allowing WMI algorithms to scale with the complexity of the hybrid problem is still a challenge. In this paper we highlight some substantial limitations of existing state-of-the-art solutions, and develop an algorithm that combines SMT-based enumeration, an efficient technique in formal verification, with an effective encoding of the problem structure. This allows our algorithm to avoid generating redundant models, resulting in substantial computational savings. An extensive experimental evaluation on both synthetic and real-world datasets confirms the advantage of the proposed solution over existing alternatives. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/spallitta22a.html https://proceedings.mlr.press/v180/spallitta22a.html Mutual information based Bayesian graph neural network for few-shot learning In the deep neural network based few-shot learning, the limited training data may make the neural network extract ineffective features, which leads to inaccurate results. By Bayesian graph neural network (BGNN), the probability distributions on hidden layers imply useful features, and the few-shot learning could improved by establishing the correlation among features. Thus, in this paper, we incorporate mutual information (MI) into BGNN to describe the correlation, and propose an innovative framework by adopting the Bayesian network with continuous variables (BNCV) for effective calculation of MI. First, we build the BNCV simultaneously when calculating the probability distributions of features from the Dropout in hidden layers of BGNN. Then, we approximate the MI values efficiently by probabilistic inferences over BNCV. Finally, we give the correlation based loss function and training algorithm of our BGNN model. Experimental results show that our MI based BGNN framework is effective for few-shot learning and outperforms some state-of-the-art competitors by large margins on accuracy. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/song22a.html https://proceedings.mlr.press/v180/song22a.html Fast inference and transfer of compositional task structures for few-shot task generalization We tackle real-world problems with complex structures beyond the pixel-based game or simulator. We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph that defines a set of subtasks and their dependencies that are unknown to the agent. Different from the previous meta-RL methods trying to directly infer the unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/sohn22a.html https://proceedings.mlr.press/v180/sohn22a.html Enhanced adaptive optics control with image to image translation We aim to significantly enhance the science return of astronomical observatories, and in particular giant terrestrial optical telescopes. Observatories employ Adaptive Optics (AO) systems in order to acquire high sensitivity diffraction limited images of the sky. The incumbent “workhorse” for control of AO systems employs a linear real-time controller in a closed loop, with sensing of state performed via a (Shack-Hartmann) wavefront sensor (WFS). The actuators of a deformable mirror (DM) are driven, with the action performed in each iteration having a continuous representation as an array of DC voltages. The typical control regime is practical and scalable, nonetheless, there remains a residual uncompensated turbulence that leads to optical aberrations limiting the class of scientific assets that can be acquired. We have developed and trained a translational GAN model that accurately estimates residual perturbations from WFS images. Model inference occurs in 0.34 milliseconds using off-the-shelf GPU hardware, and is applicable for use in AO control where the control loop might be running at 500Hz. We develop an AO control regime with a second controller stage actuating a second DM controlled in an open loop according to the estimated residual turbulence. Using the open-source COMPASS tool for simulation, we are able to significantly improve the performance using our new regime. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/smith22a.html https://proceedings.mlr.press/v180/smith22a.html VQ-Flows: Vector quantized local normalizing flows Normalizing flows provide an elegant approach to generative modeling that allows for efficient sampling and exact density evaluation of unknown data distributions. However, current techniques have significant limitations in their expressivity when the data distribution is supported on a low-dimensional manifold or has a non-trivial topology. We introduce a novel statistical framework for learning a mixture of local normalizing flows as “chart maps” over the data manifold. Our framework augments the expressivity of recent approaches while preserving the signature property of normalizing flows, that they admit exact density evaluation. We learn a suitable atlas of charts for the data manifold via a vector quantized auto-encoder (VQ-AE) and the distributions over them using a conditional flow. We validate experimentally that our probabilistic framework enables existing approaches to better model data distributions over complex manifolds. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/sidheekh22a.html https://proceedings.mlr.press/v180/sidheekh22a.html PAC-Bayesian domain adaptation bounds for multiclass learners Multiclass neural networks are a common tool in modern unsupervised domain adaptation, yet an appropriate theoretical description for their non-uniform sample complexity is lacking in the adaptation literature. To fill this gap, we propose the first PAC-Bayesian adaptation bounds for multiclass learners. We facilitate practical use of our bounds by also proposing the first approximation techniques for the multiclass distribution divergences we consider. For divergences dependent on a Gibbs predictor, we propose additional PAC-Bayesian adaptation bounds which remove the need for inefficient Monte-Carlo estimation. Empirically, we test the efficacy of our proposed approximation techniques as well as some novel design-concepts which we include in our bounds. Finally, we apply our bounds to analyze a common adaptation algorithm that uses neural networks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/sicilia22a.html https://proceedings.mlr.press/v180/sicilia22a.html Shifted compression framework: generalizations and improvements Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed {\sf SGD}, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the {\em difference} between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/shulgin22a.html https://proceedings.mlr.press/v180/shulgin22a.html Neural ensemble search via Bayesian sampling Recently, neural architecture search (NAS) has been applied to automate the design of neural networks in real-world applications. A large number of algorithms have been developed to improve the search cost or the performance of the final selected architectures in NAS. Unfortunately, these NAS algorithms aim to select only one single well-performing architecture from their search spaces and thus have overlooked the capability of neural network ensemble (i.e., an ensemble of neural networks with diverse architectures) in achieving improved performance over a single final selected architecture. To this end, we introduce a novel neural ensemble search algorithm, called neural ensemble search via Bayesian sampling (NESBS), to effectively and efficiently select well-performing neural network ensembles from a NAS search space. In our extensive experiments, NESBS algorithm is shown to be able to achieve improved performance over state-of-the-art NAS algorithms while incurring a comparable search cost, thus indicating the superior performance of our NESBS algorithm over these NAS algorithms in practice. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/shu22a.html https://proceedings.mlr.press/v180/shu22a.html Conditional simulation using diffusion Schrödinger bridges Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffusion process over a long time horizon. When performing unconditional simulation, a Schr{ö}dinger bridge formulation of generative modeling leads to a theoretically grounded algorithm shortening generation time which is complementary to other proposed acceleration techniques. We extend the Schrödinger bridge framework to conditional simulation. We demonstrate this novel methodology on various applications including image super-resolution, optimal filtering for state-space models and the refinement of pre-trained networks. Our code can be found at https://github.com/vdeborto/cdsb. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/shi22a.html https://proceedings.mlr.press/v180/shi22a.html Reframed GES with a neural conditional dependence measure In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/shen22a.html https://proceedings.mlr.press/v180/shen22a.html SymNet 2.0: Effectively handling Non-Fluents and Actions in Generalized Neural Policies for RDDL Relational MDPs Relational MDPs (RMDPs) compactly represent an infinite set of MDPs with an unbounded number of objects. Solving an RMDP requires a generalized policy that applies to all instances of a domain. Recently, Garg et al. proposed SymNet for this task– it constructs a graph neural network that shares parameters across all instances in a domain, thus making it applicable to any instance in a zero-shot manner. Our analysis of SymNet reveals that it performs no better than random on 1/4th of planning competition domains. The key reasons are its design choices: it misses important information during graph construction, leading to (1) poor generalizability, and (2) potential non-identifiability of different actions. In response, our solution, SymNet2.0, substantially augments SymNet’s graph construction approach by introducing additional nodes and edges which allow a better transfer of important information about a domain. It also improves SymNet’s action decoders with relevant information from objects to make different actions identifiable during scoring. Extensive experiments on twelve competition domains, where we use imitation learning over data generated from the PROST planner, demonstrate that SymNet2.0 performs vastly better than SymNet. Interestingly, even though SymNet2.0 is trained over data from PROST, it outperforms the planner on several test instances due to former’s ability to scale to large instances in a zero-shot manner. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/sharma22a.html https://proceedings.mlr.press/v180/sharma22a.html Learning functions on multiple sets using multi-set transformers We propose a general deep architecture for learning functions on multiple permutation-invariant sets. We also show how to generalize this architecture to sets of elements of any dimension by dimension equivariance. We demonstrate that our architecture is a universal approximator of these functions, and show superior results to existing methods on a variety of tasks including counting tasks, alignment tasks, distinguishability tasks and statistical distance measurements. This last task is quite important in Machine Learning. Although our approach is quite general, we demonstrate that it can generate approximate estimates of KL divergence and mutual information that are more accurate than previous techniques that are specifically designed to approximate those statistical distances. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/selby22a.html https://proceedings.mlr.press/v180/selby22a.html Probabilistic spatial transformer networks Spatial Transformer Networks (STNs) estimate image transformations that can improve downstream tasks by ‘zooming in’ on relevant regions in an image. However, STNs are hard to train and sensitive to mis-predictions of transformations. To circumvent these limitations, we propose a probabilistic extension that estimates a stochastic transformation rather than a deterministic one. Marginalizing transformations allows us to consider each image at multiple poses, which makes the localization task easier and the training more robust. As an additional benefit, the stochastic transformations act as a localized, learned data augmentation that improves the downstream tasks. We show across standard imaging benchmarks and on a challenging real-world dataset that these two properties lead to improved classification performance, robustness and model calibration. We further demonstrate that the approach generalizes to non-visual domains by improving model performance on time-series data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/schwobel22a.html https://proceedings.mlr.press/v180/schwobel22a.html How unfair is private learning? As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision (CIFAR-10 and CelebA), and tabular (Law School) datasets and learning algorithms. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/sanyal22a.html https://proceedings.mlr.press/v180/sanyal22a.html Robust identifiability in linear structural equation models of causal inference We consider the problem of robust parameter estimation from observational data in the context of linear structural equation models (LSEMs). Under various conditions on LSEMs and the model parameters the prior work provides efficient algorithms to recover the parameters. However, these results are often about generic identifiability. In practice, generic identifiability is not sufficient and we need robust identifiability: small changes in the observational data should not affect the parameters by a huge amount. Robust identifiability has received far less attention and remains poorly understood. Sankararaman et al. (2019) recently provided a set of sufficient conditions on parameters under which robust identifiability is feasible. However, a limitation of their work is that their results only apply to a small sub-class of LSEMs, called “bow-free paths.” In this work, we show that for any “bow-free model”, in all but $\frac{1}{\poly(n)}$-measure of instances robust identifiability holds. Moreover, whenever an instance is robustly identifiable, the algorithm proposed in Foygel et al., (2012) can be used to recover the parameters in a robust fashion. In contrast, for generic identifiability Foygel et al., (2012) proved that with measure $1$, instances are generically identifiable. Thus, we show that robust identifiability is a strictly harder problem than generic identifiability. Finally, we validate our results on both simulated and real-world datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/sankararaman22a.html https://proceedings.mlr.press/v180/sankararaman22a.html Feature learning and random features in standard finite-width convolutional neural networks: An empirical study The Neural Tangent Kernel is an important milestone in the ongoing effort to build a theory for deep learning. Its prediction that sufficiently wide neural networks behave as kernel methods, or equivalently as random feature models arising from linearized networks, has been confirmed empirically for certain wide architectures. In this paper, we compare the performance of two common finite-width convolutional neural networks, LeNet and AlexNet, to their linearizations on common benchmark datasets like MNIST and modified versions of it, CIFAR-10 and an ImageNet subset. We demonstrate empirically that finite-width neural networks, generally, greatly outperform the finite-width linearization of these architectures. When increasing the problem difficulty of the classification task, we observe a larger gap which is in line with common intuition that finite-width neural networks perform feature learning which finite-width linearizations cannot. At the same time, finite-width linearizations improve dramatically with width, approaching the behavior of the wider standard networks which in turn perform slightly better than their standard width counterparts. Therefore, it appears that feature learning for non-wide standard networks is important but becomes less significant with increasing width. We furthermore identify cases where both standard and linearized networks match in performance, in agreement with NTK theory, and a case where a wide linearization outperforms its standard width counterpart. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/samarin22a.html https://proceedings.mlr.press/v180/samarin22a.html Resolving label uncertainty with implicit posterior models We propose a method for jointly inferring labels across a collection of data samples, where each sample consists of an observation and a prior belief about the label. By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs. This formulation unifies various machine learning settings; the weak beliefs can come in the form of noisy or incomplete labels, likelihoods given by a different prediction mechanism on auxiliary input, or common-sense priors reflecting knowledge about the structure of the problem at hand. We demonstrate the proposed algorithms on diverse problems: classification with negative training examples, learning from rankings, weakly and self-supervised aerial imagery segmentation, co-segmentation of video frames, and coarsely supervised text classification. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/rolf22a.html https://proceedings.mlr.press/v180/rolf22a.html Quantum perceptron revisited: Computational-statistical tradeoffs Quantum machine learning algorithms could provide significant speed-ups over their classical counterparts; however, whether they could also achieve good generalization remains unclear. Recently, two quantum perceptron models which give a quadratic improvement over the classical perceptron algorithm using Grover’s search have been proposed by Wiebe et al. While the first model reduces the complexity with respect to the size of the training set, the second one improves the bound on the number of mistakes made by the perceptron. In this paper, we introduce a hybrid quantum-classical perceptron algorithm with lower complexity and better generalization ability than the classical perceptron. We show a quadratic improvement over the classical perceptron in both the number of samples and the margin of the data. We derive a bound on the expected error of the hypothesis returned by our algorithm, which compares favorably to the one obtained with the classical online perceptron. We use numerical experiments to illustrate the trade-off between computational complexity and statistical accuracy in quantum perceptron learning and discuss some of the key practical issues surrounding the implementation of quantum perceptron models into near-term quantum devices, whose practical implementation represents a serious challenge due to inherent noise. However, the potential benefits make correcting this worthwhile. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/roget22a.html https://proceedings.mlr.press/v180/roget22a.html A free lunch from the noise: Provable and practical exploration for representation learning Representation learning lies at the heart of the empirical success of deep learning for dealing with the curse of dimensionality. However, the power of representation learning has not been fully exploited yet in reinforcement learning (RL), due to i), the trade-off between expressiveness and tractability; and ii), the coupling between exploration and representation learning. In this paper, we first reveal the fact that under some noise assumption in the stochastic control model, we can obtain the linear spectral feature of its corresponding Markov transition operator in closed-form for free. Based on this observation, we propose Spectral Dynamics Embedding (SPEDE), which breaks the tradeoff and completes optimistic exploration for representation learning by exploiting the structure of the noise. We provide rigorous theoretical analysis of SPEDE, and demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ren22a.html https://proceedings.mlr.press/v180/ren22a.html Expectation programming: Adapting probabilistic programming systems to estimate expectations efficiently We show that the standard computational pipeline of probabilistic programming systems (PPSs) can be inefficient for estimating expectations and introduce the concept of expectation programming to address this. In expectation programming, the aim of the backend inference engine is to directly estimate expected return values of programs, as opposed to approximating their conditional distributions. This distinction, while subtle, allows us to achieve substantial performance improvements over the standard PPS computational pipeline by tailoring computation to the expectation we care about. We realize a particular instance of our expectation programming concept, Expectation Programming in Turing (EPT), by extending the PPS Turing to allow so-called target-aware inference to be run automatically. We then verify the statistical soundness of EPT theoretically, and show that it provides substantial empirical gains in practice. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/reichelt22a.html https://proceedings.mlr.press/v180/reichelt22a.html Learning in Markov games: Can we exploit a general-sum opponent? In this paper, we study the learning problem in two-player general-sum Markov Games. We consider the online setting where we control a single player, playing against an arbitrary opponent to minimize the regret. Previous works only consider the zero-sum Markov Games setting, in which the two agents are completely adversarial. However, in some cases, the two agents may have different reward functions without having conflicting objectives. This involves a stronger notion of regret than the one used in previous works. This class of games, called general-sum Markov Games is far to be well understood and studied. We show that the new regret minimization problem is significantly harder than in standard Markov Decision Processes and zero-sum Markov Games. To do this, we derive a lower bound on the expected regret of any “good” learning strategy which shows the constant dependencies with the number of deterministic policies, which is not present in zerosum Markov Games and Markov Decision Processes. Then we propose a novel optimistic algorithm that nearly matches the proposed lower bound. Proving these results requires overcoming several new challenges that are not present in Markov Decision Processes or zero-sum Markov Games. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ramponi22a.html https://proceedings.mlr.press/v180/ramponi22a.html Clustering a union of linear subspaces via matrix factorization and innovation search This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed-form algorithms for the subspace clustering algorithm. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/rahmani22a.html https://proceedings.mlr.press/v180/rahmani22a.html Voronoi density estimator for high-dimensional data: Computation, compactification and convergence The Voronoi Density Estimator (VDE) is an established density estimation technique that adapts to the local geometry of data. However, its applicability has been so far limited to problems in two and three dimensions. This is because Voronoi cells rapidly increase in complexity as dimensions grow, making the necessary explicit computations infeasible. We define a variant of the VDE deemed Compactified Voronoi Density Estimator (CVDE), suitable for higher dimensions. We propose computationally efficient algorithms for numerical approximation of the CVDE and formally prove convergence of the estimated density to the original one. We implement and empirically validate the CVDE through a comparison with the Kernel Density Estimator (KDE). Our results indicate that the CVDE outperforms the KDE on sound and image data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/polianskii22a.html https://proceedings.mlr.press/v180/polianskii22a.html Using hierarchies to efficiently combine evidence with Dempster’s rule of combination Dempster’s rule of combination allows us to combine various independent pieces of evidence that each have a certain degree of uncertainty. This provides a useful way for dealing with uncertain evidence, but the rule is computationally intractable. In this paper, we analyze the complexity of this rule for differently structured bodies of evidence and we consider a known algorithm by Shafer and Logan to compute this rule efficiently over a hierarchical set of evidence. We show that one can check in polynomial time whether an arbitrary set of evidence has a hierarchical shape, enabling the use of Shafer and Logan’s algorithm. Moreover, we consider two different approaches to deal with non-hierarchical sets of evidence: (i) considering hierarchical subsets and (ii) taking advantage of internal hierarchical structures in the overall set. For the former case, we conclude that getting different hierarchies from an arbitrary set of pieces of evidence corresponds to the VERTEX COVER problem and we present algorithms for obtaining these hierarchies based on this correspondence. For the latter case, we present a fixed-parameter tractable algorithm which computes the belief function of any piece of evidence included in the set. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/pinto-prieto22a.html https://proceedings.mlr.press/v180/pinto-prieto22a.html Bayesian quantile and expectile optimisation Bayesian optimisation (BO) is widely used to optimise stochastic black box functions. While most BO approaches focus on optimising conditional expectations, many applications require risk-averse strategies and alternative criteria accounting for the distribution tails need to be considered. In this paper, we propose new variational models for Bayesian quantile and expectile regression that are well-suited for heteroscedastic noise settings. Our models consist of two latent Gaussian processes accounting respectively for the conditional quantile (or expectile) and the scale parameter of an asymmetric likelihood functions. Furthermore, we propose two BO strategies based on max-value entropy search and Thompson sampling, that are tailored to such models and that can accommodate large batches of points. Contrary to existing BO approaches for risk-averse optimisation, our strategies can directly optimise for the quantile and expectile, without requiring replicating observations or assuming a parametric form for the noise. As illustrated in the experimental section, the proposed approach clearly outperforms the state of the art in the heteroscedastic, non-Gaussian case. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/picheny22a.html https://proceedings.mlr.press/v180/picheny22a.html Identifiability of sparse causal effects using instrumental variables Exogenous heterogeneity, for example, in the form of instrumental variables can help us learn a system’s underlying causal structure and predict the outcome of unseen intervention experiments. In this paper, we consider linear models in which the causal effect from covariates X on a response Y is sparse. We provide conditions under which the causal coefficient becomes identifiable from the observed distribution. These conditions can be satisfied even if the number of instruments is as small as the number of causal parents. We also develop graphical criteria under which identifiability holds with probability one if the edge coefficients are sampled randomly from a distribution that is absolutely continuous with respect to Lebesgue measure and $Y$ is childless. As an estimator, we propose spaceIV and prove that it consistently estimates the causal effect if the model is identifiable and evaluate its performance on simulated data. If identifiability does not hold, we show that it may still be possible to recover a subset of the causal parents. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/pfister22a.html https://proceedings.mlr.press/v180/pfister22a.html AND/OR branch-and-bound for computational protein design optimizing K* The importance of designing proteins, such as high affinity antibodies, has become ever more apparent. Computational Protein Design can cast such design problems as optimization tasks with the objective of maximizing K*, an approximation of binding affinity. Here we lay out a graphical model framework for K* optimization that enables use of compact AND/OR search algorithms. We designed an AND/OR branch-and-bound algorithm, AOBB-K*, for optimizing K* that is guided by a new K* heuristic and can incorporate specialized performance improvements with theoretical guarantees. As AOBB-K* is inspired by algorithms from the well studied task of Marginal MAP, this work provides a foundation for harnessing advancements in state-of-the-art mixed inference schemes and adapting them to protein design. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/pezeshki22a.html https://proceedings.mlr.press/v180/pezeshki22a.html Learning large Bayesian networks with expert constraints We propose a new score-based algorithm for learning the structure of a Bayesian Network (BN). It is the first algorithm that simultaneously supports the requirements of (i) learning a BN of bounded treewidth, (ii) satisfying expert constraints, including positive and negative ancestry properties between nodes, and (iii) scaling up to BNs with several thousand nodes. The algorithm operates in two phases. In Phase 1, we utilize a modified version of an existing BN structure learning algorithm, modified to generate an initial Directed Acyclic Graph (DAG) that supports a portion of the given constraints. In Phase 2, we follow the BN-SLIM framework, introduced by Peruvemba Ramaswamy and Szeider (AAAI 2021). We improve the initial DAG by repeatedly running a MaxSAT solver on selected local parts. The MaxSAT encoding entails local versions of the expert constraints as hard constraints. We evaluate a prototype implementation of our algorithm on several standard benchmark sets. The encouraging results demonstrate the power and flexibility of the BN-SLIM framework. It boosts the score while increasing the number of satisfied expert constraints. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/peruvemba-ramaswamy22a.html https://proceedings.mlr.press/v180/peruvemba-ramaswamy22a.html Attribution of predictive uncertainties in classification models Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/perez22a.html https://proceedings.mlr.press/v180/perez22a.html Robust learning of tractable probabilistic models Tractable probabilistic models (TPMs) compactly represent a joint probability distribution over a large number of random variables and admit polynomial time computation of (1) exact likelihoods; (2) marginal probability distributions over a small subset of variables given evidence; and (3) in some cases most probable explanations over all non-observed variables given observations. In this paper, we leverage these tractability properties to solve the robust maximum likelihood parameter estimation task in TPMs under the assumption that a TPM structure and complete training data is provided as input. Specifically, we show that TPMs learned by optimizing the likelihood perform poorly when data is subject to adversarial attacks/noise/perturbations/corruption and we can address this issue by optimizing robust likelihood. To this end, we develop an efficient approach for constructing uncertainty sets that model data corruption in TPMs and derive an efficient gradient-based local search method for learning TPMs that are robust against these uncertainty sets. We empirically demonstrate the efficacy of our proposed approach on a collection of benchmark datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/peddi22a.html https://proceedings.mlr.press/v180/peddi22a.html Understanding and mitigating the limitations of prioritized experience replay Prioritized Experience Replay (ER) has been empirically shown to improve sample efficiency across many domains and attracted great attention; however, there is little theoretical understanding of why such prioritized sampling helps and its limitations. In this work, we take a deep look at the prioritized ER. In a supervised learning setting, we show the equivalence between the error-based prioritized sampling method for minimizing mean squared error and the uniform sampling for cubic power loss. We then provide theoretical insight into why error-based prioritized sampling improves convergence rate upon uniform sampling when minimizing mean squared error during early learning. Based on the insight, we further point out two limitations of the prioritized ER method: 1) outdated priorities and 2) insufficient coverage of the sample space. To mitigate the limitations, we propose our model-based stochastic gradient Langevin dynamics sampling method. We show that our method does provide states distributed close to an ideal prioritized sampling distribution estimated by the brute-force method, which does not suffer from the two limitations. We conduct experiments on both discrete and continuous control problems to show our approach’s efficacy and examine the practical implication of our method in an autonomous driving application. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/pan22a.html https://proceedings.mlr.press/v180/pan22a.html Learning invariant weights in neural networks Assumptions about invariances or symmetries in data can significantly increase the predictive power of statistical models. Many commonly used machine learning models are constraint to respect certain symmetries, such as translation equivariance in convolutional neural networks, and incorporating other symmetry types is actively being studied. Yet, learning invariances from the data itself remains an open research problem. It has been shown that the marginal likelihood offers a principled way to learn invariances in Gaussian Processes. We propose a weight-space equivalent to this approach, by minimizing a lower bound on the marginal likelihood to learn invariances in neural networks, resulting in naturally higher performing models. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ouderaa22a.html https://proceedings.mlr.press/v180/ouderaa22a.html Evaluating high-order predictive distributions in deep learning Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate uncertainty from those that do not. We establish that the predictive distribution order required for such differentiation increases greatly with input dimension, rendering these methods impractical. To accommodate high-dimensional inputs, we introduce dyadic sampling, which focuses on predictive distributions associated with random pairs of inputs. We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/osband22a.html https://proceedings.mlr.press/v180/osband22a.html An explore-then-commit algorithm for submodular maximization under full-bandit feedback We investigate the problem of combinatorial multi-armed bandits with stochastic submodular (in expectation) rewards and full-bandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, Explore-Then-Commit Greedy (ETCG) and prove that it achieves a $(1-1/e)$-regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and real-world data that the ETCG empirically outperforms other full-bandit methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nie22a.html https://proceedings.mlr.press/v180/nie22a.html Ordinal causal discovery Causal discovery for purely observational, categorical data is a long-standing challenging problem. Unlike continuous data, the vast majority of existing methods for categorical data focus on inferring the Markov equivalence class only, which leaves the direction of some causal relationships undetermined. This paper proposes an identifiable ordinal causal discovery method that exploits the ordinal information contained in many real-world applications to uniquely identify the causal structure. The proposed method is applicable beyond ordinal data via data discretization. Through real-world and synthetic experiments, we demonstrate that the proposed ordinal causal discovery method combined with simple score-and-search algorithms has favorable and robust performance compared to state-of-the-art alternative methods in both ordinal categorical and non-categorical data. An accompanied R package OCD is freely available at the first author’s website. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ni22a.html https://proceedings.mlr.press/v180/ni22a.html Cycle class consistency with distributional optimal transport and knowledge distillation for unsupervised domain adaptation Unsupervised domain adaptation (UDA) aims to transfer knowledge from a model trained on a labeled source domain to an unlabeled target domain. To this end, we propose in this paper a novel cycle class-consistent model based on optimal transport (OT) and knowledge distillation. The model consists of two agents, a teacher and a student cooperatively working in a cycle process under the guidance of the distributional optimal transport and distillation manner. The OT distance is designed to bridge the gap between the distribution of the target data and a distribution over the source class-conditional distributions. The optimal probability matrix then provides pseudo labels to learn a teacher that achieves a good classification performance on the target domain. Knowledge distillation is performed in the next step in which the teacher distills and transfers its knowledge to the student. And finally, the student produces its prediction for the optimal transport step. This process forms a closed cycle in which the teacher and student networks are simultaneously trained to conduct transfer learning from the source to the target domain. Extensive experiments show that our proposed method outperforms existing methods, especially the class-aware and OT-based ones on benchmark datasets including Office-31, Office-Home, and ImageCLEF-DA. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nguyen22c.html https://proceedings.mlr.press/v180/nguyen22c.html Efficient and accurate top-k recovery from choice data The intersection of learning to rank and choice modeling is an active area of research with applications in e-commerce, information retrieval and the social sciences. In some applications such as recommendation systems, the statistician is primarily interested in recovering the set of the top ranked items from a large pool of items as efficiently as possible using passively collected discrete choice data, i.e., the user picks one item from a set of multiple items. Motivated by this practical consideration, we propose the choice-based Borda count algorithm as a fast and accurate ranking algorithm for top $K$-recovery i.e., correctly identifying all of the top $K$ items. We show that the choice-based Borda count algorithm has optimal sample complexity for top-$K$ recovery under a broad class of random utility models. We prove that in the limit, the choice-based Borda count algorithm produces the same top-$K$ estimate as the commonly used Maximum Likelihood Estimate method but the former’s speed and simplicity brings considerable advantages in practice. Experiments on both synthetic and real datasets show that the counting algorithm is competitive with commonly used ranking algorithms in terms of accuracy while being several orders of magnitude faster. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nguyen22b.html https://proceedings.mlr.press/v180/nguyen22b.html Robust Bayesian recourse Algorithmic recourse aims to recommend an informative feedback to overturn an unfavorable machine learning decision. We introduce in this paper the Bayesian recourse, a model-agnostic recourse that minimizes the posterior probability odds ratio. Further, we present its min-max robust counterpart with the goal of hedging against future changes in the machine learning model parameters. The robust counterpart explicitly takes into account possible perturbations of the data in a Gaussian mixture ambiguity set prescribed using the optimal transport (Wasserstein) distance. We show that the resulting worst-case objective function can be decomposed into solving a series of two-dimensional optimization subproblems, and the min-max recourse finding problem is thus amenable to a gradient descent algorithm. Contrary to existing methods for generating robust recourses, the robust Bayesian recourse does not require a linear approximation step. The numerical experiment demonstrates the effectiveness of our proposed robust Bayesian recourse facing model shifts. Our code is available at https://github.com/VinAIResearch/robust-bayesian-recourse. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nguyen22a.html https://proceedings.mlr.press/v180/nguyen22a.html CounteRGAN: Generating counterfactuals for real-time recourse and interpretability using residual GANs Model interpretability, fairness, and recourse for end users have increased as machine learning models have become increasingly popular in areas including criminal justice, finance, healthcare, and job marketplaces. This work presents a novel method of addressing these issues by producing meaningful counterfactuals that are aimed at providing recourse to users and highlighting potential model biases. A meaningful counterfactual is a reasonable alternative scenario that illustrates how input data perturbations can influence the model’s output. The CounteRGAN method generates meaningful counterfactuals for a target classifier by utilizing a novel Residual Generative Adversarial Network (RGAN). We compare our method against leading state-of-the-art approaches on image and tabular datasets over a variety of performance metrics. The results indicate a significant improvement over existing techniques in combined metric performance, with a latency reduction of 2 to 7 orders of magnitude which enables providing real-time recourse to users. The code for reproducibility can be found here: https://github.com/gan-counterfactuals/countergan. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nemirovsky22a.html https://proceedings.mlr.press/v180/nemirovsky22a.html Linearizing contextual bandits with latent state dynamics In many real-world applications of multi-armed bandit problems, both rewards and contexts are often influenced by confounding latent variables which evolve stochastically over time. While the observed contexts and rewards are nonlinearly related, we show that prior knowledge of latent causal structure can be used to reduce the problem to the linear bandit setting. We develop two algorithms, Latent Linear Thompson Sampling (L2TS) and Latent Linear UCB (L2UCB), which use online EM algorithms for hidden Markov models to learn the latent transition model and maintain a posterior belief over the latent state, and then use the resulting posteriors as context features in a linear bandit problem. We upper bound the error in reward estimation in the presence of a dynamical latent state, and derive a novel problem-dependent regret bound for linear Thompson sampling with non-stationarity and unconstrained reward distributions, which we apply to L2TS under certain conditions. Finally, we demonstrate the superiority of our algorithms over related bandit algorithms through experiments. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nelson22a.html https://proceedings.mlr.press/v180/nelson22a.html Efficient learning of sparse and decomposable PDEs using random projection Learning physics models in the form of Partial Differential Equations (PDEs) is carried out through back-propagation to match the simulations of the physics model with experimental observations. Nevertheless, such matching involves computation over billions of elements, presenting a significant computational overhead. We notice many PDEs in real world problems are sparse and decomposable, where the temporal updates and the spatial features are sparsely concentrated on small interface regions. We propose RAPID-PDE, an algorithm to expedite the learning of sparse and decomposable PDEs. Our RAPID-PDE first uses random projection to compress the high dimensional sparse updates and features into low dimensional representations and then use these compressed signals during learning. Crucially, such a conversion is only carried out once prior to learning and the entire learning process is conducted in the compressed space. Theoretically, we derive a constant factor approximation between the projected loss function and the original one with logarithmic number of projected dimensions. Empirically, we demonstrate RAPID-PDE with data compressed to 0.05% of its original size learns similar models compared with uncompressed algorithms in learning a set of phase-field models which govern the spatial-temporal dynamics of nano-scale structures in metallic materials. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nasim22a.html https://proceedings.mlr.press/v180/nasim22a.html Partially adaptive regularized multiple regression analysis for estimating linear causal effects This paper assumes that cause-effect relationships among variables can be described with a linear structural equation model. Then, a situation is considered where a set of observed covariates satisfies the back-door criterion but the ordinary least squares method cannot be applied to estimate linear causal effects because of multicollinearity/high-dimensional data problems. In this situation, we propose a novel regression approach, the “partially adaptive L$_p$-regularized multiple regression analysis” (PAL$_p$MA) method for estimating the total effects. Different from standard regularized regression analysis, PAL$_p$MA provides a consistent or less-biased estimator of the linear causal effect. PAL$_p$MA is also applicable to evaluating direct effects through the single-door criterion. Given space constraints, the proofs, some numerical experiments, and an industrial case study on setting up painting conditions of car bodies are provided in the Supplementary Material. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nanmo22a.html https://proceedings.mlr.press/v180/nanmo22a.html Semiparametric causal sufficient dimension reduction of multidimensional treatments Cause-effect relationships are typically evaluated by comparing outcome responses to binary treatment values, representing two arms of a hypothetical randomized controlled trial. However, in certain applications, treatments of interest are continuous and multidimensional. For example, understanding the causal relationship between severity of radiation therapy, summarized by a multidimensional vector of radiation exposure values and post-treatment side effects is a problem of clinical interest in radiation oncology. An appropriate strategy for making interpretable causal conclusions is to reduce the dimension of treatment. If individual elements of a multidimensional treatment vector weakly affect the outcome, but the overall relationship between treatment and outcome is strong, careless approaches to dimension reduction may not preserve this relationship. Further, methods developed for regression problems do not directly transfer to causal inference due to confounding complications. In this paper, we use semiparametric inference theory for structural models to give a general approach to causal sufficient dimension reduction of a multidimensional treatment such that the cause-effect relationship between treatment and outcome is preserved. We illustrate the utility of our proposals through simulations and a real data application in radiation oncology. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nabi22a.html https://proceedings.mlr.press/v180/nabi22a.html Data augmentation in Bayesian neural networks and the cold posterior effect Bayesian neural networks that incorporate data augmentation implicitly use a “randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function” (Izmailov et al. 2021). Here, we provide several approaches to developing principled Bayesian neural networks incorporating data augmentation. We introduce a “finite orbit” setting which allows valid likelihoods to be computed exactly, and for the more usual “full orbit” setting we derive multi-sample bounds tighter than those used previously. These models cast light on the origin of the cold posterior effect. In particular, we find that the cold posterior effect persists even in these principled models incorporating data augmentation. This suggests that the cold posterior effect cannot be dismissed as an artifact of data augmentation using incorrect likelihoods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/nabarro22a.html https://proceedings.mlr.press/v180/nabarro22a.html Probabilistic surrogate networks for simulators with unbounded randomness We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates’ ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/munk22a.html https://proceedings.mlr.press/v180/munk22a.html ReVar: Strengthening policy evaluation via reduced variance sampling This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle exploration strategy that uses knowledge of the variance of the reward distributions. We then introduce the \textbf{Re}duced \textbf{Var}iance Sampling (\rev\!) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that \rev leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/mukherjee22b.html https://proceedings.mlr.press/v180/mukherjee22b.html Safety aware changepoint detection for piecewise i.i.d. bandits In this paper, we consider the setting of piecewise i.i.d. bandits under a safety constraint. In this piecewise i.i.d. setting, there exists a finite number of changepoints where the mean of some or all arms change simultaneously. We introduce the safety constraint studied in Wu et al. (2016) to this setting such that at any round the cumulative reward is above a constant factor of the default action reward. We propose two actively adaptive algorithms for this setting that satisfy the safety constraint, detect changepoints, and restart without the knowledge of the number of changepoints or their locations. We provide regret bounds for our algorithms and show that the bounds are comparable to their counterparts from the safe bandit and piecewise i.i.d. bandit literature. We also provide the first matching lower bounds for this setting. Empirically, we show that our safety-aware algorithms match the performance of the state-of-the-art actively adaptive algorithms that do not satisfy the safety constraint. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/mukherjee22a.html https://proceedings.mlr.press/v180/mukherjee22a.html Set-valued prediction in hierarchical classification with constrained representation complexity Set-valued prediction is a well-known concept in multi-class classification. When a classifier is uncertain about the class label for a test instance, it can predict a set of classes instead of a single class. In this paper, we focus on hierarchical multi-class classification problems, where valid sets (typically) correspond to internal nodes of the hierarchy. We argue that this is a very strong restriction, and we propose a relaxation by introducing the notion of representation complexity for a predicted set. In combination with probabilistic classifiers, this leads to a challenging inference problem for which specific combinatorial optimization algorithms are needed. We propose three methods and evaluate them on benchmark datasets: a naïve approach that is based on matrix-vector multiplication, a reformulation as a knapsack problem with conflict graph, and a recursive tree search method. Experimental results demonstrate that the last method is computationally more efficient than the other two approaches, due to a hierarchical factorization of the conditional class distribution. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/mortier22a.html https://proceedings.mlr.press/v180/mortier22a.html Monotonicity regularization: Improved penalties and novel applications to disentangled representation learning and robust classification We study settings where gradient penalties are used alongside risk minimization with the goal of obtaining predictors satisfying different notions of monotonicity. Specifically, we present two sets of contributions. In the first part of the paper, we show that different choices of penalties define the regions of the input space where the property is observed. As such, previous methods result in models that are monotonic only in a small volume of the input space. We thus propose an approach that uses mixtures of training instances and random points to populate the space and enforce the penalty in a much larger region. As a second set of contributions, we introduce regularization strategies that enforce other notions of monotonicity in different settings. In this case, we consider applications, such as image classification and generative modeling, where monotonicity is not a hard constraint but can help improve some aspects of the model. Namely, we show that inducing monotonicity can be beneficial in applications such as: (1) allowing for controllable data generation, (2) defining strategies to detect anomalous data, and (3) generating explanations for predictions. Our proposed approaches do not introduce relevant computational overhead while leading to efficient procedures that provide extra benefits over baseline models. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/monteiro22a.html https://proceedings.mlr.press/v180/monteiro22a.html Can mean field control (mfc) approximate cooperative multi agent reinforcement learning (marl) with non-uniform interaction? Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field ‘seen’ by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/mondal22a.html https://proceedings.mlr.press/v180/mondal22a.html Forget-me-not! Contrastive critics for mitigating posterior collapse Variational autoencoders (VAEs) suffer from posterior collapse, where the powerful neural networks used for modeling and inference optimize the objective without meaningfully using the latent representation. We introduce inference critics that detect and incentivize against posterior collapse by requiring correspondence between latent variables and the observations. By connecting the critic’s objective to the literature in self-supervised contrastive representation learning, we show both theoretically and empirically that optimizing inference critics increases the mutual information between observations and latents, mitigating posterior collapse. This approach is straightforward to implement and requires significantly less training time than prior methods, yet obtains competitive results on three established datasets. Overall, the approach lays the foundation to bridge the previously disconnected frameworks of contrastive learning and probabilistic modeling with variational autoencoders, underscoring the benefits both communities may find at their intersection. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/menon22a.html https://proceedings.mlr.press/v180/menon22a.html Multistate analysis with infinite mixtures of Markov chains Driven by applications in clinical medicine and business, we address the problem of modeling trajectories over multiple states. We build on well-known methods from survival analysis and introduce a family of sequence models based on localized Bayesian Markov chains. We develop inference and prediction algorithms, and we apply the model to real-world data, demonstrating favorable empirical results. Our approach provides a practical and effective alternative to plain Markov chains and to existing (finite) mixture models; It retains the simplicity and computational benefits of the former while matching or exceeding the predictive performance of the latter. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/maystre22a.html https://proceedings.mlr.press/v180/maystre22a.html Case-based off-policy evaluation using prototype learning Importance sampling (IS) is often used to perform off-policy evaluation but it is prone to several issues—especially when the behavior policy is unknown and must be estimated from data. Significant differences between target and behavior policies can result in uncertain value estimates due to, for example, high variance. Standard practices such as inspecting IS weights may be insufficient to diagnose such problems and determine for which type of inputs the policies differ in suggested actions and resulting values. To address this, we propose estimating the behavior policy for IS using prototype learning. The learned prototypes provide a condensed summary of the input-action space, which allows for describing differences between policies and assessing the support for evaluating a certain target policy. In addition, we can describe a value estimate in terms of prototypes to understand which parts of the target policy have the most impact on the estimate. We find that this provides new insights in the examination of a learned policy for sepsis management. Moreover, we study the bias resulting from restricting models to use prototypes, how bias propagates to IS weights and estimated values and how this varies with history length. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/matsson22a.html https://proceedings.mlr.press/v180/matsson22a.html A causal bandit approach to learning good atomic interventions in presence of unobserved confounders We study the problem of determining the best atomic intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where interventions on CBN correspond to arms of the bandit instance. First, we propose a simple regret minimization algorithm that takes as input a causal graph with observable and unobservable nodes and in $T$ exploration rounds achieves $\tilde{O}(\sqrt{m(\mathcal{C})/T})$ expected simple regret. Here $m(\mathcal{C})$ is a parameter dependent on the input CBN $\mathcal{C}$ and could be much smaller than the number of arms. We also show that this is almost optimal for CBNs whose causal graphs have an $n$-ary tree structure. Next, we propose a cumulative regret minimization algorithm that takes as input a causal graph with observable nodes and performs better than the optimal MAB algorithms that do not use causal side-information. We experimentally compare both our algorithms with the best known algorithms in the literature. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/maiti22a.html https://proceedings.mlr.press/v180/maiti22a.html Perturbation type categorization for multiple adversarial perturbation robustness Recent works in adversarial robustness have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against a single perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their perturbation types. We first theoretically show on a toy task that adversarial examples of different perturbation types constitute different distributions—making it possible to distinguish them. We support these arguments with experimental validation on multiple l_p attacks and common corruptions. Instead of training a single classifier, we propose PROTECTOR, a two-stage pipeline that first categorizes the perturbation type of the input, and then makes the final prediction using the classifier specifically trained against the predicted perturbation type. We theoretically show that at test time the adversary faces a natural trade-off between fooling the perturbation classifier and the succeeding classifier optimized with perturbation-specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. Experiments on MNIST and CIFAR-10 show that PROTECTOR outperforms prior adversarial training-based defenses by over 5% when tested against the union of l_1, l_2, l_inf attacks. Additionally, our method extends to a more diverse attack suite, also showing large robustness gains against multiple l_p, spatial and recolor attacks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/maini22a.html https://proceedings.mlr.press/v180/maini22a.html Low-precision arithmetic for fast Gaussian processes Low precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements. However, despite their promise, low precision operations have received little attention for Gaussian process (GP) training, largely because GPs require sophisticated linear algebra routines that are unstable in low precision. We study the different failure modes that can occur when training GPs in half-precision. To circumvent these failure modes, we propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, compact kernels, and preconditioners. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low precision over a wide range of settings, and reduces the runtime of 1.8 million data points to 10 hours on a single GPU, without requiring any sparse approximations. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/maddox22a.html https://proceedings.mlr.press/v180/maddox22a.html Data sampling affects the complexity of online SGD over dependent data Conventional machine learning applications typically assume that data samples are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data samples, which are known to heavily bias the stochastic optimization process and slow down the convergence of learning. In this paper, we conduct a fundamental study on how different stochastic data sampling schemes affect the sample complexity of online stochastic gradient descent (SGD) over highly dependent data. Specifically, with a $\phi$-mixing process of data, we show that online SGD with proper periodic data-subsampling achieves an improved sample complexity over the standard online SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence of online SGD over highly dependent data. Moreover, we show that online SGD with mini-batch sampling can further substantially improve the sample complexity over online SGD with periodic data-subsampling over highly dependent data. Numerical experiments validate our theoretical results. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ma22a.html https://proceedings.mlr.press/v180/ma22a.html Local calibration: metrics and recalibration Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability \textit{on average} across the entire population. However, it is in general impossible to measure the reliability of an \textit{individual} prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel {\textbf{lo}cal \textbf{re}calibration} method \method{}, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/luo22a.html https://proceedings.mlr.press/v180/luo22a.html Nonparametric exponential family graph embeddings for multiple representation learning In graph data, each node often serves multiple functionalities. However, most graph embedding models assume that each node can only possess one representation. We address this issue by proposing a nonparametric graph embedding model. The model allows each node to learn multiple representations where they are needed to represent the complexity of random walks in the graph. It extends the Exponential family graph embedding model with two nonparametric prior settings, the Dirichlet process and the uniform process. The model combines the ability of Exponential family graph embedding to take the number of occurrences of context nodes into account with nonparametric priors giving it the flexibility to learn more than one latent representation for each node. The learned embeddings outperform other state of the art approaches in link prediction and node classification tasks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lu22a.html https://proceedings.mlr.press/v180/lu22a.html Data poisoning attacks on off-policy policy evaluation methods Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible, unethical, or expensive. However, the extent to which such methods can be trusted under adversarial threats to data quality is largely unexplored. In this work, we make the first attempt at investigating the sensitivity of OPE methods to marginal adversarial perturbations to the data. We design a generic data poisoning attack framework leveraging influence functions from robust statistics to carefully construct perturbations that maximize error in the policy value estimates. We carry out extensive experimentation with multiple healthcare and control datasets. Our results demonstrate that many existing OPE methods are highly prone to generating value estimates with large errors when subject to data poisoning attacks, even for small adversarial perturbations. These findings question the reliability of policy values derived using OPE methods and motivate the need for developing OPE methods that are statistically robust to train-time data poisoning attacks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lobo22a.html https://proceedings.mlr.press/v180/lobo22a.html Offline policy optimization with eligible actions Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a common used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/liu22d.html https://proceedings.mlr.press/v180/liu22d.html SASH: Efficient secure aggregation based on SHPRG for federated learning To prevent private training data leakage in Federated Learning systems, we propose a novel secure aggregation scheme based on seed homomorphic pseudo-random generator (SHPRG), named SASH. SASH leverages the homomorphic property of SHPRG to simplify the masking and demasking scheme, which for each of the clients and for the server, entails a overhead linear w.r.t model size and constant w.r.t number of clients. We prove that even against worst-case colluding adversaries, SASH preserves training data privacy, while being resilient to dropouts without extra overhead. We experimentally demonstrate SASH significantly improves the efficiency to 20× over baseline, especially in the more realistic case where the numbers of clients and model size become large, and a certain percentage of clients drop out from the system. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/liu22c.html https://proceedings.mlr.press/v180/liu22c.html PathFlow: A normalizing flow generator that finds transition paths Sampling from a Boltzmann distribution to calculate important macro statistics is one of the central tasks in the study of large atomic and molecular systems. Recently, a one-shot configuration sampler, the Boltzmann generator [Noé et al., 2019], is introduced. Though a Boltzmann generator can directly generate independent metastable states, it lacks the ability to find transition pathways and describe the whole transition process. In this paper, we propose PathFlow that can function as a one-shot generator as well as a transition pathfinder. More specifically, a normalizing flow model is constructed to map the base distribution and linear interpolated path in the latent space to the Boltzmann distribution and a minimum (free) energy path in the configuration space simultaneously. PathFlow can be trained by standard gradient-based optimizers using the proposed gradient estimator with a theoretical guarantee. PathFlow, validated with the extensively studied examples including a synthetic Müller potential and Alanine dipeptide, shows a remarkable performance. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/liu22b.html https://proceedings.mlr.press/v180/liu22b.html Federated online clustering of bandits Contextual multi-armed bandit (MAB) is an important sequential decision-making problem in recommendation systems. A line of works, called the clustering of bandits (CLUB), utilize the collaborative effect over users and dramatically improve the recommendation quality. Owing to the increasing application scale and public concerns about privacy, there is a growing demand to keep user data decentralized and push bandit learning to the local server side. Existing CLUB algorithms, however, are designed under the centralized setting where data are available at a central server. We focus on studying the federated online clustering of bandit (FCLUB) problem, which aims to minimize the total regret while satisfying privacy and communication considerations. We design a new phase-based scheme for cluster detection and a novel asynchronous communication protocol for cooperative bandit learning for this problem. To protect users’ privacy, previous differential privacy (DP) definitions are not very suitable, and we propose a new DP notion that acts on the user cluster level. We provide rigorous proofs to show that our algorithm simultaneously achieves (clustered) DP, sublinear communication complexity and sublinear regret. Finally, experimental evaluations show our superior performance compared with benchmark algorithms. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/liu22a.html https://proceedings.mlr.press/v180/liu22a.html Dimension reduction for high-dimensional small counts with KL divergence Dimension reduction for high-dimensional count data with a large proportion of zeros is an important task in various applications. As a large number of dimension reduction methods rely on the proximity measure, we develop a dissimilarity measure that is well-suited for small counts based on the Kullback-Leibler divergence. We compare the proposed measure with other widely used dissimilarity measures and show that the proposed one has superior discriminative ability when applied to high-dimensional count data having an excess of zeros. Extensive empirical results, on both simulated and publicly-available real-world datasets that contain many zeros, demonstrate that the proposed dissimilarity measure can improve a wide range of dimension reduction methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ling22a.html https://proceedings.mlr.press/v180/ling22a.html Laplace approximated Gaussian process state-space models Gaussian process state-space models describe time series data in a probabilistic and non-parametric manner using a Gaussian process transition function. As inference is intractable, recent methods use variational inference and either rely on simplifying independence assumptions on the approximate posterior or learn the temporal states iteratively. The latter hampers optimization since the posterior over the presence can only be learned once the posterior governing the past has converged. We present a novel inference scheme that applies stochastic variational inference for the Gaussian process posterior and the Laplace approximation on the temporal states. This approach respects the conditional dependencies in the model and, through the Laplace approximation, treats the temporal states jointly, thereby avoiding their sequential learning. Our method is computationally efficient and leads to better calibrated predictions compared to state-of-the art alternatives on synthetic data and on a range of benchmark datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lindinger22a.html https://proceedings.mlr.press/v180/lindinger22a.html AdaCat: Adaptive categorical discretization for autoregressive models Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, which makes it parameter inefficient. We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a simple add-on to any discretization-based distribution estimator. In experiments, AdaCat improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22h.html https://proceedings.mlr.press/v180/li22h.html $\ell_∞$-Bounds of the MLE in the BTL Model under General Comparison Graphs The Bradley-Terry-Luce (BTL) model is a popular statistical approach for estimating the global ranking of a collection of items using pairwise comparisons. To ensure accurate ranking, it is essential to obtain precise estimates of the model parameters in the $\ell_{\infty}$-loss. The difficulty of this task depends crucially on the topology of the pairwise comparison graph over the given items. However, beyond very few well-studied cases, such as the complete and Erd{ö}s-R{é}nyi comparison graphs, little is known about the performance of the maximum likelihood estimator (MLE) of the BTL model parameters in the $\ell_{\infty}$-loss under more general graph topologies. In this paper, we derive novel, general upper bounds on the $\ell_{\infty}$ estimation error of the BTL MLE that depend explicitly on the algebraic connectivity of the comparison graph, the maximal performance gap across items and the sample complexity. We demonstrate that the derived bounds perform well and in some cases are sharper compared to known results obtained using different loss functions and more restricted assumptions and graph topologies. We carefully compare our results to Yan et al. (2012), which is closest in spirit to our work. We further provide minimax lower bounds under $\ell_{\infty}$-error that nearly match the upper bounds over a class of sufficiently regular graph topologies. Finally, we study the implications of our $\ell_{\infty}$-bounds for efficient (offline) tournament design. We illustrate and discuss our findings through various examples and simulations. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22g.html https://proceedings.mlr.press/v180/li22g.html A label efficient two-sample test Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed bimodal query is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at https://github.com/wayne0908/Label-Efficient-Two-Sample. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22f.html https://proceedings.mlr.press/v180/li22f.html Efficient resource allocation with fairness constraints in restless multi-armed bandits Restless Multi-Armed Bandits (RMAB) is an apt model to represent decision-making problems in public health interventions (e.g., tuberculosis, maternal, and child care), anti-poaching planning, sensor monitoring, personalized recommendations and many more. Existing research in RMAB has contributed mechanisms and theoretical results to a wide variety of settings, where the focus is on maximizing expected value. In this paper, we are interested in ensuring that RMAB decision making is also fair to different arms while maximizing expected value. In the context of public health settings, this would ensure that different people and/or communities are fairly represented while making public health intervention decisions. To achieve this goal, we formally define the fairness constraints in RMAB and provide planning and learning methods to solve RMAB in a fair manner. We demonstrate key theoretical properties of fair RMAB and experimentally demonstrate that our proposed methods handle fairness constraints without sacrificing significantly on solution quality. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22e.html https://proceedings.mlr.press/v180/li22e.html Proportional allocation of indivisible resources under ordinal and uncertain preferences. We study a fair resource allocation problem with indivisible items. The agents’ preferences over items are assumed to be ordinal and have uncertainties. We adopt stochastic dominance proportionality as our fairness notion and study a sequence of problems related to finding allocations that are fair with a high probability. We provide complexity analysis for each problem and efficient algorithms for some problems. Finally, we propose several heuristic algorithms to find an allocation that is fair with the highest probability. We thoroughly evaluate the performance of the algorithms on both synthetic and real datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22d.html https://proceedings.mlr.press/v180/li22d.html Deep Dirichlet process mixture models In this paper we propose the deep Dirichlet process mixture (DDPM) model, which is an unsupervised method that simultaneously performs clustering and feature learning. The traditional Dirichlet process mixture model can infer the number of mixture components, but its flexibility is restricted since the clustering is performed in the raw feature space. Our method alleviates this limitation by using the flow-based deep neural network to learn more expressive features. DDPM unifies Dirichlet processes and the flow-based model with Monte Carlo expectation-maximization, and uses Gibbs sampling to sample from the posterior. This combination allows our method to exploit the mutually beneficial relation between clustering and feature learning. The effectiveness of DDPM is demonstrated by thorough experiments in various synthetic and real-world datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22c.html https://proceedings.mlr.press/v180/li22c.html PDQ-Net: Deep probabilistic dual quaternion network for absolute pose regression on $SE(3)$ Accurate absolute pose regression is one of the key challenges in robotics and computer vision. Existing direct regression methods suffer from two limitations. First, some noisy scenarios such as poor illumination conditions are likely to result in the uncertainty of pose estimation. Second, the output n-dimensional feature vector in the Euclidean space $\mathbb{R}^n$ cannot be well mapped to $SE(3)$ manifold. In this work, we propose a deep dual quaternion network that performs the absolute pose regression on $SE(3)$. We first develop an antipodally symmetric probability distribution over the unit dual quaternion on $SE(3)$ to model uncertainties and then propose an intermediary differential representation space to replace the final output pose, which avoids the mapping problem from $\mathbb{R}^n$ to $SE(3)$. In addition, we introduce a backpropagation method that considers the continuousness and differentiability of the proposed intermediary space. Extensive experiments on the camera re-localization task on the Cambridge Landmarks and 7-Scenes datasets demonstrate that our method greatly improves the accuracy of the pose as well as the robustness in dealing with uncertainty and ambiguity, compared to the state-of-the-art. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22b.html https://proceedings.mlr.press/v180/li22b.html Solving structured hierarchical games using differential backward induction From large-scale organizations to decentralized political systems, hierarchical strategic decision making is commonplace. We introduce a novel class of structured hierarchical games (SHGs) that formally capture such hierarchical strategic interactions. In an SHG, each player is a node in a tree, and strategic choices of players are sequenced from root to leaves, with root moving first, followed by its children, then followed by their children, and so on until the leaves. A player’s utility in an SHG depends on its own decision, and on the choices of its parent and all the tree leaves. SHGs thus generalize simultaneous-move games, as well as Stackelberg games with many followers. We leverage the structure of both the sequence of player moves as well as payoff dependence to develop a gradient-based back propagation-style algorithm, which we call Differential Backward Induction (DBI), for approximating equilibria of SHGs. We provide a sufficient condition for convergence of DBI and demonstrate its efficacy in finding approximate equilibrium solutions to several SHG models of hierarchical policy-making problems. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/li22a.html https://proceedings.mlr.press/v180/li22a.html Recursive Monte Carlo and variational inference with auxiliary variables A key design constraint when implementing Monte Carlo and variational inference algorithms is that it must be possible to cheaply and exactly evaluate the marginal densities of proposal distributions and variational families. This takes many interesting proposals off the table, such as those based on involved simulations or stochastic optimization. This paper broadens the design space, by presenting a framework for applying Monte Carlo and variational inference algorithms when proposal densities cannot be exactly evaluated. Our framework, recursive auxiliary-variable inference (RAVI), instead approximates the necessary densities using meta-inference: an additional layer of Monte Carlo or variational inference, that targets the proposal, rather than the model. RAVI generalizes and uniﬁes several existing methods for inference with expressive approximating families, which we show correspond to speciﬁc choices of meta-inference algorithm, and provides new theory for analyzing their bias and variance. We illustrate RAVI’s design framework and theorems by using them to analyze and improve upon Salimans et al.’s Markov Chain Variational Inference, and to design a novel sampler for Dirichlet process mixtures, achieving state-of-the-art results on a standard benchmark dataset from astronomy and on a challenging datacleaning task with Medicare hospital data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lew22a.html https://proceedings.mlr.press/v180/lew22a.html Fixing the Bethe approximation: How structural modifications in a graph improve belief propagation Belief propagation is an iterative method for inference in probabilistic graphical models. Its well-known relationship to a classical concept from statistical physics, the Bethe free energy, puts it on a solid theoretical foundation. If belief propagation fails to approximate the marginals, then this is often due to a failure of the Bethe approximation. In this work, we show how modifications in a graphical model can be a great remedy for fixing the Bethe approximation. Specifically, we analyze how the removal of edges influences and improves belief propagation, and demonstrate that this positive effect is particularly distinct for dense graphs. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/leisenberger22a.html https://proceedings.mlr.press/v180/leisenberger22a.html Systematized event-aware learning for multi-object tracking We propose an end-to-end online multi-object tracking (MOT) framework with a systematized event-aware loss, which is designed to control possible occurrences in an online MOT situation and compel the tracker to take appropriate actions when such events occur. Training samples from real candidates using a simulation tracker are generated, and a systematized event-aware association matrix is constructed for every frame to enable the tracker to learn the ideal action in a running environment. Several experiments, including ablation studies on various public MOT benchmark datasets, are conducted. The experimental results verify that each event affecting the tracking measure can be controlled, and the proposed method presents optimal results compared with recent state-of-the-art MOT methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lee22a.html https://proceedings.mlr.press/v180/lee22a.html Interpolating between sampling and variational inference with infinite stochastic mixtures Sampling and Variational Inference (VI) are two large families of methods for approximate inference that have complementary strengths. Sampling methods excel at approximating arbitrary probability distributions, but can be inefficient. VI methods are efficient, but may misrepresent the true distribution. Here, we develop a general framework where approximations are stochastic mixtures of simple component distributions. Both sampling and VI can be seen as special cases: in sampling, each mixture component is a delta-function and is chosen stochastically, while in standard VI a single component is chosen to minimize divergence. We derive a practical method that interpolates between sampling and VI by analytically solving an optimization problem over a mixing distribution. Intermediate inference methods then arise by varying a single parameter. Our method provably improves on sampling (reducing variance) and on VI (reducing bias+variance despite increasing variance). We demonstrate our method’s bias/variance trade-off in practice on reference problems, and we compare outcomes to commonly used sampling and VI methods. This work takes a step towards a highly flexible yet simple family of inference methods that combines the complementary strengths of sampling and VI. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lange22a.html https://proceedings.mlr.press/v180/lange22a.html Greedy relaxations of the sparsest permutation algorithm There has been an increasing interest in methods that exploit permutation reasoning to search for directed acyclic causal models, including the “Ordering Search’’ of Teyssier and Kohler and GSP of Solus, Wang and Uhler. We extend the methods of the latter by a permutation-based operation tuck, and develop a class of algorithms, namely GRaSP, that are computationally efficient and pointwise consistent under increasingly weaker assumptions than faithfulness. The most relaxed form of GRaSP outperforms many state-of-the-art causal search algorithms in simulation, allowing efficient and accurate search even for dense graphs and graphs with more than 100 variables. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/lam22a.html https://proceedings.mlr.press/v180/lam22a.html Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy. A robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via vanilla ERM. In this paper, we find that a simple approach of ensembling the standard and robust models, after calibrating on only ID data, outperforms prior state-of-the-art both ID and OOD. On ten natural distribution shift datasets, ID-calibrated ensembles get the best of both worlds: strong ID accuracy of the standard model and OOD accuracy of the robust model. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well on both ID and OOD: (1) standard and robust models should be calibrated (on ID data, because OOD data is unavailable), (2) OOD has no anticorrelated spurious features. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kumar22a.html https://proceedings.mlr.press/v180/kumar22a.html Hitting times for continuous-time imprecise-Markov chains We study the problem of characterizing the expected hitting times for a robust generalization of continuous-time Markov chains. This generalization is based on the theory of imprecise probabilities, and the models with which we work essentially constitute sets of stochastic processes. Their inferences are tight lower- and upper bounds with respect to variation within these sets. We consider three distinct types of these models, corresponding to different levels of generality and structural independence assumptions on the constituent processes. Our main results are twofold; first, we demonstrate that the hitting times for all three types are equivalent. Moreover, we show that these inferences are described by a straightforward generalization of a well-known linear system of equations that characterizes expected hitting times for traditional time-homogeneous continuous-time Markov chains. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/krak22a.html https://proceedings.mlr.press/v180/krak22a.html Revisiting the general identifiability problem We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kivva22a.html https://proceedings.mlr.press/v180/kivva22a.html On the effectiveness of adversarial training against common corruptions The literature on robustness towards common corruptions shows no consensus on whether adversarial training can improve the performance in this setting. First, we show that, when used with an appropriately selected perturbation radius, Lp adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration. Then we explain why adversarial training performs better than data augmentation with simple Gaussian noise which has been observed to be a meaningful baseline on common corruptions. Related to this, we identify the sigma-overfitting phenomenon when Gaussian augmentation overfits to a particular standard deviation used for training which has a significant detrimental effect on common corruption accuracy. We discuss how to alleviate this problem and then how to further enhance Lp adversarial training by introducing an efficient relaxation of adversarial training with learned perceptual image patch similarity as the distance metric. Through experiments on CIFAR-10 and ImageNet-100, we show that our approach does not only improve the Lp adversarial training baseline but also has cumulative gains with data augmentation methods such as AugMix, DeepAugment, ANT, and SIN, leading to state-of-the-art performance on common corruptions. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kireev22a.html https://proceedings.mlr.press/v180/kireev22a.html Combinatorial Bayesian optimization with random mapping functions to convex polytopes Bayesian optimization is a popular method for solving the problem of global optimization of an expensive-to-evaluate black-box function. It relies on a probabilistic surrogate model of the objective function, upon which an acquisition function is built to determine where next to evaluate the objective function. In general, Bayesian optimization with Gaussian process regression operates on a continuous space. When input variables are categorical or discrete, an extra care is needed. A common approach is to use one-hot encoded or Boolean representation for categorical variables which might yield a combinatorial explosion problem. In this paper we present a method for Bayesian optimization in a combinatorial space, which can operate well in a large combinatorial space. The main idea is to use a random mapping which embeds the combinatorial space into a convex polytope in a continuous space, on which all essential process is performed to determine a solution to the black-box optimization in the combinatorial space. We describe our combinatorial Bayesian optimization algorithm and present its regret analysis. Numerical experiments demonstrate that our method shows satisfactory performance compared to existing methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kim22a.html https://proceedings.mlr.press/v180/kim22a.html Restless and uncertain: Robust policies for restless bandits via deep multi-agent reinforcement learning We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant uncertainty, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret–robust policies for RMABs. Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary “$\lambda$-network” in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest—we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/killian22a.html https://proceedings.mlr.press/v180/killian22a.html Dynamic relocation in ridesharing via fixpoint construction To address spatial imbalances in the supply and demand of drivers, ridesharing platforms can make use of policies to direct driver relocation. We study a simple model of this problem, which allows us to give a constructive characterization of the unique fixpoint of system dynamics. Using this construction, we design a dynamic policy that provides stronger, than previous work, guarantees about its rate of convergence to the fixpoint. Simulations demonstrate the benefits of our approach. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kash22a.html https://proceedings.mlr.press/v180/kash22a.html Improved feature importance computation for tree models based on the Banzhaf value The Shapley value – a fundamental game-theoretic solution concept – has recently become one of the main tools used to explain predictions of tree ensemble models. Another well-known game-theoretic solution concept is the Banzhaf value. Although the Banzhaf value is closely related to the Shapley value, its properties w.r.t. feature attribution have not been understood equally well. This paper shows that, for tree ensemble models, the Banzhaf value offers some crucial advantages over the Shapley value while providing similar feature attributions. In particular, we first give an optimal O(TL + n) time algorithm for computing the Banzhaf value-based attribution of a tree ensemble model’s output. Here, T is the number of trees, L is the maximum number of leaves in a tree, and n is the number of features. In comparison, the state-of-the-art Shapley value-based algorithm runs in O(TLD^2 + n) time, where D denotes the maximum depth of a tree in the ensemble. Next, we experimentally compare the Banzhaf and Shapley values for tree ensemble models. Both methods deliver essentially the same average importance scores for the studied datasets using two different tree ensemble models (the sklearn implementation of Decision Trees or xgboost implementation of Gradient Boosting Decision Trees). However, our results indicate that, on top of being computable faster, the Banzhaf is more numerically robust than the Shapley value. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/karczmarz22a.html https://proceedings.mlr.press/v180/karczmarz22a.html Test for non-negligible adverse shifts Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences when there is in fact adequate sample coverage and predictive performance. We propose instead a framework to detect adverse shifts based on outlier scores, D-SOS for short. D-SOS holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates at varying weighted thresholds. Users can define what worse means in terms of relevant notions of outlyingness, including proxies for predictive performance. Compared to tests of equal distribution, our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation. We show how versatile and practical D-SOS is on a wide range of real and simulated data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kamulete22a.html https://proceedings.mlr.press/v180/kamulete22a.html Optimal control of partially observable Markov decision processes with finite linear temporal logic constraints Autonomous agents often operate in environments where the state is partially observed. In addition to maximizing their cumulative reward, agents must execute complex tasks with rich temporal and logical structures. These tasks can be expressed using temporal logic languages like finite linear temporal logic. This paper, for the first time, provides a structured framework for designing agent policies that maximize the reward while ensuring that the probability of satisfying the temporal logic specification is sufficiently high. We reformulate the problem as a constrained partially observable Markov decision process (POMDP) and provide a novel approach that can leverage off-the-shelf unconstrained POMDP solvers for solving it. Our approach guarantees approximate optimality and constraint satisfaction with high probability. We demonstrate its effectiveness by implementing it on several models of interest. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kalagarla22a.html https://proceedings.mlr.press/v180/kalagarla22a.html Decision-theoretic planning with communication in open multiagent systems In open multiagent systems, the set of agents operating in the environment changes over time and in ways that are nontrivial to predict. For example, if collaborative robots were tasked with fighting wildfires, they may run out of suppressants and be temporarily unavailable to assist their peers. Because an agent’s optimal action depends on the actions of others, each agent must not only predict the actions of its peers, but, before that, reason whether they are even present to perform an action. Addressing openness thus requires agents to model each other’s presence, which can be enhanced through agents communicating about their presence in the environment. At the same time, communicative acts can also incur costs (e.g., consuming limited bandwidth), and thus an agent must tradeoff the benefits of enhanced coordination with the costs of communication. We present a new principled, decision-theoretic method in the context provided by the recent communicative interactive POMDP framework for planning in open agent settings that balances this tradeoff. Simulations of multiagent wildfire suppression problems demonstrate how communication can improve planning in open agent environments, as well as how agents tradeoff the benefits and costs of communication under different scenarios. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/kakarlapudi22a.html https://proceedings.mlr.press/v180/kakarlapudi22a.html If you’ve trained one you’ve trained them all: inter-architecture similarity increases with robustness Previous work has shown that commonly-used metrics for comparing representations between neural networks overestimate similarity due to correlations between data points. We show that intra-example feature correlations also causes significant overestimation of network similarity and propose an image inversion technique to analyze only the features used by a network. With this technique, we find that similarity across architectures is significantly lower than commonly understood, but we surprisingly find that similarity between models with different architectures increases as the adversarial robustness of the models increase. Our findings indicate that robust networks tend toward a universal set of representations, regardless of architecture, and that the robust training criterion is a strong prior constraint on the functions that can be learned by diverse modern architectures. We also find that the representations learned by a robust network of any architecture have an asymmetric overlap with non-robust networks of many architectures, indicating that the representations used by robust neural networks are highly entangled with the representations used by non-robust networks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/jones22a.html https://proceedings.mlr.press/v180/jones22a.html Orthogonal Gromov-Wasserstein discrepancy with efficient lower bound Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspirations from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with O(n^3) complexity, and directly extends to the fused Gromov-Wasserstein distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/jin22a.html https://proceedings.mlr.press/v180/jin22a.html Fedvarp: Tackling the variance due to partial client participation in federated learning Data-heterogeneous federated learning (FL) systems suffer from two significant sources of convergence error: 1) client drift error caused by performing multiple local optimization steps at clients, and 2) partial client participation error caused by the fact that only a small subset of the edge clients participate in every training round. We find that among these, only the former has received significant attention in the literature. To remedy this, we propose FedVARP, a novel variance reduction algorithm applied at the server that eliminates error due to partial client participation. To do so, the server simply maintains in memory the most recent update for each client and uses these as surrogate updates for the non-participating clients in every round. Further, to alleviate the memory requirement at the server, we propose a novel clustering-based variance reduction algorithm ClusterFedVARP. Unlike previously proposed methods, both FedVARP and ClusterFedVARP do not require additional computation at clients or communication of additional optimization parameters. Through extensive experiments, we show that FedVARP outperforms state-of-the-art methods, and ClusterFedVARP achieves performance comparable to FedVARP with much less memory requirements. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/jhunjhunwala22a.html https://proceedings.mlr.press/v180/jhunjhunwala22a.html Towards painless policy optimization for constrained MDPs We study policy optimization in an infinite horizon, $\gamma$-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the \textbf{Coin Betting Politex (CBP)} algorithm. Assuming that the action-value functions are $\epsilon_{\text{\tiny{b}}}$-close to the span of the $d$-dimensional state-action features and no sampling errors, we prove that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - \gamma)^3 \sqrt{T}} + \frac{\epsilon_{\text{\tiny{b}}} \sqrt{d}}{(1 - \gamma)^2} \right)$ reward sub-optimality and an $O\left(\frac{1}{(1 - \gamma)^2 \sqrt{T}} + \frac{\epsilon_{\text{\tiny{b}}} \sqrt{d}}{1 - \gamma} \right)$ constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/jain22a.html https://proceedings.mlr.press/v180/jain22a.html Balancing utility and scalability in metric differential privacy Metric differential privacy (mDP) is a modification of differential privacy that is more suitable when records can be represented in a general metric space, such as text data represented as word embed- dings or geographical coordinates on a map. We consider the task of releasing elements of the metric space under metric differential privacy where utility is measured as the distance of the released element to the original element. Linear programming (LP) can be used to construct a mechanism that achieves the optimal utility for a particular mDP constraint. However, these LPs suffer from a polynomial explosion of variables and constraints that render them impractical for solving real-world problems. An important question is how to design rigorous mDP mechanisms that balance the utility- scalability tradeoff. Our main contribution is a new method for reducing the LP size used to generate mDP mechanisms by constraining the search space such that certain input and output pairs have transition probabilities derived from the exponential mechanism. Our method produces mDP mechanisms whose LPs are smaller that all prior work in this area. We also provide a lower bound on the best possible mechanism utility. Our experiments on real-world metric spaces highlight the superior utility-scalability tradeoff of our mechanism. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/imola22a.html https://proceedings.mlr.press/v180/imola22a.html Binary independent component analysis: a non-stationarity-based approach We consider independent component analysis of binary data. While fundamental in practice, this case has been much less developed than ICA for continuous data. We start by assuming a linear mixing model in a continuous-valued latent space, followed by a binary observation model. Importantly, we assume that the sources are non-stationary; this is necessary since any non-Gaussianity would essentially be destroyed by the binarization. Interestingly, the model allows for closed-form likelihood by employing the cumulative distribution function of the multivariate Gaussian distribution. In stark contrast to the continuous-valued case, we prove non-identifiability of the model with few observed variables; our empirical results imply identifiability when the number of observed variables is higher. We present a practical method for binary ICA that uses only pairwise marginals, which are faster to compute than the full multivariate likelihood. Experiments give insight into the requirements for the number of observed variables, segments, and latent sources that allow the model to be estimated. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hyttinen22a.html https://proceedings.mlr.press/v180/hyttinen22a.html Quantification of Credal Uncertainty in Machine Learning: A Critical Analysis and Empirical Comparison The representation and quantification of uncertainty has received increasing attention in machine learning in the recent past. The formalism of credal sets provides an interesting alternative in this regard, especially as it combines the representation of epistemic (lack of knowledge) and aleatoric (statistical) uncertainty in a rather natural way. In this paper, we elaborate on uncertainty measures for credal sets from the perspective of machine learning. More specifically, we provide an overview of proposals, discuss existing measures in a critical way, and also propose a new measure that is more tailored to the machine learning setting. Based on an experimental study, we conclude that theoretically well-justified measures also lead to better performance in practice. Besides, we corroborate the difficulty of the disaggregation problem, that is, of decomposing the amount of total uncertainty into aleatoric and epistemic uncertainty in a sound manner, thereby complementing theoretical findings with empirical evidence. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hullermeier22a.html https://proceedings.mlr.press/v180/hullermeier22a.html A mutually exciting latent space Hawkes process model for continuous-time networks Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieving superior prediction accuracy and providing more interpretable fits than existing models. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/huang22b.html https://proceedings.mlr.press/v180/huang22b.html Uncertainty-aware pseudo-labeling for quantum calculations Machine learning models have recently shown promise in predicting molecular quantum chemical properties. However, the path to real-life adoption requires (1) learning under low-resource constraints and (2) out-of-distribution generalization to unseen, structurally diverse molecules. We observe that these two challenges can be addressed via abundant labels, which is often not the case in quantum chemistry. We hypothesize that pseudo-labeling on a vast array of unlabeled molecules can serve as gold-label proxies to expand the training labeled dataset significantly. The challenge in pseudo-labeling is to prevent the bad pseudo-labels from biasing the model. Motivated by the entropy minimization framework, we develop a simple and effective strategy Pseudo that can assign pseudo-labels, detect bad pseudo-labels through evidential uncertainty, and prevent them from biasing the model using adaptive weighting. Empirically, Pseudo improves quantum calculations accuracy in full data, low data, and out-of-distribution settings. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/huang22a.html https://proceedings.mlr.press/v180/huang22a.html Near-optimal Thompson sampling-based algorithms for differentially private stochastic bandits We address differentially private stochastic bandits. We present two (near)-optimal Thompson Sampling-based learning algorithms: DP-TS and Lazy-DP-TS. The core idea in achieving optimality is the principle of optimism in the face of uncertainty. We reshape the posterior distribution in an optimistic way as compared to the non-private Thompson Sampling. Our DP-TS achieves a $\sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\} )} \log \left(\frac{\log(T)}{\epsilon \cdot \Delta_j} \right) \right)$ regret bound, where $\mathcal{A}$ is the arm set, $\Delta_j$ is the sub-optimality gap of a sub-optimal arm $j$, and $\epsilon$ is the privacy parameter. Our Lazy-DP-TS gets rid of the extra $\log$ factor by using the idea of dropping observations. The regret of Lazy-DP-TS is $ \sum\limits_{j \in \mathcal{A}: \Delta_j > 0} O \left(\frac{\log(T)}{\min \left\{\epsilon, \Delta_j \right\}} \right)$, which matches the regret lower bound. Additionally, we conduct experiments to compare the empirical performance of our proposed algorithms with the existing optimal algorithms for differentially private stochastic bandits. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hu22a.html https://proceedings.mlr.press/v180/hu22a.html CIGMO: Categorical invariant representations in a deep generative framework Data of general object images have two most common structures: (1) each object of a given shape can be rendered in multiple different views, and (2) shapes of objects can be categorized in such a way that the diversity of shapes is much larger across categories than within a category. Existing deep generative models can typically capture either structure, but not both. In this work, we introduce a novel deep generative model, called CIGMO, that can learn to represent category, shape, and view factors from image data. The model is comprised of multiple modules of shape representations that are each specialized to a particular category and disentangled from view representation, and can be learned using a group-based weakly supervised learning method. By empirical investigation, we show that our model can effectively discover categories of object shapes despite large view variation and quantitatively supersede various previous methods including the state-of-the-art invariant clustering algorithm. Further, we show that our approach using category-specialization can enhance the learned shape representation to better perform down-stream tasks such as one-shot object identification as well as shape-view disentanglement. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hosoya22a.html https://proceedings.mlr.press/v180/hosoya22a.html Fast predictive uncertainty for classification with Bayesian deep networks In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hobbhahn22a.html https://proceedings.mlr.press/v180/hobbhahn22a.html Quadratic metric elicitation for fairness and beyond Metric elicitation is a recent framework for eliciting classification performance metrics that best reflect implicit user preferences based on the task and context. However, available elicitation strategies have been limited to linear (or quasi-linear) functions of predictive rates, which can be practically restrictive for many applications including fairness. This paper develops a strategy for eliciting more flexible multiclass metrics defined by quadratic functions of rates, designed to reflect human preferences better. We show its application in eliciting quadratic violation-based group-fair metrics. Our strategy requires only relative preference feedback, is robust to noise, and achieves near-optimal query complexity. We further extend this strategy to eliciting polynomial metrics – thus broadening the use cases for metric elicitation. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hiranandani22a.html https://proceedings.mlr.press/v180/hiranandani22a.html Learning sparse representations of preferences within Choquet expected utility theory This paper deals with preference elicitation within Choquet Expected Utility (CEU) theory for decision making under uncertainty. We consider the Savage’s framework with a finite set of states and assume that preferences of the Decision Maker over acts are observable. The CEU model involves two parameters that must be tuned to the value system of the decision maker: a set function (capacity) modeling weights attached to events, of size exponential in the number of states, and a utility function defined on the space of outcomes. Our aim is to learn a sparse representation of the CEU model from preference data. We propose and test a preference learning approach based on a spline representation of utilities and the sparse learning of capacities to obtain CEU models achieving a good tradeoff between the aim of sparsity and the expressivity required by preference data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/herin22a.html https://proceedings.mlr.press/v180/herin22a.html Variational multiple shooting for Bayesian ODEs with Gaussian processes Recent machine learning advances have proposed black-box estimation of \textit{unknown continuous-time system dynamics} directly from data. However, earlier works are based on approximative solutions or point estimates. We propose a novel Bayesian nonparametric model that uses Gaussian processes to infer posteriors of unknown ODE systems directly from data. We derive sparse variational inference with decoupled functional sampling to represent vector field posteriors. We also introduce a probabilistic shooting augmentation to enable efficient inference from arbitrarily long trajectories. The method demonstrates the benefit of computing vector field posteriors, with predictive uncertainty scores outperforming alternative methods on multiple ODE learning tasks. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hegde22a.html https://proceedings.mlr.press/v180/hegde22a.html Reinforcement learning in many-agent settings under partial observability Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep RL, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity to a recently presented actor-critic based MARL algorithm, interactive A2C. We introduce a Dirichlet-multinomial model for maintaining beliefs over the agent population when agents’ actions are not perfectly observable. We show that the posterior is a mixture of Dirichlet distributions that we approximate as a single component for tractability. We also show that the prediction accuracy of this method increases with more agents. Finally we show empirically that our method can learn optimal behaviors in two recently introduced pragmatic domains with large agent population, and demonstrates robustness in partially observable environments. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/he22a.html https://proceedings.mlr.press/v180/he22a.html Generalizing off-policy learning under sample selection bias Learning personalized decision policies that generalize to the target population is of great relevance. Since training data is often not representative of the target population, standard policy learning methods may yield policies that do not generalize target population. To address this challenge, we propose a novel framework for learning policies that generalize to the target population. For this, we characterize the difference between the training data and the target population as a sample selection bias using a selection variable. Over an uncertainty set around this selection variable, we optimize the minimax value of a policy to achieve the best worst-case policy value on the target population. In order to solve the minimax problem, we derive an efficient algorithm based on a convex-concave procedure and prove convergence for parametrized spaces of policies such as logistic policies. We prove that, if the uncertainty set is well-specified, our policies generalize to the target population as they can not do worse than on the training data. Using simulated data and a clinical trial, we demonstrate that, compared to standard policy learning methods, our framework improves the generalizability of policies substantially. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hatt22a.html https://proceedings.mlr.press/v180/hatt22a.html Modeling extremes with $d$-max-decreasing neural networks We propose a neural network architecture that enables non-parametric calibration and generation of multivariate extreme value distributions (MEVs). MEVs arise from Extreme Value Theory (EVT) as the necessary class of models when extrapolating a distributional fit over large spatial and temporal scales based on data observed in intermediate scales. In turn, EVT dictates that $d$-max-decreasing, a stronger form of convexity, is an essential shape constraint in the characterization of MEVs. As far as we know, our proposed architecture provides the first class of non-parametric estimators for MEVs that preserve these essential shape constraints. We show that the architecture approximates the dependence structure encoded by MEVs at parametric rate. Moreover, we present a new method for sampling high-dimensional MEVs using a generative model. We demonstrate our methodology on a wide range of experimental settings, ranging from environmental sciences to financial mathematics and verify that the structural properties of MEVs are retained compared to existing methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/hasan22a.html https://proceedings.mlr.press/v180/hasan22a.html Learning a neural Pareto manifold extractor with constraints Multi-objective optimization (MOO) problems require balancing competing objectives, often under constraints. The Pareto optimal solution set defines all possible optimal trade-offs over such objectives. In this work, we present a novel method for Pareto-front learning: inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time. Our key insight is to exploit Fritz-John Conditions for a novel guided double gradient descent strategy. Evaluation on synthetic benchmark problems allows us to vary MOO problem difficulty in controlled fashion and measure accuracy \vs known analytic solutions. We further test scalability and generalization in learning optimal neural model parameterizations for Multi-Task Learning (MTL) on image classification. Results show consistent improvement in accuracy and efficiency over prior MTL methods as well as techniques from operations research. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/gupta22a.html https://proceedings.mlr.press/v180/gupta22a.html Efficient and transferable adversarial examples from bayesian neural networks An established way to improve the transferability of black-box evasion attacks is to craft the adversarial examples on an ensemble-based surrogate to increase diversity. We argue that transferability is fundamentally related to uncertainty. Based on a state-of-the-art Bayesian Deep Learning technique, we propose a new method to efficiently build a surrogate by sampling approximately from the posterior distribution of neural network weights, which represents the belief about the value of each parameter. Our extensive experiments on ImageNet, CIFAR-10 and MNIST show that our approach improves the success rates of four state-of-the-art attacks significantly (up to 83.2 percentage points), in both intra-architecture and inter-architecture transferability. On ImageNet, our approach can reach 94% of success rate while reducing training computations from 11.6 to 2.4 exaflops, compared to an ensemble of independently trained DNNs. Our vanilla surrogate achieves 87.5% of the time higher transferability than three test-time techniques designed for this purpose. Our work demonstrates that the way to train a surrogate has been overlooked, although it is an important element of transfer-based attacks. We are, therefore, the first to review the effectiveness of several training methods in increasing transferability. We provide new directions to better understand the transferability phenomenon and offer a simple but strong baseline for future work. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/gubri22a.html https://proceedings.mlr.press/v180/gubri22a.html Robust expected information gain for optimal Bayesian experimental design using ambiguity sets The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model’s prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze Robust Expected Information Gain(REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a "log-sum-exp" stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/go22a.html https://proceedings.mlr.press/v180/go22a.html Do Bayesian variational autoencoders know what they don’t know? The problem of detecting the Out-of-Distribution (OoD) inputs is of paramount importance for Deep Neural Networks. It has been previously shown that even Deep Generative Models that allow estimating the density of the inputs may not be reliable and often tend to make over-confident predictions for OoDs, assigning to them a higher density than to the in-distribution data. This over-confidence in a single model can be potentially mitigated with Bayesian inference over the model parameters that take into account epistemic uncertainty. This paper investigates three approaches to Bayesian inference: stochastic gradient Markov chain Monte Carlo, Bayes by Backpropagation, and Stochastic Weight Averaging-Gaussian. The inference is implemented over the weights of the deep neural networks that parameterize the likelihood of the Variational Autoencoder. We empirically evaluate the approaches against several benchmarks that are often used for OoD detection: estimation of the marginal likelihood utilizing sampled model ensemble, typicality test, disagreement score, and Watanabe-Akaike Information Criterion. Finally, we introduce two simple scores that demonstrate the state-of-the-art performance. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/glazunov22a.html https://proceedings.mlr.press/v180/glazunov22a.html Neural-progressive hedging: Enforcing constraints in reinforcement learning with stochastic programming We propose a framework, called neural-progressive hedging (NP), that leverages stochastic programming during the online phase of executing a reinforcement learning (RL) policy. The goal is to ensure feasibility with respect to constraints and risk-based objectives such as conditional value-at-risk (CVaR) during the execution of the policy, using probabilistic models of the state transitions to guide policy adjustments. The framework is particularly amenable to the class of sequential resource allocation problems since feasibility with respect to typical resource constraints cannot be enforced in a scalable manner. The NP framework provides an alternative that adds modest overhead during the online phase. Experimental results demonstrate the efficacy of the NP framework on two continuous real-world tasks: (i) the portfolio optimization problem with liquidity constraints for financial planning, characterized by non-stationary state distributions; and (ii) the dynamic repositioning problem in bike sharing systems, that embodies the class of supply-demand matching problems. We show that the NP framework produces policies that are better than deep RL and other baseline approaches, adapting to non-stationarity, whilst satisfying structural constraints and accommodating risk measures in the resulting policies. Additional benefits of the NP framework are ease of implementation and better explainability of the policies. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ghosh22a.html https://proceedings.mlr.press/v180/ghosh22a.html Mitigating statistical bias within differentially private synthetic data Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ghalebikesabi22a.html https://proceedings.mlr.press/v180/ghalebikesabi22a.html Estimating transfer entropy under long ranged dependencies Estimating Transfer Entropy (TE) between time series is a highly impactful problem in fields such as finance and neuroscience. The well-known nearest neighbor estimator of TE potentially fails if temporal dependencies are noisy and long ranged, primarily because it estimates TE indirectly relying on the estimation of joint entropy terms in high dimensions, which is a hard problem in itself. Other estimators, such as those based on Copula entropy or conditional mutual information have similar limitations. Leveraging the successes of modern discriminative models that operate in high dimensional (noisy) feature spaces, we express TE as a difference of two conditional entropy terms, which we directly estimate from conditional likelihoods computed in-sample from any discriminator (timeseries forecaster) trained per maximum likelihood principle. To ensure that the in-sample log likelihood estimates are not overfit to the data, we propose a novel perturbation model based on locality sensitive hash (LSH) functions, which regularizes a discriminative model to have smooth functional outputs within local neighborhoods of the input space. Our estimator is consistent, and its variance reduces linearly in sample size. We also demonstrate its superiority w.r.t. state-of-the-art estimators through empirical evaluations on a synthetic as well as real world datasets from the neuroscience and finance domains. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/garg22a.html https://proceedings.mlr.press/v180/garg22a.html Sequential algorithmic modification with test data reuse After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications are minor or are implemented in rapid succession. Recent work has shown how one can repeatedly test modifications on the same dataset and protect against overfitting by (i) discretizing test results along a grid and (ii) applying a Bonferroni correction to adjust for the total number of modifications considered by an adaptive developer. However, the standard Bonferroni correction is overly conservative when most modifications are beneficial and/or highly correlated. This work investigates more powerful approaches using alpha-recycling and sequentially-rejective graphical procedures (SRGPs). We introduce two novel extensions that account for correlation between adaptively chosen algorithmic modifications: the first leverages the correlation between consecutive modifications using flexible fixed sequence tests, and the second leverages the correlation between the proposed modifications and those generated by a hypothetical prespecified model updating procedure. In empirical analyses, both SRGPs control the error rate of approving deleterious modifications and approve significantly more beneficial modifications than previous approaches. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/feng22a.html https://proceedings.mlr.press/v180/feng22a.html Self-distribution distillation: efficient uncertainty estimation Deep learning is increasingly being applied in safety-critical domains. For these scenarios it is important to know the level of uncertainty in a model’s prediction to ensure appropriate decisions are made by the system. Deep ensembles are the de-facto standard approach to obtaining various measures of uncertainty. However, ensembles often significantly increase the resources required in the training and/or deployment phases. Approaches have been developed that typically address the costs in one of these phases. In this work we propose a novel training approach, self-distribution distillation (S2D), which is able to efficiently train a single model that can estimate uncertainties. Furthermore it is possible to build ensembles of these models and apply hierarchical ensemble distillation approaches. Experiments on CIFAR-100 showed that S2D models outperformed standard models and Monte-Carlo dropout. Additional out-of-distribution detection experiments on LSUN, Tiny ImageNet, SVHN showed that even a standard deep ensemble can be outperformed using S2D based ensembles and novel distilled models. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/fathullah22a.html https://proceedings.mlr.press/v180/fathullah22a.html Implicit kernel meta-learning using kernel integral forms Meta-learning algorithms have made significant progress in the context of meta-learning for image classification but less attention has been given to the regression setting. In this paper we propose to learn the probability distribution representing a random feature kernel that we wish to use within kernel ridge regression (KRR). We introduce two instances of this meta-learning framework, learning a neural network pushforward for a translation-invariant kernel and an affine pushforward for a neural network random feature kernel, both mapping from a Gaussian latent distribution. We learn the parameters of the pushforward by minimizing a meta-loss associated to the KRR objective. Since the resulting kernel does not admit an analytical form, we adopt a random feature sampling approach to approximate it. We call the resulting method Implicit Kernel Meta-Learning (IKML). We derive a meta-learning bound for IKML, which shows the role played by the number of tasks $T$, the task sample size $n$, and the number of random features $M$. In particular the bound implies that $M$ can be the chosen independently of $T$ and only mildly dependent on $n$. We introduce one synthetic and two real-world meta-learning regression benchmark datasets. Experiments on these datasets show that IKML Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/falk22a.html https://proceedings.mlr.press/v180/falk22a.html Temporal abstractions-augmented temporally contrastive learning: An alternative to the Laplacian in RL In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/erraqabi22a.html https://proceedings.mlr.press/v180/erraqabi22a.html SENTINEL: taming uncertainty with ensemble based distributional reinforcement learning In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/eriksson22a.html https://proceedings.mlr.press/v180/eriksson22a.html Learning explainable templated graphical models Templated graphical models (TGMs) encode model structure using rules that capture recurring relationships between multiple random variables. While the rules in TGMs are interpretable, it is not clear how they can be used to generate explanations for the individual predictions of the model. Further, learning these rules from data comes with high computational costs: it typically requires an expensive combinatorial search over the space of rules and repeated optimization over rule weights. In this work, we propose a new structure learning algorithm, Explainable Structured Model Search (ESMS), that learns a templated graphical model and an explanation framework for its predictions. ESMS uses a novel search procedure to efficiently search the space of models and discover models that trade-off predictive accuracy and explainability. We introduce the notion of relational stability and prove that our proposed explanation framework is stable. Further, our proposed piecewise pseudolikelihood (PPLL) objective does not require re-optimizing the rule weights across models during each iteration of the search. In our empirical evaluation on three realworld datasets, we show that our proposed approach not only discovers models that are explainable, but also significantly outperforms existing state-out-the-art structure learning approaches. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/embar22a.html https://proceedings.mlr.press/v180/embar22a.html ResIST: Layer-wise decomposition of ResNets for distributed training We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/dun22a.html https://proceedings.mlr.press/v180/dun22a.html Improving sign-random-projection via count sketch Computing the angular similarity between pairs of vectors is a core part of various machine learning algorithms. The seminal work of Charikar (a.k.a. Sign-Random-Projection (SRP) or SimHash) provides an unbiased estimate for the same. However, SRP suffers from the following limitations: (i) large variance in the similarity estimation, (ii) and high running time while computing the sketch. There are improved variants that address these limitations. However, they are known to improve on only one aspect in their proposal, for e.g. Yu et al. suggest a faster algorithm, Ji et al., Kang and Wong, provide estimates with a smaller variance. In this work, we propose a sketching algorithm that addresses both aspects in one algorithm – a faster algorithm along with a smaller variance in the similarity estimation. Moreover, our algorithm is space-efficient as well. We present a rigorous theoretical analysis of our proposal and complement it via experiments on synthetic and real-world datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/dubey22a.html https://proceedings.mlr.press/v180/dubey22a.html X-MEN: guaranteed XOR-maximum entropy constrained inverse reinforcement learning Inverse Reinforcement Learning (IRL) is a powerful way of learning from demonstrations. In this paper, we address IRL problems with the availability of prior knowledge that optimal policies will never violate certain constraints. Conventional approaches ignoring these constraints need many demonstrations to converge. We propose XOR-Maximum Entropy Constrained Inverse Reinforcement Learning (X-MEN), which is guaranteed to converge to the global optimal reward function in linear rate w.r.t. the number of learning iterations. X-MEN embeds XOR-sampling – a provable sampling approach which transforms the #-P complete sampling problem into queries to NP oracles – into the framework of maximum entropy IRL. X-MEN also guarantees the learned IRL agent will never generate trajectories that violate constraints. Empirical results in navigation demonstrate that X-MEN converges faster to the optimal rewards compared to baseline approaches and always generates trajectories that satisfy multi-state combinatorial constraints. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ding22a.html https://proceedings.mlr.press/v180/ding22a.html Revisiting DP-Means: fast scalable algorithms via parallelism and delayed cluster creation DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/dinari22b.html https://proceedings.mlr.press/v180/dinari22b.html Variational- and metric-based deep latent space for out-of-distribution detection One popular deep-learning approach for the task of Out-Of-Distribution (OOD) detection is based on thresholding the values of per-class Gaussian likelihood of deep features. However, two issues arise with that approach: first, the distributions are often far from being Gaussian; second, many OOD data points fall within the effective support of the known classes’ Gaussians. Thus, either way it is hard to find a good threshold. In contrast, our proposed solution for OOD detection is based on a new latent space where: 1) each known class is well captured by a nearly-isotropic Gaussian; 2) those Gaussians are far from each other and from the origin of the space (together, these properties effectively leave the area around the origin free for OOD data). Concretely, given a (possibly-trained) backbone deep net of choice, we use it to train a conditional variational model via a Kullback Leibler loss, a triplet loss, and a new distancing loss that pushes classes away from each other. During inference, the class-dependent log-likelihood values of a deep feature ensemble of the test point are also weighted based on reconstruction errors, improving further the decision rule. Experiments on popular benchmarks show that our method yields state-of-the-art results, a feat achieved despite the fact that, unlike some competitors, we make no use of OOD data for training or hyperparameter tuning. Our code is available at \url{https://github.com/BGU-CS-VIL/vmdls}. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/dinari22a.html https://proceedings.mlr.press/v180/dinari22a.html Balancing adaptability and non-exploitability in repeated games We study the problem of adaptability in repeated games: simultaneously guaranteeing low regret for several classes of opponents. We add the constraint that our algorithm is non-exploitable, in that the opponent lacks an incentive to use an algorithm against which we cannot achieve rewards exceeding some “fair” value. Our solution is an expert algorithm (LAFF), which searches within a set of sub-algorithms that are optimal for each opponent class, and punishes evidence of exploitation by switching to a policy that enforces a fair solution. With benchmarks that depend on the opponent class, we first show that LAFF has sublinear regret uniformly over these classes. Second, we show that LAFF discourages exploitation, because exploitative opponents have linear regret. To our knowledge, this work is the first to provide guarantees for both regret and non-exploitability in multi-agent learning. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/digiovanni22a.html https://proceedings.mlr.press/v180/digiovanni22a.html Multiclass classification for Hawkes processes We investigate the multiclass classification prob- lem where the features are event sequences. More precisely, the data are assumed to be generated by a mixture of simple linear Hawkes processes. In this new setting, the classes are discriminated by various triggering kernels. A challenge is then to build an efficient classification procedure. We de- rive the optimal Bayes rule and provide a two-step estimation procedure of the Bayes classifier. In the first step, the weights of the mixture are estimated; in the second step, an empirical risk minimization procedure is performed to estimate the parameters of the Hawkes processes. We establish the consis- tency of the resulting procedure and derive rates of convergence. Finally, the numerical properties of the data-driven algorithm are illustrated through a simulation study where the triggering kernels are assumed to belong to the popular parametric expo- nential family. It highlights the accuracy and the robustness of the proposed algorithm. In particular, even if the underlying kernels are misspecified, the procedure exhibits good performance. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/denis22a.html https://proceedings.mlr.press/v180/denis22a.html Bayesian spillover graphs for dynamic networks We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter h allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/deng22a.html https://proceedings.mlr.press/v180/deng22a.html Bayesian structure learning with generative flow networks In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/deleu22a.html https://proceedings.mlr.press/v180/deleu22a.html Multi-objective Bayesian optimization over high-dimensional search spaces Many real world scientific and industrial applications require optimizing multiple competing black-box objectives. When the objectives are expensive-to-evaluate, multi-objective Bayesian optimization (BO) is a popular approach because of its high sample efficiency. However, even with recent methodological advances, most existing multi-objective BO methods perform poorly on search spaces with more than a few dozen parameters and rely on global surrogate models that scale cubically with the number of observations. In this work we propose MORBO, a scalable method for multi-objective BO over high-dimensional search spaces. MORBO identifies diverse globally optimal solutions by performing BO in multiple local regions of the design space in parallel using a coordinated strategy. We show that MORBO significantly advances the state-of-the-art in sample efficiency for several high-dimensional synthetic problems and real world applications, including an optical display design problem and a vehicle design problem with 146 and 222 parameters, respectively. On these problems, where existing BO algorithms fail to scale and perform well, MORBO provides practitioners with order-of-magnitude improvements in sample efficiency over the current approach. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/daulton22a.html https://proceedings.mlr.press/v180/daulton22a.html Faster non-convex federated learning via global and local momentum We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(\epsilon^{-1.5})$ to converge to an $\epsilon$-stationary point (i.e., $\mathbb{E}[\|\nabla f(x)\|^2] \leq \epsilon$) for smooth non-convex functions – under arbitrary client heterogeneity and compressed communication – compared to the $\mathcal{O}(\epsilon^{-2})$ complexity of most prior works. Our key algorithmic idea that enables achieving this improved complexity is based on the observation that the convergence in FL is hampered by two sources of high variance: (i) the global server aggregation step with multiple local updates, exacerbated by client heterogeneity, and (ii) the noise of the local client-level stochastic gradients. The first issue is particularly detrimental to FL algorithms that perform plain averaging at the server. By modeling the server aggregation step as a generalized gradient-type update, we propose a variance-reducing momentum-based global update at the server, which when applied in conjunction with variance-reduced local updates at the clients, enables \texttt{FedGLOMO} to enjoy an improved convergence rate. Our experiments illustrate the intrinsic variance reduction effect of \texttt{FedGLOMO}, which implicitly suppresses client-drift in heterogeneous data distribution settings and promotes communication efficiency. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/das22b.html https://proceedings.mlr.press/v180/das22b.html Individual fairness in feature-based pricing for monopoly markets We study fairness in the context of feature-based price discrimination in monopoly markets. We propose a new notion of individual fairness, namely, \alpha-fairness, which guarantees that individuals with similar features face similar prices. First, we study discrete valuation space and give an analytical solution for optimal fair feature-based pricing. We show that the cost of fair pricing is defined as the ratio of expected revenue in an optimal feature-based pricing to the expected revenue in an optimal fair feature-based pricing (CoF) can be arbitrarily large in general. When the revenue function is continuous and concave with respect to the prices, we show that one can achieve CoF strictly less than 2, irrespective of the model parameters. Finally, we provide an algorithm to compute fair feature-based pricing strategy that achieves this CoF. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/das22a.html https://proceedings.mlr.press/v180/das22a.html On provably robust meta-Bayesian optimization Bayesian optimization (BO) has become popular for sequential optimization of black-box functions. When BO is used to optimize a target function, we often have access to previous evaluations of potentially related functions. This begs the question as to whether we can leverage these previous experiences to accelerate the current BO task through meta-learning (meta-BO), while ensuring robustness against potentially harmful dissimilar tasks that could sabotage the convergence of BO. This paper introduces two scalable and provably robust meta-BO algorithms: robust meta-Gaussian process-upper confidence bound (RM-GP-UCB) and RM-GP-Thompson sampling (RM-GP-TS). We prove that both algorithms are asymptotically no-regret even when some or all previous tasks are dissimilar to the current task, and show that RM-GP-UCB enjoys a better theoretical robustness than RM-GP-TS. We also exploit the theoretical guarantees to optimize the weights assigned to individual previous tasks through regret minimization via online learning, which diminishes the impact of dissimilar tasks and hence further enhances the robustness. Empirical evaluations show that (a) RM-GP-UCB performs effectively and consistently across various applications, and (b) RM-GP-TS, despite being less robust than RM-GP-UCB both in theory and in practice, performs competitively in some scenarios with less dissimilar tasks and is more computationally efficient. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/dai22a.html https://proceedings.mlr.press/v180/dai22a.html Variational message passing neural network for Maximum-A-Posteriori (MAP) inference Maximum-A-Posteriori (MAP) inference is a fundamental task in probabilistic inference and belief propagation (BP) is a widely used algorithm for MAP inference. Though BP has been applied successfully to many different fields, it offers no performance guarantee and often performs poorly on loopy graphs. To improve the performance on loopy graphs and to scale up to large graphs, we propose a variational message passing neural network (V-MPNN), where we leverage both the power of neural networks in modeling complex functions and the well-established algorithmic theories on variational belief propagation. Instead of relying on a hand-crafted variational assumption, we propose a neural-augmented free energy where a general variational distribution is parameterized through a neural network. A message passing neural network is utilized for the minimization of neural-augmented free energy. Training of the MPNN is thus guided by neural-augmented free energy, without requiring exact MAP configurations as annotations. We empirically demonstrate the effectiveness of the proposed V-MPNN by comparing against both state-of-the-art training-free methods and training-based methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/cui22a.html https://proceedings.mlr.press/v180/cui22a.html Counterfactual inference of second Opinions Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources—they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/corvelo-benz22a.html https://proceedings.mlr.press/v180/corvelo-benz22a.html Greedy equivalence search in the presence of latent confounders We investigate Greedy PAG Search (GPS) for score-based causal discovery over equivalence classes, similar to the famous Greedy Equivalence Search algorithm, except now in the presence of latent confounders. It is based on a novel characterization of Markov equivalence classes for MAGs, that not only improves state-of-the-art identification of Markov equivalence between MAGs to linear time complexity for sparse graphs, but also allows for efficient traversal over equivalence classes in the space of all MAGs. The resulting GPS algorithm is evaluated against several existing alternatives and found to show promising performance, both in terms of speed and accuracy. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/claassen22a.html https://proceedings.mlr.press/v180/claassen22a.html Cyclic test time augmentation with entropy weight method In the recent studies of data augmentation of neural networks, the application of test time augmentation has been studied to extract optimal transformation policies to enhance performance with minimum cost. The policy search method with the best level of input data dependency involves training a loss predictor network to estimate suitable transformations for each of the given input image in independent manner, resulting in instance-level transformation extraction. In this work, we propose a method to utilize and modify the loss prediction pipeline to further improve the performance with the cyclic search for suitable transformations and the use of the entropy weight method. The cyclic usage of the loss predictor allows refining each input image with multiple transformations with a more flexible transformation magnitude. For cases where multiple augmentations are generated, we implement the entropy weight method to reflect the data uncertainty of each augmentation to force the final result to focus on augmentations with low uncertainty. The experimental results show convincing qualitative outcomes and robust performance for the corrupted conditions of data. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chun22a.html https://proceedings.mlr.press/v180/chun22a.html A geometric method for improved uncertainty estimation in real-time Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management.Post-hoc model calibrations can improve models’ uncertainty estimations without the need for retraining, and without changing the model.Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model’s estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at our Github: https://github.com/NoSleepDeveloper/Geometric-Calibrator Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chouraqui22a.html https://proceedings.mlr.press/v180/chouraqui22a.html Combating the instability of mutual information-based losses via regularization Notable progress has been made in numerous fields of machine learning based on neural network-driven mutual information (MI) bounds. However, utilizing the conventional MI-based losses is often challenging due to their practical and mathematical limitations. In this work, we first identify the symptoms behind their instability: (1) the neural network not converging even after the loss seemed to converge, and (2) saturating neural network outputs causing the loss to diverge. We mitigate both issues by adding a novel regularization term to the existing losses. We theoretically and experimentally demonstrate that added regularization stabilizes training. Finally, we present a novel benchmark that evaluates MI-based losses on both the MI estimation power and its capability on the downstream tasks, closely following the pre-existing supervised and contrastive learning settings. We evaluate six different MI-based losses and their regularized counterparts on multiple benchmarks to show that our approach is simple yet effective. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/choi22a.html https://proceedings.mlr.press/v180/choi22a.html Feature selection for discovering distributional treatment effect modifiers Finding the features relevant to the difference in treatment effects is essential to unveil the underlying causal mechanisms. Existing methods seek such features by measuring how greatly the feature attributes affect the degree of the {\it conditional average treatment effect} (CATE). However, these methods may overlook important features because CATE, a measure of the average treatment effect, cannot detect differences in distribution parameters other than the mean (e.g., variance). To resolve this weakness of existing methods, we propose a feature selection framework for discovering {\it distributional treatment effect modifiers}. We first formulate a feature importance measure that quantifies how strongly the feature attributes influence the discrepancy between potential outcome distributions. Then we derive its computationally efficient estimator and develop a feature selection algorithm that can control the type I error rate to the desired level. Experimental results show that our framework successfully discovers important features and outperforms the existing mean-based method. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chikahara22a.html https://proceedings.mlr.press/v180/chikahara22a.html Greedy modality selection via approximate submodular maximization Multimodal learning considers learning from multi-modality data, aiming to fuse heterogeneous sources of information. However, it is not always feasible to leverage all available modalities due to memory constraints. Further, training on all the modalities may be inefficient when redundant information exists within data, such as different subsets of modalities providing similar performance. In light of these challenges, we study modality selection, intending to efficiently select the most informative and complementary modalities under certain computational constraints. We formulate a theoretical framework for optimizing modality selection in multimodal learning and introduce a utility measure to quantify the benefit of selecting a modality. For this optimization problem, we present efficient algorithms when the utility measure exhibits monotonicity and approximate submodularity. We also connect the utility measure with existing Shapley-value-based feature importance scores. Last, we demonstrate the efficacy of our algorithm on synthetic (Patch-MNIST) and real-world (PEMS-SF, CMU-MOSI) datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/cheng22a.html https://proceedings.mlr.press/v180/cheng22a.html Offline reinforcement learning under value and density-ratio realizability: The power of gaps We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22g.html https://proceedings.mlr.press/v180/chen22g.html On the definition and computation of causal treewidth Causal treewidth is a recently introduced notion allowing one to speed up Bayesian network inference and to bound its complexity in the presence of functional dependencies (causal mechanisms) whose identities are unknown. Causal treewidth is no greater than treewidth and can be bounded even when treewidth is unbounded. The utility of causal treewidth has been illustrated recently in the context of causal inference and model-based supervised learning. However, the current definition of causal treewidth is descriptive rather than perspective, therefore limiting its full exploitation in a practical setting. We provide an extensive study of causal treewidth in this paper which moves us closer to realizing the full computational potential of this notion both theoretically and practically. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22f.html https://proceedings.mlr.press/v180/chen22f.html Shoring up the foundations: fusing model embeddings and weak supervision Foundation models offer an exciting new paradigm for constructing models with out-of-the-box embeddings and a few labeled examples. However, it is not clear how to best apply foundation models without labeled data. A potential approach is to fuse foundation models with weak supervision frameworks, which use weak label sources—pre-trained models, heuristics, crowd-workers—to construct pseudolabels. The challenge is building a combination that best exploits the signal available in both foundation models and weak sources. We propose LIGER, a combination that uses foundation model embeddings to improve two crucial elements of existing weak supervision techniques. First, we produce finer estimates of weak source quality by partitioning the embedding space and learning per-part source accuracies. Second, we improve source coverage by extending source votes in embedding space. Despite the black-box nature of foundation models, we prove results characterizing how our approach improves performance and show that lift scales with the smoothness of label distributions in embedding space. On six benchmark NLP and video tasks, LIGER outperforms vanilla weak supervision by 14.1 points, weakly-supervised kNN and adapters by 11.8 points, and kNN and adapters supervised by traditional hand labels by 7.2 points. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22e.html https://proceedings.mlr.press/v180/chen22e.html Sublinear time algorithms for greedy selection in high dimensions Greedy selection is a widely used idea for solving many machine learning problems. But greedy selection algorithms often have high complexities and thus may be prohibitive for large-scale data. In this paper, we consider two fundamental optimization problems in machine learning: k-center clustering and convex hull approximation, where they both can be solved via greedy selection. We propose sublinear time algorithms for them through combining the strategies of randomization and greedy selection. Our results are similar in spirit to the linear time stochastic greedy selection algorithms for submodular maximization, but with several important differences. Our runtimes are independent of the number of input data items n. In particular, our runtime for k-center clustering significantly improves upon that of the uniform sampling approach, especially when the dimensionality is high. Our sublinear algorithms can also reduce the computational complexities for various applications, such as data selection and compression, active learning, and topic modeling, etc. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22d.html https://proceedings.mlr.press/v180/chen22d.html Knowledge representation combining quaternion path integration and depth-wise atrous circular convolution Knowledge models endeavor to improve representation and feature extraction capabilities while keeping low computational cost. Firstly, existing embedding models in hypercomplex spaces of non-Abelian group are optimized. Then a method for fast quaternion multiplication is proposed with proof, with which path semantics are computed and further integrated with the attention mechanism based on the idea semantic extraction of relation sequences could be regarded as a multiple rotational blending problem. A depth-wise atrous circular convolution framework is set up for better feature extraction. Experiments including Link Prediction and Path Query are conducted on benchmark datasets verifying our model holds advantages over state-of-the-art models like Rotate3D. Moreover, the model is tested on a biomedical dataset simulating real-world applications. An ablation study is also performed to explore the effectiveness of different components. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22c.html https://proceedings.mlr.press/v180/chen22c.html Stackmix: a complementary mix algorithm Techniques combining multiple images as input/output have proven to be effective data augmentations for training convolutional neural networks. In this paper, we present StackMix: each input is presented as a concatenation of two images, and the label is the mean of the two one-hot labels. On its own, StackMix rivals other widely used methods in the “Mix” line of work. More importantly, unlike previous work, significant gains across a variety of benchmarks are achieved by combining StackMix with existing Mix augmentation, effectively mixing more than two images. E.g., by combining StackMix with CutMix, test error in the supervised setting is improved across a variety of settings over CutMix, including 0.8% on ImageNet, 3% on Tiny ImageNet, 2% on CIFAR-100, 0.5% on CIFAR-10, and 1.5% on STL-10. Similar results are achieved with Mixup. We further show that gains hold for robustness to common input corruptions and perturbations at varying severities with a 0.7% improvement on CIFAR-100-C, by combining StackMix with AugMix over AugMix. On its own, improvements with StackMix hold across different number of labeled samples on CIFAR-100, maintaining approximately a 2% gap in test accuracy –down to using only 5% of the whole dataset– and is effective in the semi-supervised setting with a 2% improvement with the standard benchmark Pi-model. Finally, we perform an extensive ablation study to better understand the proposed methodology. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22b.html https://proceedings.mlr.press/v180/chen22b.html A competitive analysis of online failure-aware assignment Motivated by a new generation of Internet advertising that has emerged in the live streaming e-commerce markets (e.g., Tiktok) over the past five years, we study a variant of online bipartite matching problem: advertisers send ad requests to influencers (aka, key opinion leaders) on a social media platform. Each influencer has a maximum number of ad requests she can accommodate. We assign a fixed number of influencers to an advertiser when she enters the platform. The advertiser then matches with each of the assigned influencers with a probability, which can be thought of as a set of negotiations between the advertiser and the set of assigned influencers. Unlike the standard online assignment problems, the outcome of any of these matches is not revealed throughout the session (negotiations take time). Our goal is to maximize the expected number of matches between advertisers and influencers. We put forward a new deterministic algorithm with a competitive ratio of $1/2$ and prove that no deterministic algorithm can achieve a better competitive ratio. We also show that the competitive ratio can be improved when randomness is allowed. We then study a setting where a match is successful with either probability 0 or a fixed $p$. We present an optimal randomized algorithm that achieves a competitive ratio of $1-1/e$ in this setting. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chen22a.html https://proceedings.mlr.press/v180/chen22a.html The optimal noise in noise-contrastive learning is not what you think Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion. This setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data’s and even from a different family. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chehab22a.html https://proceedings.mlr.press/v180/chehab22a.html Privacy-aware compression for federated data analysis Federated data analytics is a framework for distributed data analysis where a server compiles noisy responses from a group of distributed low-bandwidth user devices to estimate aggregate statistics. Two major challenges in this framework are privacy, since user data is often sensitive, and compression, since the user devices have low network bandwidth. Prior work has addressed these challenges separately by combining standard compression algorithms with known privacy mechanisms. In this work, we take a holistic look at the problem and design a family of privacy-aware compression mechanisms that work for any given communication budget. We first propose a mechanism for transmitting a single real number that has optimal variance under certain conditions. We then show how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning. Our experiments illustrate that our mechanism can lead to better utility vs. compression trade-offs for the same privacy loss in a number of settings. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chaudhuri22a.html https://proceedings.mlr.press/v180/chaudhuri22a.html Capturing actionable dynamics with structured latent ordinary differential equations End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/chapfuwa22a.html https://proceedings.mlr.press/v180/chapfuwa22a.html Active approximately metric-fair learning Existing studies on individual fairness focus on the passive setting and typically require $O(\frac{1}{\varepsilon^2})$ labeled instances to achieve an $\varepsilon$ bias budget. In this paper, we build on the elegant Approximately Metric-Fair (AMF) learning framework and propose an active AMF learner that can provably achieve the same budget with only $O(\log \frac{1}{\varepsilon})$ labeled instances. To our knowledge, this is a first and substantial improvement of the existing sample complexity for achieving individual fairness. Through experiments on three data sets, we show the proposed active AMF learner improves fairness on linear and non-linear models more efficiently than its passive counterpart as well as state-of-the-art active learners, while maintaining a comparable accuracy. To facilitate algorithm design and analysis, we also design a provably equivalent form of the approximate metric fairness based on uniform continuity instead of the existing almost Lipschitz continuity. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/cao22a.html https://proceedings.mlr.press/v180/cao22a.html AUTM flow: atomic unrestricted time machine for monotonic normalizing flows Nonlinear monotone transformations are used extensively in normalizing flows to construct invertible triangular mappings from simple distributions to complex ones. In existing literature, monotonicity is usually enforced by restricting function classes or model parameters and the inverse transformation is often approximated by root-finding algorithms as a closed-form inverse is unavailable. In this paper, we introduce a new integral-based approach termed: Atomic Unrestricted Time Machine (AUTM), equipped with unrestricted integrands and easy-to-compute explicit inverse. AUTM offers a versatile and efficient way to the design of normalizing flows with explicit inverse and unrestricted function classes or parameters. Theoretically, we present a constructive proof that AUTM is universal: all monotonic normalizing flows can be viewed as limits of AUTM flows. We provide a concrete example to show how to approximate any given monotonic normalizing flow using AUTM flows with guaranteed convergence. The result implies that AUTM can be used to transform an existing flow into a new one equipped with explicit inverse and unrestricted parameters. The performance of the new approach is evaluated on high dimensional density estimation, variational inference and image generation. Experiments demonstrate superior speed and memory efficiency of AUTM. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/cai22a.html https://proceedings.mlr.press/v180/cai22a.html On the inductive bias of neural networks for learning read-once DNFs Learning functions over Boolean variables is a fundamental problem in machine learning. But not much is known about learning such functions using neural networks. Here we focus on learning read-once disjunctive normal forms (DNFs) under the uniform distribution with a convex neural network and gradient methods. We first observe empirically that gradient methods converge to compact solutions with neurons that are aligned with the terms of the DNF. This is despite the fact that there are many zero training error networks that do not have this property. Thus, the learning process has a clear inductive bias towards such logical formulas. Following recent results which connect the inductive bias of gradient flow (GF) to Karush-Kuhn-Tucker (KKT) points of minimum norm problems, we study these KKT points in our setting. We prove that zero training error solutions that memorize training points are not KKT points and therefore GF cannot converge to them. On the other hand, we prove that globally optimal KKT points correspond exactly to networks that are aligned with the DNF terms. These results suggest a strong connection between the inductive bias of GF and solutions that align with the DNF. We conclude with extensive experiments which verify our findings. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bronstein22a.html https://proceedings.mlr.press/v180/bronstein22a.html On-the-fly adaptation of patrolling strategies in changing environments We consider the problem of efficient patrolling strategy adaptation in a changing environment where the topology of Defender’s moves and the importance of guarded targets change unpredictably. The Defender must instantly switch to a new strategy optimized for the new environment, not disrupting the ongoing patrolling task, and the new strategy must be computed promptly under all circumstances. Since strategy switching may cause unintended security risks compromising the achieved protection, our solution includes mechanisms for detecting and mitigating this problem. The efficiency of our framework is evaluated experimentally. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/brazdil22a.html https://proceedings.mlr.press/v180/brazdil22a.html Lifting in multi-agent systems under uncertainty A decentralised partially observable Markov decision problem (DecPOMDP) formalises collaborative multi-agent decision making. A solution to a DecPOMDP is a joint policy for the agents, fulfilling an optimality criterion such as maximum expected utility. A crux is that the problem is intractable regarding the number of agents. Inspired by lifted inference, this paper examines symmetries within the agent set for a potential tractability. Specifically, this paper contributes (i) specifications of counting and isomorphic symmetries, (ii) a compact encoding of symmetric DecPOMDPs as partitioned DecPOMDPs, and (iii) a formal analysis of complexity and tractability. This works allows tractability in terms of agent numbers and a new query type for isomorphic DecPOMDPs. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/braun22a.html https://proceedings.mlr.press/v180/braun22a.html Information theoretic approach to detect collusion in multi-agent games Collusion in a competitive multi-agent game occurs when two or more agents co-operate covertly to the disadvantage of others. Most competitive multi-agent games do not allow players to share information and explicitly prohibit collusion. In this paper, we present a novel way of detecting collusion using a domain-independent information-theoretic approach. Specifically, we show that the use of mutual information between actions of the agents provides a good indication of collusive behavior. Our experiments show that our method can detect varying levels of collusion in repeated simultaneous games like iterated Rock Paper Scissors. We further extend the detection to partially observable sequential games like poker and show the effectiveness of our methodology. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bonjour22a.html https://proceedings.mlr.press/v180/bonjour22a.html Robustness of model predictions under extension Mathematical models of the real world are simplified representations of complex systems. A caveat to using mathematical models is that predicted causal effects and conditional independences may not be robust under model extensions, limiting applicability of such models. In this work, we consider conditions under which qualitative model predictions are preserved when two models are combined. Under mild assumptions, we show how to use the technique of causal ordering to efficiently assess the robustness of qualitative model predictions. We also characterize a large class of model extensions that preserve qualitative model predictions. For dynamical systems at equilibrium, we demonstrate how novel insights help to select appropriate model extensions and to reason about the presence of feedback loops. We illustrate our ideas with a viral infection model with immune responses. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/blom22a.html https://proceedings.mlr.press/v180/blom22a.html On testability of the front-door model via Verma constraints The front-door criterion can be used to identify and compute causal effects despite the existence of unmeasured confounders between a treatment and outcome. However, the key assumptions – (i) the existence of a variable (or set of variables) that fully mediates the effect of the treatment on the outcome, and (ii) which simultaneously does not suffer from similar issues of confounding as the treatment-outcome pair – are often deemed implausible. This paper explores the testability of these assumptions. We show that under mild conditions involving an auxiliary variable, the assumptions encoded in the front-door model (and simple extensions of it) may be tested via generalized equality constraints a.k.a Verma constraints. We propose two goodness-of-fit tests based on this observation, and evaluate the efficacy of our proposal on real and synthetic data. We also provide theoretical and empirical comparisons to instrumental variable approaches to handling unmeasured confounding. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bhattacharya22a.html https://proceedings.mlr.press/v180/bhattacharya22a.html Offline change detection under contamination In this work, we propose a non-parametric and robust change detection algorithm to detect multiple change points in time series data under non-adversarial contamination. The algorithm is designed for the offline setting, where the objective is to detect changes when all data are received. We only make weak moment assumptions on the inliers (uncorrupted data) to handle a large class of distributions. The robust scan statistic in the change detection algorithm is fashioned using mean estimators based on influence functions. We establish the consistency of the estimated change point indexes as the number of samples increases, and provide empirical evidence to support the consistency results. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bhatt22a.html https://proceedings.mlr.press/v180/bhatt22a.html Identifying near-optimal decisions in linear-in-parameter bandit models with continuous decision sets We consider an online optimization problem in a bandit setting in which a learner chooses decisions from a continuous decision set at discrete decision epochs, and receives noisy rewards from the environment in response. While the noise samples are assumed to be independent and sub-Gaussian, the mean reward at each epoch is a fixed but unknown linear function of a feature vector, which depends on the decision through a known (and possibly nonlinear) feature map. We study the problem within the framework of best-arm identification with fixed confidence, and provide a template algorithm for approximately learning the optimal decision in a probably approximately correct (PAC) setting. More precisely, the template algorithm samples the decision space till a stopping condition is met, and returns a subset of decisions such that, with the required confidence, every element of the subset is approximately optimal for the unknown mean reward function. We provide a sample complexity bound for the template algorithm and then specialize it to the case where the mean-reward function is a univariate polynomial of a single decision variable. We provide an implementable algorithm for this case by explicitly instantiating all the steps in the template algorithm. Finally, we provide experimental results to demonstrate the efficacy of our algorithms. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bhat22a.html https://proceedings.mlr.press/v180/bhat22a.html Learning soft interventions in complex equilibrium systems Complex systems often contain feedback loops that can be described as cyclic causal models. Intervening in such systems may lead to counterintuitive effects, which cannot be inferred directly from the graph structure. After establishing a framework for differentiable soft interventions based on Lie groups, we take advantage of modern automatic differentiation techniques and their application to implicit functions in order to optimize interventions in cyclic causal models. We illustrate the use of this framework by investigating scenarios of transition to sustainable economies. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/besserve22a.html https://proceedings.mlr.press/v180/besserve22a.html On early extinction and the effect of travelling in the SIR model We consider a population protocol version of the SIR model. In every round, an individual is chosen uniformly at random. If the individual is susceptible, then it becomes infected w.p. $\beta I_t/N$, where $I_t$ is the number of infections at time $t$ and $N$ is the total number of individuals. If the individual is infected, then it recovers w.p. $\gamma$, whereas, if the individual is already recovered, nothing happens. We prove sharp bounds on the probability of the disease becoming pandemic vs extinguishing early (dying out quickly). The probability of extinguishing early, $\Pr{\mathcal{E}_{ext}}$, is typically neglected in prior work since most use (deterministic) differential equations. Leveraging on this, using $\Pr{\mathcal{E}_{ext}}$, we proceed by bounding the expected size of the population that contracts the disease $\mathbf{E}\left[R_\infty\right]$. Prior work only calculated $\mathbf{E}\left[R_\infty | \overline{\mathcal{E}_{ext}}\right]$, or obtained non-closed form solutions. We then study the two-country model also accounting for the role of $\Pr{\mathcal{E}_{ext}}$. We assume that both countries have different infection rates $\beta^{(i)}$, but share the same recovery rate $\gamma$. In this model, each round has two steps: First, an individual is chosen u.a.r. and travels w.p. $p_{travel}$ to the other country. Afterwards, the process continues as before with the respective infection rates. Finally, using simulations, we characterise the influence of $p_{travel}$ on the total number of infections. Our simulations show that, depending on the $\beta^{(i)}$, increasing $p_{travel}$ can decrease or increase the expected total number of infections $\mathbf{E}\left[R_\infty\right]$. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/berenbrink22a.html https://proceedings.mlr.press/v180/berenbrink22a.html Empirical bayes approach to truth discovery problems When aggregating information from conflicting sources, one’s goal is to find the truth. Most real-value truth discovery (TD) algorithms try to achieve this goal by estimating the competence of each source and then aggregating the conflicting information by weighing each source’s answer proportionally to her competence. However, each of those algorithms requires more than a single source for such estimation and usually does not consider different estimation methods other than a weighted mean. Therefore, in this work we formulate, prove, and empirically test the conditions for an Empirical Bayes Estimator (EBE) to dominate the weighted mean aggregation. Our main result demonstrates that EBE, under mild conditions, can be used as a second step of any TD algorithm in order to reduce the expected error. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ben-shabat22a.html https://proceedings.mlr.press/v180/ben-shabat22a.html Equilibrium aggregation: encoding sets via optimization Processing sets or other unordered, potentially variable-sized inputs in neural networks is usually handled by aggregating a number of input tensors into a single representation. While a number of aggregation methods already exist from simple sum pooling to multi-head attention, they are limited in their representational power both from theoretical and empirical perspectives. On the search of a principally more powerful aggregation strategy, we propose an optimization-based method called Equilibrium Aggregation. We show that many existing aggregation methods can be recovered as special cases of Equilibrium Aggregation and that it is provably more efficient in some important cases. Equilibrium Aggregation can be used as a drop-in replacement in many existing architectures and applications. We validate its efficiency on three different tasks: median estimation, class counting, and molecular property prediction. In all experiments, Equilibrium Aggregation achieves higher performance than the other aggregation techniques we test. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bartunov22a.html https://proceedings.mlr.press/v180/bartunov22a.html Byzantine-tolerant distributed multiclass sparse linear discriminant analysis Communication cost and security issues are both important in large-scale distributed machine learning. In this paper, we investigate a multiclass sparse classification problem under two distributed systems. We propose two distributed multiclass sparse discriminant analysis algorithms based on mean-aggregation and median-aggregation under the normal distributed system or Byzantine failure system. Both of them are computation and communication efficient. Several theoretical results, including estimation error bounds, and support recovery, are established. With moderate initial estimators, our iterative estimators achieve a (near-)optimal rate and exact support recovery after a constant number of rounds. Experiments on both synthetic and real datasets are provided to demonstrate the effectiveness of our proposed methods. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bao22b.html https://proceedings.mlr.press/v180/bao22b.html Physics guided neural networks for spatio-temporal super-resolution of turbulent flows Direct numerical simulation (DNS) of turbulent flows is computationally expensive and cannot be applied to flows with large Reynolds numbers. Low-resolution large eddy simulation (LES) is a popular alternative, but it is unable to capture all of the scales of turbulent transport accurately. Reconstructing DNS from low-resolution LES is critical for large-scale simulation in many scientific and engineering disciplines, but it poses many challenges to existing super-resolution methods due to the complexity of turbulent flows and computational cost of generating frequent LES data. We propose a physics-guided neural network for reconstructing frequent DNS from sparse LES data by enhancing its spatial resolution and temporal frequency. Our proposed method consists of a partial differential equation (PDE)-based recurrent unit for capturing underlying temporal processes and a physics-guided super-resolution model that incorporates additional physical constraints. We demonstrate the effectiveness of both components in reconstructing the Taylor-Green Vortex using sparse LES data. Moreover, we show that the proposed recurrent unit can preserve the physical characteristics of turbulent flows by leveraging the physical relationships in the Navier-Stokes equation. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/bao22a.html https://proceedings.mlr.press/v180/bao22a.html Asymmetric DQN for partially observable reinforcement learning Offline training in simulated partially observable environments allows reinforcement learning methods to exploit privileged state information through a mechanism known as asymmetry. Such privileged information has the potential to greatly improve the optimal convergence properties, if used appropriately. However, current research in asymmetric reinforcement learning is often heuristic in nature, with few connections to underlying theory or theoretical guarantees, and is primarily tested through empirical evaluation. In this work, we develop the theory of Asymmetric Policy Iteration, an exact model-based dynamic programming solution method, and then apply relaxations which eventually result in Asymmetric DQN, a model-free deep reinforcement learning algorithm. Our theoretical findings are complemented and validated by empirical experimentation performed in environments which exhibit significant amounts of partial observability, and require both information gathering strategies and memorization. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/baisero22a.html https://proceedings.mlr.press/v180/baisero22a.html Discovery of extended summary graphs in time series This study addresses the problem of learning an extended summary causal graph from time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behaviour of our method is illustrated through several experiments. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/assaad22a.html https://proceedings.mlr.press/v180/assaad22a.html Inductive synthesis of finite-state controllers for POMDPs We present a novel learning framework to obtain finite-state controllers (FSCs) for partially observable Markov decision processes and illustrate its applicability for indefinite-horizon specifications. Our framework builds on oracle-guided inductive synthesis to explore a design space compactly representing available FSCs. The inductive synthesis approach consists of two stages: The outer stage determines the design space, i.e., the set of FSC candidates, while the inner stage efficiently explores the design space. This framework is easily generalisable and shows promising results when compared to existing approaches. Experiments indicate that our technique is (i) competitive to state-of-the-art belief-based approaches for indefinite-horizon properties, (ii) yields smaller FSCs than existing methods for several POMDP models, and (iii) naturally treats multi-objective specifications. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/andriushchenko22a.html https://proceedings.mlr.press/v180/andriushchenko22a.html Multi-winner approval voting goes epistemic Epistemic voting interprets votes as noisy signals about a ground truth. We consider contexts where the truth consists of a set of objective winners, knowing a lower and upper bound on its cardinality. A prototypical problem for this setting is the aggregation of multi-label annotations with prior knowledge on the size of the ground truth. We posit noise models, for which we define rules that output an optimal set of winners. We report on experiments on multi-label annotations (which we collected). Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/allouche22a.html https://proceedings.mlr.press/v180/allouche22a.html Data dependent randomized smoothing Randomized smoothing is a recent technique that achieves state-of-art performance in training certifiably robust deep neural networks. While the smoothing family of distributions is often connected to the choice of the norm used for certification, the parameters of these distributions are always set as global hyper parameters independent from the input data on which a network is certified. In this work, we revisit Gaussian randomized smoothing and show that the variance of the Gaussian distribution can be optimized at each input so as to maximize the certification radius for the construction of the smooth classifier. Since the data dependent classifier does not directly enjoy sound certification with existing approaches, we propose a memory-enhanced data dependent smooth classifier that is certifiable by construction. This new approach is generic, parameter-free, and easy to implement. In fact, we show that our data dependent framework can be seamlessly incorporated into 3 randomized smoothing approaches, leading to consistent improved certified accuracy. When this framework is used in the training routine of these approaches followed by a data dependent certification, we achieve 9% and 6% improvement over the certified accuracy of the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/alfarra22a.html https://proceedings.mlr.press/v180/alfarra22a.html Non-parametric inference of relational dependence Independence testing plays a central role in statistical and causal inference from observational data. Standard independence tests assume that the data samples are independent and identically distributed (i.i.d.) but that assumption is violated in many real-world datasets and applications centered on relational systems. This work examines the problem of estimating independence in data drawn from relational systems by defining sufficient representations for the sets of observations influencing individual instances. Specifically, we define marginal and conditional independence tests for relational data by considering the kernel mean embedding as a flexible aggregation function for relational variables. We propose a consistent, non-parametric, scalable kernel test to operationalize the relational independence test for non-i.i.d. observational data under a set of structural assumptions. We empirically evaluate our proposed method on a variety of synthetic and semi-synthetic networks and demonstrate its effectiveness compared to state-of-the-art kernel-based independence tests. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ahsan22a.html https://proceedings.mlr.press/v180/ahsan22a.html Neuro-symbolic entropy regularization In structured output prediction, the goal is to jointly predict several output variables that together encode a structured object – a path in a graph, an entity-relation triple, or an ordering of objects. Such a large output space makes learning hard and requires vast amounts of labeled data. Different approaches leverage alternate sources of supervision. One approach – entropy regularization – posits that decision boundaries should lie in low-probability regions. It extracts supervision from unlabeled examples, but remains agnostic to the structure of the output space. Conversely, neuro-symbolic approaches exploit the knowledge that not every prediction corresponds to a valid structure in the output space. Yet, they do not further restrict the learned output distribution.This paper introduces a framework that unifies both approaches. We propose a loss, neuro-symbolic entropy regularization, that encourages the model to confidently predict a valid object. It is obtained by restricting entropy regularization to the distribution over only the valid structures. This loss can be computed efficiently when the output constraint is expressed as a tractable logic circuit. Moreover, it seamlessly integrates with other neuro-symbolic losses that eliminate invalid predictions. We demonstrate the efficacy of our approach on a series of semi-supervised and fully-supervised structured-prediction experiments, where it leads to models whose predictions are more accurate as well as more likely to be valid. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/ahmed22a.html https://proceedings.mlr.press/v180/ahmed22a.html GNN2GNN: Graph neural networks to generate neural networks The success of neural networks (NNs) is tightly linked with their architectural design—a complex problem by itself. We here introduce a novel framework leveraging Graph Neural Networks to Generate Neural Networks (GNN2GNN) where powerful NN architectures can be learned out of a set of available architecture-performance pairs. GNN2GNN relies on a three-way adversarial training of GNN, to optimise a generator model capable of producing predictions about powerful NN architectures. Unlike Neural Architecture Search (NAS) techniques proposing efficient searching algorithms over a set of NN architec- tures, GNN2GNN relies on learning NN architectural design criteria. GNN2GNN learns to propose NN architectures in a single step – i.e., training of the generator –, overcoming the recursive approach characterising NAS. Therefore, GNN2GNN avoids the expensive and inflexible search of efficient structures typical of NAS approaches. Extensive experiments over two state-of-the-art datasets prove the strength of our framework, showing that it can generate powerful architectures with high probability. Moreover, GNN2GNN outperforms possible counterparts for generating NN architectures, and shows flexibility against dataset quality degradation. Finally, GNN2GNN paves the way towards generalisation between datasets. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/agiollo22a.html https://proceedings.mlr.press/v180/agiollo22a.html Regret guarantees for model-based reinforcement learning with long-term average constraints We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward and incurs $K$ costs. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose \NAM, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. We show that with the assumption of slackness, characterized by $\kappa$, the optimization problem is feasible for the sampled MDPs. Further, for MDP with $S$ states, $A$ actions, and mixing time $T_M$, we prove that following \NAM{} algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by $\Tilde{O}(T_MS\sqrt{AT})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(T_MS\sqrt{AT})$. To the best of our knowledge, this is the first work that obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling method. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/agarwal22b.html https://proceedings.mlr.press/v180/agarwal22b.html NeuroBE: Escalating neural network approximations of Bucket Elimination A major limiting factor in graphical model inference is the complexity of computing the partition function. Exact message-passing algorithms such as Bucket Elimination (BE) require exponential memory to compute the partition function; therefore, approximations are necessary. In this paper, we build upon a recently introduced methodology called Deep Bucket Elimination (DBE) that uses classical Neural Networks to approximate messages generated by BE for large buckets. The main feature of our new scheme, renamed NeuroBE, is that it customizes the architecture of the neural networks, their learning process and in particular, adapts the loss function to the internal form or distribution of messages. Our experiments demonstrate significant improvements in accuracy and time compared with the earlier DBE scheme. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/agarwal22a.html https://proceedings.mlr.press/v180/agarwal22a.html Mutation-driven follow the regularized leader for last-iterate convergence in zero-sum games In this study, we consider a variant of the Follow the Regularized Leader (FTRL) dynamics in two-player zero-sum games. FTRL is guaranteed to converge to a Nash equilibrium when time-averaging the strategies, while a lot of variants suffer from the issue of limit cycling behavior, i.e., lack the last-iterate convergence guarantee. To this end, we propose mutant FTRL (M-FTRL), an algorithm that introduces mutation for the perturbation of action probabilities. We then investigate the continuous-time dynamics of M-FTRL and provide the strong convergence guarantees toward stationary points that approximate Nash equilibria under full-information feedback. Furthermore, our simulation demonstrates that M-FTRL can enjoy faster convergence rates than FTRL and optimistic FTRL under full-information feedback and surprisingly exhibits clear convergence under bandit feedback. Wed, 17 Aug 2022 00:00:00 +0000 https://proceedings.mlr.press/v180/abe22a.html https://proceedings.mlr.press/v180/abe22a.html