Proceedings of Machine Learning ResearchProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
Held in Pittsburgh, PA, USA on 31 July to 04 August 2023
Published as Volume 216 by the Proceedings of Machine Learning Research on 02 July 2023.
Volume Edited by:
Robin J. Evans
Ilya Shpitser
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/v216/
Fri, 11 Aug 2023 08:33:18 +0000Fri, 11 Aug 2023 08:33:18 +0000Jekyll v3.9.3Regularized online DR-submodular optimizationThe utilization of online optimization techniques is prevalent in many fields of artificial intelligence, enabling systems to continuously learn and adjust to their surroundings. This paper outlines a regularized online optimization problem, where the regularizer is defined on the average of the actions taken. The objective is to maximize the sum of rewards and the regularizer value while adhering to resource constraints, where the reward function is assumed to be DR-submodular. Both concave and DR-submodular regularizers are analyzed. Concave functions are useful in describing the impartiality of decisions, while DR-submodular functions can be employed to represent the overall effect of decisions on all relevant parties. We have developed two algorithms for each of the concave and DR-submodular regularizers. These algorithms are easy to implement, efficient, and produce sublinear regret in both cases. The performance of the proposed algorithms and regularizers has been verified through numerical experiments in the context of internet advertising.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zuo23a.html
https://proceedings.mlr.press/v216/zuo23a.htmlMixupE: Understanding and improving Mixup from directional derivative perspectiveMixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpolating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this work, we first analyze Mixup and show that it implicitly regularizes infinitely many directional derivatives of all orders. Based on this new insight, we propose an improved version of Mixup, theoretically justified to deliver better generalization performance than the vanilla Mixup. To demonstrate the effectiveness of the proposed method, we conduct experiments across various domains such as images, tabular data, speech, and graphs. Our results show that the proposed method improves Mixup across multiple datasets using a variety of architectures, for instance, exhibiting an improvement over Mixup by 0.8% in ImageNet top-1 accuracy.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zou23a.html
https://proceedings.mlr.press/v216/zou23a.htmlPrivate Prediction Strikes Back! Private Kernelized Nearest Neighbors with Individual Rényi FilterMost existing approaches of differentially private (DP) machine learning focus on <em>private training</em>. Despite its many advantages, <em>private training</em> lacks the flexibility in adapting to incremental changes to the training dataset such as deletion requests from exercising GDPR’s right to be forgotten. We revisit a long-forgotten alternative, known as <em>private prediction</em>, and propose a new algorithm named <em>Individual Kernelized Nearest Neighbor</em> (Ind-KNN). Ind-KNN is easily updatable over dataset changes and it allows precise control of the Rényi DP at an individual user level — a user’s privacy loss is measured by the exact amount of her contribution to predictions; and a user is removed if her prescribed privacy budget runs out. Our results show that Ind-KNN consistently improves the accuracy over existing private prediction methods for a wide range of epsilon on four vision and language tasks. We also illustrate several cases under which Ind-KNN is preferable over private training with NoisySGD.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhu23b.html
https://proceedings.mlr.press/v216/zhu23b.htmlAUC Maximization in Imbalanced Lifelong LearningImbalanced data is ubiquitous in machine learning, such as medical or fine-grained image datasets. The existing continual learning methods employ various techniques such as balanced sampling to improve classification accuracy in this setting. However, classification accuracy is not a suitable metric for imbalanced data, and hence these methods may not obtain a good classifier as measured by other metrics (e.g., Area under the ROC Curve). In this paper, we propose a solution to enable efficient imbalanced continual learning by designing an algorithm to effectively maximize one widely used metric in an imbalanced data setting: Area Under the ROC Curve (AUC). We find that simply replacing accuracy with AUC will cause <em>gradient interference problem</em> due to the imbalanced data distribution. To address this issue, we propose a new algorithm, namely DIANA, which performs a novel synthesis of model <b>D</b>ecoupl<b>I</b>ng <b>AN</b>d <b>A</b>lignment. In particular, the algorithm updates two models simultaneously: one focuses on learning the current knowledge while the other concentrates on reviewing previously-learned knowledge, and the two models gradually align during training. The results show that the proposed DIANA achieves state-of-the-art performance on all the imbalanced datasets compared with several competitive baselines.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhu23a.html
https://proceedings.mlr.press/v216/zhu23a.htmlConvergence rates for localized actor-critic in networked Markov potential gamesWe introduce a class of networked Markov potential games where agents are associated with nodes in a network. Each agent has its own local potential function, and the reward of each agent depends only on the states and actions of agents within a neighborhood. In this context, we propose a localized actor-critic algorithm. The algorithm is scalable since each agent uses only local information and does not need access to the global state. Further, the algorithm overcomes the curse of dimensionality through the use of function approximation. Our main results provide finite-sample guarantees up to a localization error and a function approximation error. Specifically, we achieve an $\tilde{\mathcal{O}}(\tilde{\epsilon}^{-4})$ sample complexity measured by the averaged Nash regret. This is the first finite-sample bound for multi-agent competitive games that does not depend on the number of agents.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhou23b.html
https://proceedings.mlr.press/v216/zhou23b.htmlLearning robust representation for reinforcement learning with distractions by reward sequence predictionReinforcement learning algorithms have achieved remarkable success in acquiring behavioral skills directly from pixel inputs. However, their application in real-world scenarios presents challenges due to their sensitivity to visual distractions (e.g., changes in viewpoint and light). A key factor contributing to this challenge is that the learned representations often suffer from overfitting task-irrelevant information. By comparing several representation learning methods, we find that the key to alleviating overfitting in representation learning is to choose proper prediction targets. Motivated by our comparison, we propose a novel representation learning approach—namely, reward sequence prediction (RSP)—that uses reward sequences or their transforms (e.g., discrete time Fourier transform) as prediction targets. RSP can efficiently learn robust representations as reward sequences rarely contain task-irrelevant information while providing a large number of supervised signals to accelerate representation learning. An appealing feature is that RSP makes no assumption about the type of distractions and thus can improve performance even when multiple types of distractions exist. We evaluate our approach in Distracting Control Suite. Experiments show that our method achieves state-of-the-art sample efficiency and generalization ability in tasks with distractions.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhou23a.html
https://proceedings.mlr.press/v216/zhou23a.htmlRDM-DC: Poisoning Resilient Dataset Condensation with Robust Distribution MatchingDataset condensation aims to condense the original training dataset into a small synthetic dataset for data-efficient learning. The recently proposed dataset condensation techniques allow the model trainers with limited resources to learn acceptable deep learning models on a small amount of synthetic data. However, in an adversarial environment, given the original dataset as a poisoned dataset, dataset condensation may encode the poisoning information into the condensed synthetic dataset. To explore the vulnerability of dataset condensation to data poisoning, we revisit the state-of-the-art targeted data poisoning method and customize a targeted data poisoning algorithm for dataset condensation. By executing the two poisoning methods, we demonstrate that, when the synthetic dataset is condensed from a poisoned dataset, the models trained on the synthetic dataset may predict the targeted sample as the attack-targeted label. To defend against data poisoning, we introduce the concept of poisoned deviation to quantify the poisoning effect. We further propose a poisoning-resilient dataset condensation algorithm with a calibration method to reduce poisoned deviation. Extensive evaluations demonstrate that our proposed algorithm can protect the synthetic dataset from data poisoning with minor performance drop.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zheng23a.html
https://proceedings.mlr.press/v216/zheng23a.htmlConditionally optimistic exploration for cooperative deep multi-agent reinforcement learningEfficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation scheme. The high-level intuition is that to perform optimism-based exploration, agents would explore cooperative strategies if each agent’s optimism estimate captures a structured dependency relationship with other agents. Assuming agents compute actions following a sequential order at <em>each environment timestep</em>, we provide a perspective to view MARL as tree search iterations by considering agents as nodes at different depths of the search tree. Inspired by the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees), we develop a method called Conditionally Optimistic Exploration (COE). COE augments each agent’s state-action value estimate with an action-conditioned optimistic bonus derived from the visitation count of the global state and joint actions of preceding agents. COE is performed during training and disabled at deployment, making it compatible with any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhao23b.html
https://proceedings.mlr.press/v216/zhao23b.htmlConditional counterfactual causal effect for individual attributionIdentifying the causes of an event, also termed as causal attribution, is a commonly encountered task in many application problems. Available methods, mostly in Bayesian or causal inference literature, suffer from two main drawbacks: 1) cannot attribute for individuals, and 2) attributing one single cause at a time and cannot deal with the interaction effect among multiple causes. In this paper, based on our proposed new measurement, called conditional counterfactual causal effect (CCCE), we introduce an individual causal attribution method, which is able to utilize the individual observation as the evidence and consider common influence and interaction effect of multiple causes simultaneously. We discuss the identifiability of CCCE and also give the identification formulas under proper assumptions. Finally, we conduct experiments on simulated and real data to illustrate the effectiveness of CCCE and the results show that our proposed method outperforms significantly state-of-the-art methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhao23a.html
https://proceedings.mlr.press/v216/zhao23a.htmlGreed is good: correspondence recovery for unlabeled linear regressionWe consider the unlabeled linear regression reading as $\mathbf{Y} = \mathbf{\Pi}^{*}\mathbf{X}\mathbf{B}^* + \mathbf{W}$, where $\mathbf{\Pi}^{*}, \mathbf{B}^*$ and $\mathbf{W}$ represents missing (or incomplete) correspondence information, signals, and additive noise, respectively. Our goal is to perform data alignment between $\mathbf{Y}$ and $\mathbf{X}$, or equivalently, reconstruct the correspondence information encoded by $\mathbf{\Pi}^*$. Based on whether signal $\mathbf{B}^*$ is given a prior, we separately propose two greedy-selection-based estimators, which both reach the mini-max optimality. Compared with previous works, our work $(i)$ supports partial recovery of the correspondence information; and $(ii)$ applies to a general matrix family rather than the permutation matrices, to put more specifically, selection matrices, where multiple rows of $\mathbf{X}$ can correspond to the same row in $\mathbf{Y}$. Moreover, numerical experiments are provided to corroborate our claims.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhang23e.html
https://proceedings.mlr.press/v216/zhang23e.htmlGraph Self-supervised Learning via Proximity Distribution MinimizationSelf-supervised learning (SSL) for graphs is an essential problem since graph data are ubiquitous and labeling can be costly. We argue that existing SSL approaches for graphs have two limitations. First, they rely on corruption techniques such as node attribute perturbation and edge dropping to generate graph views for contrastive learning. These unnatural corruption techniques require extensive tuning efforts and provide marginal improvements. Second, the current approaches require the computation of multiple graph views, which is memory and computationally inefficient. These shortcomings of graph SSL call for a corruption-free single-view learning approach, but the strawman approach of using neighboring nodes as positive examples suffers two problems: it ignores the strength of connections between nodes implied by the graph structure on a macro level, and cannot deal with the high noise in real-world graphs. We propose Proximity Divergence Minimization (PDM), a corruption-free single-view graph SSL approach that overcomes these problems by leveraging node proximity to measure connection strength and denoise the graph structure. Through extensive experiments, we show that PDM achieves up to 4.55% absolute improvement in ROC-AUC on graph SSL tasks over state-of-the-art approaches while being more memory efficient. Moreover, PDM even outperforms supervised training on node classification tasks of ogbn-proteins dataset. Our code is publicly available.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhang23d.html
https://proceedings.mlr.press/v216/zhang23d.htmlProvably efficient representation selection in Low-rank Markov Decision Processes: from online to offline RLThe success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhang23c.html
https://proceedings.mlr.press/v216/zhang23c.htmlEnergy-based Predictive Representations for Partially Observed Reinforcement LearningIn real-world applications, handling partial observability is a common requirement for reinforcement learning algorithms, which is not captured by a Markov decision process (MDP). Although partially observable Markov decision processes (POMDPs) have been specifically designed to address this requirement, they present significant computational and statistical challenges in learning and planning. In this work, we introduce the <em>Energy-based Predictive Representation (EPR)</em> to provide a unified approach for designing practical reinforcement learning algorithms in both the MDP and POMDP settings. This framework enables coherent handling of <em>learning, exploration, and planning</em> tasks. The proposed framework leverages a powerful neural energy-based model to extract an adequate representation, allowing for efficient approximation of Q-functions. This representation facilitates the efficient computation of confidence, enabling the implementation of optimism or pessimism in planning when faced with uncertainty. Consequently, it effectively manages the trade-off between exploration and exploitation. Experimental investigations demonstrate that the proposed algorithm achieves state-of-the-art performance in both MDP and POMDP settings.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhang23b.html
https://proceedings.mlr.press/v216/zhang23b.htmlFast Teammate Adaptation in the Presence of Sudden Policy Change Cooperative multi-agent reinforcement learning (MARL), where agents coordinates with teammate(s) for a shared goal, may sustain non-stationary caused by the policy change of teammates. Prior works mainly concentrate on the policy change cross episodes, ignoring the fact that teammates may suffer from sudden policy change within an episode, which might lead to miscoordination and poor performance. We formulate the problem as an open Dec-POMDP, where we control some agents to coordinate with uncontrolled teammates, whose policies could be changed within one episode. Then we develop a new framework \textit{\textbf{Fas}t \textbf{t}eammates \textbf{a}da\textbf{p}tation (\textbf{Fastap})} to address the problem. Concretely, we first train versatile teammates’ policies and assign them to different clusters via the Chinese Restaurant Process (CRP). Then, we train the controlled agent(s) to coordinate with the sampled uncontrolled teammates by capturing their identifications as context for fast adaptation. Finally, each agent applies its local information to anticipate the teammates’ context for decision-making accordingly. This process proceeds alternately, leading to a robust policy that can adapt to any teammates during the decentralized execution phase. We show in multiple multi-agent benchmarks that Fastap can achieve superior performance than multiple baselines in stationary and non-stationary scenarios. Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/zhang23a.html
https://proceedings.mlr.press/v216/zhang23a.htmlOnline estimation of similarity matrices with incomplete dataThe similarity matrix measures pairwise similarities between a set of data points and is an essential concept in data processing, routinely used in practical applications. Obtaining a similarity matrix is typically straightforward when data points are completely observed. However, incomplete observations can make it challenging to obtain a high-quality similarity matrix, which becomes even more complex in online data. To address this challenge, we propose matrix correction algorithms that leverage the positive semi-definiteness (PSD) of the similarity matrix to improve similarity estimation in both offline and online scenarios. Our approaches have a solid theoretical guarantee of performance and excellent potential for parallel execution on large-scale data. Empirical evaluations demonstrate their high effectiveness and efficiency with significantly improved results over classical imputation-based methods, benefiting downstream applications with superior performance. Our code is available at \url{https://github.com/CUHKSZ-Yu/OnMC}.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/yu23a.html
https://proceedings.mlr.press/v216/yu23a.htmlMonte-Carlo Search for an Equilibrium in Dec-POMDPsDecentralized partially observable Markov decision processes (Dec-POMDPs) formalize the problem of designing individual controllers for a group of collaborative agents under stochastic dynamics and partial observability. Seeking a global optimum is difficult (NEXP complete), but seeking a Nash equilibrium - each agent policy being a best response to the other agents - is more accessible, and allowed addressing infinite-horizon problems with solutions in the form of finite state controllers. In this paper, we show that this approach can be adapted to cases where only a generative model (a simulator) of the Dec-POMDP is available. This requires relying on a simulation-based POMDP solver to construct an agent’s FSC node by node. A related process is used to heuristically derive initial FSCs. Experiment with benchmarks shows that MC-JESP is competitive with existing Dec-POMDP solvers, even better than many offline methods using explicit models.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/you23a.html
https://proceedings.mlr.press/v216/you23a.htmlTowards Physically Reliable Molecular Representation LearningEstimating the energetic properties of molecular systems is a critical task in material design. Machine learning has shown remarkable promise on this task over classical force fields, but a fully data-driven approach suffers from limited labeled data; not just the amount of available data lacks, but the distribution of labeled examples is highly skewed to stable states. In this work, we propose a molecular representation learning method that extrapolates well beyond the training distribution, powered by physics-driven parameter estimation from classical energy equations and self-supervised learning inspired from masked language modeling. To ensure reliability of the proposed model, we introduce a series of novel evaluation schemes in multifaceted ways, beyond the energy or force accuracy that has been dominantly used. From extensive experiments, we demonstrate that the proposed method is effective in discovering molecular structures, outperforming other baselines. Furthermore, we extrapolate it to the chemical reaction pathways beyond stable states, taking a step towards physically reliable molecular representation learning.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/yi23a.html
https://proceedings.mlr.press/v216/yi23a.htmlMitigating Transformer Overconfidence via Lipschitz RegularizationThough Transformers have achieved promising results in many computer vision tasks, they tend to be over-confident in predictions, as the standard Dot Product Self-Attention (DPSA) can barely preserve distance for the unbounded input domain. In this work, we fill this gap by proposing a novel Lipschitz Regularized Transformer (LRFormer). Specifically, we present a new similarity function with the distance within Banach Space to ensure the Lipschitzness and also regularize the term by a contractive Lipschitz Bound. The proposed method is analyzed with a theoretical guarantee, providing a rigorous basis for its effectiveness and reliability. Extensive experiments conducted on standard vision benchmarks demonstrate that our method outperforms the state-of-the-art single forward pass approaches in prediction, calibration, and uncertainty estimation.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ye23a.html
https://proceedings.mlr.press/v216/ye23a.htmlMMEL: A Joint Learning Framework for Multi-Mention Entity LinkingEntity linking, bridging mentions in the contexts with their corresponding entities in the knowledge bases, has attracted wide attention due to many potential applications. Recently, plenty of multimodal entity linking approaches have been proposed to take full advantage of the visual information rather than solely the textual modality. Although feasible, these methods mainly focus on the single-mention scenarios and neglect the scenarios where multiple mentions exist simultaneously in the same context, which limits the performance. In fact, such multi-mention scenarios are pretty common in public datasets and real-world applications. To solve this challenge, we first propose a joint feature extraction module to learn the representations of context and entity candidates, from both the visual and textual perspectives. Then, we design a pairwise training scheme (for training) and a multi-mention collaborative ranking method (for testing) to model the potential connections between different mentions. We evaluate our method on a public dataset and a self-constructed dataset, NYTimes-MEL, under both text-only and multimodal scenarios. The experimental results demonstrate that our method can largely outperform the state-of-the-art methods, especially in multi-mention scenarios. Our dataset and source code are publicly available at https://github.com/ycm094/MMEL-main.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/yang23d.html
https://proceedings.mlr.press/v216/yang23d.htmlMulti-modal differentiable unsupervised feature selectionMulti-modal high throughput biological data presents a great scientific opportunity and a significant computational challenge. In multi-modal measurements, every sample is observed simultaneously by two or more sets of sensors. In such settings, many observed variables in both modalities are often nuisance and do not carry information about the phenomenon of interest. Here, we propose a multi-modal unsupervised feature selection framework: identifying informative variables based on coupled high-dimensional measurements. Our method is designed to identify features associated with two types of latent low-dimensional structures: (i) shared structures that govern the observations in both modalities, and (ii) differential structures that appear in only one modality. To that end, we propose two Laplacian-based scoring operators. We incorporate the scores with differentiable gates that mask nuisance features and enhance the accuracy of the structure captured by the graph Laplacian. The performance of the new scheme is illustrated using synthetic and real datasets, including an extended biological application to single-cell multi-omics.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/yang23c.html
https://proceedings.mlr.press/v216/yang23c.htmlMixture of Normalizing Flows for European Option PricingWe present a mixture of normalizing flows (MoNF) approach to European option pricing with guarantees that its estimations are free from static arbitrage. In contrast to many existing methods that meet economic rationality constraints (e.g., non-arbitrage) by introducing auxiliary losses, our solution meets those constraints exactly by design. To achieve this, we propose to build a model for risk neutral density using normalizing flows, which results in a pricing model, instead of modelling the option pricing function directly. First, we convert the constraints for direct pricing models to the constraints for models backed by risk neutral density estimation, then we design a specific NF architecture that meets these constraints. Furthermore, we find that employing a mixture of such normalizing flows improves the performance significantly, compared to using a deeper single NF. Finally, we present a mechanism to regularise the proposed model, and this regularisation can serve as a bridge between our method and any sample-based mathematical finance method. The evaluations on five option datasets show superiority of our method compared to mathematical finance solutions and some other neural networks based methods. The code is available at \url{https://github.com/qmfin/MoNF}.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/yang23b.html
https://proceedings.mlr.press/v216/yang23b.htmlPessimistic Model Selection for Offline Deep Reinforcement LearningDeep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/yang23a.html
https://proceedings.mlr.press/v216/yang23a.htmlProvably Efficient Adversarial Imitation Learning with Unknown TransitionsImitation learning (IL) has proven to be an effective method for learning good policies from expert demonstrations. Adversarial imitation learning (AIL), a subset of IL methods, is particularly promising, but its theoretical foundation in the presence of unknown transitions has yet to be fully developed. This paper explores the theoretical underpinnings of AIL in this context, where the stochastic and uncertain nature of environment transitions presents a challenge. We examine the expert sample complexity and interaction complexity required to recover good policies. To this end, we establish a framework connecting reward-free exploration and AIL, and propose an algorithm, MB-TAIL, that achieves the minimax optimal expert sample complexity of $\widetilde{\mathcal{O}} (H^{3/2} |\mathcal{S}|/\varepsilon)$ and interaction complexity of $\widetilde{\mathcal{O}} (H^{3} |\mathcal{S}|^2 |\mathcal{A}|/\varepsilon^2)$. Here, $H$ represents the planning horizon, $|\mathcal{S}|$ is the state space size, $|\mathcal{A}|$ is the action space size, and $\varepsilon$ is the desired imitation gap. MB-TAIL is the first algorithm to achieve this level of expert sample complexity in the unknown transition setting and improves upon the interaction complexity of the best-known algorithm, OAL, by $\mathcal{O} (H)$. Additionally, we demonstrate the generalization ability of MB-TAIL by extending it to the function approximation setting and proving that it can achieve expert sample and interaction complexity independent of $|\mathcal{S}|$.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/xu23c.html
https://proceedings.mlr.press/v216/xu23c.html$E(2)$-Equivariant Vision TransformerVision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Ini- tial attempts have been made on designing equiv- ariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding opera- tor. We prove that GE-ViT meets all the theoreti- cal requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/xu23b.html
https://proceedings.mlr.press/v216/xu23b.htmlConformal Risk Control for Ordinal ClassificationAs a natural extension to the standard conformal prediction method, several conformal risk control methods have been recently developed and applied to various learning problems. In this work, we seek to control the conformal risk in expectation for ordinal classification tasks, which have broad applications to many real problems. For this purpose, we firstly formulated the ordinal classification task in the conformal risk control framework, and provided theoretic risk bounds of the risk control method. Then we proposed two types of loss functions specially designed for ordinal classification tasks, and developed corresponding algorithms to determine the prediction set for each case to control their risks at a desired level. We demonstrated the effectiveness of our proposed methods, and analyzed the difference between the two types of risks on three different datasets, including a simulated dataset, the UTKFace dataset and the diabetic retinopathy detection dataset.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/xu23a.html
https://proceedings.mlr.press/v216/xu23a.htmlTwo-stage holistic and contrastive explanation of image classificationThe need to explain the output of a deep neural network classifier is now widely recognized. While previous methods typically explain a single class in the output, we advocate explaining the whole output, which is a probability distribution over multiple classes. A whole-output explanation can help a human user gain an overall understanding of model behaviour instead of only one aspect of it. It can also provide a natural framework where one can examine the evidence used to discriminate between competing classes, and thereby obtain contrastive explanations. In this paper, we propose a contrastive whole-output explanation (CWOX) method for image classification, and evaluate it using quantitative metrics and through human subject studies. The source code of CWOX is available at https://github.com/vaynexie/CWOX.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/xie23a.html
https://proceedings.mlr.press/v216/xie23a.htmlA one-sample decentralized proximal algorithm for non-convex stochastic composite optimizationWe focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: \texttt{Prox-DASA} and \texttt{Prox-DASA-GT}. These algorithms can find $\epsilon$-stationary points in $\mathcal{O}(n^{-1}\epsilon^{-2})$ iterations using constant batch sizes (i.e., $\mathcal{O}(1)$). Unlike prior work, our algorithms achieve comparable complexity without requiring large batch sizes, more complex per-iteration operations (such as double loops), or stronger assumptions. Our theoretical findings are supported by extensive numerical experiments, which demonstrate the superiority of our algorithms over previous approaches. Our code is available at \url{https://github.com/xuxingc/ProxDASA}.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/xiao23a.html
https://proceedings.mlr.press/v216/xiao23a.htmlRobust Quickest Change Detection for Unnormalized ModelsDetecting an abrupt and persistent change in the underlying distribution of online data streams is an important problem in many applications. This paper proposes a new robust score-based algorithm called RSCUSUM, which can be applied to unnormalized models and addresses the issue of unknown post-change distributions. RSCUSUM replaces the Kullback-Leibler divergence with the Fisher divergence between pre- and post-change distributions for computational efficiency in unnormalized statistical models and introduces a notion of the “least favorable” distribution for robust change detection. The algorithm and its theoretical analysis are demonstrated through simulation studies.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wu23c.html
https://proceedings.mlr.press/v216/wu23c.htmlUniform-PAC Guarantees for Model-Based RL with Bounded Eluder DimensionRecently, there has been remarkable progress in reinforcement learning (RL) with general function approximation. However, all these works only provide regret or sample complexity guarantees. It is still an open question if one can achieve stronger performance guarantees, i.e., the uniform probably approximate correctness (Uniform-PAC) guarantee that can imply both a sub-linear regret bound and a polynomial sample complexity for any target learning accuracy. We study this problem by proposing algorithms for both nonlinear bandits and model-based episodic RL using the general function class with a bounded eluder dimension. The key idea of the proposed algorithms is to assign each action to different levels according to its width with respect to the confidence set. The achieved Uniform-PAC sample complexity is tight in the sense that it matches the state-of-the-art regret bounds or sample complexity guarantees when reduced to the linear case. To the best of our knowledge, this is the first work for Uniform-PAC guarantees on bandit and RL that goes beyond linear cases.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wu23b.html
https://proceedings.mlr.press/v216/wu23b.htmlLearning To Invert: Simple Adaptive Attacks for Gradient Inversion in Federated LearningGradient inversion attack enables recovery of training samples from model gradients in federated learning (FL), and constitutes a serious threat to data privacy. To mitigate this vulnerability, prior work proposed both principled defenses based on differential privacy, as well as heuristic defenses based on gradient compression as countermeasures. These defenses have so far been very effective, in particular those based on gradient compression that allow the model to maintain high accuracy while greatly reducing the effectiveness of attacks. In this work, we argue that such findings underestimate the privacy risk in FL. As a counterexample, we show that existing defenses can be broken by a simple adaptive attack, where a model trained on auxiliary data is able to invert gradients on both vision and language tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wu23a.html
https://proceedings.mlr.press/v216/wu23a.htmlQuantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures?The quantification of aleatoric and epistemic uncertainty in terms of conditional entropy and mutual information, respectively, has recently become quite common in machine learning. While the properties of these measures, which are rooted in information theory, seem appealing at first glance, we identify various incoherencies that call their appropriateness into question. In addition to the measures themselves, we critically discuss the idea of an additive decomposition of total uncertainty into its aleatoric and epistemic constituents. Experiments across different computer vision tasks support our theoretical findings and raise concerns about current practice in uncertainty quantification.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wimmer23a.html
https://proceedings.mlr.press/v216/wimmer23a.htmlBidirectional Attention as a Mixture of Continuous Word ExpertsBidirectional attention - composed of the neural network architecture of self-attention with positional encodings, together with the masked language model (MLM) objective - has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wibisono23a.html
https://proceedings.mlr.press/v216/wibisono23a.htmlOn the Role of Generalization in Transferability of Adversarial ExamplesBlack-box adversarial attacks designing adversarial examples for unseen deep neural networks (DNNs) have received great attention over the past years. However, the underlying factors driving the transferability of black-box adversarial examples still lack a thorough understanding. In this paper, we aim to demonstrate the role of the generalization behavior of the substitute classifier used for generating adversarial examples in the transferability of the attack scheme to unobserved DNN classifiers. To do this, we apply the max-min adversarial example game framework and show the importance of the generalization properties of the substitute DNN from training to test data in the success of the black-box attack scheme in application to different DNN classifiers. We prove theoretical generalization bounds on the difference between the attack transferability rates on training and test samples. Our bounds suggest that operator norm-based regularization methods could improve the transferability of the designed adversarial examples. We support our theoretical results by performing several numerical experiments showing the role of the substitute network’s generalization in generating transferable adversarial examples. Our empirical results indicate the power of Lipschitz regularization and early stopping methods in improving the transferability of designed adversarial examples.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23g.html
https://proceedings.mlr.press/v216/wang23g.htmlA constrained Bayesian approach to out-of-distribution predictionConsider the problem of out-of-distribution prediction given data from multiple environments. While a sufficiently diverse collection of training environments will facilitate the identification of an invariant predictor, with an optimal generalization performance, many applications only provide us with a limited number of environments. It is thus necessary to consider adapting to distribution shift using a handful of labeled test samples. We propose a constrained Bayesian approach for this task, which restricts to models with a worst-group training loss above a prespecified threshold. Our method avoids a pathology of the standard Bayesian posterior, which occurs when spurious correlations improve in-distribution prediction. We also show that on certain high-dimensional linear problems, constrained modeling improves the sample efficiency of adaptation. Synthetic and real-world experiments demonstrate the robust performance of our approach.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23f.html
https://proceedings.mlr.press/v216/wang23f.htmlRobust distillation for worst-class performance: on the interplay between teacher and student objectivesKnowledge distillation is a popular technique that has been shown to produce remarkable gains in average accuracy. However, recent work has shown that these gains are not uniform across subgroups in the data, and can often come at the cost of accuracy on rare subgroups and classes. Robust optimization is a common remedy to improve worst-class accuracy in standard learning settings, but in distillation it is unknown whether it is best to apply robust objectives when training the teacher, the student, or both. This work studies the interplay between robust objectives for the teacher and student. Empirically, we show that that jointly modifying the teacher and student objectives can lead to better worst-class student performance and even Pareto improvement in the trade-off between worst-class and overall performance. Theoretically, we show that the <em>per-class calibration</em> of teacher scores is key when training a robust student. Both the theory and experiments support the surprising finding that applying a robust teacher training objective does not always yield a more robust student.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23e.html
https://proceedings.mlr.press/v216/wang23e.htmlA trajectory is worth three sentences: multimodal transformer for offline reinforcement learningTransformers hold tremendous promise in solving offline reinforcement learning (RL) by formulating it as a sequence modeling problem inspired by language modeling (LM). Prior works using transformers model a sample (trajectory) of RL as one sequence analogous to a sequence of words (one sentence) in LM, despite the fact that each trajectory includes tokens from three diverse modalities: state, action, and reward, while a sentence contains words only. Rather than taking a modality-agnostic approach which uniformly models the tokens from different modalities as one sequence, we propose a multimodal sequence modeling approach in which a trajectory (one “sentence”) of three modalities (state, action, reward) is disentangled into three unimodal ones (three “sentences”). We investigate the correlation of different modalities during sequential decision-making and use the insights to design a multimodal transformer, named Decision Transducer (DTd). DTd outperforms prior art in offline RL on the conducted D4RL benchmarks and enjoys better sample efficiency and algorithm flexibility. Our code is made publicly here.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23d.html
https://proceedings.mlr.press/v216/wang23d.htmlDiversity-enhanced probabilistic ensemble for uncertainty estimationEnsemble methods combine multiple individual models for prediction, which have demonstrated their effectiveness in accurate uncertainty quantification (UQ) and strong robustness. Obtaining a diverse ensemble set of model parameters results in better model averaging performance and better approximation of the true posterior distribution of these parameters. In this paper, we propose the diversity-enhanced probabilistic ensemble method with the adaptive uncertainty-guided ensemble learning strategy for better quantifying uncertainty and further improving the model robustness. Specifically, we construct the probabilistic ensemble model by building a Gaussian distribution of the model parameters for each ensemble component using Laplacian approximation in a post-processing manner. Then a mixture of Gaussian model is established with learnable and refinable parameters in an EM-like algorithm. During ensemble training, we leverage the uncertainty estimated from previous models as guidance when training the next one such that the new model will focus more on the less explored regions by previous models. Various experiments including out-of-distribution detection and image classification under distributional shifts have demonstrated better uncertainty estimation and improved model generalization ability of our proposed method.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23c.html
https://proceedings.mlr.press/v216/wang23c.htmlEfficient Privacy-Preserving Stochastic Nonconvex OptimizationWhile many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while matching the best-known utility guarantee. Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23b.html
https://proceedings.mlr.press/v216/wang23b.htmlExploration for Free: How Does Reward Heterogeneity Improve Regret in Cooperative Multi-agent Bandits?This paper studies a cooperative multi-agent bandit scenario in which the rewards observed by agents are heterogeneous—one agent’s meat can be another agent’s poison. Specifically, the total reward observed by each agent is the sum of two values: an arm-specific reward, capturing the intrinsic value of the arm, and a privately-known agent-specific reward, which captures the personal preference/limitations of the agent. This heterogeneity in total reward leads to different local optimal arms for agents but creates an opportunity for \textit{free exploration} in a cooperative setting—an agent can freely explore its local optimal arm with no regret and share this free observation with some other agents who would suffer regrets if they pull this arm since the arm is not optimal for them. We first characterize a regret lower bound that captures free exploration, i.e., arms that can be freely explored have no contribution to the regret lower bound. Then, we present a cooperative bandit algorithm that takes advantage of free exploration and achieves a near-optimal regret upper bound which tightly matches the regret lower bound up to a constant factor. Lastly, we run numerical simulations to compare our algorithm with various baselines without free exploration.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wang23a.html
https://proceedings.mlr.press/v216/wang23a.htmlBirds of an odd feather: guaranteed out-of-distribution (OOD) novel category detectionIn this work, we solve the problem of novel category detection under distribution shift. This problem is critical to ensuring the safety and efficacy of machine learning models, particularly in domains such as healthcare where timely detection of novel subgroups of patients is crucial. To address this problem, we propose a method based on constrained learning. Our approach is guaranteed to detect a novel category under a relatively weak assumption, namely that rare events in past data have bounded frequency under the shifted distribution. Prior works on the problem do not provide such guarantees, as they either attend to very specific types of distribution shift or make stringent assumptions that limit their guarantees. We demonstrate favorable performance of our method on challenging novel category detection problems over real world datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/wald23a.html
https://proceedings.mlr.press/v216/wald23a.htmlA policy gradient approach for optimization of smooth risk measuresWe propose policy gradient algorithms for solving a risk-sensitive reinforcement learning (RL) problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using the broad class of smooth risk measures of the cumulative discounted reward. We propose two template policy gradient algorithms that optimize a smooth risk measure in on-policy and off-policy RL settings, respectively. We derive non-asymptotic bounds that quantify the rate of convergence of our proposed algorithms to a stationary point of the smooth risk measure. As special cases, we establish that our algorithms apply to optimization of mean-variance and distortion risk measures, respectively.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/vijayan23a.html
https://proceedings.mlr.press/v216/vijayan23a.htmlProbabilistic circuits that know what they don’t knowProbabilistic circuits (PCs) are models that allow exact and tractable probabilistic inference. In contrast to neural networks, they are often assumed to be well-calibrated and robust to out-of-distribution (OOD) data. In this paper, we show that PCs are in fact not robust to OOD data, i.e., they don’t know what they don’t know. We then show how this challenge can be overcome by model uncertainty quantification. To this end, we propose tractable dropout inference (TDI), an inference procedure to estimate uncertainty by deriving an analytical solution to Monte Carlo dropout (MCD) through variance propagation. Unlike MCD in neural networks, which comes at the cost of multiple network evaluations, TDI provides tractable sampling-free uncertainty estimates in a single forward pass. TDI improves the robustness of PCs to distribution shift and OOD data, demonstrated through a series of experiments evaluating the classification confidence and uncertainty estimates on real-world data.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ventola23a.html
https://proceedings.mlr.press/v216/ventola23a.htmlBandits with costly reward observationsMany machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit’s action. We show that the observation cost implies an $\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/tucker23a.html
https://proceedings.mlr.press/v216/tucker23a.htmlSPDF: Sparse Pre-training and Dense Fine-tuning for Large Language ModelsThe pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/thangarasa23a.html
https://proceedings.mlr.press/v216/thangarasa23a.htmlFairness-aware class imbalanced learning on multiple subgroupsWe present a novel Bayesian-based optimization framework that addresses the challenge of generalization in overparameterized models when dealing with imbalanced subgroups and limited samples per subgroup. Our proposed tri-level optimization framework utilizes <em>local</em> predictors, which are trained on a small amount of data, as well as a fair and class-balanced predictor at the middle and lower levels. To effectively overcome saddle points for minority classes, our lower-level formulation incorporates sharpness-aware minimization. Meanwhile, at the upper level, the framework dynamically adjusts the loss function based on validation loss, ensuring a close alignment between the <em>global</em> predictor and local predictors. Theoretical analysis demonstrates the framework’s ability to enhance classification and fairness generalization, potentially resulting in improvements in the generalization bound. Empirical results validate the superior performance of our tri-level framework compared to existing state-of-the-art approaches. The source code can be found at \url{https://github.com/PennShenLab/FACIMS}.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/tarzanagh23a.html
https://proceedings.mlr.press/v216/tarzanagh23a.htmlLow-rank matrix recovery with unknown correspondenceWe study a matrix recovery problem with unknown correspondence: given the observation matrix $M_o=[A,\tilde P B]$, where $\tilde P$ is an unknown permutation matrix, we aim to recover the underlying matrix $M=[A,B]$. Such problem commonly arises in many applications where heterogeneous data are utilized and the correspondence among them are unknown, e.g., due to data mishandling or privacy concern. We show that, in some applications, it is possible to recover $M$ via solving a nuclear norm minimization problem. Moreover, under a proper low-rank condition on $M$, we derive a non-asymptotic error bound for the recovery of $M$. We propose an algorithm, $\text{M}^3\text{O}$ (Matrix recovery via Min-Max Optimization) which recasts this combinatorial problem as a continuous minimax optimization problem and solves it by proximal gradient with a Max-Oracle. $\text{M}^3\text{O}$ can also be applied to a more general scenario where we have missing entries in $M_o$ and multiple groups of data with distinct unknown correspondence. Experiments on simulated data, the MovieLens 100K dataset and Yale B database show that $\text{M}^3\text{O}$ achieves state-of-the-art performance over several baselines and can recover the ground-truth correspondence with high accuracy.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/tang23a.html
https://proceedings.mlr.press/v216/tang23a.htmlTwo-stage Kernel Bayesian Optimization in High DimensionsBayesian optimization is a popular method for optimizing expensive black-box functions. Yet it oftentimes struggles in high dimensions, where the computation could be prohibitively heavy. While a complex kernel with many length scales is prone to overfitting and expensive to train, a simple coarse kernel with too few length scales cannot effectively capture the variations of the high dimensional function in different directions. To alleviate this problem, we introduce CobBO: a Bayesian optimization algorithm with two-stage kernels and a coordinate backoff stopping rule. It adaptively selects a promising low dimensional subspace and projects past measurements into it using a computational efficient coarse kernel. Within the subspace, the computational cost of conducting Bayesian optimization with a more flexible and accurate kernel becomes affordable and thus a sequence of consecutive observations in the same subspace are collected until a stopping rule is met. Extensive evaluations show that CobBO finds solutions comparable to or better than other state-of-the-art methods for dimensions ranging from tens to hundreds, while reducing both the trial complexity and computational costs.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/tan23a.html
https://proceedings.mlr.press/v216/tan23a.htmlExploiting Inferential Structure in Neural ProcessesNeural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set. This set is encoded by a latent variable, which is often assumed to follow a simple distribution. However, in real-word settings, the context set may be drawn from richer distributions having multiple modes, heavy tails, etc. In this work, we provide a framework that allows NPs’ latent variable to be given a rich prior defined by a graphical model. These distributional assumptions directly translate into an appropriate aggregation strategy for the context set. Moreover, we describe a message-passing procedure that still allows for end-to-end optimization with stochastic gradients. We demonstrate the generality of our framework by using mixture and Student-t assumptions that yield improvements in function modelling and test-time robustness.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/tailor23a.html
https://proceedings.mlr.press/v216/tailor23a.htmlWhy Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the OOD detector rankingsReliable detection of out-of-distribution (OOD) instances is becoming a critical requirement for machine learning systems deployed in safety-critical applications. Recently, many OOD detectors have been developed in the literature, and their performance has been evaluated using empirical studies based on well-established benchmark datasets. However, these studies do not provide a conclusive recommendation because the performance of OOD detection depends on the benchmark datasets. In this work, we want to question the reliability of the OOD detection performance numbers obtained from many of these empirical experiments. We report several experimental conditions that are not controlled and lead to significant changes in OOD detector performance and rankings of OOD methods. These include the technicalities related to how the DNN was trained (such as seed, train/test split, etc.), which do not change the accuracy of closed-set DNN models but may significantly change the performance of OOD detection methods that rely on representation from these DNNs. We performed extensive sensitivity studies in image and text domains to quantify the instability of OOD performance measures due to unintuitive experimental factors. These factors need to be more rigorously controlled and accounted for in many current OOD experiments. Experimental studies in OOD detection should improve methodological standards regarding experiment control and replication.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/szyc23a.html
https://proceedings.mlr.press/v216/szyc23a.htmlLocally Regularized Sparse Graph by Fast Proximal Gradient DescentSparse graphs built by sparse representation has been demonstrated to be effective in clustering high-dimensional data. Albeit the compelling empirical performance, the vanilla sparse graph ignores the geometric information of the data by performing sparse representation for each datum separately. In order to obtain a sparse graph aligned with the local geometric structure of data, we propose a novel Support Regularized Sparse Graph, abbreviated as SRSG, for data clustering. SRSG encourages local smoothness on the neighborhoods of nearby data points by a well-defined support regularization term. We propose a fast proximal gradient descent method to solve the non-convex optimization problem of SRSG with the convergence matching the Nesterov’s optimal convergence rate of first-order methods on smooth and convex objective function with Lipschitz continuous gradient. Extensive experimental results on various real data sets demonstrate the superiority of SRSG over other competing clustering methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sun23c.html
https://proceedings.mlr.press/v216/sun23c.htmlPandering in a (flexible) representative democracyIn representative democracies, regular election cycles are supposed to prevent misbehavior by elected officials, hold them accountable, and subject them to the “will of the people." Pandering, or dishonest preference reporting by candidates campaigning for election, undermines this democratic idea. Much of the work on Computational Social Choice to date has investigated strategic actions in only a single election. We introduce a novel formal model of pandering and examine the resilience of two voting systems, Representative Democracy (RD) and Flexible Representative Democracy (FRD), to pandering within a single election and across multiple rounds of elections. For both voting systems, our analysis centers on the types of strategies candidates employ and how voters update their views of candidates based on how the candidates have pandered in the past. We provide theoretical results on the complexity of pandering in our setting for a single election, formulate our problem for multiple cycles as a Markov Decision Process, and use reinforcement learning to study the effects of pandering by single candidates and groups of candidates over many rounds.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sun23b.html
https://proceedings.mlr.press/v216/sun23b.htmlMeta-learning Control Variates: Variance Reduction with Limited DataControl variates can be a powerful tool to reduce the variance of Monte Carlo estimators, but constructing effective control variates can be challenging when the number of samples is small. In this paper, we show that when a large number of related integrals need to be computed, it is possible to leverage the similarity between these integration tasks to improve performance even when the number of samples per task is very small. Our approach, called meta learning CVs (Meta-CVs), can be used for up to hundreds or thousands of tasks. Our empirical assessment indicates that Meta-CVs can lead to significant variance reduction in such settings, and our theoretical analysis establishes general conditions under which Meta-CVs can be successfully trained.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sun23a.html
https://proceedings.mlr.press/v216/sun23a.htmlOn the informativeness of supervision signalsSupervised learning typically focuses on learning transferable representations from training examples annotated by humans. While rich annotations (like soft labels) carry more information than sparse annotations (like hard labels), they are also more expensive to collect. For example, while hard labels only provide information about the closest class an object belongs to (e.g., “this is a dog”), soft labels provide information about the object’s relationship with multiple classes (e.g., “this is most likely a dog, but it could also be a wolf or a coyote”). We use information theory to compare how a number of commonly-used supervision signals contribute to representation-learning performance, as well as how their capacity is affected by factors such as the number of labels, classes, dimensions, and noise. Our framework provides theoretical justification for using hard labels in the big-data regime, but richer supervision signals for few-shot learning and out-of-distribution generalization. We validate these results empirically in a series of experiments with over 1 million crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sucholutsky23a.html
https://proceedings.mlr.press/v216/sucholutsky23a.htmlDifferentially Private Stochastic Convex Optimization in (Non)-Euclidean Space RevisitedIn this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) in Euclidean and general $\ell_p^d$ spaces. Specifically, we focus on three settings that are still far from well understood: (1) DP-SCO over a constrained and bounded (convex) set in Euclidean space; (2) unconstrained DP-SCO in $\ell_p^d$ space; (3) DP-SCO with heavy-tailed data over a constrained and bounded set in $\ell_p^d$ space. For problem (1), for both convex and strongly convex loss functions, we propose methods whose outputs could achieve (expected) excess population risks that are only dependent on the Gaussian width of the constraint set, rather than the dimension of the space. Moreover, we also show the bound for strongly convex functions is optimal up to a logarithmic factor. For problems (2) and (3), we propose several novel algorithms and provide the first theoretical results for both cases when $1<p<2$ and $2\leq p\leq \infty$.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/su23b.html
https://proceedings.mlr.press/v216/su23b.htmlSolving multi-model MDPs by coordinate ascent and dynamic programmingMulti-model Markov decision process (MMDP) is a promising framework for computing policies that are robust to parameter uncertainty in MDPs. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. Because MMDPs are NP-hard to solve, most methods resort to approximations. In this paper, we derive the policy gradient of MMDPs and propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs. The main innovation of CADP compared with earlier algorithms is to take the coordinate ascent perspective to adjust model weights iteratively to guarantee monotone policy improvements to a local maximum. A theoretical analysis of CADP proves that it never performs worse than previous dynamic programming algorithms like WSU. Our numerical results indicate that CADP substantially outperforms existing methods on several benchmark problems.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/su23a.html
https://proceedings.mlr.press/v216/su23a.htmlFast Heterogeneous Federated Learning with Hybrid Client SelectionClient selection schemes are widely adopted to handle the communication-efficient problems in recent studies of Federated Learning (FL). However, the large variance of the model updates aggregated from the randomly-selected unrepresentative subsets directly slows the FL convergence. We present a novel clustering-based client selection scheme to accelerate the FL convergence by variance reduction. Simple yet effective schemes are designed to improve the clustering effect and control the effect fluctuation, therefore, generating the client subset with certain representativeness of sampling. Theoretically, we demonstrate the improvement of the proposed scheme in variance reduction. We also present the tighter convergence guarantee of the proposed method thanks to the variance reduction. Experimental results confirm the exceed efficiency of our scheme compared to alternatives.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/song23b.html
https://proceedings.mlr.press/v216/song23b.htmlViBid: Linear Vision Transformer with Bidirectional NormalizationThe vision transformer has achieved state-of-the-art performance in various vision tasks; however, the memory consumption is larger than those of previous convolutional neural network based models because of $O(N^2)$ time and memory complexity of the general self-attention models. Many approaches aim to change the complexity to $O(N)$ to solve this problem; however, they stack deep convolutional layers to retain locality or complicate the architecture as seen in window attention, to compensate for the performance degradation. To solve these problems, we propose ViBid algorithm, which resolves the complexity problem of $O(N^2)$ by replacing Softmax with bidirectional normalization (BiNorm). In addition, it has a much simpler architecture than the existing transformer model with $O(N)$ complexity. Owing to our simple architecture, we were able to use larger resolutions for training, and we obtained a lighter and superior GPU throughput model with competitive performance. ViBid can be used with any transformer method that uses queries, keys, and values (QKV) because of BiNorm, and it is quite universal due to its simple architectural structure.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/song23a.html
https://proceedings.mlr.press/v216/song23a.htmlAligned Diffusion Schrödinger BridgesDiffusion Schrödinger bridges (DSBs) have recently emerged as a powerful framework for recovering stochastic dynamics via their marginal observations at different time points. Despite numerous successful applications, existing algorithms for solving DSBs have so far failed to utilize the structure of aligned data, which naturally arises in many biological phenomena. In this paper, we propose a novel algorithmic framework that, for the first time, solves DSBs while respecting the data alignment. Our approach hinges on a combination of two decades-old ideas: The classical Schrödinger bridge theory and Doob’s $h$-transform. Compared to prior methods, our approach leads to a simpler training procedure with lower variance, which we further augment with principled regularization schemes. This ultimately leads to sizeable improvements across experiments on synthetic and real data, including the tasks of predicting conformational changes in proteins and temporal evolution of cellular differentiation processes.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/somnath23a.html
https://proceedings.mlr.press/v216/somnath23a.htmlOn the limitations of Markovian rewards to express multi-objective, risk-sensitive, and modal tasksIn this paper, we study the expressivity of scalar, Markovian reward functions in Reinforcement Learning (RL), and identify several limitations to what they can express. Specifically, we look at three classes of RL tasks; multi-objective RL, risk-sensitive RL, and modal RL. For each class, we derive necessary and sufficient conditions that describe when a problem in this class can be expressed using a scalar, Markovian reward. Moreover, we find that scalar, Markovian rewards are unable to express most of the instances in each of these three classes. We thereby contribute to a more complete understanding of what standard reward functions can and cannot express. In addition to this, we also call attention to modal problems as a new class of problems, since they have so far not been given any systematic treatment in the RL literature. We also briefly outline some approaches for solving some of the problems we discuss, by means of bespoke RL algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/skalse23a.html
https://proceedings.mlr.press/v216/skalse23a.htmlProbabilistic Flow Circuits: Towards Unified Deep Models for Tractable Probabilistic InferenceWe consider the problem of increasing the expressivity of probabilistic circuits by augmenting them with the successful generative models of normalizing flows. To this effect, we theoretically establish the requirement of decomposability for such combinations to retain tractability of the learned models. Our model, called Probabilistic Flow Circuits, essentially extends circuits by allowing for normalizing flows at the leaves. Our empirical evaluation clearly establishes the expressivity and tractability of this new class of probabilistic circuits.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sidheekh23a.html
https://proceedings.mlr.press/v216/sidheekh23a.htmlA Bayesian approach for bandit online optimization with switching costAs a classical problem, online optimization with switching cost has been studied for a long time due to its wide applications in various areas. However, few works have investigated the bandit setting where both the forms of the main cost function $f(x)$ evaluated at state $x$ and the switching cost function $c(x, y)$ of transitioning from state $x$ to $y$ are unknown. In this paper, we consider the situation when $\left(f(x_t)+\varepsilon_t,\,{c}(x_t, x_{t-1})\right)$ can be observed with noise $\varepsilon_t$ after making a decision $x_t$ at time $t$, aiming to minimize the expected total cost within a time horizon. To solve this problem, we propose two algorithms from a Bayesian approach, named Greedy Search and Alternating Search, respectively. They have different theoretical guarantees of competitive ratios under mild regularity conditions, and the latter algorithm achieves a faster running speed. Using simulations of two classical black-box optimization problems, we demonstrate the superior performance of our algorithms compared with the classical method.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/shi23b.html
https://proceedings.mlr.press/v216/shi23b.htmlLearning Nonlinear Causal Effect via Kernel Anchor RegressionLearning causal effects is a fundamental problem in science. Anchor regression has been developed to address this problem for a large class of causal graphical models, though the relationships between the variables are assumed to be linear. In this work, we tackle the nonlinear setting by proposing kernel anchor regression (KAR). Beyond a classic two-stage least square (2SLS) estimator, we also study an improved variant that involves nonparametric kernel regression in three separate stages. We provide convergence results for the proposed KAR estimators and the identifiability conditions for KAR to learn the nonlinear structural equation models (SEM). Experimental results demonstrate the superior performances of the proposed KAR estimators over existing baselines.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/shi23a.html
https://proceedings.mlr.press/v216/shi23a.htmlRisk-limiting financial audits via weighted sampling without replacementWe introduce the notion of risk-limiting financial audits (RLFA): procedures that manually evaluate a subset of $N$ financial transactions to check the validity of a claimed assertion $\mathcal{A}$ about the transactions. More specifically, RLFA satisfy two properties: (i) if $\mathcal{A}$ is false, they correctly disprove it with probability at least $1-\delta$, and (ii) they validate the correctness of $\mathcal{A}$ with probability $1$, if it is true. We propose a general RLFA strategy, by constructing new confidence sequences (CSs) for the weighted average of $N$ unknown values, based on samples drawn without replacement from a (randomized) weighted sampling scheme. Next, we develop methods to improve the quality of CSs by incorporating side information about the unknown values. We show that when the side information is sufficiently accurate, it can directly drive the sampling. For the case where the accuracy is unknown <em>a priori</em>, we introduce an alternative approach using control variates. Crucially, our construction adapts to the quality of side information by strongly leveraging the side information if it is highly predictive, and learning to ignore it if it is uninformative. Our methods also recover the state-of-the-art bounds for the special case of uniformly sampled observations with no side information, which has already found applications in election auditing. The harder weighted case with general side information solves the more challenging problem of AI-assisted financial auditing.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/shekhar23a.html
https://proceedings.mlr.press/v216/shekhar23a.htmlSymNet 3.0: Exploiting Long-Range Influences in Learning Generalized Neural Policies for Relational MDPsWe focus on the learning of generalized neural policies for Relational Markov Decision Processes (RMDPs) expressed in RDDL. Recent work first converts the instances of a relational domain into an instance graph, and then trains a Graph Attention Network (GAT) of fixed depth with parameters shared across instances to learn a state representation, which can be decoded to get the policy [sharma et al., 22]. Unfortunately, this approach struggles to learn policies that exploit long-range dependencies – a fact we formally prove in this paper. As a remedy, we first construct a novel influence graph characterized by edges capturing one-step influence (dependence) between nodes based on the transition model. We then define influence distance between two nodes as the shortest path between them in this graph – a feature we exploit to represent long-range dependencies. We show that our architecture, referred to as Symbolic Influence Network (SymNet3.0), with its distance-based features, does not suffer from the representational issues faced by earlier approaches. Extensive experimentation demonstrates that we are competitive with existing baselines on 12 standard IPPC domains, and perform significantly better on six additional domains (including IPPC variants), designed to test a model’s capability in capturing long-range dependencies. Further analysis shows that SymNet3.0 automatically learns to focus on nodes that have key information for representing policies that capture long-range dependencies.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sharma23c.html
https://proceedings.mlr.press/v216/sharma23c.htmlCounting Background Knowledge Consistent Markov Equivalent Directed Acyclic GraphsWe study the problem of counting the number of directed acyclic graphs in a Markov equivalence class (MEC) that are consistent with <em>background knowledge</em> specified in the form of the directions of some additional edges in the MEC. A polynomial-time algorithm for the special case of the problem, when no background knowledge constraints are specified, was given by Wienöbst, Bannach, and Liskiewicz (AAAI 2021), who also showed that the general case is NP-hard (in fact, #P-hard). In this paper, we show that the problem is nevertheless tractable in an interesting class of instances, by establishing that it is “fixed-parameter tractable”: we give an algorithm that runs in time $O(k! k^2 n^4)$, where $n$ is the number of nodes in the MEC and $k$ is the maximum number of nodes in any maximal clique of the MEC that participate in the specified background knowledge constraints. In particular, our algorithm runs in polynomial time in the well-studied special case of MECs of bounded tree-width or bounded maximum clique size.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sharma23b.html
https://proceedings.mlr.press/v216/sharma23b.htmlEfficiently learning the graph for semi-supervised learningComputational efficiency is a major bottleneck in using classic graph-based approaches for semi-supervised learning on datasets with a large number of unlabeled examples. Known techniques to improve efficiency typically involve an approximation of the graph regularization objective, but suffer two major drawbacks - first the graph is assumed to be known or constructed with heuristic hyperparameter values, second they do not provide a principled approximation guarantee for learning over the full unlabeled dataset. Building on recent work on learning graphs for semi-supervised learning from multiple datasets for problems from the same domain, and leveraging techniques for fast approximations for solving linear systems in the graph Laplacian matrix, we propose algorithms that overcome both the above limitations. We show a formal separation in the learning-theoretic complexity of sparse and dense graph families. We further show how to approximately learn the best graphs from the sparse families efficiently using the conjugate gradient method. Our approach can also be used to learn the graph efficiently online with sub-linear regret, under mild smoothness assumptions. Our online learning results are stated generally, and may be useful for approximate and efficient parameter tuning in other problems. We implement our approach and demonstrate significant ($\sim$10-100x) speedups over prior work on semi-supervised learning with learned graphs on benchmark datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sharma23a.html
https://proceedings.mlr.press/v216/sharma23a.htmlImplicit Training of Inference Network Models for Structured PredictionMost research in deep learning has predominantly focused on the development of new models and training procedures. In contrast, the exploration of training objectives has received considerably less attention, often limited to combinations of standard losses. When dealing with complex structured outputs, the effectiveness of conventional objectives as proxies for the true objective becomes can be questionable. In this study, we propose that existing inference network-based methods for structured prediction, as observed in previous works [Tu and Gimpel, 2018, Tu et al., 2020a], indirectly learn to optimize a dynamic loss objective parameterized by the energy model. Based on this insight, we propose a method that treats the energy network as a trainable loss function and employs an implicit-gradient-based technique to learn the corresponding dynamic objective. We experiment with multiple tasks such as multi-label classification, entity recognition etc., and find significant performance improvements over baseline approaches. Our results demonstrate that implicitly learning a dynamic loss landscape proves to be an effective approach for enhancing model performance in structured prediction tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/shankar23a.html
https://proceedings.mlr.press/v216/shankar23a.htmlMnemonist: Locating Model Parameters that Memorize Training ExamplesRecent work has shown that an adversary can reconstruct training examples given access to the parameters of a deep learning image classification model. We show that the quality of reconstruction depends heavily on the type of activation functions used. In particular, we show that ReLU activations lead to much lower quality reconstructions compared to smooth activation functions. We explore if this phenomenon is a fundamental property of models with ReLU activations, or if it is a weakness of current attack strategies. We first study the training dynamics of small MLPs with ReLU activations and identify redundant model parameters that do not memorise training examples. Building on this, we propose our Mnemonist method, which is able to detect redundant model parameters, and then guide current attacks to focus on informative parameters to improve the quality of reconstructions of training examples from ReLU models.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/shahin-shamsabadi23a.html
https://proceedings.mlr.press/v216/shahin-shamsabadi23a.htmlMDPose: real-time multi-person pose estimation via mixture density modelOne of the major challenges in multi-person pose estimation is instance-aware keypoint estimation. Previous methods address this problem by leveraging an off-the-shelf detector, heuristic post-grouping process or explicit instance identification process, hindering further improvements in the inference speed which is an important factor for practical applications. From the statistical point of view, those additional processes for identifying instances are necessary to bypass learning the high-dimensional joint distribution of human keypoints, which is a critical factor for another major challenge, the occlusion scenario. In this work, we propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints with a mixture density model, termed as MDPose. Our MDPose estimates the distribution of human keypoints’ coordinates using a mixture density model with an instance-aware keypoint head consisting simply of 8 convolutional layers. It is trained by minimizing the negative log-likelihood of the ground truth keypoints. Also, we propose a simple yet effective training strategy, Random Keypoint Grouping (RKG), which significantly alleviates the underflow problem leading to successful learning of relations between keypoints. On OCHuman dataset, which consists of images with highly occluded people, our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints. Furthermore, our MDPose shows significant improvement in inference speed with a competitive accuracy on MS COCO, a widely-used human keypoint dataset, thanks to the proposed much simpler single-stage pipeline.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/seo23a.html
https://proceedings.mlr.press/v216/seo23a.htmlLearning from Low Rank Tensor Data: A Random Tensor Theory PerspectiveUnder a simplified data model, this paper provides a theoretical analysis of learning from data that have an underlying low-rank tensor structure in both supervised and unsupervised settings. For the supervised setting, we provide an analysis of a Ridge classifier (with high regularization parameter) with and without knowledge of the low-rank structure of the data. Our results quantify analytically the gain in misclassification errors achieved by exploiting the low-rank structure for denoising purposes, as opposed to treating data as mere vectors. We further provide a similar analysis in the context of clustering, thereby quantifying the exact performance gap between tensor methods and standard approaches which treat data as simple vectors.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/seddik23a.html
https://proceedings.mlr.press/v216/seddik23a.htmlLifelong bandit optimization: no prior and no regretMachine learning algorithms are often repeatedly. applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LIBO, an algorithm which adapts to the environment by learning from past experience and becomes more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LIBO sequentially meta-learns a kernel that approximates the true kernel and solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LIBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LIBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. We propose F-LIBO, which solves the lifelong problem in a federated manner.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/schur23a.html
https://proceedings.mlr.press/v216/schur23a.htmlLocal Message Passing on Frustrated SystemsMessage passing on factor graphs is a powerful framework for probabilistic inference, which finds important applications in various scientific domains. The most wide-spread message passing scheme is the sum-product algorithm (SPA) which gives exact results on trees but often fails on graphs with many small cycles. We search for an alternative message passing algorithm that works particularly well on such cyclic graphs. Therefore, we challenge the extrinsic principle of the SPA, which loses its objective on graphs with cycles. We further replace the local SPA message update rule at the factor nodes of the underlying graph with a generic mapping, which is optimized in a data-driven fashion. These modifications lead to a considerable improvement in performance while preserving the simplicity of the SPA. We evaluate our method for two classes of cyclic graphs: the 2x2 fully connected Ising grid and factor graphs for symbol detection on linear communication channels with inter-symbol interference. To enable the method for large graphs as they occur in practical applications, we develop a novel loss function that is inspired by the Bethe approximation from statistical physics and allows for training in an unsupervised fashion.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/schmid23a.html
https://proceedings.mlr.press/v216/schmid23a.htmlLearning good interventions in causal graphs via coveringWe study the causal bandit problem that entails identifying a near-optimal intervention from a specified set A of (possibly non-atomic) interventions over a given causal graph. Here, an optimal intervention in A is one that maximizes the expected value for a designated reward variable in the graph, and we use the standard notion of simple regret to quantify near optimality. Considering Bernoulli random variables and for causal graphs on N vertices with constant in-degree, prior work has achieved a worst case guarantee of O(N/sqrt(T)) for simple regret. The current work utilizes the idea of covering interventions (which are not necessarily contained within A) and establishes a simple regret guarantee of O(sqrt(N/T)). Notably, and in contrast to prior work, our simple regret bound depends only on explicit parameters of the problem instance. We also go beyond prior work and achieve a simple regret guarantee for causal graphs with unobserved variables. Further, we perform experiments to show improvements over baselines in this setting.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sawarni23a.html
https://proceedings.mlr.press/v216/sawarni23a.htmlOnline Heavy-tailed Change-point detectionWe study algorithms for online change-point detection (OCPD), where samples that are potentially heavy-tailed, are presented one at a time and a change in the underlying mean must be detected as early as possible. We present an algorithm based on clipped Stochastic Gradient Descent (SGD), that works even if we only assume that the second moment of the data generating process is bounded. We derive guarantees on worst-case, finite-sample false-positive rate (FPR) over the family of all distributions with bounded second moment. Thus, our method is the first OCPD algorithm that guarantees finite-sample FPR, even if the data is high dimensional and the underlying distributions are heavy-tailed. The technical contribution of our paper is to show that clipped-SGD can estimate the mean of a random vector and simultaneously provide confidence bounds at all confidence values. We combine this robust estimate with a union bound argument and construct a sequential change-point algorithm with finite-sample FPR guarantees. We show empirically that our algorithm works well in a variety of situations, whether the underlying data are heavy-tailed, light-tailed, high dimensional or discrete. No other algorithm achieves bounded FPR theoretically or empirically, over all settings we study simultaneously.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sankararaman23a.html
https://proceedings.mlr.press/v216/sankararaman23a.htmlHeteroskedastic Geospatial Tracking with Distributed Camera NetworksVisual object tracking has seen significant progress in recent years. However, the vast majority of this work focuses on tracking objects within the image plane of a single camera and ignores the uncertainty associated with predicted object locations. In this work, we focus on the geospatial object tracking problem using data from a distributed camera network. The goal is to predict an object’s track in geospatial coordinates along with uncertainty over the object’s location while respecting communication constraints that prohibit centralizing raw image data. We present a novel single-object geospatial tracking data set that includes high-accuracy ground truth object locations and video data from a network of four cameras. We present a modeling framework for addressing this task including a novel backbone model and explore how uncertainty calibration and fine-tuning through a differentiable tracker affect performance.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/samplawski23a.html
https://proceedings.mlr.press/v216/samplawski23a.htmlIs the volume of a credal set a good measure for epistemic uncertainty?Adequate uncertainty representation and quantification have become imperative in various scientific disciplines, especially in machine learning and artificial intelligence. As an alternative to representing uncertainty via one single probability measure, we consider credal sets (convex sets of probability measures). The geometric representation of credal sets as d-dimensional polytopes implies a geometric intuition about (epistemic) uncertainty. In this paper, we show that the volume of the geometric representation of a credal set is a meaningful measure of epistemic uncertainty in the case of binary classification, but less so for multi-class classification. Our theoretical findings highlight the crucial role of specifying and employing uncertainty measures in machine learning in an appropriate way, and for being aware of possible pitfalls.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/sale23a.html
https://proceedings.mlr.press/v216/sale23a.htmlSemi-supervised learning of partial differential operators and dynamical flowsThe evolution of many dynamical systems is generically governed by nonlinear partial differential equations (PDEs), whose solution, in a simulation framework, requires vast amounts of computational resources. In this work, we present a novel method that combines a hyper-network solver with a Fourier Neural Operator architecture. Our method treats time and space separately and as a result, it successfully propagates initial conditions in continuous time steps by employing the general composition properties of the partial differential operators. Following previous works, supervision is provided at a specific time point. We test our method on various time evolution PDEs, including nonlinear fluid flows in one, two, or three spatial dimensions. The results show that the new method improves the learning accuracy at the time of the supervision point, and can interpolate the solutions to any intermediate time.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/rotman23a.html
https://proceedings.mlr.press/v216/rotman23a.htmlHallucinated adversarial control for conservative offline policy evaluationWe study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, collected by other agents, we seek to obtain a (tight) lower bound on a policy’s performance. This is crucial when deciding whether a given policy satisfies certain minimal performance/safety criteria before it can be deployed in the real world. To this end, we introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics. To form a conservative estimate of the policy’s performance, HAMBO hallucinates worst-case trajectories that the policy may take, within the margin of the models’ epistemic confidence regions. We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return. Finally, we discuss scalable variants of our approach based on Bayesian Neural Networks and empirically demonstrate that they yield reliable and tight lower bounds in various continuous control environments.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/rothfuss23a.html
https://proceedings.mlr.press/v216/rothfuss23a.htmlApproximately Bayes-optimal pseudo-label selectionSemi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). This selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes-optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace’s method and the Gaussian integral. We empirically assess BPLS on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/rodemann23a.html
https://proceedings.mlr.press/v216/rodemann23a.htmlThe past does matter: correlation of subsequent states in trajectory predictions of Gaussian Process modelsComputing the distribution of trajectories from a Gaussian Process model of a dynamical system is an important challenge in utilizing such models. Motivated by the computational cost of sampling-based approaches, we consider approximations of the model’s output and trajectory distribution. We show that previous work on uncertainty propagation, focussed on discrete state-space models, incorrectly included an independence assumption between subsequent states of the predicted trajectories. Expanding these ideas to continuous ordinary differential equation models, we illustrate the implications of this assumption and propose a novel piecewise linear approximation of Gaussian Processes to mitigate them.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ridderbusch23a.html
https://proceedings.mlr.press/v216/ridderbusch23a.htmlInference for probabilistic dependency graphsProbabilistic dependency graphs (PDGs) are a flexible class of probabilistic graphical models, subsuming Bayesian Networks and Factor Graphs. They can also capture inconsistent beliefs, and provide a way of measuring the degree of this inconsistency. We present the first tractable inference algorithm for PDGs with discrete variables, making the asymptotic complexity of PDG inference similar that of the graphical models they generalize. The key components are: (1) the observation that PDG inference can be reduced to convex optimization with exponential cone constraints, (2) a construction that allows us to express these problems compactly for PDGs of boundeed treewidth, for which we needed to further develop the theory of PDGs, and (3) an appeal to interior point methods that can solve such problems in polynomial time. We verify the correctness and time complexity of our approach, and provide an implementation of it. We then evaluate our implementation, and demonstrate that it outperforms baseline approaches.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/richardson23a.html
https://proceedings.mlr.press/v216/richardson23a.htmlValidation of composite systems by discrepancy propagationAssessing the validity of a real-world system with respect to given quality criteria is a common yet costly task in industrial applications due to the vast number of required real-world tests. Validating such systems by means of simulation offers a promising and less expensive alternative, but requires an assessment of the simulation accuracy and therefore end-to-end measurements. Additionally, covariate shifts between simulations and actual usage can cause difficulties for estimating the reliability of such systems. In this work, we present a validation method that propagates bounds on distributional discrepancy measures through a composite system, thereby allowing us to derive an upper bound on the failure probability of the real system from potentially inaccurate simulations. Each propagation step entails an optimization problem, where – for measures such as maximum mean discrepancy (MMD) – we develop tight convex relaxations based on semidefinite programs. We demonstrate that our propagation method yields valid and useful bounds for composite systems exhibiting a variety of realistic effects. In particular, we show that the proposed method can successfully account for data shifts within the experimental design as well as model inaccuracies within the simulation.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/reeb23a.html
https://proceedings.mlr.press/v216/reeb23a.htmlContrastive learning for supervised graph matchingDeep graph matching techniques have shown promising results in recent years. In this work, we cast deep graph matching as a contrastive learning task and introduce a new objective function for contrastive mapping to exploit the relationships between matches and non-matches. To this end, we develop a hardness attention mechanism to select negative samples which captures the relatedness and informativeness of positive and negative samples. Further, we propose a novel deep graph matching framework, Stable Graph Matching (StableGM), which incorporates Sinkhorn ranking into a stable marriage algorithm to efficiently compute one-to-one node correspondences between graphs. We prove that the proposed objective function for contrastive matching is both positive and negative informative, offering theoretical guarantees to achieve dual-optimality in graph matching. We empirically verify the effectiveness of our proposed approach by conducting experiments on standard graph matching benchmarks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ratnayaka23a.html
https://proceedings.mlr.press/v216/ratnayaka23a.htmlUSIM-DAL: Uncertainty-aware Statistical Image Modeling-based Dense Active Learning for Super-resolutionDense regression is a widely used approach in computer vision for tasks such as image super-resolution, enhancement, depth estimation, etc. However, the high cost of annotation and labeling makes it challenging to achieve accurate results. We propose incorporating active learning into dense regression models to address this problem. Active learning allows models to select the most informative samples for labeling, reducing the overall annotation cost while improving performance. Despite its potential, active learning has not been widely explored in high-dimensional computer vision regression tasks like super-resolution. We address this research gap and propose a new framework called USIM-DAL that leverages the statistical properties of colour images to learn informative priors using probabilistic deep neural networks that model the heteroscedastic predictive distribution allowing uncertainty quantification. Moreover, the aleatoric uncertainty from the network serves as a proxy for error that is used for active learning. Our experiments on a wide variety of datasets spanning applications in natural images (visual genome, BSD100), medical imaging (histopathology slides), and remote sensing (satellite images) demonstrate the efficacy of the newly proposed USIM-DAL and superiority over several dense regression active learning methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/rangnekar23a.html
https://proceedings.mlr.press/v216/rangnekar23a.htmlJana: Jointly amortized neural approximation of complex Bayesian modelsThis work proposes “jointly amortized neural approximation” (JANA) of intractable likelihood functions and posterior densities arising in Bayesian surrogate modeling and simulation-based inference. We train three complementary networks in an end-to-end fashion: 1) a summary network to compress individual data points, sets, or time series into informative embedding vectors; 2) a posterior network to learn an amortized approximate posterior; and 3) a likelihood network to learn an amortized approximate likelihood. Their interaction opens a new route to amortized marginal likelihood and posterior predictive estimation – two important ingredients of Bayesian workflows that are often too expensive for standard methods. We benchmark the fidelity of JANA on a variety of simulation models against state of-the-art Bayesian methods and propose a powerful and interpretable diagnostic for joint calibration. In addition, we investigate the ability of recurrent likelihood networks to emulate complex time series models without resorting to hand-crafted summary statistics.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/radev23a.html
https://proceedings.mlr.press/v216/radev23a.htmlSplit, count, and share: a differentially private set intersection cardinality estimation protocolWe describe a simple two-party protocol in which each party contributes a set as input. The output of the protocol is an estimate of the cardinality of the intersection of the two input sets. We show that our protocol is efficient and secure. We show that the space complexity and communication complexity are constant, the time complexity for each party is proportional to the size of their input set, and that our protocol is differentially private. We also analyze the distribution of the output of the protocol, deriving both its asymptotic distribution and finite-sample bounds on its tail probabilities. These analyses show that, when the input sets are large, our protocol produces accurate set intersection cardinality estimates. We claim that our protocol is an attractive alternative to traditional private set intersection cardinality (PSI-CA) protocols when the input sets are large, exact precision is not required, and differential privacy on its own can provide sufficient protection to the underlying sensitive data.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/purcell23a.html
https://proceedings.mlr.press/v216/purcell23a.htmlExact Count of Boundary Pieces of ReLU Classifiers: Towards the Proper Complexity Measure for ClassificationClassic learning theory suggests that proper regularization is the key to good generalization and robustness. In classification, current training schemes only target the complexity of the classifier itself, which can be misleading and ineffective. Instead, we advocate directly measuring the complexity of the decision boundary. Existing literature is limited in this area with few well-established definitions of boundary complexity. As a proof of concept, we start by analyzing ReLU neural networks, whose boundary complexity can be conveniently characterized by the number of affine pieces. With the help of tropical geometry, we develop a novel method that can explicitly count the exact number of boundary pieces, and as a by-product, the exact number of total affine pieces. Numerical experiments are conducted and distinctive properties of our boundary complexity are uncovered. First, the boundary piece count appears largely independent of other measures, e.g., total piece count, and $l_2$ norm of weights, during the training process. Second, the boundary piece count is negatively correlated with robustness, where popular robust training techniques, e.g., adversarial training or random noise injection, are found to reduce the number of boundary pieces.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/piwek23a.html
https://proceedings.mlr.press/v216/piwek23a.htmlBoosting AND/OR-based computational protein design: dynamic heuristics and generalizable UFOScientific computing has experienced a surge empowered by advancements in technologies such as neural networks. However, certain important tasks are less amenable to these technologies, benefiting from innovations to traditional inference schemes. One such task is protein re-design. Recently a new re-design algorithm, {AOBB-K\textsuperscript{*}}, was introduced and was competitive with state-of-the-art {BBK\textsuperscript{*}} on small protein re-design problems. However, {AOBB-K\textsuperscript{*}} did not scale well. In this work, we focus on scaling up {AOBB-K\textsuperscript{*}} and introduce three new versions: {AOBB-K\textsuperscript{*}}-b (boosted), {AOBB-K\textsuperscript{*}}-{DH} (with dynamic heuristics), and {AOBB-K\textsuperscript{*}}-{UFO} (with underflow optimization) that significantly enhance scalability.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/pezeshki23a.html
https://proceedings.mlr.press/v216/pezeshki23a.htmlCopula for Instance-wise Feature Selection and RankInstance-wise feature selection and ranking methods can achieve a good selection of task-friendly features for each sample in the context of neural networks. However, existing approaches that assume feature subsets to be independent are imperfect when considering the dependency between features. To address this limitation, we propose to incorporate the Gaussian copula, a powerful mathematical technique for capturing correlations between variables, into the current feature selection framework with no additional changes needed. Experimental results on both synthetic and real datasets, in terms of performance comparison and interpretability, demonstrate that our method is capable of capturing meaningful correlations.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/peng23a.html
https://proceedings.mlr.press/v216/peng23a.htmlMulti-View Independent Component Analysis with Shared and Individual SourcesIndependent component analysis (ICA) is a blind source separation method for linear disentanglement of independent latent sources from observed data. We investigate the special setting of noisy linear ICA, referred to as ShIndICA, where the observations are split among different views, each receiving a mixture of shared and individual sources. We prove that the corresponding linear structure is identifiable and the sources distribution can be recovered. To computationally estimate the sources, we optimize a constrained form of the joint log-likelihood of the observed data among all views. Furthermore, we propose a model selection procedure for recovering the number of shared sources. Finally, we empirically demonstrate the advantages of our model over baselines. We apply ShIndICA in a challenging real-life task, using three transcriptome datasets provided by three different labs (three different views). The recovered sources were used for a downstream graph inference task, facilitating the discovery of a plausible representation of the data’s underlying graph structure.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/pandeva23a.html
https://proceedings.mlr.press/v216/pandeva23a.htmlStochastic Generative Flow NetworksGenerative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures through the lens of “inference as control”. They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/pan23a.html
https://proceedings.mlr.press/v216/pan23a.htmlMaximizing submodular functions under submodular constraintsWe consider the problem of maximizing submodular functions under submodular constraints by formulating the problem in two ways: SCSKC and DiffC. Given two submodular functions f and g where f is monotone, the objective of SCSKC problem is to find a set S of size at most k that maximizes f(S) under the constraint that g(S) < theta, for a given value of theta. The problem of DiffC focuses on finding a set S of size at most k such that h(S) = f(S)-g(S) is maximized. It is known that these problems are highly inapproximable and do not admit any constant factor multiplicative approximation algorithms unless NP is easy. Known approximation algorithms involve data-dependent approximation factors that are not efficiently computable. We initiate a study of the design of approximation algorithms where the approximation factors are efficiently computable. For the problem of SCSKC, we prove that the greedy algorithm produces a solution whose value is at least (1-1/e)f(OPT) - A, where A is the data-dependent additive error. For the DiffC problem, we design an algorithm that uses the SCSKC greedy algorithm as a subroutine. This algorithm produces a solution whose value is at least (1-1/e)h(OPT)-B, where B is also a data-dependent additive error. A salient feature of our approach is that the additive error terms can be computed efficiently, thus enabling us to ascertain the quality of the solutions produced.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/padmanabhan23a.html
https://proceedings.mlr.press/v216/padmanabhan23a.htmlBaysian numerical integration with neural networksBayesian probabilistic numerical methods for numerical integration offer significant advantages over their non-Bayesian counterparts: they can encode prior information about the integrand, and can quantify uncertainty over estimates of an integral. However, the most popular algorithm in this class, Bayesian quadrature, is based on Gaussian process models and is therefore associated with a high computational cost. To improve scalability, we propose an alternative approach based on Bayesian neural networks which we call Bayesian Stein networks. The key ingredients are a neural network architecture based on Stein operators, and an approximation of the Bayesian posterior based on the Laplace approximation. We show that this leads to orders of magnitude speed-ups on the popular Genz functions benchmark, and on challenging problems arising in the Bayesian analysis of dynamical systems, and the prediction of energy production for a large-scale wind farm.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ott23a.html
https://proceedings.mlr.press/v216/ott23a.htmlStructure-aware robustness certificates for graph classificationCertifying the robustness of a graph-based machine learning model poses a critical challenge for safety. Current robustness certificates for graph classifiers guarantee output invariance with respect to the total number of node pair flips (edge addition or edge deletion), which amounts to an {$l_{0}$} ball centred on the adjacency matrix. Although theoretically attractive, this type of isotropic structural noise can be too restrictive in practical scenarios where some node pairs are more critical than others in determining the classifier’s output. The certificate, in this case, gives a pessimistic depiction of the robustness of the graph model. To tackle this issue, we develop a randomised smoothing method based on adding an anisotropic noise distribution to the input graph structure. We show that our process generates structural-aware certificates for our classifiers, whereby the magnitude of robustness certificates can vary across different pre-defined structures of the graph. We demonstrate the benefits of these certificates in both synthetic and real-world experiments.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/osselin23a.html
https://proceedings.mlr.press/v216/osselin23a.htmlApproximate Thompson Sampling via Epistemic Neural NetworksThompson sampling (TS) is a popular heuristic for action selection, but it requires sampling from a posterior distribution. Unfortunately, this can become computationally intractable in complex environments, such as those modeled using neural networks. Approximate posterior samples can produce effective actions, but only if they reasonably approximate joint predictive distributions of outputs across inputs. Notably, accuracy of marginal predictive distributions does not suffice. Epistemic neural networks (ENNs) are designed to produce accurate joint predictive distributions. We compare a range of ENNs through computational experiments that assess their performance in approximating TS across bandit and reinforcement learning environments. The results indicate that ENNs serve this purpose well and illustrate how the quality of joint predictive distributions drives performance. Further, we demonstrate that the <em>epinet</em> – a small additive network that estimates uncertainty – matches the performance of large ensembles at orders of magnitude lower computational cost. This enables effective application of TS with computation that scales gracefully to complex environments.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/osband23a.html
https://proceedings.mlr.press/v216/osband23a.htmlGraph classification Gaussian processes via spectral featuresGraph classification aims to categorise graphs based on their structure and node attributes. In this work, we propose to tackle this task using tools from graph signal processing by deriving spectral features, which we then use to design two variants of Gaussian process models for graph classification. The first variant uses spectral features based on the distribution of energy of a node feature signal over the spectrum of the graph. We show that even such a simple approach, having no learned parameters, can yield competitive performance compared to strong neural network and graph kernel baselines. A second, more sophisticated variant is designed to capture multi-scale and localised patterns in the graph by learning spectral graph wavelet filters, obtaining improved performance on synthetic and real-world data sets. Finally, we show that both models produce well calibrated uncertainty estimates, enabling reliable decision making based on the model predictions.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/opolka23a.html
https://proceedings.mlr.press/v216/opolka23a.htmlIncentivizing honest performative predictions with proper scoring rulesProper scoring rules incentivize experts to accurately report beliefs, assuming predictions cannot influence outcomes. We relax this assumption and investigate incentives when predictions are performative, i.e., when they can influence the outcome of the prediction, such as when making public predictions about the stock market. We say a prediction is a fixed point if it accurately reflects the expert’s beliefs after that prediction has been made. We show that in this setting, reports maximizing expected score generally do not reflect an expert’s beliefs, and we give bounds on the inaccuracy of such reports. We show that, for binary predictions, if the influence of the expert’s prediction on outcomes is bounded, it is possible to define scoring rules under which optimal reports are arbitrarily close to fixed points. However, this is impossible for predictions over more than two outcomes. We also perform numerical simulations in a toy setting, showing that our bounds are tight in some situations and that prediction error is often substantial (greater than 5-10%). Lastly, we discuss alternative notions of optimality, including performative stability, and show that they incentivize reporting fixed points.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/oesterheld23a.html
https://proceedings.mlr.press/v216/oesterheld23a.htmlAn improved variational approximate posterior for the deep Wishart processDeep kernel processes are a recently introduced class of deep Bayesian models that have the flexibility of neural networks, but work entirely with Gram matrices. They operate by alternately sampling a Gram matrix from a distribution over positive semi-definite matrices, and applying a deterministic transformation. When the distribution is chosen to be Wishart, the model is called a deep Wishart process (DWP). This particular model is of interest because its prior is equivalent to a deep Gaussian process (DGP) prior, but at the same time it is invariant to rotational symmetries, leading to a simpler posterior distribution. Practical inference in the DWP was made possible in recent work (“A variational approximate posterior for the deep Wishart process” Ober and Aitchison, 2021a) where the authors used a generalisation of the Bartlett decomposition of the Wishart distribution as the variational approximate posterior. However, predictive performance in that paper was less impressive than one might expect, with the DWP only beating a DGP on a few of the UCI datasets used for comparison. In this paper, we show that further generalising their distribution to allow linear combinations of rows and columns in the Bartlett decomposition results in better predictive performance, while incurring negligible additional computation cost.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ober23a.html
https://proceedings.mlr.press/v216/ober23a.htmlSize-constrained k-submodular maximization in near-linear timeWe investigate the problems of maximizing k-submodular functions over total size constraints and over individual size constraints. k-submodularity is a generalization of submodularity beyond just picking items of a ground set, instead associating one of k types to chosen items. For sensor selection problems, for instance, this enables modeling of which type of sensor to put at a location, not simply whether to put a sensor or not. We propose and analyze threshold-greedy algorithms for both types of constraints. We prove that our proposed algorithms achieve the best known approximation ratios for both constraint types, up to a user-chosen parameter that balances computational complexity and the approximation ratio, while only using a number of function evaluations that depends linearly (up to poly-logarithmic terms) on the number of elements n, the number of types k, and the inverse of the user chosen parameter. Other algorithms that achieve the best-known deterministic approximation ratios require a number of function evaluations that depends linearly on the budget B, while our methods do not. We empirically demonstrate our algorithms’ performance in applications of sensor placement with k types and influence maximization with k topics.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nie23a.html
https://proceedings.mlr.press/v216/nie23a.htmlEfficient Failure Pattern Identification of Predictive AlgorithmsGiven a (machine learning) classifier and a collection of unlabeled data, how can we efficiently identify misclassification patterns presented in this dataset? To address this problem, we propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The recommendation algorithm is conceptualized as a stochastic sampler that, in each round, queries the annotators a subset of samples for their true labels and obtains the feedback information on whether the samples are misclassified. The sampling mechanism needs to balance between discovering new patterns of misclassification (exploration) and confirming the potential patterns of classification (exploitation). We construct a determinantal point process, whose intensity balances the exploration-exploitation trade-off through the weighted update of the posterior at each round to form the generator of the stochastic sampler. The numerical results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nguyen23c.html
https://proceedings.mlr.press/v216/nguyen23c.htmlProbabilistic Multi-Dimensional ClassificationMulti-dimensional classification (MDC) can be employed in a range of applications where one needs to predict multiple class variables for each given instance. Many existing MDC methods suffer from at least one of inaccuracy, scalability, limited use to certain types of data, hardness of interpretation or lack of probabilistic (uncertainty) estimations. This paper is an attempt to address all these disadvantages simultaneously. We propose a formal framework for probabilistic MDC in which learning an optimal multi-dimensional classifier can be decomposed, without loss of generality, into learning a set of (smaller) single-variable multi-class probabilistic classifiers and a directed acyclic graph. Current and future developments of both probabilistic classification and graphical model learning can directly enhance our framework, which is flexible and provably optimal. A collection of experiments is conducted to highlight the usefulness of this MDC framework.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nguyen23b.html
https://proceedings.mlr.press/v216/nguyen23b.htmlSimple Transferability Estimation for Regression TasksWe consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nguyen23a.html
https://proceedings.mlr.press/v216/nguyen23a.htmlKeep-Alive Caching for the Hawkes processWe study the design of caching policies in applications such as serverless computing where there is not a fixed size cache to be filled, but rather there is a cost associated with the time an item stays in the cache. We present a model for such caching policies which captures the trade-off between this cost and the cost of cache misses. We characterize optimal caching policies in general and apply this characterization by deriving a closed form for Hawkes processes. Since optimal policies for Hawkes processes depend on the history of arrivals, we also develop history-independent policies which achieve near-optimal average performance. We evaluate the performances of the optimal policy and approximate polices using simulations and a data trace of Azure Functions, Microsoft’s FaaS (Function as a Service) platform for serverless computingSun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/narayana23a.html
https://proceedings.mlr.press/v216/narayana23a.htmlTwo-phase Attacks in Security GamesA standard model of a security game assumes a one-off assault during which the attacker cannot update their strategy even if new actionable insights are gained in the process. In this paper, we propose a version of a security game that takes into account a possibility of a two-phase attack. Specifically, in the first phase, the attacker makes a preliminary move to gain extra information about this particular instance of the game. Based on this information, the attacker chooses an optimal concluding move. We derive a compact-form mixed-integer linear program that computes an optimal strategy of the defender. Our simulation shows that this strategy mitigates serious losses incurred to the defender by a two-phase attack while still protecting well against less sophisticated attackers.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nagorko23a.html
https://proceedings.mlr.press/v216/nagorko23a.htmlActive metric learning and classification using similarity queriesActive learning is commonly used to train label-efficient models by adaptively selecting the most informative queries. However, most active learning strategies are designed to either learn a representation of the data (e.g., embedding or metric learning) or perform well on a task (e.g., classification) on the data. However, many machine learning tasks involve a combination of both representation learning and a task-specific goal. Motivated by this, we propose a novel unified query framework that can be applied to any problem in which a key component is learning a representation of the data that reflects similarity. Our approach builds on similarity or nearest neighbor (NN) queries which seek to select samples that result in improved embeddings. The queries consist of a reference and a set of objects, with an oracle selecting the object most similar (i.e., nearest) to the reference. In order to reduce the number of solicited queries, they are chosen adaptively according to an information theoretic criterion. We demonstrate the effectiveness of the proposed strategy on two tasks - active metric learning and active classification - using a variety of synthetic and real world datasets. In particular, we demonstrate that actively selected NN queries outperform recently developed active triplet selection methods in a deep metric learning setting. Further, we show that in classification, actively selecting class labels can be reformulated as a process of selecting the most informative NN query, allowing direct application of our method.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nadagouda23a.html
https://proceedings.mlr.press/v216/nadagouda23a.htmlOn Testability and Goodness of Fit Tests in Missing Data ModelsSignificant progress has been made in developing identification and estimation techniques for missing data problems where modeling assumptions can be described via a directed acyclic graph. The validity of results using such techniques rely on the assumptions encoded by the graph holding true; however, verification of these assumptions has not received sufficient attention in prior work. In this paper, we provide new insights on the testable implications of three broad classes of missing data graphical models, and design goodness-of-fit tests for them. The classes of models explored are: sequential missing-at-random and missing-not-at-random models which can be used for modeling longitudinal studies with dropout/censoring, and a no self-censoring model which can be applied to cross-sectional studies and surveys.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/nabi23a.html
https://proceedings.mlr.press/v216/nabi23a.htmlComposing Efficient, Robust Tests for Policy SelectionModern reinforcement learning systems produce many high-quality policies throughout the learning process. However, to choose which policy to actually deploy in the real world, they must be tested under an intractable number of environmental conditions. We introduce RPOSST, an algorithm to select a small set of test cases from a larger pool based on a relatively small number of sample evaluations. RPOSST treats the test case selection problem as a two-player game and optimizes a solution with provable $k$-of-$N$ robustness, bounding the error relative to a test that used all the test cases in the pool. Empirical results demonstrate that RPOSST finds a small set of test cases that identify high quality policies in a toy one-shot game, poker datasets, and a high-fidelity racing simulator.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/morrill23a.html
https://proceedings.mlr.press/v216/morrill23a.htmlAdaptive Conditional Quantile Neural ProcessesNeural processes are a family of probabilistic models that inherit the flexibility of neural networks to parameterize stochastic processes. Despite providing well-calibrated predictions, especially in regression problems, and quick adaptation to new tasks, the Gaussian assumption that is commonly used to represent the predictive likelihood fails to capture more complicated distributions such as multimodal ones. To overcome this limitation, we propose Conditional Quantile Neural Processes (CQNPs), a new member of the neural processes family, which exploits the attractive properties of quantile regression in modeling the distributions irrespective of their form. By introducing an extension of quantile regression where the model learns to focus on estimating <em>informative</em> quantiles, we show that the sampling efficiency and prediction accuracy can be further enhanced. Our experiments with real and synthetic datasets demonstrate substantial improvements in predictive performance compared to the baselines, and better modeling of heterogeneous distributions’ characteristics such as multimodality.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/mohseni23a.html
https://proceedings.mlr.press/v216/mohseni23a.htmlOn Minimizing the Impact of Dataset Shifts on Actionable ExplanationsThe Right to Explanation is an important regulatory principle that allows individuals to request actionable explanations for algorithmic decisions. However, several technical challenges arise when providing such actionable explanations in practice. For instance, models are periodically retrained to handle dataset shifts. This process may invalidate some of the previously prescribed explanations, thus rendering them unactionable. But, it is unclear if and when such invalidations occur, and what factors determine explanation stability i.e., if an explanation remains unchanged amidst model retraining due to dataset shifts. In this paper, we address the aforementioned gaps and provide one of the first theoretical and empirical characterizations of the factors influencing explanation stability. To this end, we conduct rigorous theoretical analysis to demonstrate that model curvature, weight decay parameters while training, and the magnitude of the dataset shift are key factors that determine the extent of explanation (in)stability. Extensive experimentation with real-world datasets not only validates our theoretical results, but also demonstrates that the aforementioned factors dramatically impact the stability of explanations produced by various state-of-the-art methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/meyer23a.html
https://proceedings.mlr.press/v216/meyer23a.htmlOn the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy EvaluationOff-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed for off-policy evaluation, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises in the context of PO. We provide a novel view of off-policy PO, showing a connection between the policy improvement and variance minimization objectives. Then, we illustrate how minimizing the off-policy variance can, in some circumstances, lead to a policy improvement, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. Finally, we present numerical simulations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/metelli23a.html
https://proceedings.mlr.press/v216/metelli23a.htmlKrADagrad: Kronecker approximation-domination gradient preconditioned stochastic optimizationSecond order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a Kronecker factored preconditioner to reduce these requirements: it is used for large deep models [Anil et al., 2020] and in production [Anil et al., 2022]. However, it takes inverse matrix roots of ill-conditioned matrices. This requires 64-bit precision, imposing strong hardware constraints. In this paper, we propose a novel factorization, Kronecker Approximation-Domination (KrAD). Using KrAD, we update a matrix that directly approximates the inverse empirical Fisher matrix (like full matrix AdaGrad), avoiding inversion and hence 64-bit precision. We then propose KrADagrad , with similar computational costs to Shampoo and the same regret. Synthetic ill-conditioned experiments show improved performance over Shampoo for 32-bit precision, while for several real datasets we have comparable or better generalization.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/mei23a.html
https://proceedings.mlr.press/v216/mei23a.htmlCausal information splitting: Engineering proxy features for robustness to distribution shiftsStatistical prediction models are often trained on data that is drawn from different probability distributions than their eventual use cases. One approach to proactively prepare for these shifts harnesses the intuition that causal mechanisms should remain invariant between environments. Here we focus on a challenging setting in which the causal and anticausal variables of the target are unobserved. Leaning on information theory, we develop feature selection and engineering techniques for the observed downstream variables that act as proxies. We identify proxies that help to build stable models and moreover utilize auxiliary training tasks to extract stability-enhancing information from proxies. We demonstrate the effectiveness of our techniques on synthetic and real data.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/mazaheri23a.html
https://proceedings.mlr.press/v216/mazaheri23a.htmlTCE: A Test-Based Approach to Measuring Calibration ErrorThis paper proposes a new metric to measure the calibration error of probabilistic binary classifiers, called test-based calibration error (TCE). TCE incorporates a novel loss function based on a statistical test to examine the extent to which model predictions differ from probabilities estimated from data. It offers (i) a clear interpretation, (ii) a consistent scale that is unaffected by class imbalance, and (iii) an enhanced visual representation with respect to the standard reliability diagram. In addition, we introduce an optimality criterion for the binning procedure of calibration error metrics based on a minimal estimation error of the empirical probabilities. We provide a novel computational algorithm for optimal bins under bin-size constraints. We demonstrate properties of TCE through a range of experiments, including multiple real-world imbalanced datasets and ImageNet 1000.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/matsubara23a.html
https://proceedings.mlr.press/v216/matsubara23a.htmlKnowledge Intensive Learning of Cutset NetworksCutset networks (CNs) are interpretable probabilistic representations that combine probability trees and tree Bayesian networks, to model and reason about large multi-dimensional probability distributions. Motivated by high-stakes applications in domains such as healthcare where (a) rich domain knowledge in the form of qualitative influences is readily available and (b) use of interpretable models that the user can efficiently probe and infer over is often necessary, we focus on learning CNs in the presence of qualitative influences. We propose a penalized objective function that uses the influences as constraints, and develop a gradient-based learning algorithm, KICN. We show that because CNs are tractable, KICN is guaranteed to converge to a local maximum of the penalized objective function. Our experiments on several benchmark data sets show that our new algorithm is superior to the state-of-the-art, especially when the data is scarce or noisy.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/mathur23a.html
https://proceedings.mlr.press/v216/mathur23a.htmlPartial identification of dose responses with hidden confoundersInferring causal effects of continuous-valued treatments from observational data is a crucial task promising to better inform policy- and decision-makers. A critical assumption needed to identify these effects is that all confounding variables—causal parents of both the treatment and the outcome—are included as covariates. Unfortunately, given observational data alone, we cannot know with certainty that this criterion is satisfied. Sensitivity analyses provide principled ways to give bounds on causal estimates when confounding variables are hidden. While much attention is focused on sensitivity analyses for discrete-valued treatments, much less is paid to continuous-valued treatments. We present novel methodology to bound both average and conditional average continuous-valued treatment-effect estimates when they cannot be point identified due to hidden confounding. A semi-synthetic benchmark on multiple datasets shows our method giving tighter coverage of the true dose-response curve than a recently proposed continuous sensitivity model and baselines. Finally, we apply our method to a real-world observational case study to demonstrate the value of identifying dose-dependent causal effects.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/marmarelis23a.html
https://proceedings.mlr.press/v216/marmarelis23a.htmlThe Shrinkage-Delinkage Trade-off: an Analysis of Factorized Gaussian Approximations for Variational InferenceWhen factorized approximations are used for variational inference (VI), they tend to underestimate the uncertainty—as measured in various ways—of the distributions they are meant to approximate. We consider two popular ways to measure the uncertainty deficit of VI: (i) the degree to which it underestimates the componentwise variance, and (ii) the degree to which it underestimates the entropy. To better understand these effects, and the relationship between them, we examine an informative setting where they can be explicitly (and elegantly) analyzed: the approximation of a Gaussian, $p$, with a dense covariance matrix, by a Gaussian, $q$, with a diagonal covariance matrix. We prove that $q$ always underestimates both the componentwise variance and the entropy of $p$, <em>though not necessarily to the same degree</em>. Moreover we demonstrate that the entropy of $q$ is determined by the trade-off of two competing forces: it is decreased by the shrinkage of its componentwise variances (our first measure of uncertainty) but it is increased by the factorized approximation which delinks the nodes in the graphical model of $p$. We study various manifestations of this trade-off, notably one where, as the dimension of the problem grows, the per-component entropy gap between $p$ and $q$ becomes vanishingly small even though $q$ underestimates every componentwise variance by a constant multiplicative factor. We also use the shrinkage-delinkage trade-off to bound the entropy gap in terms of the problem dimension and the condition number of the correlation matrix of $p$. Finally we present empirical results on both Gaussian and non-Gaussian targets, the former to validate our analysis and the latter to explore its limitations.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/margossian23a.html
https://proceedings.mlr.press/v216/margossian23a.htmlRandom Reshuffling with Variance Reduction: New Analysis and Better RatesVirtually all state-of-the-art methods for training supervised machine learning models are variants of Stochastic Gradient Descent (SGD), enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the next training data point to be learning from. Standard variants of SGD employ a sampling with replacement strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on sampling without replacement as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as random shuffling (RS), or before every epoch, strategy known as random reshuffling (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al. [2020] provided tight sublinear convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shuffling-based methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining linear convergence rates. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/malinovsky23a.html
https://proceedings.mlr.press/v216/malinovsky23a.htmlFederated learning of models pre-trained on different features with consensus graphsLearning an effective global model on private and decentralized datasets has become an increasingly important challenge of machine learning when applied in practice. Existing distributed learning paradigms, such as Federated Learning, enable this via model aggregation which enforces a strong form of modeling homogeneity and synchronicity across clients. This is however not suitable to many practical scenarios. For example, in distributed sensing, heterogeneous sensors reading data from different views of the same phenomenon would need to use different models for different data modalities. Local learning therefore happens in isolation but inference requires merging the local models to achieve consensus. To enable consensus among local models, we propose a feature fusion approach that extracts local representations from local models and incorporates them into a global representation that improves the prediction performance. Achieving this requires addressing two non-trivial problems. First, we need to learn an alignment between similar feature components which are arbitrarily arranged across clients to enable representation aggregation. Second, we need to learn a consensus graph that captures the high-order interactions between local feature spaces and how to combine them to achieve a better prediction. This paper presents solutions to these problems and demonstrates them in real-world applications on time series data such as power grids and traffic networks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ma23b.html
https://proceedings.mlr.press/v216/ma23b.htmlDeepGD3: Unknown-Aware Deep Generative/Discriminative Hybrid Defect Detector for PCB Soldering InspectionWe present a novel approach for detecting soldering defects in Printed Circuit Boards (PCBs) composed mainly of Surface Mount Technology (SMT) components, using advanced computer vision and deep learning techniques. The main challenge addressed is the detection of soldering defects in new components for which only samples of good soldering are available at the model training phase. To address this, we design a system composed of generative and discriminative models to leverage the knowledge gained from the soldering samples of old components to detect the soldering defects of new components. To meet industrial quality standards, we keep the leakage rate (i.e., miss detection rate) low by making the system "unknown-aware" with a low unknown rate. We evaluated the method on a real-world dataset from an electronics company. It significantly reduces the leakage rate from 1.827% $\pm$ 3.063% and 1.942% $\pm$ 1.337% to 0.063% $\pm$ 0.075% with an unknown rate of 3.706% $\pm$ 2.270% compared to the discriminative and generative approaches, respectively.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ma23a.html
https://proceedings.mlr.press/v216/ma23a.htmlPractical privacy-preserving Gaussian process regression via secret sharingGaussian process regression (GPR) is a non-parametric model that has been used in many real-world applications that involve sensitive personal data (e.g., healthcare, finance, etc.) from multiple data owners. To fully and securely exploit the value of different data sources, this paper proposes a privacy-preserving GPR method based on secret sharing (SS), a secure multi-party computation (SMPC) technique. In contrast to existing studies that protect the data privacy of GPR via homomorphic encryption, differential privacy, or federated learning, our proposed method is more practical and can be used to preserve the data privacy of both the model inputs and outputs for various data-sharing scenarios (e.g., horizontally/vertically-partitioned data). However, it is non-trivial to directly apply SS on the conventional GPR algorithm, as it includes some operations whose accuracy and/or efficiency have not been well-enhanced in the current SMPC protocol. To address this issue, we derive a new SS-based exponentiation operation through the idea of “confusion-correction” and construct an SS-based matrix inversion algorithm based on Cholesky decomposition. More importantly, we theoretically analyze the communication cost and the security of the proposed SS-based operations. Empirical results show that our proposed method can achieve reasonable accuracy and efficiency under the premise of preserving data privacy.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/luo23a.html
https://proceedings.mlr.press/v216/luo23a.htmlBenefits of monotonicity in safe exploration with Gaussian processesWe consider the problem of sequentially maximising an unknown function over a set of actions while ensuring that every sampled point has a function value below a given safety threshold. We model the function using kernel-based and Gaussian process methods, while differing from previous works in our assumption that the function is monotonically increasing with respect to a safety variable. This assumption is motivated by various practical applications such as adaptive clinical trial design and robotics. Taking inspiration from the GP-UCB and SAFEOPT algorithms, we propose an algorithm, monotone safe UCB (M-SafeUCB) for this task. We show that M-SafeUCB enjoys theoretical guarantees in terms of safety, a suitably-defined regret notion, and approximately finding the entire safe boundary. In addition, we illustrate that the monotonicity assumption yields significant benefits in terms of the guarantees obtained, as well as algorithmic simplicity and efficiency. We support our theoretical findings by performing empirical evaluations on a variety of functions, including a simulated clinical trial experiment.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/losalka23a.html
https://proceedings.mlr.press/v216/losalka23a.htmlNo-Regret Linear Bandits beyond RealizabilityWe study linear bandits when the underlying reward function is <em>not</em> linear. Existing work relies on a uniform misspecification parameter $\epsilon$ that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever $\epsilon > 0$. We describe a more natural model of misspecification which only requires the approximation error at each input $x$ to be proportional to the suboptimality gap at $x$. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm — designed for the realizable case — is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal $\sqrt{T}$ regret for problems that the best-known regret is almost linear in time horizon $T$. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/liu23c.html
https://proceedings.mlr.press/v216/liu23c.htmlResidual-based error bound for physics-informed neural networksNeural networks are universal approximators and are studied for their use in solving differential equations. However, a major criticism is the lack of error bounds for obtained solutions. This paper proposes a technique to rigorously evaluate the error bound of Physics-Informed Neural Networks (PINNs) on most linear ordinary differential equations (ODEs), certain nonlinear ODEs, and first-order linear partial differential equations (PDEs). The error bound is based purely on equation structure and residual information and does not depend on assumptions of how well the networks are trained. We propose algorithms that bound the error efficiently. Some proposed algorithms provide tighter bounds than others at the cost of longer run time.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/liu23b.html
https://proceedings.mlr.press/v216/liu23b.htmlAccelerating Voting by Quantum ComputationStudying the computational complexity and designing fast algorithms for determining winners under voting rules are classical and fundamental questions in computational social choice. In this paper, we accelerate voting by leveraging quantum computation: we propose a quantum-accelerated voting algorithm that can be applied to any anonymous voting rule. We show that our algorithm can be quadratically faster than any classical algorithm (based on sampling with replacement) under a wide range of common voting rules, including positional scoring rules, Copeland, and single transferable voting (STV). Precisely, our quantum-accelerated voting algorithm outputs the correct winner with high probability in $\Theta\left(\frac{n}{\text{MOV}}\right)$ time, where $n$ is the number of votes and $\text{MOV}$ is <em>margin of victory</em>, the smallest number of voters to change the winner. In contrast, any classical voting algorithm based on sampling with replacement requires $\Omega\left(\frac{n^2}{\text{MOV}^2}\right)$ time under a large class of voting rules. Our theoretical results are supported by experiments under plurality, Borda, Copeland, and STV.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/liu23a.html
https://proceedings.mlr.press/v216/liu23a.htmlBISCUIT: Causal Representation Learning from Binary InteractionsIdentifying the causal variables of an environment and how to intervene on them is of core value in applications such as robotics and embodied AI. While an agent can commonly interact with the environment and may implicitly perturb the behavior of some of these causal variables, often the targets it affects remain unknown. In this paper, we show that causal variables can still be identified for many common setups, e.g., additive Gaussian noise models, if the agent’s interactions with a causal variable can be described by an unknown binary variable. This happens when each causal variable has two different mechanisms, e.g., an observational and an interventional one. Using this identifiability result, we propose BISCUIT, a method for simultaneously learning causal variables and their corresponding binary interaction variables. On three robotic-inspired datasets, BISCUIT accurately identifies causal variables and can even be scaled to complex, realistic environments for embodied AI.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/lippe23a.html
https://proceedings.mlr.press/v216/lippe23a.htmlCUE: An Uncertainty Interpretation Framework for Text Classifiers Built on Pre-Trained Language ModelsText classifiers built on Pre-trained Language Models (PLMs) have achieved remarkable progress in various tasks including sentiment analysis, natural language inference, and question-answering. However, the occurrence of uncertain predictions by these classifiers poses a challenge to their reliability when deployed in practical applications. Much effort has been devoted to designing various probes in order to understand what PLMs capture. But few studies have delved into factors influencing PLM-based classifiers’ predictive uncertainty. In this paper, we propose a novel framework, called CUE, which aims to interpret uncertainties inherent in the predictions of PLM-based models. In particular, we first map PLM-encoded representations to a latent space via a variational auto-encoder. We then generate text representations by perturbing the latent space which causes fluctuation in predictive uncertainty. By comparing the difference in predictive uncertainty between the perturbed and the original text representations, we are able to identify the latent dimensions responsible for uncertainty and subsequently trace back to the input features that contribute to such uncertainty. Our extensive experiments on four benchmark datasets encompassing linguistic acceptability classification, emotion classification, and natural language inference show the feasibility of our proposed framework. Our source code is available at https://github.com/lijiazheng99/CUE.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/li23d.html
https://proceedings.mlr.press/v216/li23d.htmlGaussian Process Surrogate Models for Neural NetworksNot being able to understand and predict the behavior of deep learning systems makes it hard to decide what architecture and algorithm to use for a given problem. In science and engineering, modeling is a methodology used to understand complex systems whose internal processes are opaque. Modeling replaces a complex system with a simpler, more interpretable surrogate. Drawing inspiration from this, we construct a class of surrogate models for neural networks using Gaussian processes. Rather than deriving kernels for infinite neural networks, we learn kernels empirically from the naturalistic behavior of finite neural networks. We demonstrate our approach captures existing phenomena related to the spectral bias of neural networks, and then show that our surrogate models can be used to solve practical problems such as identifying which points most influence the behavior of specific neural networks and predicting which architectures and algorithms will generalize well for specific datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/li23c.html
https://proceedings.mlr.press/v216/li23c.htmlNonconvex stochastic scaled gradient descent and generalized eigenvector problemsMotivated by the problem of online canonical correlation analysis, we propose the <em>Stochastic Scaled-Gradient Descent</em> (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of $\sqrt{1/T}$, and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/li23b.html
https://proceedings.mlr.press/v216/li23b.htmlMemory Mechanism for Unsupervised Anomaly DetectionUnsupervised anomaly detection is a binary classification that detects anomalies in unseen samples given only unlabeled normal data. Reconstruction-based approaches are widely used, which perform reconstruction error minimization on training data to learn normal patterns and quantify the degree of anomalies by reconstruction errors on testing data. However, this approach tends to miss anomalies when the normal data has multi-pattern. Because the model generalizes unrestrictedly beyond normal patterns even to include anomaly patterns. In this paper, we proposed a memory mechanism that memorizes typical normal patterns through a capacity-controlled external differentiable matrix so that the generalization of the model to anomalies is limited by the retrieval of the matrix. We achieved state-of-the-art performance on several public benchmarks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/li23a.html
https://proceedings.mlr.press/v216/li23a.htmlWhen are post-hoc conceptual explanations identifiable?Interest in understanding and factorizing learned embedding spaces through conceptual explanations is steadily growing. When no human concept labels are available, concept discovery methods search trained embedding spaces for interpretable concepts like object shape or color that can provide post-hoc explanations for decisions. Unlike previous work, we argue that concept discovery should be identifiable, meaning that a number of known concepts can be provably recovered to guarantee reliability of the explanations. As a starting point, we explicitly make the connection between concept discovery and classical methods like Principal Component Analysis and Independent Component Analysis by showing that they can recover independent concepts under non-Gaussian distributions. For dependent concepts, we propose two novel approaches that exploit functional compositionality properties of image-generating processes. Our provably identifiable concept discovery methods substantially outperform competitors on a battery of experiments including hundreds of trained models and dependent concepts, where they exhibit up to 29 % better alignment with the ground truth. Our results highlight the strict conditions under which reliable concept discovery without human labels can be guaranteed and provide a formal foundation for the domain. Our code is available online.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/leemann23a.html
https://proceedings.mlr.press/v216/leemann23a.htmlFinding Invariant Predictors Efficiently via Causal StructureOne fundamental problem in machine learning is out-of-distribution generalization. A method named the surgery estimator incorporates the causal structure in the form of a directed acyclic graph (DAG) to find predictors that are invariant across target domains using distributional invariances via Pearl’s do-calculus. However, finding a surgery estimator can take exponential time as the current methods need to search through all possible predictors. In this work, we first provide a graphical characterization of the identifiability of conditional causal queries. Next, we leverage this characterization together with a greedy search step to develop a polynomial-time algorithm for finding invariant predictors using the causal graph. Given the correct causal graph, our method is guaranteed to find at least one invariant predictor, if it exists. We show that our proposed algorithm can significantly reduce the run-time both in simulated and semi-synthetic data experiments and have predictive performance that is comparable to the existing work that runs in exponential time.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/lee23a.html
https://proceedings.mlr.press/v216/lee23a.htmlTowards better certified segmentation via diffusion modelsThe robustness of image segmentation has been an important research topic in the past few years as segmentation models have reached production-level accuracy. However, like classification models, segmentation models can be vulnerable to adversarial perturbations, which hinders their use in critical-decision systems like healthcare or autonomous driving. Recently, randomized smoothing has been proposed to certify segmentation predictions by adding Gaussian noise to the input to obtain theoretical guarantees. However, this method exhibits a trade-off between the amount of added noise and the level of certification achieved. In this paper, we address the problem of certifying segmentation prediction using a combination of randomized smoothing and diffusion models. Our experiments show that combining randomized smoothing and diffusion models significantly improves certified robustness, with results indicating a mean improvement of 21 points in accuracy compared to previous state-of-the-art methods on Pascal-Context and Cityscapes public datasets. Our method is independent of the selected segmentation model and does not need any additional specialized training procedure.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/laousy23a.html
https://proceedings.mlr.press/v216/laousy23a.htmlVariable importance matching for causal inferenceOur goal is to produce methods for observational causal inference that are auditable, easy to troubleshoot, yield accurate treatment effect estimates, and scalable to high-dimensional data. We describe a general framework called Model-to-Match that achieves these goals by (i) learning a distance metric via outcome modeling, (ii) creating matched groups using the distance metric, and (iii) using the matched groups to estimate treatment effects. Model-to-Match uses variable importance measurements to construct a distance metric, making it a flexible framework that can be adapted to various applications. Concentrating on the scalability of the problem in the number of potential confounders, we operationalize the Model-to-Match framework with LASSO. We derive performance guarantees for settings where LASSO outcome modeling consistently identifies all confounders (importantly without requiring the linear model to be correctly specified). We also provide experimental results demonstrating the auditability of matches, as well as extensions to more general nonparametric outcome modeling.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/lanners23a.html
https://proceedings.mlr.press/v216/lanners23a.htmlFixed-Budget Best-Arm Identification with Heterogeneous Reward VariancesWe study the problem of best-arm identification (BAI) in the fixed-budget setting with heterogeneous reward variances. We propose two variance-adaptive BAI algorithms for this setting: SHVar for known reward variances and SHAdaVar for unknown reward variances. Our algorithms rely on non-uniform budget allocations among the arms where the arms with higher reward variances are pulled more often than those with lower variances. The main algorithmic novelty is in the design of SHAdaVar, which allocates budget greedily based on overestimating the unknown reward variances. We bound probabilities of misidentifying the best arms in both SHVar and SHAdaVar. Our analyses rely on novel lower bounds on the number of pulls of an arm that do not require closed-form solutions to the budget allocation problem. Since one of our budget allocation problems is analogous to the optimal experiment design with unknown variances, we believe that our results are of a broad interest. Our experiments validate our theory, and show that SHVar and SHAdaVar outperform algorithms from prior works with analytical guarantees.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/lalitha23a.html
https://proceedings.mlr.press/v216/lalitha23a.htmlOptimal Budget Allocation for Crowdsourcing Labels for GraphsCrowdsourcing is an effective and efficient paradigm for obtaining labels for unlabeled corpus employing crowd workers. This work considers the budget allocation problem for a generalized setting on a graph of instances to be labeled where edges encode instance dependencies. Specifically, given a graph and a labeling budget, we propose an optimal policy to allocate the budget among the instances to maximize the overall labeling accuracy. We formulate the problem as a Bayesian Markov Decision Process (MDP), where we define our task as an optimization problem that maximizes the overall label accuracy under budget constraints. Then, we propose a novel stage-wise reward function that considers the effect of worker labels on the whole graph at each timestamp. This reward function is utilized to find an optimal policy for the optimization problem. Theoretically, we show that our proposed policies are consistent when the budget is infinite. We conduct extensive experiments on five real-world graph datasets and demonstrate the effectiveness of the proposed policies to achieve a higher label accuracy under budget constraints.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kulkarni23a.html
https://proceedings.mlr.press/v216/kulkarni23a.htmlDifferentially private synthetic data using KD-treesCreation of a synthetic dataset that faithfully represents the data distribution and simultaneously preserves privacy is a major research challenge. Many space partitioning based approaches have emerged in recent years for answering statistical queries in a differentially private manner. However, for synthetic data generation problem, recent research has been mainly focused on deep generative models. In contrast, we exploit space partitioning techniques together with noise perturbation and thus achieve intuitive and transparent algorithms. We propose both data independent and data dependent algorithms for $\epsilon$-differentially private synthetic data generation whose kernel density resembles that of the real dataset. Additionally, we provide theoretical results on the utility-privacy trade-offs and show how our data dependent approach overcomes the curse of dimensionality and leads to a scalable algorithm. We show empirical utility improvements over the prior work, and discuss performance of our algorithm on a downstream classification task on a real dataset.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kreacic23a.html
https://proceedings.mlr.press/v216/kreacic23a.htmlRisk-aware curriculum generation for heavy-tailed task distributionsAutomated curriculum generation for reinforcement learning (RL) aims to speed up learning by designing a sequence of tasks of increasing difficulty. Such tasks are usually drawn from probability distributions with exponentially bounded tails, such as uniform or Gaussian distributions. However, existing approaches overlook heavy-tailed distributions. Under such distributions, current methods may fail to learn optimal policies in rare and risky tasks, which fall under the tails and yield the lowest returns, respectively. We address this challenge by proposing a risk-aware curriculum generation algorithm that simultaneously creates two curricula: 1) a primary curriculum that aims to maximize the expected discounted return with respect to a distribution over target tasks, and an auxiliary curriculum that identifies and over-samples rare and risky tasks observed in the primary curriculum. Our empirical results evidence that the proposed algorithm achieves significantly higher returns in frequent as well as rare tasks compared to the state-of-the-art methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/koprulu23b.html
https://proceedings.mlr.press/v216/koprulu23b.htmlReward-machine-guided, self-paced reinforcement learningSelf-paced reinforcement learning (RL) aims to improve the data efficiency of learning by automatically creating sequences, namely curricula, of probability distributions over contexts. However, existing techniques for self-paced RL fail in long-horizon planning tasks that involve temporally extended behaviors. We hypothesize that taking advantage of prior knowledge about the underlying task structure can improve the effectiveness of self-paced RL. We develop a self-paced RL algorithm guided by reward machines, i.e., a type of finite-state machine that encodes the underlying task structure. The algorithm integrates reward machines in 1) the update of the policy and value functions obtained by any RL algorithm of choice, and 2) the update of the automated curriculum that generates context distributions. Our empirical results evidence that the proposed algorithm achieves optimal behavior reliably even in cases in which existing baselines cannot make any meaningful progress. It also decreases the curriculum length and reduces the variance in the curriculum generation process by up to one-fourth and four orders of magnitude, respectively.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/koprulu23a.html
https://proceedings.mlr.press/v216/koprulu23a.htmlMolecule Design by Latent Space Energy-Based Modeling and Gradual Distribution ShiftingGeneration of molecules with desired chemical and biological properties such as high drug-likeness, high binding affinity to target proteins, is critical for drug discovery. In this paper, we propose a probabilistic generative model to capture the joint distribution of molecules and their properties. Our model assumes an energy-based model (EBM) in the latent space. Conditional on the latent vector, the molecule and its properties are modeled by a molecule generation model and a property regression model respectively. To search for molecules with desired properties, we propose a sampling with gradual distribution shifting (SGDS) algorithm, so that after learning the model initially on the training data of existing molecules and their properties, the proposed algorithm gradually shifts the model distribution towards the region supported by molecules with desired values of properties. Our experiments show that our method achieves very strong performances on various molecule design tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kong23a.html
https://proceedings.mlr.press/v216/kong23a.htmlUniversal Graph Contrastive Learning with a Novel Laplacian PerturbationGraph Contrastive Learning (GCL) is an effective method for discovering meaningful patterns in graph data. By evaluating diverse augmentations of the graph, GCL learns discriminative representations and provides a flexible and scalable mechanism for various graph mining tasks. This paper proposes a novel contrastive learning framework by introducing Laplacian perturbation. The proposed framework offers a distinct advantage by employing an indirect perturbation method, which provides a more stable approach while maintaining the perturbation effects. Moreover, it exhibits a wide range of applicability by not being restricted to specific graph types. We demonstrate that a spectral graph convolution based on the Laplacian successfully extracts representations from diverse graph types. Our extensive experiments on a variety of real-world datasets, covering multiple graph types, show that the proposed model outperforms state-of-the-art baselines in both node classification and link sign prediction tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ko23a.html
https://proceedings.mlr.press/v216/ko23a.htmlCausal effect estimation from observational and interventional data through matrix weighted linear estimatorsWe study causal effect estimation from a mixture of observational and interventional data in a confounded linear regression model with multivariate treatments. We show that the statistical efficiency in terms of expected squared error can be improved by combining estimators arising from both the observational and interventional setting. To this end, we derive methods based on matrix weighted linear estimators and prove that our methods are asymptotically unbiased in the infinite sample limit. This is an important improvement compared to the pooled estimator using the union of interventional and observational data, for which the bias only vanishes if the ratio of observational to interventional data tends to zero. Studies on synthetic data confirm our theoretical findings. In settings where confounding is substantial and the ratio of observational to interventional data is large, our estimators outperform a Stein-type estimator and various other baselines.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kladny23a.html
https://proceedings.mlr.press/v216/kladny23a.htmlOn Identifiability of Conditional Causal EffectsWe address the problem of identifiability of an arbitrary <em>conditional causal effect</em> given both the causal graph and a set of any observational and/or interventional distributions of the form $Q[S]:=P(S|do(V\setminus S))$, where $V$ denotes the set of all observed variables and $S\subseteq V$. We call this problem conditional generalized identifiability (<b>c-gID</b> in short) and prove the completeness of Pearl’s $do$-calculus for the c-gID problem by providing sound and complete algorithm for the c-gID problem. This work revisited the c-gID problem in Lee et al. [2020], Correa et al. [2021] by adding explicitly the positivity assumption which is crucial for identifiability. It extends the results of [Lee et al., 2019, Kivva et al., 2022] on general identifiability (gID) which studied the problem for <em>unconditional</em> causal effects and Shpitser and Pearl [2006b] on identifiability of conditional causal effects given <em>merely</em> the observational distribution $P(\mathbf{V})$ as our algorithm generalizes the algorithms proposed in [Kivva et al., 2022] and [Shpitser and Pearl, 2006b].Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kivva23a.html
https://proceedings.mlr.press/v216/kivva23a.htmlPhase-shifted adversarial trainingAdversarial training (AT) has been considered an imperative component for safely deploying neural network-based applications. However, it typically comes with slow convergence and worse performance on clean samples (i.e., non-adversarial samples). In this work, we analyze the behavior of neural networks during learning with adversarial samples through the lens of response frequency. Interestingly, we observe that AT causes neural networks to converge slowly to high-frequency information, resulting in highly oscillatory predictions near each data point. To learn high-frequency content efficiently, we first prove that a universal phenomenon, the frequency principle (i.e., lower frequencies are learned first), still holds in AT. Building upon this theoretical foundation, we present a novel approach to AT, which we call phase-shifted adversarial training (PhaseAT). In PhaseAT, the high-frequency components, which are a contributing factor to slow convergence, are adaptively shifted into the low-frequency range where faster convergence occurs. For evaluation, we conduct extensive experiments on CIFAR-10 and ImageNet, using an adaptive attack that is carefully designed for reliable evaluation. Comprehensive results show that PhaseAT substantially improves convergence for high-frequency information, thereby leading to improved adversarial robustness.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kim23b.html
https://proceedings.mlr.press/v216/kim23b.htmlHow to use dropout correctly on residual networks with batch normalizationFor the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kim23a.html
https://proceedings.mlr.press/v216/kim23a.htmlAssessing the Impact of Context Inference Error and Partial Observability on RL Methods for Just-In-Time Adaptive InterventionsJust-in-Time Adaptive Interventions (JITAIs) are a class of personalized health interventions developed within the behavioral science community. JITAIs aim to provide the right type and amount of support by iteratively selecting a sequence of intervention options from a pre-defined set of components in response to each individual’s time varying state. In this work, we explore the application of reinforcement learning methods to the problem of learning intervention option selection policies. We study the effect of context inference error and partial observability on the ability to learn effective policies. Our results show that the propagation of uncertainty from context inferences is critical to improving intervention efficacy as context uncertainty increases, while policy gradient algorithms can provide remarkable robustness to partially observed behavioral state information.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/karine23a.html
https://proceedings.mlr.press/v216/karine23a.htmlFed-LAMB: Layer-wise and Dimension-wise Locally Adaptive Federated LearningIn the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (dimension-wise) adaptive gradient methods (e.g., Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces layer-wise adaptivity to the local model updates to accelerate the convergence of adaptive FL methods. Our framework includes two variants based on two recent locally adaptive federated learning algorithms. Theoretically, we provide a convergence analysis of our layer-wise FL methods, coined Fed-LAMB and Mime-LAMB, which match the convergence rate of state-of-the-art results in adaptive FL and exhibits linear speedup in terms of the number of workers. Experimental results on various datasets and models, under both IID and non-IID local data settings, show that both Fed-LAMB and Mime-LAMB achieve faster convergence speed and better generalization performance, compared to various recent adaptive FL methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/karimi23a.html
https://proceedings.mlr.press/v216/karimi23a.htmlHeavy-tailed linear bandit with Huber regressionLinear bandit algorithms have been extensively studied and have shown successful in sequential decision tasks despite their simplicity. Many algorithms however work under the assumption that the reward is the sum of linear function of observed contexts and a sub-Gaussian error. In practical applications, errors can be heavy-tailed, especially in financial data. In such reward environments, algorithms designed for sub-Gaussian error may underexplore, resulting in suboptimal regret. In this paper, we relax the reward assumption and propose a novel linear bandit algorithm which works well under heavy-tailed errors as well. The proposed algorithm utilizes Huber regression. When contexts are stochastic with positive definite covariance matrix and the $(1+\delta)$-th moment of the error is bounded by a constant, we show that the high-probability upper bound of the regret is $O(\sqrt{d}T^{\frac{1}{1+\delta}}(\log dT)^{\frac{\delta}{1+\delta}})$, where $d$ is the dimension of context variables, $T$ is the time horizon, and $\delta\in (0,1]$. This bound improves on the state-of-the-art regret bound of the Median of Means and Truncation algorithm by a factor of $\sqrt{\log T}$ and $\sqrt{d}$ for the case where the time horizon $T$ is unknown. We also remark that when $\delta=1$, the order is the same as the regret bound of linear bandit algorithms designed for sub-Gaussian errors. We support our theoretical findings with synthetic experiments.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kang23a.html
https://proceedings.mlr.press/v216/kang23a.htmlCausal Discovery with Hidden Confounders using the Algorithmic Markov ConditionCausal sufficiency is a cornerstone assumption in causal discovery. It is, however, both unlikely to hold in practice as well as unverifiable. When it does not hold, existing methods struggle to return meaningful results. In this paper, we show how to discover the causal network over both observed and unobserved variables. Moreover, we show that the causal model is identifiable in the sparse linear Gaussian case. More generally, we extend the algorithmic Markov condition to include latent confounders. We propose a consistent score based on the Minimum Description Length principle to discover the full causal network, including latent confounders. Based on this score, we develop an effective algorithm that finds those sets of nodes for which the addition of a confounding factor $Z$ is most beneficial, then fits a new causal network over both observed as well as inferred latent variables.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kaltenpoth23a.html
https://proceedings.mlr.press/v216/kaltenpoth23a.htmlNyström $M$-Hilbert-Schmidt independence criterionKernel techniques are among the most popular and powerful approaches of data science. Among the key features that make kernels ubiquitous are (i) the number of domains they have been designed for, (ii) the Hilbert structure of the function class associated to kernels facilitating their statistical analysis, and (iii) their ability to represent probability distributions without loss of information. These properties give rise to the immense success of Hilbert-Schmidt independence criterion (HSIC) which is able to capture joint independence of random variables under mild conditions, and permits closed-form estimators with quadratic computational complexity (w.r.t. the sample size). In order to alleviate the quadratic computational bottleneck in large-scale applications, multiple HSIC approximations have been proposed, however these estimators are restricted to $M=2$ random variables, do not extend naturally to the $M\ge 2$ case, and lack theoretical guarantees. In this work, we propose an alternative Nyström-based HSIC estimator which handles the $M\ge 2$ case, prove its consistency, and demonstrate its applicability in multiple contexts, including synthetic examples, dependency testing of media annotations, and causal discovery.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/kalinke23a.html
https://proceedings.mlr.press/v216/kalinke23a.htmlBayesian inference for vertex-series-parallel partial ordersPartial orders are a natural model for the social hierarchies that may constrain “queue-like” rank-order data. However, the computational cost of counting the linear extensions of a general partial order on a ground set with more than a few tens of elements is prohibitive. Vertex-series-parallel partial orders (VSPs) are a subclass of partial orders which admit rapid counting and represent the sorts of relations we expect to see in a social hierarchy. However, no Bayesian analysis of VSPs has been given to date. We construct a marginally consistent family of priors over VSPs with a parameter controlling the prior distribution over VSP depth. The prior for VSPs is given in closed form. We extend an existing observation model for queue-like rank-order data to represent noise in our data and carry out Bayesian inference on “Royal Acta” data and Formula 1 race data. Model comparison shows our model is a better fit to the data than Plackett-Luce mixtures, Mallows mixtures, and “bucket order” models and competitive with more complex models fitting general partial orders.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jiang23b.html
https://proceedings.mlr.press/v216/jiang23b.htmlMulti-view graph contrastive learning for solving vehicle routing problemsRecently, neural heuristics based on deep learning have reported encouraging results for solving vehicle routing problems (VRPs), especially on independent and identically distributed (i.i.d.) instances, e.g. <em>uniform</em>. However, in the presence of a distribution shift for the testing instances, their performance becomes considerably inferior. In this paper, we propose a multi-view graph contrastive learning (MVGCL) approach to enhance the generalization across different distributions, which exploits a graph pattern learner in a self-supervised fashion to facilitate a neural heuristic equipped with an active search scheme. Specifically, our MVGCL first leverages graph contrastive learning to extract transferable patterns from VRP graphs to attain the generalizable multi-view (i.e. node and graph) representation. Then it adopts the learnt node embedding and graph embedding to assist the neural heuristic and the active search (during inference) for route construction, respectively. Extensive experiments on randomly generated VRP instances of various distributions, and the ones from TSPLib and CVRPLib show that our MVGCL is superior to the baselines in boosting the cross-distribution generalization performance.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jiang23a.html
https://proceedings.mlr.press/v216/jiang23a.htmlContent Sharing Design for Social Welfare in Networked Disclosure GameThis work models the costs and benefits of personal information sharing, or self-disclosure, in online social networks as a networked disclosure game. In a networked population where edges represent visibility amongst users, we assume a leader can influence network structure through content promotion, and we seek to optimize social welfare through network design. Our approach considers user interaction non-homogeneously, where pairwise engagement amongst users can involve or not involve sharing personal information. We prove that this problem is NP-hard. As a solution, we develop a Mixed-integer Linear Programming algorithm, which can achieve an exact solution, and also develop a time-efficient heuristic algorithm that can be used at scale. We conduct numerical experiments to demonstrate the properties of the algorithms and map theoretical results to a dataset of posts and comments in 2020 and 2021 in a COVID-related Subreddit community where privacy risks and sharing tradeoffs were particularly pronounced.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jia23b.html
https://proceedings.mlr.press/v216/jia23b.htmlIncentivising Diffusion while Preserving Differential PrivacyDiffusion auction refers to an emerging paradigm of online marketplace where an auctioneer utilises a social network to attract potential buyers. Diffusion auction poses significant privacy risks. From the auction outcome, it is possible to infer hidden, and potentially sensitive, preferences of buyers. To mitigate such risks, we initiate the study of differential privacy (DP) in diffusion auction mechanisms. DP is a well-established notion of privacy that protects a system against inference attacks. Achieving DP in diffusion auctions is non-trivial as the well-designed auction rules are required to incentivise the buyers to truthfully report their neighbourhood. We study the single-unit case and design two differentially private diffusion mechanisms (DPDMs): recursive DPDM and layered DPDM. We prove that these mechanisms guarantee differential privacy, incentive compatibility and individual rationality for both valuations and neighbourhood. We then empirically compare their performance on real and synthetic datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jia23a.html
https://proceedings.mlr.press/v216/jia23a.htmlNoisy adversarial representation learning for effective and efficient image obfuscationRecent real-world applications of deep learning have led to the development of machine learning as a service (MLaaS). However, the scenario of client-server inference presents privacy concerns, where the server processes raw data sent from the user’s client device. One solution to this issue is to provide an obfuscator function to the client device using Adversarial Representation Learning (ARL). Prior works have primarily focused on the privacy-utility trade-off while overlooking the computational cost and memory burden on the client side. In this paper, we propose an effective and efficient ARL method that incorporates feature noise into the ARL pipeline. We evaluated our approach on various datasets, comparing it with state-of-the-art ARL techniques. Our experimental results indicate that our method achieves better accuracy, lower computation and memory overheads, and improved resistance to information leakage and reconstruction attacks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jeong23a.html
https://proceedings.mlr.press/v216/jeong23a.htmlRobust statistical comparison of random variables with locally varying scale of measurementSpaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning. Nevertheless, it is still understood as an open question how to exploit the entire information encoded in them properly. We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces. This order contains stochastic dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given. We derive a (regularized) statistical test for our proposed generalized stochastic dominance (GSD) order, operationalize it by linear optimization, and robustify it by imprecise probability models. Our findings are illustrated with data from multidimensional poverty measurement, finance, and medicine.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jansen23a.html
https://proceedings.mlr.press/v216/jansen23a.htmlInvestigating a Generalization of Probabilistic Material Implication and Bayesian ConditionalsProbabilistic "if A then B" rules are typically formalized as Bayesian conditionals P(B|A), as many (e.g., Pearl) have argued that Bayesian conditionals are the correct way to think about such rules. However, there are challenges with standard inferences such as modus ponens and modus tollens that might make probabilistic material implication a better candidate at times for rule-based systems employing forward-chaining; and arguably material implication is still suitable when information about prior or conditional probabilities is not available at all. We investigate a generalization of probabilistic material implication and Bayesian conditionals that combines the advantages of both formalisms in a systematic way and prove basic properties of the generalized rule, in particular, for inference chains in graphs.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jahn23a.html
https://proceedings.mlr.press/v216/jahn23a.htmlPosterior sampling-based online learning for the stochastic shortest path modelWe consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $\tilde{\mathcal{O}}(B_{\ast} S\sqrt{AK})$, where $B_{\ast}$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/jafarnia-jahromi23a.html
https://proceedings.mlr.press/v216/jafarnia-jahromi23a.htmlA near-optimal high-probability swap-Regret upper bound for multi-agent bandits in unknown general-sum gamesIn this paper, we study a multi-agent bandit problem in an unknown general-sum game repeated for a number of rounds (i.e., learning in a black-box game with bandit feedback), where a set of agents have no information about the underlying game structure and cannot observe each other’s actions and rewards. In each round, each agent needs to play an arm (i.e., action) from a (possibly different) arm set (i.e., action set), and only receives the reward of the played arm that is affected by other agents’ actions. The objective of each agent is to minimize her own cumulative swap regret, where the swap regret is a generic performance measure for online learning algorithms. We are the first to give a near-optimal high-probability swap-regret upper bound based on a refined martingale analysis for the exponential-weighting-based algorithms with the implicit exploration technique, which can further bound the expected swap regret instead of the pseudo-regret studied in the literature. It is also guaranteed that correlated equilibria can be achieved in a polynomial number of rounds if the algorithm is played by all agents. Furthermore, we conduct numerical experiments to verify the performance of the studied algorithm.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/huang23b.html
https://proceedings.mlr.press/v216/huang23b.htmlASTRA: Understanding the practical impact of robustness for probabilistic programsWe present the first systematic study of effectiveness of robustness transformations on a diverse set of 24 probabilistic programs representing generalized linear models, mixture models, and time-series models. We evaluate five robustness transformations from literature on each model. We quantify and present insights on (1) the improvement of the posterior prediction accuracy and (2) the execution time overhead of the robustified programs, in the presence of three input noise models. To automate the evaluation of various robustness transformations, we developed ASTRA - a novel framework for quantifying the robustness of probabilistic programs and exploring the trade-offs between robustness and execution time. Our experimental results indicate that the existing transformations are often suitable only for specific noise models, can significantly increase execution time, and have non-trivial interaction with the inference algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/huang23a.html
https://proceedings.mlr.press/v216/huang23a.htmlOptimistic Thompson Sampling-based algorithms for episodic reinforcement learningWe propose two Thompson Sampling-like, model-based learning algorithms for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a $\widetilde{O} \left(\sqrt{AS^2H^4T} \right)$ regret bound, where $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the number of time-steps per episode and $T$ is the number of episodes. Our second algorithm, Optimistic Thompson Sampling plus for MDPs (O-TS-MDP$^+$), achieves the (near)-optimal $\widetilde{O} \left(\sqrt{ASH^3T} \right)$ regret bound by taking a more aggressive clipping strategy. Since O-TS was only empirically studied previously, we derive regret bounds of O-TS for stochastic bandits. In addition, we propose, O-TS-Bandit$^+$, a randomized version of UCB1 [Auer et al., 2002], for stochastic bandits. Both O-TS and O-TS-Bandit$^+$ achieve the optimal $O\left(\frac{A\ln(T)}{\Delta} \right)$ problem-dependent regret bound, where $\Delta$ denotes the sub-optimality gap.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/hu23a.html
https://proceedings.mlr.press/v216/hu23a.htmlIncreasing effect sizes of pairwise conditional independence tests between random vectorsA simple approach to test for conditional independence of two random vectors given a third random vector is to simultaneously test for conditional independence of every pair of components of the two random vectors given the third random vector. In this work, we show that conditioning on additional components of the two random vectors that are independent given the third one increases the tests’ effect sizes while leaving the validity of the overall approach unchanged. We leverage this result to derive a practical pairwise testing algorithm that first chooses tests with a relatively large effect size and then does the actual testing. We show both numerically and theoretically that our algorithm outperforms standard pairwise independence testing and other existing methods if the dependence within the two random vectors is sufficiently high.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/hochsprung23a.html
https://proceedings.mlr.press/v216/hochsprung23a.htmlMassively parallel reweighted wake-sleepReweighted wake-sleep (RWS) is a machine learning method for performing Bayesian inference in a very general class of models. RWS draws $K$ samples from an underlying approximate posterior, then uses importance weighting to provide a better estimate of the true posterior. RWS then updates its approximate posterior towards the importance-weighted estimate of the true posterior. However, recent work [Chattergee and Diaconis, 2018] indicates that the number of samples required for effective importance weighting is exponential in the number of latent variables. Attaining such a large number of importance samples is intractable in all but the smallest models. Here, we develop massively parallel RWS, which circumvents this issue by drawing $K$ samples of all $n$ latent variables, and individually reasoning about all $K^n$ possible combinations of samples. While reasoning about $K^n$ combinations might seem intractable, the required computations can be performed in polynomial time by exploiting conditional independencies in the generative model. We show considerable improvements over standard “global” RWS, which draws $K$ samples from the full joint.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/heap23a.html
https://proceedings.mlr.press/v216/heap23a.htmlScalable and robust tensor ring decomposition for large-scale dataTensor ring (TR) decomposition has recently received increased attention due to its superior expressive performance for high-order tensors. However, the applicability of traditional TR decomposition algorithms to real-world applications is hindered by prevalent large data sizes, missing entries, and corruption with outliers. In this work, we propose a scalable and robust TR decomposition algorithm capable of handling large-scale tensor data with missing entries and gross corruptions. We first develop a novel auto-weighted steepest descent method that can adaptively fill the missing entries and identify the outliers during the decomposition process. Further, taking advantage of the tensor ring model, we develop a novel fast Gram matrix computation (FGMC) approach and a randomized subtensor sketching (RStS) strategy which yield significant reduction in storage and computational complexity. Experimental results demonstrate that the proposed method outperforms existing TR decomposition methods in the presence of outliers, and runs significantly faster than existing robust tensor completion algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/he23b.html
https://proceedings.mlr.press/v216/he23b.htmlLoosely consistent emphatic temporal-difference learningThere has been significant interest in searching for off-policy Temporal-Difference (TD) algorithms that find the same solution that would have been obtained in the on-policy regime. An important property of such algorithms is that their expected update has the same fixed point as that of On-policy TD($\lambda$), which we call <em>loose consistency</em>. Notably, Full-IS-TD($\lambda$) is the only existing loosely consistent method under general linear function approximation but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of ETD($\lambda$), which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new loosely consistent algorithm called <em>Average Emphatic TD</em> (AETD($\lambda$)) with a transient bias, which strikes a balance between bias and variance. Further, we unify AETD($\lambda$) with existing methods and obtain a new family of loosely consistent algorithms called <em>Loosely Consistent Emphatic TD</em> (LC-ETD($\lambda$, $\beta$, $\nu$)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through experiments on illustrative examples, we show the effectiveness and practicality of LC-ETD($\lambda$, $\beta$, $\nu$).Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/he23a.html
https://proceedings.mlr.press/v216/he23a.htmlInference and sampling of point processes from diffusion excursionsPoint processes often have a natural interpretation with respect to a continuous process. We propose a point process construction that describes arrival time observations in terms of the state of a latent diffusion process. In this framework, we relate the return times of a diffusion in a continuous path space to new arrivals of the point process. This leads to a continuous sample path that is used to describe the underlying mechanism generating the arrival distribution. These models arise in many disciplines, such as financial settings where actions in a market are determined by a hidden continuous price or in neuroscience where a latent stimulus generates spike trains. Based on the developments in Itô’s excursion theory, we propose methods for inferring and sampling from the point process derived from the latent diffusion process. We illustrate the approach with numerical examples using both simulated and real data. The proposed methods and framework provide a basis for interpreting point processes through the lens of diffusions.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/hasan23a.html
https://proceedings.mlr.press/v216/hasan23a.htmlOn inference and learning with probabilistic generating circuitsProbabilistic generating circuits (PGCs) are economical representations of multivariate probability generating polynomials (PGPs). They unify and extend decomposable probabilistic circuits and determinantal point processes, admitting tractable computation of marginal probabilities. However, the need for addition and multiplication of high-degree polynomials incurs a significant additional factor in the complexity of inference. Here, we give a new inference algorithm that eliminates this extra factor. Specifically, we show that it suffices to keep track of the highest degree coefficients of the computed polynomials, rendering the algorithm linear in the circuit size. In addition, we show that determinant-based circuits need not be expanded to division-free circuits, but can be handled by division-based fast algorithms. While these advances enhance the appeal of PGCs, we also discover an obstacle to learning them from data: it is NP-hard to recognize whether a given PGC encodes a PGP. We discuss the implications of our ambivalent findings and sketch a method, in which learning is restricted to PGCs that are composed of moderate-size subcircuits.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/harviainen23b.html
https://proceedings.mlr.press/v216/harviainen23b.htmlRevisiting Bayesian network learning with small vertex coverThe problem of structure learning in Bayesian networks asks for a directed acyclic graph (DAG) that maximizes a given scoring function. Since the problem is NP-hard, research effort has been put into discovering restricted classes of DAGs for which the search problem can be solved in polynomial time. Here, we initiate investigation of questions that have received less attention thus far: Are the known polynomial algorithms close to the best possible, or is there room for significant improvements? If the interest is in Bayesian learning, that is, in sampling or weighted counting of DAGs, can we obtain similar complexity results? Focusing on DAGs with bounded vertex cover number—a class studied in Korhonen and Parviainen’s seminal work (NIPS 2015)—we answer the questions in the affirmative. We also give, apparently the first, proof that the counting problem is $#$P-hard in general. In addition, we show that under the vertex-cover constraint counting is $#$W[1]-hard.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/harviainen23a.html
https://proceedings.mlr.press/v216/harviainen23a.htmlOn the Convergence of Continual Learning with Adaptive MethodsOne of the objectives of continual learning is to prevent catastrophic forgetting in learning multiple tasks sequentially, and the existing solutions have been driven by the conceptualization of the plasticity-stability dilemma. However, the convergence of continual learning for each sequential task is less studied so far. In this paper, we provide a convergence analysis of memory-based continual learning with stochastic gradient descent and empirical evidence that training current tasks causes the cumulative degradation of previous tasks. We propose an adaptive method for nonconvex continual learning (NCCL), which adjusts step sizes of both previous and current tasks with the gradients. The proposed method can achieve the same convergence rate as the SGD method when the catastrophic forgetting term which we define in the paper is suppressed at each iteration. Further, we demonstrate that the proposed algorithm improves the performance of continual learning over existing methods for several image classification tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/han23a.html
https://proceedings.mlr.press/v216/han23a.htmlDifferentiable user modelsProbabilistic user modeling is essential for building machine learning systems in the ubiquitous cases with humans in the loop. However, modern advanced user models, often designed as cognitive behavior simulators, are incompatible with modern machine learning pipelines and computationally prohibitive for most practical applications. We address this problem by introducing widely-applicable differentiable surrogates for bypassing this computational bottleneck; the surrogates enable computationally efficient inference with modern cognitive models. We show experimentally that modeling capabilities comparable to the only available solution, existing likelihood-free inference methods, are achievable with a computational cost suitable for online applications. Finally, we demonstrate how AI-assistants can now use cognitive models for online interaction in a menu-search task, which has so far required hours of computation during interaction.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/hamalainen23a.html
https://proceedings.mlr.press/v216/hamalainen23a.htmlInterpretable differencing of machine learning modelsUnderstanding the differences between machine learning (ML) models is of interest in scenarios ranging from choosing amongst a set of competing models, to updating a deployed model with new training data. In these cases, we wish to go beyond differences in overall metrics such as accuracy to identify where in the feature space do the differences occur. We formalize this problem of model differencing as one of predicting a dissimilarity function of two ML models’ outputs, subject to the representation of the differences being human-interpretable. Our solution is to learn a Joint Surrogate Tree (JST), which is composed of two conjoined decision tree surrogates for the two models. A JST provides an intuitive representation of differences and places the changes in the context of the models’ decision logic. Context is important as it helps users to map differences to an underlying mental model of an AI system. We also propose a refinement procedure to increase the precision of a JST. We demonstrate, through an empirical evaluation, that such contextual differencing is concise and can be achieved with no loss in fidelity over naive approaches.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/haldar23a.html
https://proceedings.mlr.press/v216/haldar23a.htmlSufficient identification conditions and semiparametric estimation under missing not at random mechanismsConducting valid statistical analyses is challenging in the presence of missing-not-at-random (MNAR) data, where the missingness mechanism is dependent on the missing values themselves even conditioned on the observed data. Here, we consider a MNAR model that generalizes several prior popular MNAR models in two ways: first, it is less restrictive in terms of statistical independence assumptions imposed on the underlying joint data distribution, and second, it allows for all variables in the observed sample to have missing values. This MNAR model corresponds to a so-called criss-cross structure considered in the literature on graphical models of missing data that prevents nonparametric identification of the entire missing data model. Nonetheless, part of the complete-data distribution remains nonparametrically identifiable. By exploiting this fact and considering a rich class of exponential family distributions, we establish sufficient conditions for identification of the complete-data distribution as well as the entire missingness mechanism. We then propose methods for testing the independence restrictions encoded in such models using odds ratio as our parameter of interest. We adopt two semiparametric approaches for estimating the odds ratio parameter and establish the corresponding asymptotic theories: one involves maximizing a conditional likelihood with order statistics and the other uses estimating equations. The utility of our methods is illustrated via simulation studies.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/guo23a.html
https://proceedings.mlr.press/v216/guo23a.htmlCausal Discovery for time series from multiple datasets with latent contextsCausal discovery from time series data is a typical problem setting across the sciences. Often, multiple datasets of the same system variables are available, for instance, time series of river runoff from different catchments. The local catchment systems then share certain causal parents, such as time-dependent large-scale weather over all catchments, but differ in other catchment-specific drivers, such as the altitude of the catchment. These drivers can be called temporal and spatial contexts, respectively, and are often partially unobserved. Pooling the datasets and considering the joint causal graph among system, context, and certain auxiliary variables enables us to overcome such latent confounding of system variables. In this work, we present a non-parametric time series causal discovery method, <em>J(oint)-PCMCI$^+$</em>, that efficiently learns such joint causal time series graphs when both observed and latent contexts are present, including time lags. We present asymptotic consistency results and numerical experiments demonstrating the utility and limitations of the method.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gunther23a.html
https://proceedings.mlr.press/v216/gunther23a.htmlFunctional causal Bayesian optimizationWe propose functional causal Bayesian optimization (fCBO), a method for finding interventions that optimize a target variable in a known causal graph. fCBO extends the CBO family of methods to enable functional interventions, which set a variable to be a deterministic function of other variables in the graph. fCBO models the unknown objectives with Gaussian processes whose inputs are defined in a reproducing kernel Hilbert space, thus allowing to compute distances among vector-valued functions. In turn, this enables to sequentially select functions to explore by maximizing an expected improvement acquisition functional while keeping the typical computational tractability of standard BO settings. We introduce graphical criteria that establish when considering functional interventions allows attaining better target effects, and conditions under which selected interventions are also optimal for conditional target effects. We demonstrate the benefits of the method in a synthetic and in a real-world causal graph.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gultchin23a.html
https://proceedings.mlr.press/v216/gultchin23a.htmlConcurrent Misclassification and Out-of-Distribution Detection for Semantic Segmentation via Energy-Based Normalizing FlowRecent semantic segmentation models accurately classify test-time examples that are similar to a training dataset distribution. However, their discriminative closed-set approach is not robust in practical data setups with distributional shifts and out-of-distribution (OOD) classes. As a result, the predicted probabilities can be very imprecise when used as confidence scores at test time. To address this, we propose a generative model for concurrent in-distribution misclassification (IDM) and OOD detection that relies on a normalizing flow framework. The proposed flow-based detector with an energy-based inputs (FlowEneDet) can extend previously deployed segmentation models without their time-consuming retraining. Our FlowEneDet results in a low-complexity architecture with marginal increase in the memory footprint. FlowEneDet achieves promising results on Cityscapes, Cityscapes-C, FishyScapes and SegmentMeIfYouCan benchmarks in IDM/OOD detection when applied to pretrained DeepLabV3+ and SegFormer semantic segmentation models.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gudovskiy23a.html
https://proceedings.mlr.press/v216/gudovskiy23a.htmlStochastic Graphical Bandits with Heavy-Tailed RewardsWe consider stochastic graphical bandits, where after pulling an arm, the decision maker observes rewards of not only the chosen arm but also its neighbors in a feedback graph. Most of existing work assumes that the rewards are drawn from bounded or at least sub-Gaussian distributions, which however may be violated in many practical scenarios such as social advertising and financial markets. To settle this issue, we investigate stochastic graphical bandits with heavy-tailed rewards, where the distributions have finite moments of order $1+\epsilon$, for some $\epsilon\in(0, 1]$. Firstly, we develop one UCB-type algorithm, whose expected regret is upper bounded by a sum of gap-based quantities over the <em>clique covering</em> of the feedback graph. The key idea is to estimate the reward means of the selected arm’s neighbors by more refined robust estimators, and to construct a graph-based upper confidence bound for selecting candidates. Secondly, we design another elimination-based strategy and improve the regret bound to a gap-based sum with size controlled by the <em>independence number</em> of the feedback graph. For benign graphs, the <em>independence number</em> could be smaller than the size of the <em>clique covering</em>, resulting in tighter regret bounds. Finally, we conduct experiments on synthetic data to demonstrate the effectiveness of our methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gou23a.html
https://proceedings.mlr.press/v216/gou23a.htmlA scalable Walsh-Hadamard regularizer to overcome the low-degree spectral bias of neural networksDespite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards “simpler” functions. Various notions of simplicity have been introduced to characterize this behavior. Here, we focus on the case of neural networks with discrete (zero-one) inputs through the lens of their Fourier (Walsh-Hadamard) transforms, where the notion of simplicity can be captured through the degree of the Fourier coefficients. We empirically show that neural networks have a tendency to learn lower-degree frequencies. We show how this spectral bias towards simpler features can in fact hurt the neural network’s generalization on real-world datasets. To remedy this we propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies. Our regularizer also helps avoid erroneous identification of low-degree frequencies, which further improves generalization. We extensively evaluate our regularizer on synthetic datasets to gain insights into its behavior. Finally, we show significantly improved generalization on four different datasets compared to standard neural networks and other relevant baselines.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gorji23a.html
https://proceedings.mlr.press/v216/gorji23a.htmlPiecewise Deterministic Markov Processes for Bayesian Neural NetworksInference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subsampling, though introduce a model-specific inhomogenous Poisson Process (IPPs) which is difficult to sample from. This work introduces a new generic and adaptive thinning scheme for sampling from these IPPs, and demonstrates how this approach can accelerate the application of PDMPs for inference in BNNs. Experimentation illustrates how inference with these methods is computationally feasible, can improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when compared against other approximate inference schemes.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/goan23a.html
https://proceedings.mlr.press/v216/goan23a.htmlVacant holes for unsupervised detection of the outliers in compact latent representationDetection of the outliers is pivotal for any machine learning model deployed and operated in real-world. It is essential for the Deep Neural Networks that were shown to be overconfident with such inputs. Moreover, even deep generative models that allow estimation of the probability density of the input fail in achieving this task. In this work, we concentrate on the specific type of these models: Variational Autoencoders (VAEs). First, we unveil a significant theoretical flaw in the assumption of the classical VAE model. Second, we enforce an accommodating topological property to the image of the deep neural mapping to the latent space: compactness to alleviate the flaw and obtain the means to provably bound the image within the determined limits by squeezing both inliers and outliers together. We enforce compactness using two approaches: $(i)$ Alexandroff extension and $(ii)$ fixed Lipschitz continuity constant on the mapping of the encoder of the VAEs. Finally and most importantly, we discover that the anomalous inputs predominantly tend to land on the vacant latent holes within the compact space, enabling their successful identification. For that reason, we introduce a specifically devised score for hole detection and evaluate the solution against several baseline benchmarks achieving promising results.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/glazunov23a.html
https://proceedings.mlr.press/v216/glazunov23a.htmlFast and scalable score-based kernel calibration testsWe introduce the Kernel Calibration Conditional Stein Discrepancy test (KCCSD test), a nonparametric, kernel-based test for assessing the calibration of probabilistic models with well-defined scores. In contrast to previous methods, our test avoids the need for possibly expensive expectation approximations while providing control over its type-I error. We achieve these improvements by using a new family of kernels for score-based probabilities that can be estimated without probability density samples, and by using a Conditional Goodness of Fit criterion for the KCCSD test’s U-statistic. We demonstrate the properties of our test on various synthetic settings.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/glaser23a.html
https://proceedings.mlr.press/v216/glaser23a.htmlProbabilistically robust conformal predictionConformal prediction (CP) is a framework to quantify uncertainty of machine learning classifiers including deep neural networks. Given a testing example and a trained classifier, CP produces a prediction set of candidate labels with a user-specified coverage (i.e., true class label is contained with high probability). Almost all the existing work on CP assumes clean testing data and there is not much known about the robustness of CP algorithms w.r.t natural/adversarial perturbations to testing examples. This paper studies the problem of probabilistically robust conformal prediction (PRCP) which ensures robustness to most perturbations around clean input examples. PRCP generalizes the standard CP (cannot handle perturbations) and adversarially robust CP (ensures robustness w.r.t worst-case perturbations) to achieve better trade-offs between nominal performance and robustness. We propose a novel adaptive PRCP (aPRCP) algorithm to achieve probabilistically robust coverage. The key idea behind aPRCP is to determine two parallel thresholds, one for data samples and another one for the perturbations on data (aka "quantile-of-quantile” design). We provide theoretical analysis to show that aPRCP algorithm achieves robust coverage. Our experiments on CIFAR-10, CIFAR-100, and ImageNet datasets using deep neural networks demonstrate that aPRCP achieves better trade-offs than state-of-the-art CP and adversarially robust CP algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ghosh23a.html
https://proceedings.mlr.press/v216/ghosh23a.htmlCopula-based deep survival models for dependent censoringA survival dataset describes a set of instances (e.g. patients) and provides, for each, either the time until an event (e.g. death), or the censoring time (e.g. when lost to follow-up - which is a lower bound on the time until the event). We consider the challenge of survival prediction: learning, from such data, a predictive model that can produce an individual survival distribution for a novel instance. Many contemporary methods of survival prediction implicitly assume that the event and censoring distributions are independent conditional on the instance’s covariates - a strong assumption that is difficult to verify (as we observe only one outcome for each instance) and which can induce significant bias when it does not hold. This paper presents a parametric model of survival that extends modern non-linear survival analysis by relaxing the assumption of conditional independence. On synthetic and semi-synthetic data, our approach significantly improves estimates of survival distributions compared to the standard that assumes conditional independence in the data.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gharari23a.html
https://proceedings.mlr.press/v216/gharari23a.htmlQuasi-Bayesian nonparametric density estimation via autoregressive predictive updatesBayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior. %, which serves to counteract overfitting. In the context of density estimation, the standard nonparametric Bayesian approach is to target the posterior predictive of the Dirichlet process mixture model. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of quasi-Bayesian predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they struggle on non-smooth data distributions. This is due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and a Gaussian process prior. While the predictive update of such a model is typically intractable, we derive a quasi-Bayesian update that achieves state-of-the-art results in small-data regimes.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ghalebikesabi23a.html
https://proceedings.mlr.press/v216/ghalebikesabi23a.htmlA Data-Driven State Aggregation Approach for Dynamic Discrete Choice ModelsIn dynamic discrete choice models, a commonly studied problem is estimating parameters of agent reward functions (also known as ’structural’ parameters) using agent behavioral data. This task is also known as inverse reinforcement learning. Maximum likelihood estimation for such models requires dynamic programming, which is limited by the curse of dimensionality [Bellman, 1957]. In this work, we present a novel algorithm that provides a data-driven method for selecting and aggregating states, which lowers the computational and sample complexity of estimation. Our method works in two stages. First, we estimate agent Qfunctions, and leverage them alongside a clustering algorithm to select a subset of states that are most pivotal for driving changes in Q-functions. Second, with these selected "aggregated" states, we conduct maximum likelihood estimation using a popular nested fixed-point algorithm [Rust, 1987]. The proposed two-stage approach mitigates the curse of dimensionality by reducing the problem dimension. Theoretically, we derive finite-sample bounds on the associated estimation error, which also characterize the trade-off of computational complexity, estimation error, and sample complexity. We demonstrate the empirical performance of the algorithm in two classic dynamic discrete choice estimation applications.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/geng23a.html
https://proceedings.mlr.press/v216/geng23a.htmlIn- or out-of-distribution detection via dual divergence estimationDetecting out-of-distribution (OOD) samples is a problem of practical importance for a reliable use of deep neural networks (DNNs) in production settings. The corollary to this problem is the detection in-distribution (ID) samples, which is applicable to domain adaptation scenarios for augmenting a train set with ID samples from other data sets, or to continual learning for replay from the past. For both ID or OOD detection, we propose a principled yet simple approach of (empirically) estimating KL-Divergence, in its dual form, for a given test set w.r.t. a known set of ID samples in order to quantify the contribution of each test sample individually towards the divergence measure and accordingly detect it as OOD or ID. Our approach is compute-efficient and enjoys strong theoretical guarantees. For WideResnet101 and ViT-L-16, by considering ImageNet-1k dataset as the ID benchmark, we evaluate the proposed OOD detector on 51 test (OOD) datasets, and observe drastically and consistently lower false positive rates w.r.t. all the competitive methods. Moreover, the proposed ID detector is evaluated, using ECG and stock price datasets, for the task of data augmentation in domain adaptation and continual learning settings, and we observe higher efficacy compared to relevant baselines.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/garg23b.html
https://proceedings.mlr.press/v216/garg23b.htmlInformation theoretic clustering via divergence maximization among clustersInformation-theoretic clustering is one of the most promising and principled approaches to finding clusters with minimal apriori assumptions. The key criterion therein is to maximize the mutual information between the data points and their cluster labels. Such an approach, however, does not explicitly promote any type of inter-cluster behavior. We instead propose to maximize the Kullback-Leibler divergence between the underlying data distributions associated to clusters (referred to as cluster distributions). We show it to entail the mutual information criterion along with maximizing cross entropy between the cluster distributions. For practical efficiency, we propose to empirically estimate the objective of KL-D between clusters in its dual form leveraging deep neural nets as a dual function approximator. Remarkably, our theoretical analysis establishes that estimating the divergence measure in its dual form simplifies the problem of clustering to one of optimally finding k-1 cut points for k clusters in the 1-D dual functional space. Overall, our approach enables linear-time clustering algorithms with theoretical guarantees of near-optimality, owing to the submodularity of the objective. We show the empirical superiority of our approach w.r.t. current state-of-the-art methods on the challenging task of clustering noisy timeseries as observed in domains such as neuroscience, healthcare, financial markets, spatio-temporal environmental dynamics, etc.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/garg23a.html
https://proceedings.mlr.press/v216/garg23a.htmlTime-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and PredictionWhen perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/gao23a.html
https://proceedings.mlr.press/v216/gao23a.htmlDoes Momentum Help in Stochastic Optimization? A Sample Complexity Analysis.Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG) are popular momentum methods in optimization. While the benefits of these acceleration ideas in deterministic settings are well understood, their advantages in stochastic optimization are unclear. Several works have recently claimed that SHB and ASG always help in stochastic optimization. Our work shows that i.) these claims are either flawed or one-sided (e.g., consider only the bias term but not the variance), and ii.) when both these terms are accounted for, SHB and ASG do not always help. Specifically, for <em>any</em> quadratic optimization, we obtain a lower bound on the sample complexity of SHB and ASG, accounting for both bias and variance, and show that the vanilla SGD can achieve the same bound.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ganesh23a.html
https://proceedings.mlr.press/v216/ganesh23a.htmlOn the role of model uncertainties in Bayesian optimisationBayesian Optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and acquisition functions, and compare them across both synthetic and real-world experiments. Our results show that Gaussian Processes, and more surprisingly, Deep Ensembles are strong surrogate models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of surrogate model in the analysis. We also study the effect of recalibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/foldager23a.html
https://proceedings.mlr.press/v216/foldager23a.htmlLogit-based ensemble distribution distillation for robust autoregressive sequence uncertaintiesEfficiently and reliably estimating uncertainty is an important objective in deep learning. It is especially pertinent to autoregressive sequence tasks, where training and inference costs are typically very high. However, existing research has predominantly focused on tasks with static data such as image classification. In this work, we investigate Ensemble Distribution Distillation (EDD) applied to large-scale natural language sequence-to-sequence data. EDD aims to compress the superior uncertainty performance of an expensive (teacher) ensemble into a cheaper (student) single model. Importantly, the ability to separate knowledge (epistemic) and data (aleatoric) uncertainty is retained. Existing probability-space approaches to EDD, however, are difficult to scale to large vocabularies. We show, for modern transformer architectures on large-scale translation tasks, that modelling the ensemble <em>logits</em>, instead of softmax probabilities, leads to significantly better students. Moreover, the students surprisingly even <em>outperform Deep Ensembles</em> by up to $\sim$10% AUROC on out-of-distribution detection, whilst matching them at in-distribution translation.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/fathullah23a.html
https://proceedings.mlr.press/v216/fathullah23a.htmlGenerating Synthetic Datasets by Interpolating along Generalized GeodesicsData for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain —where the model will ultimately be used— is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as ‘combinations’ of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and —notably— can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/fan23b.html
https://proceedings.mlr.press/v216/fan23b.htmlPersonalized federated domain adaptation for item-to-item recommendationItem-to-Item (I2I) recommendation is an important function that suggests replacement or complement options for an item based on their functional similarities or synergies. To capture such item relationships effectively, the recommenders need to understand why subsets of items are co-viewed or co-purchased by the customers. Graph-based models, such as graph neural networks (GNNs), provide a natural framework to combine, ingest and extract valuable insights from such high-order item relationships. However, learning GNNs effectively for I2I requires ingesting a large amount of relational data, which might not always be available, especially in new, emerging market segments. To mitigate this data bottleneck, we postulate that recommendation patterns learned from existing market segments (with private data) could be adapted to build effective warm-start models for emerging ones. To achieve this, we introduce a personalized graph adaptation model based on GNNs to summarize, assemble and adapt recommendation patterns across market segments with heterogeneous customer behaviors into effective local models.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/fan23a.html
https://proceedings.mlr.press/v216/fan23a.htmlDeep Gaussian mixture ensemblesThis work introduces a novel probabilistic deep learning technique called deep Gaussian mixture ensembles (DGMEs), which enables accurate quantification of both epistemic and aleatoric uncertainty. By assuming the data generating process follows that of a Gaussian mixture, DGMEs are capable of approximating complex probability distributions, such as heavy-tailed or multimodal distributions. Our contributions include the derivation of an expectation-maximization (EM) algorithm used for learning the model parameters, which results in an upper-bound on the log-likelihood of training data over that of standard deep ensembles. Additionally, the proposed EM training procedure allows for learning of mixture weights, which is not commonly done in ensembles. Our experimental results demonstrate that DGMEs outperform state-of-the-art uncertainty quantifying deep learning models in handling complex predictive densities.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/el-laham23a.html
https://proceedings.mlr.press/v216/el-laham23a.htmlStudying the Effect of GNN Spatial Convolutions On The Embedding Space’s GeometryBy recursively summing node features over entire neighborhoods, spatial graph convolution operators have been heralded as key to the success of Graph Neural Networks (GNNs). Yet, despite the multiplication of GNN methods across tasks and applications, the effect of this aggregation operation has yet to be analyzed. In fact, while most recent efforts in the GNN community have focused on optimizing the architecture of the neural network, fewer works have attempted to characterize <em>(a) the different classes of spatial convolution operators</em>, <em>(b) their impact on the geometry of the embedding space</em>, and <em>(c) how the choice of a particular convolution should relate to properties of the data</em>. In this paper, we propose to answer all three questions by dividing existing operators into two main classes (<em>symmetrized</em> vs. <em>row-normalized</em> spatial convolutions), and show how these correspond to different implicit biases on the data. Finally, we show that this convolution operator is in fact tunable, and explicit regimes in which certain choices of convolutions — and therefore, embedding geometries — might be more appropriate.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/donnat23a.html
https://proceedings.mlr.press/v216/donnat23a.htmlNeural probabilistic logic programming in discrete-continuous domainsNeural-symbolic AI (NeSy) allows neural networks to exploit symbolic background knowledge in the form of logic. It has been shown to aid learning in the limited data regime and to facilitate inference on out-of-distribution data. Probabilistic NeSy focuses on integrating neural networks with both logic and probability theory, which additionally allows learning under uncertainty. A major limitation of current probabilistic NeSy systems, such as DeepProbLog, is their restriction to finite probability distributions, i.e., discrete random variables. In contrast, deep probabilistic programming (DPP) excels in modelling and optimising continuous probability distributions. Hence, we introduce DeepSeaProbLog, a neural probabilistic logic programming language that incorporates DPP techniques into NeSy. Doing so results in the support of inference and learning of both discrete and continuous probability distributions under logical constraints. Our main contributions are 1) the semantics of DeepSeaProbLog and its corresponding inference algorithm, 2) a proven asymptotically unbiased learning algorithm, and 3) a series of experiments that illustrate the versatility of our approach.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/de-smet23a.html
https://proceedings.mlr.press/v216/de-smet23a.htmlDirichlet Proportions Model for Hierarchically Coherent Probabilistic ForecastingProbabilistic, hierarchically coherent forecasting is a key problem in many practical forecasting applications – the goal is to obtain coherent probabilistic predictions for a large number of time series arranged in a pre-specified tree hierarchy. In this paper, we present an end-to-end deep probabilistic model for hierarchical forecasting that is motivated by a classical top-down strategy. It jointly learns the distribution of the root time series, and the (dirichlet) proportions according to which each parent time-series is split among its children at any point in time. The resulting forecasts are naturally coherent, and provide probabilistic predictions over all time series in the hierarchy. We experiment on several public datasets and demonstrate significant improvements of up to 26% on most datasets compared to state-of-the-art baselines. Finally, we also provide theoretical justification for the superiority of our top-down approach compared to the more traditional bottom-up modeling.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/das23b.html
https://proceedings.mlr.press/v216/das23b.htmlCrysMMNet: Multimodal Representation for Crystal Property PredictionMachine Learning models have emerged as a powerful tool for fast and accurate prediction of different crystalline properties. Exiting state-of-the-art models rely on a single modality of crystal data i.e crystal graph structure, where they construct multi-graph by establishing edges between nearby atoms in 3D space and apply GNN to learn materials representation. Thereby, they encode local chemical semantics around the atoms successfully but fail to capture important global periodic structural information like space group number, crystal symmetry, rotational information etc, which influence different crystal properties. In this work, we leverage textual descriptions of materials to model global structural information into graph structure and learn a more robust and enriched representation of crystalline materials. To this effect, we first curate a textual dataset for crystalline material databases containing descriptions of each material. Further, we propose CrysMMNet, a simple multi-modal framework, which fuses both structural and textual representation together to generate a joint multimodal representation of crystalline materials. We conduct extensive experiments on two benchmark datasets across ten different properties to show that CrysMMNet outperforms existing state-of-the-art baseline methods with a good margin. We also observe that fusing the textual representation with crystal graph structure provides consistent improvement for all the SOTA GNN models compared to their own vanilla versions. We have shared the textual dataset, that we have curated for both the benchmark material databases, with the community for future use..Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/das23a.html
https://proceedings.mlr.press/v216/das23a.htmlImprovable Gap Balancing for Multi-Task LearningIn multi-task learning (MTL), gradient balancing has recently attracted more research interest than loss balancing since it often leads to better performance. However, loss balancing is much more efficient than gradient balancing, and thus it is still worth further exploration in MTL. Note that prior studies typically ignore that there exist varying improvable gaps across multiple tasks, where the improvable gap per task is defined as the distance between the current training progress and desired final training progress. Therefore, after loss balancing, the performance imbalance still arises in many cases. In this paper, following the loss balancing framework, we propose two novel improvable gap balancing (IGB) algorithms for MTL: one takes a simple heuristic, and the other (for the first time) deploys deep reinforcement learning for MTL. Particularly, instead of directly balancing the losses in MTL, both algorithms choose to dynamically assign task weights for improvable gap balancing. Moreover, we combine IGB and gradient balancing to show the complementarity between the two types of algorithms. Extensive experiments on two benchmark datasets demonstrate that our IGB algorithms lead to the best results in MTL via loss balancing and achieve further improvements when combined with gradient balancing. Code is available at https://github.com/YanqiDai/IGB4MTL.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/dai23a.html
https://proceedings.mlr.press/v216/dai23a.htmlConditional abstraction trees for sample-efficient reinforcement learningIn many real-world problems, the learning agent needs to learn a problem’s abstractions and solution simultaneously. However, most such abstractions need to be designed and refined by hand for different problems and domains of application. This paper presents a novel top-down approach for constructing state abstractions while carrying out reinforcement learning (RL). Starting with state variables and a simulator, it presents a novel domain-independent approach for dynamically computing an abstraction based on the dispersion of temporal difference errors in abstract states as the agent continues acting and learning. Extensive empirical evaluation on multiple domains and problems shows that this approach automatically learns semantically rich abstractions that are finely-tuned to the problem, yield strong sample efficiency, and result in the RL agent significantly outperforming existing approaches.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/dadvar23a.html
https://proceedings.mlr.press/v216/dadvar23a.htmlBlackbox optimization of unimodal functionsWe provide an intuitive new algorithm for blackbox stochastic optimization of unimodal functions, a function class that we observe empirically can capture hyperparameter-tuning loss surfaces. Our method’s convergence guarantee automatically adapts to Lipschitz constants and other problem difficulty parameters, recovering and extending prior results. We complement our theoretical development with experimental validation on hyperparameter tuning tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/cutkosky23a.html
https://proceedings.mlr.press/v216/cutkosky23a.htmlLearning to reason about contextual knowledge for planning under uncertaintySequential decision-making (SDM) methods enable AI agents to compute an action policy toward achieving long-term goals under uncertainty. Existing research has shown that contextual knowledge in declarative forms can be used for improving the performance of SDM methods. However, the contextual knowledge from people tends to be incomplete and sometimes inaccurate, which greatly limits the applicability of knowledge-based SDM methods. In this paper, we develop a novel algorithm for knowledge-based SDM, called PERIL, that learns from interaction experience to reason about contextual knowledge, as applied to urban driving scenarios. Experiments have been conducted using CARLA, a widely used autonomous driving simulator. Results demonstrate PERIL’s superiority in comparison to existing knowledge-based SDM baselines.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/cui23a.html
https://proceedings.mlr.press/v216/cui23a.htmlHuman-in-the-Loop MixupAligning model representations to humans has been found to improve robustness and generalization. However, such methods often focus on standard observational data. Synthetic data is proliferating and powering many advances in machine learning; yet, it is not always clear whether synthetic labels are perceptually aligned to humans – rendering it likely model representations are not human aligned. We focus on the synthetic data used in mixup: a powerful regularizer shown to improve model robustness, generalization, and calibration. We design a comprehensive series of elicitation interfaces, which we release as HILL MixE Suite, and recruit 159 participants to provide perceptual judgments along with their uncertainties, over mixup examples. We find that human perceptions do not consistently align with the labels traditionally used for synthetic points, and begin to demonstrate the applicability of these findings to potentially increase the reliability of downstream models, particularly when incorporating human uncertainty. We release all elicited judgments in a new data hub we call H-Mix.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/collins23a.html
https://proceedings.mlr.press/v216/collins23a.htmlExpectation consistency for calibration of neural networksDespite their incredible performance, it is well reported that deep neural networks tend to be overoptimistic about their prediction confidence. Finding effective and efficient calibration methods for neural networks is therefore an important endeavour towards better uncertainty quantification in deep learning. In this manuscript, we introduce a novel calibration technique named expectation consistency (EC), consisting of a post-training rescaling of the last layer weights by enforcing that the average validation confidence coincides with the average proportion of correct labels. First, we show that the EC method achieves similar calibration performance to temperature scaling (TS) across different neural network architectures and data sets, all while requiring similar validation samples and computational resources. However, we argue that EC provides a principled method grounded on a Bayesian optimality principle known as the Nishimori identity. Next, we provide an asymptotic characterization of both TS and EC in a synthetic setting and show that their performance crucially depends on the target function. In particular, we discuss examples where EC significantly outperforms TS.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/clarte23a.html
https://proceedings.mlr.press/v216/clarte23a.htmlEstablishing Markov equivalence in cyclic directed graphsWe present a new, efficient procedure to establish Markov equivalence between directed graphs that may or may not contain cycles. It is based on the Cyclic Equivalence Theorem (CET) in the seminal works on cyclic models by Thomas Richardson in the mid ’90s, but now rephrased from an ancestral perspective. The resulting characterization leads to a procedure for establishing Markov equivalence between graphs that no longer requires explicit tests for $d$-separation, leading to a significantly reduced algorithmic complexity. The conceptually simplified characterization may help to reinvigorate theoretical research towards sound and complete cyclic discovery in the presence of latent confounders.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/claassen23a.html
https://proceedings.mlr.press/v216/claassen23a.htmlFinite-sample guarantees for Nash Q-learning with linear function approximationNash Q-learning may be considered one of the first and most known algorithms in multi-agent reinforcement learning (MARL) for learning policies that constitute a Nash equilibrium of an underlying general-sum Markov game. Its original proof provided asymptotic guarantees and was for the tabular case. Recently, finite-sample guarantees have been provided using more modern RL techniques for the tabular case. Our work analyzes Nash Q-learning using linear function approximation – a representation regime introduced when the state space is large or continuous – and provides finite-sample guarantees that indicate its sample efficiency. We find that the obtained performance nearly matches an existing efficient result for single-agent RL under the same representation and has a polynomial gap when compared to the best-known result for the tabular case.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/cisneros-velarde23a.html
https://proceedings.mlr.press/v216/cisneros-velarde23a.htmlParity calibrationIn a sequential regression setting, a decision-maker may be primarily concerned with whether the future observation will increase or decrease compared to the current one, rather than the actual value of the future observation. In this context, we introduce the notion of parity calibration, which captures the goal of calibrated forecasting for the increase-decrease (or “parity") event in a timeseries. Parity probabilities can be extracted from a forecasted distribution for the output, but we show that such a strategy leads to theoretical unpredictability and poor practical performance. We then observe that although the original task was regression, parity calibration can be expressed as binary calibration. Drawing on this connection, we use an online binary calibration method to achieve parity calibration. We demonstrate the effectiveness of our approach on real-world case studies in epidemiology, weather forecasting, and model-based control in nuclear fusion.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chung23a.html
https://proceedings.mlr.press/v216/chung23a.htmlCombinatorial categorized bandits with expert rankingsMany real-world systems such as e-commerce websites and content-serving platforms employ two-stage recommendation — in the first stage, multiple nominators (experts) provide ranked lists of items (one nominator per category, e.g., sports and political news articles), and in the second stage, an aggregator filters across the lists and outputs a single (short) list of K items to the users. The aggregation stage can be posed as a combinatorial multi-armed bandit problem, with the additional structure that the arms are grouped into categories (disjoint sets of items) and the ranking of arms within each category is known. We propose algorithms for selecting top K items in this setting under two learning objectives, namely minimizing regret over rounds and identifying the top K items within a fixed number of rounds. For each of the objectives, we provide sharp regret/error analysis using carefully defined notion of “gap” that exploits our problem structure. The resulting regret/error bounds strictly improve over prior work in combinatorial bandits literature. We also provide supporting evidence from simulations on synthetic and semi-synthetic problems.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chowdhury23a.html
https://proceedings.mlr.press/v216/chowdhury23a.htmlAdaptivity Complexity for Causal Graph DiscoveryCausal discovery from interventional data is an important problem, where the task is to design an interventional strategy that learns the hidden ground truth causal graph $G(V,E)$ on $|V| = n$ nodes while minimizing the number of performed interventions. Most prior interventional strategies broadly fall into two categories: non-adaptive and adaptive. Non-adaptive strategies decide on a single fixed set of interventions to be performed while adaptive strategies can decide on which nodes to intervene on sequentially based on past interventions. While adaptive algorithms may use exponentially fewer interventions than their non-adaptive counterparts, there are practical concerns that constrain the amount of adaptivity allowed. Motivated by this trade-off, we study the problem of $r$-adaptivity, where the algorithm designer recovers the causal graph under a total of $r$ sequential rounds whilst trying to minimize the total number of interventions. For this problem, we provide a $r$-adaptive algorithm that achieves $O(\min\{r,\log n\} \cdot n^{1/\min\{r,\log n\}})$ approximation with respect to the verification number, a well-known lower bound for adaptive algorithms. Furthermore, for every $r$, we show that our approximation is tight. Our definition of $r$-adaptivity interpolates nicely between the non-adaptive ($r=1$) and fully adaptive ($r=n$) settings where our approximation simplifies to $O(n)$ and $O(\log n)$ respectively, matching the best-known approximation guarantees for both extremes. Our results also extend naturally to the bounded size interventions.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/choo23a.html
https://proceedings.mlr.press/v216/choo23a.htmlEnhancing Treatment Effect Estimation: A Model Robust Approach Integrating Randomized Experiments and External Controls using the Double Penalty Integration EstimatorRandomized experiments (REs) are the cornerstone for treatment effect evaluation. However, due to practical considerations, REs may encounter difficulty recruiting sufficient patients. External controls (ECs) can supplement REs to boost estimation efficiency. Yet, there may be incomparability between ECs and concurrent controls (CCs), resulting in misleading treatment effect evaluation. We introduce a novel bias function to measure the difference in the outcome mean functions between ECs and CCs. We show that the ANCOVA model augmented by the bias function for ECs renders a consistent estimator of the average treatment effect, regardless of whether or not the ANCOVA model is correct. To accommodate possibly different structures of the ANCOVA model and the bias function, we propose a double penalty integration estimator (DPIE) with different penalization terms for the two functions. With an appropriate choice of penalty parameters, our DPIE ensures consistency, oracle property, and asymptotic normality even in the presence of model misspecification. DPIE is at least as efficient as the estimator derived from REs alone, validated through theoretical and experimental results.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/cheng23a.html
https://proceedings.mlr.press/v216/cheng23a.htmlDetection of Short-Term Temporal Dependencies in Hawkes Processes with Heterogeneous Background DynamicsMany kinds of simultaneously-observed <em>event sequences</em> exhibit mutually exciting or inhibiting patterns. Reliable detection of such temporal dependencies is crucial for scientific investigation. A common model is the Multivariate Hawkes Process (MHP), whose impact function naturally encodes a causal structure in Granger causality. However, the vast majority of existing methods use a transformed <em>standard</em> MHP intensity with a constant baseline, which may be inconsistent with real-world data. On the other hand, modeling irregular and unknown background dynamics directly is a challenge, as one struggles to distinguish the effect of mutual interaction from that of fluctuations in background dynamics. In this paper, we address the short-term temporal dependency detection issue. We show that maximum likelihood estimation (MLE) for cross-impact from MHP has an error that can not be eliminated, but may be reduced by an order of magnitude using a heterogeneous intensity not for the target HP but for the interacting HP. Then we propose a robust and computationally-efficient modification of MLE that does not rely on the prior estimation of the heterogeneous intensity and is thus applicable in a data-limited regime (e.g., few-shot, unrepeated observations). Extensive experiments on various datasets show that our method outperforms existing ones by notable margins, with highlighted novel applications in neuroscience.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23g.html
https://proceedings.mlr.press/v216/chen23g.htmlCausal inference with outcome-dependent missingness and self-censoringWe consider missingness in the context of causal inference when the outcome of interest may be missing. If the outcome directly affects its own missingness status, i.e., it is “self-censoring”, this may lead to severely biased causal effect estimates. Miao et al. (2015) proposed the shadow variable method to correct for bias due to self-censoring; however, verifying the required model assumptions can be difficult. Here, we propose a test based on a randomized incentive variable offered to encourage reporting of the outcome that can be used to verify identification assumptions that are sufficient to correct for both self-censoring and confounding bias. Concretely, the test confirms whether a given set of pre-treatment covariates is sufficient to block all backdoor paths between the treatment and outcome as well as all paths between the treatment and missingness indicator after conditioning on the outcome. We show that under these conditions, the causal effect is identified by using the treatment as a shadow variable, and it leads to an intuitive inverse probability weighting estimator that uses a product of the treatment and response weights. We evaluate the efficacy of our test and downstream estimator via simulations.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23f.html
https://proceedings.mlr.press/v216/chen23f.htmlDifferential Privacy in Cooperative Multiagent PlanningPrivacy-aware multiagent systems must protect agents’ sensitive data while simultaneously ensuring that agents accomplish their shared objectives. Towards this goal, we propose a framework to privatize inter-agent communications in cooperative multiagent decision-making problems. We study sequential decision-making problems formulated as cooperative Markov games with reach-avoid objectives. We apply a differential privacy mechanism to privatize agents’ communicated symbolic state trajectories, and analyze tradeoffs between the strength of privacy and the team’s performance. For a given level of privacy, this tradeoff is shown to depend critically upon the total correlation among agents’ state-action processes. We synthesize policies that are robust to privacy by reducing the value of the total correlation. Numerical experiments demonstrate that the team’s performance under these policies decreases by only 6 percent when comparing private versus non-private implementations of communication. By contrast, the team’s performance decreases by 88 percent when using baseline policies that ignore total correlation and only optimize team performance.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23e.html
https://proceedings.mlr.press/v216/chen23e.htmlMFA: Multi-layer Feature-aware Attack for Object DetectionPhysical adversarial attacks can mislead detectors in real-world scenarios and have attracted increasing attention. However, most existing works manipulate the detector’s final outputs as attack targets while ignoring the inherent characteristics of objects. This can result in attacks being trapped in model-specific local optima and reduced transferability. To address this issue, we propose a <em>Multi-layer Feature-aware Attack</em> (MFA) that considers the importance of multi-layer features and disrupts critical object-aware features that dominate decision-making across different models. Specifically, we leverage the location and category information of detector outputs to assign attribution scores to different feature layers. Then, we weight each feature according to their attribution results and design a pixel-level loss function in the opposite optimized direction of object detection to generate adversarial camouflages. We conduct extensive experiments in both digital and physical worlds on ten outstanding detection models and demonstrate the superior performance of MFA in terms of attacking capability and transferability. Our code is available at: \url{https://github.com/ChenWen1997/MFA}.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23d.html
https://proceedings.mlr.press/v216/chen23d.htmlAn effective negotiating agent framework based on deep offline reinforcement learningLearning is crucial for automated negotiation, and recent years have witnessed a remarkable achievement in application of reinforcement learning (RL) for various negotiation tasks. Conventional RL methods focus generally on learning from active interactions with opposing negotiators. However, collecting online data is expensive in many realistic negotiation scenarios. While previous studies partially mitigate this problem through the use of opponent simulators (i.e., agents following known strategies), in reality it is usually hard to fully capture an opponent’s negotiation strategy. Moreover, a further challenge lies in an agent’s capability of adapting to dynamic variations of an opponent’s preferences or strategies, which may happen from time to time for different reasons in subsequent negotiations. In response to these challenges, this article proposes a novel Deep Offline Reinforcement learning Negotiating Agent framework that allows to learn an effective strategy using previously collected negotiation datasets without requiring interaction with an opponent. This is in contrast to existing RL-based negotiation approaches that all rely on active interaction with opponents. Furthermore, the strategy fine-tuning mechanism is included to adjust the learned strategy in response to the preferences or strategy changes of the opponent. The performance of the proposed framework is evaluated based on a diverse set of state-of-the-art baselines under different settings. Experimental results show that the framework allows to learn effective strategies exclusively with offline datasets, and is also capable of effectively adapting to changes of an opponent’s preferences or strategy.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23c.html
https://proceedings.mlr.press/v216/chen23c.htmlBenign Overfitting in Adversarially Robust Linear ClassificationBenign overfitting, where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification for over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting can occur in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples, on subGaussian mixture data. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under Lp adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23b.html
https://proceedings.mlr.press/v216/chen23b.htmlModified Retrace for Off-Policy Temporal Difference LearningOff-policy learning is a key to extend reinforcement learning as it allows to learn a target policy from a different behavior policy that generates the data. However, it is well known as “the deadly triad” when combined with bootstrapping and function approximation. Retrace is an efficient and convergent off-policy algorithm with tabular value functions which employs truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace to correct the off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chen23a.html
https://proceedings.mlr.press/v216/chen23a.htmlLearning in online MDPs: is there a price for handling the communicating case?It is a remarkable fact that the same $O(\sqrt{T})$ regret rate can be achieved in both the Experts Problem and the Adversarial Multi-Armed Bandit problem albeit with a worse dependence on number of actions in the latter case. In contrast, it has been shown that handling online MDPs with communicating structure and bandit information incurs $\Omega(T^{2/3})$ regret even in the case of deterministic transitions. Is this the price we pay for handling communicating structure or is it because we also have bandit feedback? In this paper we show that with full information, online MDPs can still be learned at an $O(\sqrt{T})$ rate even in the presence of communicating structure. We first show this by proposing an efficient follow the perturbed leader (FPL) algorithm for the deterministic transition case. We then extend our scope to consider stochastic transitions where we first give an inefficient $O(\sqrt{T})$-regret algorithm (with a mild additional condition on the dynamics). Then we show how to achieve $O\left(\sqrt{\frac{T}{\alpha}}\right)$ regret rate using an oracle-efficient algorithm but with the additional restriction that the starting state distribution has mass at least $\alpha$ on each state.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chandrasekaran23a.html
https://proceedings.mlr.press/v216/chandrasekaran23a.htmlScalable nonparametric Bayesian learning for dynamic velocity fieldsLearning and understanding heterogeneous patterns in complex spatio-temporal data is an important and challenging task across domains in science and engineering. In this work, we develop a model for learning heterogeneous and dynamic patterns of velocity field data, motivated by applications in the transportation domain. We draw from basic nonparametric Bayesian modeling elements such as the infinite hidden Markov model and Gaussian process and focus on making the learning of such a stochastic model scalable for voluminous and streaming data. This is achieved by employing sequential MAP estimates from the infinite HMM model, an efficient sequential sparse GP posterior computation, and refinement of the estimates using the Viterbi algorithm, which is shown to work effectively on a careful simulation study. We demonstrate the efficacy of our techniques to the NGSIM dataset of complex multi-vehicle interactions.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/chakraborty23a.html
https://proceedings.mlr.press/v216/chakraborty23a.htmlHuman Control: Definitions and AlgorithmsHow can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoidance of user harm. We also analyse the related concepts of non-obstruction and shutdown alignment, three previously proposed algorithms for human control, and one new algorithm.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/carey23a.html
https://proceedings.mlr.press/v216/carey23a.htmlScaling integer arithmetic in probabilistic programsDistributions on integers are ubiquitous in probabilistic modeling but remain challenging for many of today’s probabilistic programming languages (PPLs). The core challenge comes from discrete structure: many of today’s PPL inference strategies rely on enumeration, sampling, or differentiation in order to scale, which fail for high-dimensional complex discrete distributions involving integers. Our insight is that there is structure in arithmetic that these approaches are not using. We present a binary encoding strategy for discrete distributions that exploits the rich logical structure of integer operations like summation and comparison. We leverage this structured encoding with knowledge compilation to perform exact probabilistic inference, and show that this approach scales to much larger integer distributions with arithmetic.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/cao23b.html
https://proceedings.mlr.press/v216/cao23b.htmlOvercoming Language Priors for Visual Question Answering via Loss Rebalancing Label and Global ContextDespite the advances in Visual Question Answering (VQA), many VQA models currently suffer from language priors (i.e. generating answers directly from questions without using images), which severely reduces their robustness in real-world scenarios. We propose a novel training strategy called Loss Rebalancing Label and Global Context (LRLGC) to alleviate the above problem. Specifically, the Loss Rebalancing Label (LRL) is dynamically constructed based on the degree of sample bias to accurately adjust losses across samples and ensure a more balanced form of total losses in VQA. In addition, the Global Context (GC) provides the model with valid global information to assist the model in predicting answers more accurately. Finally, the model is trained through an ensemble-based approach that retains the beneficial effects of biased samples on the model while reducing their importance. Our approach is model-agnostic and enables end-to-end training. Extensive experimental results show that LRLGC (1) improves performance for various VQA models and (2) performs competitively in the VQA-CP v2 benchmark test.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/cao23a.html
https://proceedings.mlr.press/v216/cao23a.htmlTesting conventional wisdom (of the crowd)Do common assumptions about the way that crowd workers make mistakes in microtask (labeling) applications manifest in real crowdsourcing data? Prior work only addresses this question indirectly. Instead, it primarily focuses on designing new label aggregation algorithms, seeming to imply that better performance justifies any additional assumptions. However, empirical evidence in past instances has raised significant challenges to common assumptions. We continue this line of work, using crowdsourcing data itself as directly as possible to interrogate several basic assumptions about workers and tasks. We find strong evidence that the assumption that workers respond correctly to each task with a constant probability, which is common in theoretical work, is implausible in real data. We also illustrate how heterogeneity among tasks and workers can take different forms, which have different implications for the design and evaluation of label aggregation algorithms.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/burrell23a.html
https://proceedings.mlr.press/v216/burrell23a.htmlInference for mark-censored temporal point processesMarked temporal point processes (MTPPs) are a general class of stochastic models for modeling the evolution of events of different types (“marks”) in continuous time. These models have broad applications in areas such as medical data monitoring, financial prediction, user modeling, and communication networks. Of significant practical interest in such problems is the issue of missing or censored data over time. In this paper, we focus on the specific problem of inference for a trained MTPP model when events of certain types are not observed over a period of time during prediction. We introduce the concept of mark-censored sub-processes and use this framework to develop a novel marginalization technique for inference in the presence of censored marks. The approach is model-agnostic and applicable to any MTPP model with a well-defined intensity function. We illustrate the flexibility and utility of the method in the context of both parametric and neural MTPP models, with results across a range of datasets including data from simulated Hawkes processes, self-correcting processes, and multiple real-world event datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/boyd23a.html
https://proceedings.mlr.press/v216/boyd23a.htmlApproximating probabilistic explanations via supermodular minimizationExplaining in accurate and intelligible terms the predictions made by classifiers is a key challenge of eXplainable Artificial Intelligence (XAI). To this end, an abductive explanation for the predicted label of some data instance is a subset-minimal collection of features such that the restriction of the instance to these features is sufficient to determine the prediction. However, due to cognitive limitations, abductive explanations are often too large to be interpretable. In those cases, we need to reduce the size of abductive explanations, while still determining the predicted label with high probability. In this paper, we show that finding such probabilistic explanations is NP-hard, even for decision trees. In order to circumvent this issue, we investigate the approximability of probabilistic explanations through the lens of supermodularity. We examine both greedy descent and greedy ascent approaches for supermodular minimization, whose approximation guarantees depend on the curvature of the “unnormalized” error function that evaluates the precision of the explanation. Based on various experiments for explaining decision tree predictions, we show that our greedy algorithms provide an efficient alternative to the state-of-the-art constraint optimization method.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bounia23a.html
https://proceedings.mlr.press/v216/bounia23a.htmlEfficient Learning of Minimax Risk Classifiers in High DimensionsHigh-dimensional data is common in multiple areas, such as health care and genomics, where the number of features can be tens of thousands. In such scenarios, the large number of features often leads to inefficient learning. Constraint generation methods have recently enabled efficient learning of L1-regularized support vector machines (SVMs). In this paper, we leverage such methods to obtain an efficient learning algorithm for the recently proposed minimax risk classifiers (MRCs). The proposed iterative algorithm also provides a sequence of worst-case error probabilities and performs feature selection. Experiments on multiple high-dimensional datasets show that the proposed algorithm is efficient in high-dimensional scenarios. In addition, the worst-case error probability provides useful information about the classifier performance, and the features selected by the algorithm are competitive with the state-of-the-art.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bondugula23a.html
https://proceedings.mlr.press/v216/bondugula23a.htmlCorrecting for selection bias and missing response in regression using privileged informationWhen estimating a regression model, we might have data where some labels are missing, or our data might be biased by a selection mechanism. When the response or selection mechanism is ignorable (i.e., independent of the response variable given the features) one can use off-the-shelf regression methods; in the nonignorable case one typically has to adjust for bias. We observe that privileged information (i.e. information that is only available during training) might render a nonignorable selection mechanism ignorable, and we refer to this scenario as Privilegedly Missing at Random (PMAR). We propose a novel imputation-based regression method, named repeated regression, that is suitable for PMAR. We also consider an importance weighted regression method, and a doubly robust combination of the two. The proposed methods are easy to implement with most popular out-of-the-box regression algorithms. We empirically assess the performance of the proposed methods with extensive simulated experiments and on a synthetically augmented real-world dataset. We conclude that repeated regression can appropriately correct for bias, and can have considerable advantage over weighted regression, especially when extrapolating to regions of the feature space where response is never observed.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/boeken23a.html
https://proceedings.mlr.press/v216/boeken23a.htmlAmortized Inference for Gaussian Process Hyperparameters of Structured KernelsLearning the kernel parameters for Gaussian processes is often the computational bottleneck in applications such as online learning, Bayesian optimization, or active learning. Amortizing parameter inference over different datasets is a promising approach to dramatically speed up training time. However, existing methods restrict the amortized inference procedure to a fixed kernel structure. The amortization network must be redesigned manually and trained again in case a different kernel is employed, which leads to a large overhead in design time and training time. We propose amortizing kernel parameter inference over a complete kernel-structure-family rather than a fixed kernel structure. We do that via defining an amortization network over pairs of datasets and kernel structures. This enables fast kernel inference for each element in the kernel family without retraining the amortization network. As a by-product, our amortization network is able to do fast ensembling over kernel structures. In our experiments, we show drastically reduced inference time combined with competitive test performance for a large set of kernels and datasets.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bitzer23a.html
https://proceedings.mlr.press/v216/bitzer23a.htmlBeliefPPG: Uncertainty-aware heart rate estimation from PPG signals via belief propagationWe present a novel learning-based method that achieves state-of-the-art performance on several heart rate estimation benchmarks extracted from photoplethysmography signals (PPG). We consider the evolution of the heart rate in the context of a discrete-time stochastic process that we represent as a hidden Markov model. We derive a distribution over possible heart rate values for a given PPG signal window through a trained neural network. Using belief propagation, we incorporate the statistical distribution of heart rate changes to refine these estimates in a temporal context. From this, we obtain a quantized probability distribution over the range of possible heart rate values that captures a meaningful and well-calibrated estimate of the inherent predictive uncertainty. We show the robustness of our method on eight public datasets with three different cross-validation experiments.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bieri23a.html
https://proceedings.mlr.press/v216/bieri23a.htmlBest arm identification in rare eventsWe consider the Best Arm Identification (BAI) problem in the stochastic multi-armed bandit framework, where each arm has a small probability of realizing large rewards, while with overwhelming probability, the reward is zero. A key application of this framework is in online advertising, where click rates of advertisements could be a fraction of a single percent and final conversion to sales, while highly profitable, may again be a small fraction of the click rates. Lately, algorithms for BAI problems have been developed that minimise sample complexity while providing statistical guarantees on the correct arm selection. As we observe, these algorithms can be computationally prohibitive. We exploit the fact that the reward process for each arm is well approximated by a Compound Poisson process and arrive at algorithms that are faster, with a small increase in sample complexity. We analyze the problem in an asymptotic regime as rarity of reward occurrence reduces to zero, and reward amounts increase to infinity. This helps illustrate the benefits of the proposed algorithm. It also sheds light on the underlying structure of the optimal BAI algorithms in the rare event setting.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bhattacharjee23a.html
https://proceedings.mlr.press/v216/bhattacharjee23a.htmlInference of a rumor’s source in the independent cascade modelWe consider the so-called Independent Cascade Model for rumor spreading or epidemic processes popularized by Kempe et al. (2003). In this model, a node of a network is the source of a rumor – it is informed. In discrete time steps, each informed node “infects” each of its uninformed neighbors with probability p. While many facets of this process are studied in the literature, less is known about the inference problem: given a number of infected nodes in a network, can we learn the source of the rumor? In the context of epidemiology this problem is often referred to as patient zero problem. It belongs to a broader class of problems where the goal is to infer parameters of the underlying spreading model. In this work we present a maximum likelihood estimator for the rumor’s source, given a snapshot of the process in terms of a set of active nodes X after t steps. Our results show that, for acyclic graphs, the likelihood estimator undergoes a phase transition as a function of $t$. We provide a rigorous analysis for two prominent classes of acyclic network, namely d-regular trees and Galton-Watson trees, and verify empirically that our heuristics work well in various general networks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/berenbrink23a.html
https://proceedings.mlr.press/v216/berenbrink23a.htmlLearning Choice Functions with Gaussian ProcessesIn consumer theory, ranking available objects by means of preference relations yields the most common description of individual choices. However, preference-based models assume that individuals: (1) give their preferences only between pairs of objects; (2) are always able to pick the best preferred object. In many situations, they may be instead choosing out of a set with more than two elements and, because of lack of information and/or incomparability (objects with contradictory characteristics), they may not be able to select a single most preferred object. To address these situations, we need a choice model which allows an individual to express a set-valued choice. Choice functions provide such a mathematical framework. We propose a Gaussian Process model to learn choice functions from choice data. The model assumes a multiple utility representation of a choice function based on the concept of Pareto rationalization, and derives a strategy to learn both the number and the values of these latent multiple utilities. Simulation experiments demonstrate that the proposed model outperforms the state-of-the-art methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/benavoli23a.html
https://proceedings.mlr.press/v216/benavoli23a.htmlSample Boosting Algorithm (SamBA) - An interpretable greedy ensemble classifier based on local expertise for fat dataEnsemble methods are a very diverse family of algorithms with a wide range of applications. One of the most commonly used is boosting, with the prominent Adaboost. Adaboost relies on greedily learning base classifiers that rectify the error from previous iterations. Then, it combines them through a weighted majority vote, based on their quality on the entire learning set. In this paper, we propose a supervised binary classification framework that propagates the local knowledge acquired during the boosting iterations to the prediction function. Based on this general framework, we introduce SamBA, an interpretable greedy ensemble method designed for fat datasets, with a large number of dimensions and a small number of samples. SamBA learns local classifiers and combines them, using a similarity function, to optimize its efficiency in data extraction. We provide a theoretical analysis of SamBA, yielding convergence and generalization guarantees. In addition, we highlight SamBA’s empirical behavior in an extensive experimental analysis on both real biological and generated datasets, comparing it to state-of-the-art ensemble methods and similarity-based approaches.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bauvin23a.html
https://proceedings.mlr.press/v216/bauvin23a.htmlDo we become wiser with time? On causal equivalence with tiered background knowledgeEquivalence classes of DAGs (represented by CPDAGs) may be too large to provide useful causal information. Here, we address incorporating tiered background knowledge yielding restricted equivalence classes represented by ‘tiered MPDAGs’. Tiered knowledge leads to considerable gains in informativeness and computational efficiency: We show that construction of tiered MPDAGs only requires application of Meeks 1st rule, and that tiered MPDAGs (unlike general MPDAGs) are chain graphs with chordal components. This entails simplifications e.g. of determining valid adjustment sets for causal effect estimation. Further, we characterise when one tiered ordering is more informative than another, providing insights into useful aspects of background knowledge.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/bang23a.html
https://proceedings.mlr.press/v216/bang23a.htmlNeural tangent kernel at initialization: linear width sufficesIn this paper we study the problem of lower bounding the minimum eigenvalue of the neural tangent kernel (NTK) at initialization, an important quantity for the theoretical analysis of training in neural networks. We consider feedforward neural networks with smooth activation functions. Without any distributional assumptions on the input, we present a novel result: we show that for suitable initialization variance, $\widetilde{\Omega}(n)$ width, where $n$ is the number of training samples, suffices to ensure that the NTK at initialization is positive definite, improving prior results for smooth activations under our setting. Prior to our work, the sufficiency of linear width has only been shown either for networks with ReLU activation functions, and sublinear width has been shown for smooth networks but with additional conditions on the distribution of the data. The technical challenge in the analysis stems from the layerwise inhomogeneity of smooth activation functions and we handle the challenge using {\em generalized} Hermite series expansion of such activations.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/banerjee23a.html
https://proceedings.mlr.press/v216/banerjee23a.htmlBayesian inference approach for entropy regularized reinforcement learning with stochastic dynamicsWe develop a novel approach to determine the optimal policy in entropy-regularized reinforcement learning (RL) with stochastic dynamics. For deterministic dynamics, the optimal policy can be derived using Bayesian inference in the control-as-inference framework; however, for stochastic dynamics, the direct use of this approach leads to risk-taking optimistic policies. To address this issue, current approaches in entropy-regularized RL involve a constrained optimization procedure which fixes system dynamics to the original dynamics, however this approach is not consistent with the unconstrained Bayesian inference framework. In this work we resolve this inconsistency by developing an exact mapping from the constrained optimization problem in entropy-regularized RL to a different optimization problem which can be solved using the unconstrained Bayesian inference approach. We show that the optimal policies are the same for both problems, thus our results lead to the exact solution for the optimal policy in entropy-regularized RL with stochastic dynamics through Bayesian inference.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/arriojas23a.html
https://proceedings.mlr.press/v216/arriojas23a.htmlQuantifying lottery tickets under label noise: accuracy, calibration, and complexityPruning deep neural networks is a widely used strategy to alleviate the computational burden in machine learning. Overwhelming empirical evidence suggests that pruned models retain very high accuracy even with a tiny fraction of parameters. However, relatively little work has gone into characterising the small pruned networks obtained, beyond a measure of their accuracy. In this paper, we use the sparse double descent approach to identify univocally and characterise pruned models associated with classification tasks. We observe empirically that, for a given task, iterative magnitude pruning (IMP) tends to converge to networks of comparable sizes even when starting from full networks with sizes ranging over orders of magnitude. We analyse the best pruned models in a controlled experimental setup and show that their number of parameters reflects task difficulty and that they are much better than full networks at capturing the true conditional probability distribution of the labels. On real data, we similarly observe that pruned models are less prone to overconfident predictions. Our results suggest that pruned models obtained via IMP not only have advantageous computational properties but also provide a better representation of uncertainty in learning.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/arora23a.html
https://proceedings.mlr.press/v216/arora23a.htmlTwo Sides of Miscalibration: Identifying Over and Under-Confidence Prediction for Network CalibrationProper confidence calibration of deep neural networks is essential for reliable predictions in safety-critical tasks. Miscalibration can lead to model over-confidence and/or under-confidence; i.e., the model’s confidence in its prediction can be greater or less than the model’s accuracy. Recent studies have highlighted the over-confidence issue by introducing calibration techniques and demonstrated success on various tasks. However, miscalibration through under-confidence has not yet to receive much attention. In this paper, we address the necessity of paying attention to the under-confidence issue. We first introduce a novel metric, a miscalibration score, to identify the overall and class-wise calibration status, including being over or under-confident. Our proposed metric reveals the pitfalls of existing calibration techniques, where they often overly calibrate the model and worsen under-confident predictions. Then we utilize the class-wise miscalibration score as a proxy to design a calibration technique that can tackle both over and under-confidence. We report extensive experiments that show our proposed methods substantially outperforming existing calibration techniques. We also validate our proposed calibration technique on an automatic failure detection task with a risk-coverage curve, reporting that our methods improve failure detection as well as trustworthiness of the model. The code are available at \url{https://github.com/AoShuang92/miscalibration_TS}.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/ao23a.html
https://proceedings.mlr.press/v216/ao23a.htmlRobust Gaussian process regression with the trimmed marginal likelihoodAccurate outlier detection is not only a necessary preprocessing step, but can itself give important insights into the data. However, especially, for non-linear regression the detection of outliers is non-trivial, and actually ambiguous. We propose a new method that identifies outliers by finding a subset of data points T such that the marginal likelihood of all remaining data points S is maximized. Though the idea is more general, it is particular appealing for Gaussian processes regression, where the marginal likelihood has an analytic solution. While maximizing the marginal likelihood for hyper-parameter optimization is a well established non-convex optimization problem, optimizing the set of data points S is not. Indeed, even a greedy approximation is computationally challenging due to the high cost of evaluating the marginal likelihood. As a remedy, we propose an efficient projected gradient descent method with provable convergence guarantees. Moreover, we also establish the breakdown point when jointly optimizing hyper-parameters and S. For various datasets and types of outliers, our experiments demonstrate that the proposed method can improve outlier detection and robustness when compared with several popular alternatives like the student-t likelihood.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/andrade23a.html
https://proceedings.mlr.press/v216/andrade23a.htmlTransfer learning for individual treatment effect estimationThis work considers the problem of transferring causal knowledge between tasks for Individual Treatment Effect (ITE) estimation. To this end, we theoretically assess the feasibility of transferring ITE knowledge and present a practical framework for efficient transfer. A lower bound is introduced on the ITE error of the target task to demonstrate that ITE knowledge transfer is challenging due to the absence of counterfactual information. Nevertheless, we establish generalization upper bounds on the counterfactual loss and ITE error of the target task, demonstrating the feasibility of ITE knowledge transfer. Subsequently, we introduce a framework with a new Causal Inference Task Affinity (CITA) measure for ITE knowledge transfer. Specifically, we use CITA to find the closest source task to the target task and utilize it for ITE knowledge transfer. Empirical studies are provided, demonstrating the efficacy of the proposed method. We observe that ITE knowledge transfer can significantly (up to 95%) reduce the amount of data required for ITE estimation.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/aloui23a.html
https://proceedings.mlr.press/v216/aloui23a.htmlFLASH: Automating federated learning using CASHIn this paper, we present FLASH, a framework which addresses for the first time the central AutoML problem of Combined Algorithm Selection and HyperParameter (HP) Optimization (CASH) in the context of Federated Learning (FL). To limit training cost, FLASH incrementally adapts the set of algorithms to train based on their projected loss rates, while supporting decentralized (federated) implementation of the embedded hyper-parameter optimization (HPO), model selection and loss calculation problems. We provide a theoretical analysis of the training and validation loss under FLASH, and their tradeoff with the training cost measured as the data wasted in training sub-optimal algorithms. The bounds depend on the degree of dissimilarity between the datasets of the clients, a result of FL restriction that client datasets remain private. Through extensive experimental investigation on several datasets, we evaluate three variants of FLASH, and show that FLASH performs close to centralized CASH methods.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/alam23a.html
https://proceedings.mlr.press/v216/alam23a.htmlA decoder suffices for query-adaptive variational inferenceDeep generative models like variational autoencoders (VAEs) are widely used for density estimation and dimensionality reduction, but infer latent representations via amortized inference algorithms, which require that all data dimensions are observed. VAEs thus lack a key strength of probabilistic graphical models: the ability to infer posteriors for test queries with arbitrary structure. We demonstrate that many prior methods for imputation with VAEs are costly and ineffective, and achieve superior performance via query-adaptive variational inference (QAVI) algorithms based directly on the generative decoder. By analytically marginalizing arbitrary sets of missing features, and optimizing expressive posteriors including mixtures and density flows, our non-amortized QAVI algorithms achieve excellent performance while avoiding expensive model retraining. On standard image and tabular datasets, our approach substantially outperforms prior methods in the plausibility and diversity of imputations. We also show that QAVI effectively generalizes to recent hierarchical VAE models for high-dimensional images.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/agarwal23a.html
https://proceedings.mlr.press/v216/agarwal23a.htmlBounding the optimal value function in compositional reinforcement learningIn the field of reinforcement learning (RL), agents are often tasked with solving a variety of problems differing only in their reward functions. In order to quickly obtain solutions to unseen problems with new reward functions, a popular approach involves functional composition of previously solved tasks. However, previous work using such functional composition has primarily focused on specific instances of composition functions, whose limiting assumptions allow for exact zero-shot composition. Our work unifies these examples and provides a more general framework for compositionality in both standard and entropy-regularized RL. We find that, for a broad class of functions, the optimal solution for the composite task of interest can be related to the known primitive task solutions. Specifically, we present double-sided inequalities relating the optimal composite value function to the value functions for the primitive tasks. We also show that the regret of using a zero-shot policy can be bounded for this class of functions. The derived bounds can be used to develop clipping approaches for reducing uncertainty during training, allowing agents to quickly adapt to new tasks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/adamczyk23a.html
https://proceedings.mlr.press/v216/adamczyk23a.htmlGuided Deep Kernel LearningCombining Gaussian processes with the expressive power of deep neural networks is commonly done nowadays through deep kernel learning (DKL). Unfortunately, due to the kernel optimization process, this often results in losing their Bayesian benefits. In this study, we present a novel approach for learning deep kernels by utilizing infinite-width neural networks. We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach harnesses the reliable uncertainty estimation of the NNGPs to adapt the DKL target confidence when it encounters novel data points. As a result, we get the best of both worlds, we leverage the Bayesian behavior of the NNGP, namely its robustness to overfitting, and accurate uncertainty estimation, while maintaining the generalization abilities, scalability, and flexibility of deep kernels. Empirically, we show on multiple benchmark datasets of varying sizes and dimensionality, that our method is robust to overfitting, has good predictive performance, and provides reliable uncertainty estimations.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/achituve23a.html
https://proceedings.mlr.press/v216/achituve23a.htmlSubMix: Learning to Mix Graph Sampling HeuristicsSampling subgraphs for training Graph Neural Networks (GNNs) is receiving much attention from the GNN community. While a variety of methods have been proposed, each method samples the graph according to its own heuristic. However, there has been little work in mixing these heuristics in an end-to-end trainable manner. In this work, we design a generative framework for graph sampling. Our method, SubMix, parameterizes subgraph sampling as a convex combination of heuristics. We show that a continuous relaxation of the discrete sampling process allows us to efficiently obtain analytical gradients for training the sampling parameters. Our experimental results illustrate the usefulness of learning graph sampling in three scenarios: (1) robust training of GNNs by automatically learning to discard noisy edge sources; (2) improving model performance by trainable and online edge subset selection; and (3) by integrating our framework into decoupled GNN models improves their performance on standard benchmarks.Sun, 02 Jul 2023 00:00:00 +0000
https://proceedings.mlr.press/v216/abu-el-haija23a.html
https://proceedings.mlr.press/v216/abu-el-haija23a.html