Proceedings of Machine Learning ResearchProceedings of The 27th International Conference on Artificial Intelligence and Statistics
Held in Palau de Congressos, Valencia, Spain on 02-04 May 2024
Published as Volume 238 by the Proceedings of Machine Learning Research on 18 April 2024.
Volume Edited by:
Sanjoy Dasgupta
Stephan Mandt
Yingzhen Li
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/v238/
Fri, 19 Jul 2024 09:26:27 +0000Fri, 19 Jul 2024 09:26:27 +0000Jekyll v3.9.5Near Optimal Adversarial Attacks on Stochastic Bandits and Defenses with Smoothed ResponsesI study adversarial attacks against stochastic bandit algorithms. At each round, the learner chooses an arm, and a stochastic reward is generated. The adversary strategically adds corruption to the reward, and the learner is only able to observe the corrupted reward at each round. Two sets of results are presented in this paper. The first set studies the optimal attack strategies for the adversary. The adversary has a target arm he wishes to promote, and his goal is to manipulate the learner into choosing this target arm $T - o(T)$ times. I design attack strategies against UCB and Thompson Sampling that only spends $\widehat{O}(\sqrt{\log T})$ cost. Matching lower bounds are presented, and the vulnerability of UCB, Thompson sampling and $\varepsilon$-greedy are exactly characterized. The second set studies how the learner can defend against the adversary. Inspired by literature on smoothed analysis and behavioral economics, I present two simple algorithms that achieve a competitive ratio arbitrarily close to 1.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zuo24a.html
https://proceedings.mlr.press/v238/zuo24a.htmlMulti-Dimensional Hyena for Spatial Inductive BiasThe advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is attached as supplementary.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zimerman24a.html
https://proceedings.mlr.press/v238/zimerman24a.htmlRobust Offline Reinforcement Learning with Heavy-Tailed RewardsThis paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions. The implementation of the proposal is available at \url{https://github.com/Mamba413/ROOM}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhu24a.html
https://proceedings.mlr.press/v238/zhu24a.htmlTiming as an Action: Learning When to Observe and ActIn standard reinforcement learning setups, the agent receives observations and performs actions at evenly spaced intervals. However, in many real-world settings, observations are expensive, forcing agents to commit to courses of action for designated periods of time. Consider that doctors, after each visit, typically set not only a treatment plan but also a follow-up date at which that plan might be revised. In this work, we formalize the setup of timing-as-an-action. Through theoretical analysis in the tabular setting, we show that while the choice of delay intervals could be naively folded in as part of a composite action, these actions have a special structure and handling them intelligently yields statistical advantages. Taking a model-based perspective, these gains owe to the fact that delay actions do not add any parameters to the underlying model. For model estimation, we provide provable sample-efficiency improvements, and our experiments demonstrate empirical improvements in both healthcare simulators and classical reinforcement learning environments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhou24c.html
https://proceedings.mlr.press/v238/zhou24c.htmlReward-Relevance-Filtered Linear Offline Reinforcement LearningThis paper studies offline reinforcement learning with linear function approximation in a setting with decision-theoretic, but not estimation sparsity. The structural restrictions of the data-generating process presume that the transitions factor into a sparse component that affects the reward and could affect additional exogenous dynamics that do not affect the reward. Although the minimally sufficient adjustment set for estimation of full-state transition properties depends on the whole state, the optimal policy and therefore state-action value function depends only on the sparse component: we call this causal/decision-theoretic sparsity. We develop a method for reward-filtering the estimation of the state-action value function to the sparse component by a modification of thresholded lasso in least-squares policy evaluation. We provide theoretical guarantees for our reward-filtered linear fitted-Q-iteration, with sample complexity depending only on the size of the sparse component.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhou24b.html
https://proceedings.mlr.press/v238/zhou24b.htmlOn the Theoretical Expressive Power and the Design Space of Higher-Order Graph TransformersGraph transformers have recently received significant attention in graph learning, partly due to their ability to capture more global interaction via self-attention. Nevertheless, while higher-order graph neural networks have been reasonably well studied, the exploration of extending graph transformers to higher-order variants is just starting. Both theoretical understanding and empirical results are limited. In this paper, we provide a systematic study of the theoretical expressive power of order-$k$ graph transformers and sparse variants. We first show that, an order-$k$ graph transformer without additional structural information is less expressive than the $k$-Weisfeiler Lehman ($k$-WL) test despite its high computational cost. We then explore strategies to both sparsify and enhance the higher-order graph transformers, aiming to improve both their efficiency and expressiveness. Indeed, sparsification based on neighborhood information can enhance the expressive power, as it provides additional information about input graph structures. In particular, we show that a natural neighborhood-based sparse order-$k$ transformer model is not only computationally efficient, but also expressive – as expressive as $k$-WL test. We further study several other sparse graph attention models that are computationally efficient and provide their expressiveness analysis. Finally, we provide experimental results to show the effectiveness of the different sparsification strategies.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhou24a.html
https://proceedings.mlr.press/v238/zhou24a.htmlAccelerating Approximate Thompson Sampling with Underdamped Langevin Monte CarloApproximate Thompson sampling with Langevin Monte Carlo broadens its reach from Gaussian posterior sampling to encompass more general smooth posteriors. However, it still encounters scalability issues in high-dimensional problems when demanding high accuracy. To address this, we propose an approximate Thompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where the latter is the go-to workhorse for simulations of high-dimensional posteriors. Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling using a specific potential function. This design improves the sample complexity for realizing logarithmic regrets from $\mathcal{\tilde O}(d)$ to $\mathcal{\tilde O}(\sqrt{d})$. The scalability and robustness of our algorithm are also empirically validated through synthetic experiments in high-dimensional bandit problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zheng24b.html
https://proceedings.mlr.press/v238/zheng24b.htmlBetter Batch for Deep Probabilistic Time Series ForecastingDeep probabilistic time series forecasting has gained attention for its ability to provide nonlinear approximation and valuable uncertainty quantification for decision-making. However, existing models often oversimplify the problem by assuming a time-independent error process and overlooking serial correlation. To overcome this limitation, we propose an innovative training method that incorporates error autocorrelation to enhance probabilistic forecasting accuracy. Our method constructs a mini-batch as a collection of D consecutive time series segments for model training. It explicitly learns a time-varying covariance matrix over each mini-batch, encoding error correlation among adjacent time steps. The learned covariance matrix can be used to improve prediction accuracy and enhance uncertainty quantification. We evaluate our method on two different neural forecasting models and multiple public datasets. Experimental results confirm the effectiveness of the proposed approach in improving the performance of both models across a range of datasets, resulting in notable improvements in predictive accuracy.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zheng24a.html
https://proceedings.mlr.press/v238/zheng24a.htmlSelf-Supervised Quantization-Aware Knowledge DistillationQuantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/SQAKD.git.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhao24d.html
https://proceedings.mlr.press/v238/zhao24d.htmlDHMConv: Directed Hypergraph Momentum Convolution FrameworkDue to its capability to capture high-order information, the hypergraph model has shown greater potential than the graph model in various scenarios. Real-world entity relations frequently involve directionality, in order to express high-order information while capturing directional information in relationships, we present a directed hypergraph spatial convolution framework that is designed to acquire vertex embeddings of directed hypergraphs. The framework characterizes the information propagation of directed hypergraphs through two stages: hyperedge information aggregation and hyperedge information broadcasting. During the hyperedge information aggregation stage, we optimize the acquisition of hyperedge information using attention mechanisms. In the hyperedge information broadcasting stage, we leverage a directed hypergraph momentum encoder to capture the directional information of directed hyperedges. Experimental results on five publicly available directed graph datasets of three different categories demonstrate that our proposed DHMConv outperforms various commonly used graph and hypergraph models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhao24c.html
https://proceedings.mlr.press/v238/zhao24c.htmlOn Feynman-Kac training of partial Bayesian neural networksRecently, partial Bayesian neural networks (pBNNs), which only consider a subset of the parameters to be stochastic, were shown to perform competitively with full Bayesian neural networks. However, pBNNs are often multi-modal in the latent variable space and thus challenging to approximate with parametric models. To address this problem, we propose an efficient sampling-based training strategy, wherein the training of a pBNN is formulated as simulating a Feynman-Kac model. We then describe variations of sequential Monte Carlo samplers that allow us to simultaneously estimate the parameters and the latent posterior distribution of this model at a tractable computational cost. Using various synthetic and real-world datasets we show that our proposed training scheme outperforms the state of the art in terms of predictive performance.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhao24b.html
https://proceedings.mlr.press/v238/zhao24b.htmlPositivity-free Policy Learning with Observational DataPolicy learning utilizing observational data is pivotal across various domains, with the objective of learning the optimal treatment assignment policy while adhering to specific constraints such as fairness, budget, and simplicity. This study introduces a novel positivity-free (stochastic) policy learning framework designed to address the challenges posed by the impracticality of the positivity assumption in real-world scenarios. This framework leverages incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments. We characterize these incremental propensity score policies and establish identification conditions, employing semiparametric efficiency theory to propose efficient estimators capable of achieving rapid convergence rates, even when integrated with advanced machine learning algorithms. This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework’s finite-sample performance through comprehensive numerical experiments, ensuring the identification of causal effects from observational data is both robust and reliable.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhao24a.html
https://proceedings.mlr.press/v238/zhao24a.htmlMulti-resolution Time-Series Transformer for Long-term ForecastingThe performance of transformers for time-series forecasting has improved significantly. Recent architectures learn complex temporal patterns by segmenting a time-series into patches and using the patches as tokens. The patch size controls the ability of transformers to learn the temporal patterns at different frequencies: shorter patches are effective for learning localized, high-frequency patterns, whereas mining long-term seasonalities and trends requires longer patches. Inspired by this observation, we propose a novel framework, Multi-resolution Time-Series Transformer (MTST), which consists of a multi-branch architecture for simultaneous modeling of diverse temporal patterns at different resolutions. In contrast to many existing time-series transformers, we employ relative positional encoding, which is better suited for extracting periodic components at different scales. Extensive experiments on several real-world datasets demonstrate the effectiveness of MTST in comparison to state-of-the-art forecasting techniques.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24l.html
https://proceedings.mlr.press/v238/zhang24l.htmlMembership Testing in Markov Equivalence Classes via Independence QueriesUnderstanding causal relationships between variables is a fundamental problem with broad impact in numerous scientific fields. While extensive research has been dedicated to \emph{learning} causal graphs from data, its complementary concept of \emph{testing} causal relationships has remained largely unexplored. While \emph{learning} involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, the \emph{testing} counterpart addresses the following critical question: \emph{Given a specific MEC and observational data from some causal graph, can we determine if the data-generating causal graph belongs to the given MEC?} We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\exp(\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\exp(O(s))$ tests, matching our lower bound. Compared to the \emph{learning} problem, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, this shows that \emph{testing} is relatively easier. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes. Additionally, using the DAG associahedron, we provide a geometric interpretation of testing versus learning and discuss how our testing result can aid learning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24k.html
https://proceedings.mlr.press/v238/zhang24k.htmlFast and Accurate Estimation of Low-Rank Matrices from Noisy Measurements via Preconditioned Non-Convex Gradient DescentNon-convex gradient descent is a common approach for estimating a low-rank $n\times n$ ground truth matrix from noisy measurements, because it has per-iteration costs as low as $O(n)$ time, and is in theory capable of converging to a minimax optimal estimate. However, the practitioner is often constrained to just tens to hundreds of iterations, and the slow and/or inconsistent convergence of non-convex gradient descent can prevent a high-quality estimate from being obtained. Recently, the technique of \emph{preconditioning} was shown to be highly effective at accelerating the local convergence of non-convex gradient descent when the measurements are noiseless. In this paper, we describe how preconditioning should be done for noisy measurements to accelerate local convergence to minimax optimality. For the symmetric matrix sensing problem, our proposed preconditioned method is guaranteed to locally converge to minimax error at a linear rate that is immune to ill-conditioning and/or over-parameterization. Using our proposed preconditioned method, we perform a 60 megapixel medical image denoising task, and observe significantly reduced noise levels compared to previous approaches.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24j.html
https://proceedings.mlr.press/v238/zhang24j.htmlFormal Verification of Unknown Stochastic Systems via Non-parametric EstimationA novel data-driven method for formal verification is proposed to study complex systems operating in safety-critical domains. The proposed approach is able to formally verify discrete-time stochastic dynamical systems against temporal logic specifications only using observation samples and without the knowledge of the model, and provide a probabilistic guarantee on the satisfaction of the specification. We first propose the theoretical results for using non-parametric estimation to estimate an asymptotic upper bound for the \emph{Lipschitz constant} of the stochastic system, which can determine a finite abstraction of the system. Our results prove that the asymptotic convergence rate of the estimation is $O(n^{-\frac{1}{3+d}})$, where $d$ is the dimension of the system and n is the data scale. We then construct interval Markov decision processes using two different data-driven methods, namely non-parametric estimation and empirical estimation of transition probabilities, to perform formal verification against a given temporal logic specification. Multiple case studies are presented to validate the effectiveness of the proposed methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24i.html
https://proceedings.mlr.press/v238/zhang24i.htmlDiscriminant Distance-Aware Representation on Deterministic Uncertainty Quantification MethodsUncertainty estimation is a crucial aspect of deploying dependable deep learning models in safety-critical systems. In this study, we introduce a novel and efficient method for deterministic uncertainty estimation called Discriminant Distance-Awareness Representation (DDAR). Our approach involves constructing a DNN model that incorporates a set of prototypes in its latent representations, enabling us to analyze valuable feature information from the input data. By leveraging a distinction maximization layer over optimal trainable prototypes, DDAR can learn a discriminant distance-awareness representation. We demonstrate that DDAR overcomes feature collapse by relaxing the Lipschitz constraint that hinders the practicality of deterministic uncertainty methods (DUMs) architectures. Our experiments show that DDAR is a flexible and architecture-agnostic method that can be easily integrated as a pluggable layer with distance-sensitive metrics, outperforming state-of-the-art uncertainty estimation methods on multiple benchmark problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24h.html
https://proceedings.mlr.press/v238/zhang24h.htmlMultivariate Time Series Forecasting By Graph Attention Networks With Theoretical GuaranteesMultivariate time series forecasting (MTSF) aims to predict future values of multiple variables based on past values of multivariate time series, and has been applied in fields including traffic flow prediction, stock price forecasting, and anomaly detection. Capturing the inter-dependencies among multiple series poses one significant challenge to MTSF. Recent works have considered modeling the correlated series as graph nodes and using graph neural network (GNN)-based approaches with attention mechanisms added to improve the test prediction accuracy, however, none of them have theoretical guarantees regarding the generalization error. In this paper, we develop a new norm-bounded graph attention network (GAT) for MTSF by upper-bounding the Frobenius norm of weights in each layer of the GAT model to enhance performance. We theoretically establish that the generalization error bound for our model is associated with various components of GAT models: the number of attention heads, the maximum number of neighbors, the upper bound of the Frobenius norm of the weight matrix in each layer, and the norm of the input features. Empirically, we investigate the impact of different components of GAT models on the generalization performance of MTSF on real data. Our experiment verifies our theoretical findings. We compare with multiple prior frequently cited graph-based methods for MTSF using real data sets and the experiment results show our method can achieve the best performance for MTSF. Our method provides novel perspectives for improving the generalization performance of MTSF, and our theoretical guarantees give substantial implications for designing graph-based methods with attention mechanisms for MTSF.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24g.html
https://proceedings.mlr.press/v238/zhang24g.htmlOnline Learning in Contextual Second-Price Pay-Per-Click AuctionsWe study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads and needs to make an estimate on their click-through rate (CTR) in order to run a second-price pay-per-click auction. The learner’s goal is to minimize her regret, defined as the gap between her total revenue and that of an oracle strategy that always makes perfect CTR predictions. We first show that $\sqrt{T}$-regret is obtainable via a computationally inefficient algorithm and that it is unavoidable since our algorithm is no easier than the classical multi-armed bandit problem. A by-product of our results is a $\sqrt{T}$-regret bound for the simpler non-contextual setting, improving upon a recent work of [Feng et al., 2023] by removing the inverse CTR dependency that could be arbitrarily large. Then, borrowing ideas from recent advances on efficient contextual bandit algorithms, we develop two practically efficient contextual auction algorithms: the first one uses the exponential weight scheme with optimistic square errors and maintains the same $\sqrt{T}$-regret bound, while the second one reduces the problem to online regression via a simple epsilon-greedy strategy, albeit with a worse regret bound. Finally, we conduct experiments on a synthetic dataset to showcase the effectiveness and superior performance of our algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24f.html
https://proceedings.mlr.press/v238/zhang24f.htmlMulticlass Learning from Noisy Labels for Non-decomposable Performance MeasuresThere has been much interest in recent years in learning good classifiers from data with noisy labels. Most work on learning from noisy labels has focused on standard loss-based performance measures. However, many machine learning problems require using non-decomposable performance measures which cannot be expressed as the expectation or sum of a loss on individual examples; these include for example the H-mean, Q-mean and G-mean in class imbalance settings, and the Micro F1 in information retrieval. In this paper, we design algorithms to learn from noisy labels for two broad classes of multiclass non-decomposable performance measures, namely, monotonic convex and ratio-of-linear, which encompass all the above examples. Our work builds on the Frank-Wolfe and Bisection based methods of Narasimhan et al. (2015). In both cases, we develop noise-corrected versions of the algorithms under the widely studied family of class-conditional noise models. We provide regret (excess risk) bounds for our algorithms, establishing that even though they are trained on noisy data, they are Bayes consistent in the sense that their performance converges to the optimal performance w.r.t. the clean (non-noisy) distribution. Our experiments demonstrate the effectiveness of our algorithms in handling label noise.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24e.html
https://proceedings.mlr.press/v238/zhang24e.htmlRestricted Isometry Property of Rank-One Measurements with Random Unit-Modulus VectorsThe restricted isometry property (RIP) is essential for the linear map to guarantee the successful recovery of low-rank matrices. The existing works show that the linear map generated by the measurement matrices with independent and identically distributed (i.i.d.) entries satisfies RIP with high probability. However, when dealing with non-i.i.d. measurement matrices, such as the rank-one measurements, the RIP compliance may not be guaranteed. In this paper, we show that the RIP can still be achieved with high probability, when the rank-one measurement matrix is constructed by the random unit-modulus vectors. Compared to the existing works, we first address the challenge of establishing RIP for the linear map in non-i.i.d. scenarios. As validated in the experiments, this linear map is memory-efficient, and not only satisfies the RIP but also exhibits similar recovery performance of the low-rank matrices to that of conventional i.i.d. measurement matrices.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24d.html
https://proceedings.mlr.press/v238/zhang24d.htmlGeneralization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax OptimizationThis paper studies the generalization performance of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $\epsilon$-generalization are $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension of the primal variable and $\kappa$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between stability and generalization. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) and the more general sampling-determined algorithms (SDA).Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24c.html
https://proceedings.mlr.press/v238/zhang24c.htmlOptimal Sparse Survival TreesInterpretability is crucial for doctors, hospitals, pharmaceutical companies and biotechnology corporations to analyze and make decisions for high stakes problems that involve human health. Tree-based methods have been widely adopted for survival analysis due to their appealing interpretablility and their ability to capture complex relationships. However, most existing methods to produce survival trees rely on heuristic (or greedy) algorithms, which risk producing sub-optimal models. We present a dynamic-programming-with-bounds approach that finds provably-optimal sparse survival tree models, frequently in only a few seconds.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24b.html
https://proceedings.mlr.press/v238/zhang24b.htmlHintMiner: Automatic Question Hints Mining From Q&A Web Posts with Language Model via Self-Supervised LearningUsers often need ask questions and seek answers online. The Question - Answering (QA) forums such as Stack Overflow cannot always respond to the questions timely and properly. In this paper, we propose HintMiner, a novel automatic question hints mining tool for users to help them find answers. HintMiner leverages the machine comprehension and sequence generation techniques to automatically generate hints for users’ questions. It firstly retrieve many web Q&A posts and then extract some hints from the posts using MiningNet that is built via a language model. Using the huge amount of online Q&A posts, we design a self-supervised objective to train the MiningNet that is a neural encoder-decoder model based on the transformer and copying mechanisms. We have evaluated HintMiner on 60,000 Stack Overflow questions. The experiment results show that the proposed approach is effective. For example, HintMiner achieves an average BLEU score of 36.17% and an average ROUGE-2 score of 36.29%. Our tool and experimental data are publicly available at \url{https://github.com/zhangzhenyu13/HintMiner}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhang24a.html
https://proceedings.mlr.press/v238/zhang24a.htmlDeep Learning-Based Alternative Route ComputationAlgorithms for the computation of alternative routes in road networks power many geographic navigation systems. A good set of alternative routes offers meaningful options to the user of the system and can support applications such as routing that is robust to failures (e.g., road closures, extreme traffic congestion, etc.) and routing with diverse preferences and objective functions. Algorithmic techniques for alternative route computation include the penalty method, via-node type algorithms (which deploy bidirectional search and finding plateaus), and, more recently, electrical-circuit based algorithms. In this work we focus on the practically important family of via-node type algorithms and aim to produce high quality alternative routes for road networks using a novel deep learning-based approach that learns a representation of the underlying road network. We show that this approach can support natural objectives, such as the uniformly bounded stretch, that are difficult and computationally expensive to support through traditional algorithmic techniques. Moreover, we achieve this in a practical system based on the Customizable Route Planning (CRP) hierarchical routing architecture. Our training methodology uses the hierarchical partition of the graph and trains a model to predict which boundary nodes in the partition should be crossed by the alternative routes. We describe our methods in detail and evaluate them against previously studied baselines, showing quality improvements in the road networks of Seattle, Paris, and Bangalore.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhai24b.html
https://proceedings.mlr.press/v238/zhai24b.htmlLearning Sampling Policy to Achieve Fewer Queries for Zeroth-Order OptimizationZeroth-order (ZO) methods, which use the finite difference of two function evaluations (also called ZO gradient) to approximate first-order gradient, have attracted much attention recently in machine learning because of their broad applications. The accuracy of the ZO gradient highly depends on how many finite differences are averaged, which are intrinsically determined by the number of perturbations randomly drawn from a distribution. Existing ZO methods try to learn a data-driven distribution for sampling the perturbations to improve the efficiency of ZO optimization (ZOO) algorithms. In this paper, we explore a new and parallel direction, \textit{i.e.}, learn an optimal sampling policy instead of using a totally random strategy to generate perturbations based on the techniques of reinforcement learning (RL), which makes it possible to approximate the gradient with only two function evaluations. Specifically, we first formulate the problem of learning a sampling policy as a Markov decision process. Then, we propose our ZO-RL algorithm, \textit{i.e.}, using deep deterministic policy gradient, an actor-critic RL algorithm to learn a sampling policy that can guide the generation of perturbed vectors in getting ZO gradients as accurately as possible. Importantly, the existing ZOO algorithms for learning a distribution can be plugged in to improve the exploration of ZO-RL. Experimental results with different ZO estimators show that our ZO-RL algorithm can effectively reduce the query complexity of ZOO algorithms and converge faster than existing ZOO algorithms, especially in the later stage of the optimization process.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zhai24a.html
https://proceedings.mlr.press/v238/zhai24a.htmlSDMTR: A Brain-inspired Transformer for Relation InferenceDeep learning has seen a movement towards the concepts of modularity, module coordination and sparse interactions to fit the working principles of biological systems. Inspired by Global Workspace Theory and long-term memory system in human brain, both are instrumental in constructing biologically plausible artificial intelligence systems, we introduce the shared dual-memory Transformers (SDMTR)— a model that builds upon Transformers. The proposed approach includes the shared long-term memory and workspace with finite capacity in which different specialized modules compete to write information. Later, crucial information from shared workspace is inscribed into long-term memory through outer product attention mechanism to reduce information conflict and build a knowledge reservoir, thereby facilitating subsequent inference, learning and problem-solving. We apply SDMTR to multi-modality question-answering and reasoning challenges, including text-based bAbI-20k, visual Sort-of-CLEVR and triangle relations detection tasks. The results demonstrate that our SDMTR significantly outperforms the vanilla Transformer and its recent improvements. Additionally, visualization analyses indicate that the presence of memory positively correlates with model effectiveness on inference tasks. This research provides novel insights and empirical support to advance biologically plausible deep learning frameworks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zeng24a.html
https://proceedings.mlr.press/v238/zeng24a.htmlCommunication-Efficient Federated Learning With Data and Client HeterogeneityFederated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally. However, executing FL at scale comes with inherent practical challenges: 1) heterogeneity of the local node data distributions, 2) heterogeneity of node computational speeds (asynchrony), but also 3) constraints in the amount of communication between the clients and the server. In this work, we present the first variant of the classic federated averaging (FedAvg) algorithm which, at the same time, supports data heterogeneity, partial client asynchrony, and communication compression. Our algorithm comes with a novel, rigorous analysis showing that, in spite of these system relaxations, it can provide similar convergence to FedAvg in interesting parameter regimes. Experimental results in the rigorous LEAF benchmark on setups of up to $300$ nodes show that our algorithm ensures fast convergence for standard federated tasks, improving upon prior quantized and asynchronous approaches.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/zakerinia24a.html
https://proceedings.mlr.press/v238/zakerinia24a.htmlRiemannian Laplace Approximation with the Fisher MetricLaplace’s method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties depend heavily on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yu24a.html
https://proceedings.mlr.press/v238/yu24a.htmlExplanation-based Training with Differentiable Insertion/Deletion Metric-aware RegularizersThe quality of explanations for the predictions made by complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how accurately the explanations reflect the predictor’s behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both the insertion and deletion scores of the explanations while maintaining their predictive accuracy. Because the original insertion and deletion metrics are non-differentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics so that they are differentiable and use them to formalize insertion and deletion metric-based regularizers. Our experimental results on image and tabular datasets show that the deep neural network-based predictors that are fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easier-to-interpret explanations while maintaining high predictive accuracy. The code is available at https://github.com/yuyay/idexpo.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yoshikawa24a.html
https://proceedings.mlr.press/v238/yoshikawa24a.htmlGraph Machine Learning through the Lens of Bilevel OptimizationBilevel optimization refers to scenarios whereby the optimal solution of a lower-level energy function serves as input features to an upper-level objective of interest. These optimal features typically depend on tunable parameters of the lower-level energy in such a way that the entire bilevel pipeline can be trained end-to-end. Although not generally presented as such, this paper demonstrates how a variety of graph learning techniques can be recast as special cases of bilevel optimization or simplifications thereof. In brief, building on prior work we first derive a more flexible class of energy functions that, when paired with various descent steps (e.g., gradient descent, proximal methods, momentum, etc.), form graph neural network (GNN) message-passing layers; critically, we also carefully unpack where any residual approximation error lies with respect to the underlying constituent message-passing functions. We then probe several simplifications of this framework to derive close connections with non-GNN-based graph learning approaches, including knowledge graph embeddings, various forms of label propagation, and efficient graph-regularized MLP models. And finally, we present supporting empirical results that demonstrate the versatility of the proposed bilevel lens, which we refer to as BloomGML, referencing that BiLevel Optimization Offers More Graph Machine Learning. Our code is available at \url{https://github.com/amberyzheng/BloomGML}. Let graph ML bloom.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yijia-zheng24a.html
https://proceedings.mlr.press/v238/yijia-zheng24a.htmlFilter, Rank, and Prune: Learning Linear Cyclic Gaussian Graphical ModelsCausal structures in the real world often exhibit cycles naturally due to equilibrium, homeostasis, or feedback. However, causal discovery from observational studies regarding cyclic models has not been investigated extensively because the underlying structure of a linear cyclic structural equation model (SEM) cannot be determined solely from observational data. Inspired by the Bayesian information Criterion (BIC), we construct a score function that assesses both accuracy and sparsity of the structure to determine which linear Gaussian SEM is the best when only observational data is given. Then, we formulate a causal discovery problem as an optimization problem of the measure and propose the Filter, Rank, and Prune (FRP) method for solving it. We empirically demonstrate that our method outperforms competitive cyclic causal discovery baselines.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yi24a.html
https://proceedings.mlr.press/v238/yi24a.htmlSmoothness-Adaptive Dynamic Pricing with Nonparametric Demand LearningWe study the dynamic pricing problem where the demand function is nonparametric and Hölder smooth, and we focus on adaptivity to the unknown Hölder smoothness parameter $\beta$ of the demand function. Traditionally the optimal dynamic pricing algorithm heavily relies on the knowledge of $\beta$ to achieve a minimax optimal regret of $\widetilde{O}(T^{\frac{\beta+1}{2\beta+1}})$. However, we highlight the challenge of adaptivity in this dynamic pricing problem by proving that no pricing policy can adaptively achieve this minimax optimal regret without knowledge of $\beta$. Motivated by the impossibility result, we propose a self-similarity condition to enable adaptivity. Importantly, we show that the self-similarity condition does not compromise the problem’s inherent complexity since it preserves the regret lower bound $\Omega(T^{\frac{\beta+1}{2\beta+1}})$. Furthermore, we develop a smoothness-adaptive dynamic pricing algorithm and theoretically prove that the algorithm achieves this minimax optimal regret bound without the prior knowledge $\beta$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ye24b.html
https://proceedings.mlr.press/v238/ye24b.htmlEnhancing Hypergradients Estimation: A Study of Preconditioning and ReparameterizationBilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT). As a function of the error of the inner problem resolution, we study the error of the IFT method. We analyze two strategies to reduce this error: preconditioning the IFT formula and reparameterizing the inner problem. We give a detailed account of the impact of these two modifications on the error, highlighting the role played by higher-order derivatives of the functionals at stake. Our theoretical findings explain when super efficiency, namely reaching an error on the hypergradient that depends quadratically on the error on the inner problem, is achievable and compare the two approaches when this is impossible. Numerical evaluations on hyperparameter tuning for regression problems substantiate our theoretical findings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ye24a.html
https://proceedings.mlr.press/v238/ye24a.htmlSimulation-Based StackingSimulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a consistency guarantee, we present a general posterior stacking framework to make use of all available approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias of the posterior approximation at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yao24b.html
https://proceedings.mlr.press/v238/yao24b.htmlMinimizing Convex Functionals over Space of Probability Measures via KL Divergence Gradient FlowMotivated by the computation of the non-parametric maximum likelihood estimator (NPMLE) and the Bayesian posterior in statistics, this paper explores the problem of convex optimization over the space of all probability distributions. We introduce an implicit scheme, called the implicit KL proximal descent (IKLPD) algorithm, for discretizing a continuous-time gradient flow relative to the Kullback–Leibler (KL) divergence for minimizing a convex target functional. We show that IKLPD converges to a global optimum at a polynomial rate from any initialization; moreover, if the objective functional is strongly convex relative to the KL divergence, for example, when the target functional itself is a KL divergence as in the context of Bayesian posterior computation, IKLPD exhibits globally exponential convergence. Computationally, we propose a numerical method based on normalizing flow to realize IKLPD. Conversely, our numerical method can also be viewed as a new approach that sequentially trains a normalizing flow for minimizing a convex functional with a strong theoretical guarantee.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yao24a.html
https://proceedings.mlr.press/v238/yao24a.htmlHodge-Compositional Edge Gaussian ProcessesWe propose principled Gaussian processes (GPs) for modeling functions defined over the edge set of a simplicial 2-complex, a structure similar to a graph in which edges may form triangular faces. This approach is intended for learning flow-type data on networks where edge flows can be characterized by the discrete divergence and curl. Drawing upon the Hodge decomposition, we first develop classes of divergence-free and curl-free edge GPs, suitable for various applications. We then combine them to create Hodge-compositional edge GPs that are expressive enough to represent any edge function. These GPs facilitate direct and independent learning for the different Hodge components of edge functions, enabling us to capture their relevance during hyperparameter optimization. To highlight their practical potential, we apply them for flow data inference in currency exchange, ocean currents and water supply networks, comparing them to alternative models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yang24e.html
https://proceedings.mlr.press/v238/yang24e.htmlLearning Unknown Intervention Targets in Structural Causal Models from Heterogeneous DataWe study the problem of identifying the unknown intervention targets in structural causal models where we have access to heterogeneous data collected from multiple environments. The unknown intervention targets are the set of endogenous variables whose corresponding exogenous noises change across the environments. We propose a two-phase approach which in the first phase recovers the exogenous noises corresponding to unknown intervention targets whose distributions have changed across environments. In the second phase, the recovered noises are matched with the corresponding endogenous variables. For the recovery phase, we provide sufficient conditions for learning these exogenous noises up to some component-wise invertible transformation. For the matching phase, under the causal sufficiency assumption, we show that the proposed method uniquely identifies the intervention targets. In the presence of latent confounders, the intervention targets among the observed variables cannot be determined uniquely. We provide a candidate intervention target set which is a superset of the true intervention targets. Our approach improves upon the state of the art as the returned candidate set is always a subset of the target set returned by previous work. Moreover, we do not require restrictive assumptions such as linearity of the causal model or performing invariance tests to learn whether a distribution is changing across environments which could be highly sample inefficient. Our experimental results show the effectiveness of our proposed algorithm in practice.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yang24d.html
https://proceedings.mlr.press/v238/yang24d.htmlIdentifying Spurious Biases Early in Training through the Lens of Simplicity BiasNeural networks trained with (stochastic) gradient descent have an inductive bias towards learning simpler solutions. This makes them highly prone to learning spurious correlations in the training data, that may not hold at test time. In this work, we provide the first theoretical analysis of the effect of simplicity bias on learning spurious correlations. Notably, we show that examples with spurious features are provably separable based on the model’s output early in training. We further illustrate that if spurious features have a small enough noise-to-signal ratio, the network’s output on majority of examples is almost exclusively determined by the spurious features, leading to poor worst-group test accuracy. Finally, we propose SPARE, which identifies spurious correlations early in training, and utilizes importance sampling to alleviate their effect. Empirically, we demonstrate that SPARE outperforms state-of-the-art methods by up to 21.1% in worst-group accuracy, while being up to 12x faster. We also show the applicability of SPARE, as a highly effective but lightweight method, to discover spurious correlations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yang24c.html
https://proceedings.mlr.press/v238/yang24c.htmlOrthogonal Gradient Boosting for Simpler Additive Rule EnsemblesGradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximates the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yang24b.html
https://proceedings.mlr.press/v238/yang24b.htmlNeural McKean-Vlasov Processes: Distributional Dependence in Diffusion ProcessesMcKean-Vlasov stochastic differential equations (MV-SDEs) provide a mathematical description of the behavior of an infinite number of interacting particles by imposing a dependence on the particle density. We study the influence of explicitly including distributional information in the parameterization of the SDE. We propose a series of semi-parametric methods for representing MV-SDEs, and corresponding estimators for inferring parameters from data based on the properties of the MV-SDE. We analyze the characteristics of the different architectures and estimators, and consider their applicability in relevant machine learning problems. We empirically compare the performance of the different architectures and estimators on real and synthetic datasets for time series and probabilistic modeling. The results suggest that explicitly including distributional dependence in the parameterization of the SDE is effective in modeling temporal data with interaction under an exchangeability assumption while maintaining strong performance for standard Itô-SDEs due to the richer class of probability flows associated with MV-SDEs.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yang24a.html
https://proceedings.mlr.press/v238/yang24a.htmlCausal Bandits with General Causal Models and InterventionsThis paper considers causal bandits (CBs) for the sequential design of interventions in a causal system. The objective is to optimize a reward function via minimizing a measure of cumulative regret with respect to the best sequence of interventions in hindsight. The paper advances the results on CBs in three directions. First, the structural causal models (SCMs) are assumed to be unknown and drawn arbitrarily from a general class $\mathcal{F}$ of Lipschitz-continuous functions. Existing results are often focused on (generalized) linear SCMs. Second, the interventions are assumed to be generalized soft with any desired level of granularity, resulting in an infinite number of possible interventions. The existing literature, in contrast, generally adopts atomic and hard interventions. Third, we provide general upper and lower bounds on regret. The upper bounds subsume (and improve) known bounds for special cases. The lower bounds are generally hitherto unknown. These bounds are characterized as functions of the (i) graph parameters, (ii) eluder dimension of the space of SCMs, denoted by $\mathrm{dim}(\mathcal{F})$, and (iii) the covering number of the function space, denoted by $\mathrm{cn}(\mathcal{F})$. Specifically, the cumulative achievable regret over horizon $T$ is $\mathcal{O}(K d^{L-1}\sqrt{T\,\mathrm{dim}(\mathcal{F}) \log(\mathrm{cn}(\mathcal{F}))})$, where $K$ is related to the Lipschitz constants, $d$ is the graph’s maximum in-degree, and $L$ is the length of the longest causal path. The upper bound is further refined for special classes of SCMs (neural network, polynomial, and linear), and their corresponding lower bounds are provided.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yan24a.html
https://proceedings.mlr.press/v238/yan24a.htmlLearning Fair Division from Bandit FeedbackThis work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents’ values or utilities. Departing from conventional online algorithms, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing dual averaging, enabling gradual learning of both the type distribution of arriving items and agents’ values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We also empirically verify the performance of the proposed algorithms across synthetic and empirical datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/yamada24a.html
https://proceedings.mlr.press/v238/yamada24a.htmlBLIS-Net: Classifying and Analyzing Signals on GraphsGraph neural networks (GNNs) have emerged as a powerful tool for tasks such as node classification and graph classification. However, much less work has been done on signal classification, where the data consists of many functions (referred to as signals) defined on the vertices of a single graph. These tasks require networks designed differently from those designed for traditional GNN tasks. Indeed, traditional GNNs rely on localized low-pass filters, and signals of interest may have intricate multi-frequency behavior and exhibit long range interactions. This motivates us to introduce the BLIS-Net (Bi-Lipschitz Scattering Net), a novel GNN that builds on the previously introduced geometric scattering transform. Our network is able to capture both local and global signal structure and is able to capture both low-frequency and high-frequency information. We make several crucial changes to the original geometric scattering architecture which we prove increase the ability of our network to capture information about the input signal and show that BLIS-Net achieves superior performance on both synthetic and real-world data sets based on traffic flow and fMRI data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xu24c.html
https://proceedings.mlr.press/v238/xu24c.htmlUncertainty-aware Continuous Implicit Neural Representations for Remote Sensing Object CountingMany existing object counting methods rely on density map estimation (DME) of the discrete grid representation by decoding extracted image semantic features from designed convolutional neural networks (CNNs). Relying on discrete density maps not only leads to information loss dependent on the original image resolution, but also has a scalability issue when analyzing high-resolution images with cubically increasing memory complexity. Furthermore, none of the existing methods can offer reliable uncertainty quantification (UQ) for the derived count estimates. To overcome these limitations, we design UNcertainty-aware, hypernetwork-based Implicit neural representations for Counting (UNIC) to assign probabilities and the corresponding counting confidence over continuous spatial coordinates. We derive a sampling-based Bayesian counting loss function and develop the corresponding model training algorithm. UNIC outperforms existing methods on the Remote Sensing Object Counting (RSOC) dataset with reliable UQ and improved interpretability of the derived count estimates. Our code is available at https://github.com/SiyuanXu-tamu/UNIC.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xu24b.html
https://proceedings.mlr.press/v238/xu24b.htmlOnline multiple testing with e-valuesA scientist tests a continuous stream of hypotheses over time in the course of her investigation — she does not test a predetermined, fixed number of hypotheses. The scientist wishes to make as many discoveries as possible while ensuring the number of false discoveries is controlled — a well recognized way for accomplishing this is to control the false discovery rate (FDR). Prior methods for FDR control in the online setting have focused on formulating algorithms when specific dependency structures are assumed to exist between the test statistics of each hypothesis. However, in practice, these dependencies often cannot be known beforehand or tested after the fact. Our algorithm, e-LOND, provides FDR control under arbitrary, possibly unknown, dependence. We show that our method is more powerful than existing approaches to this problem through simulations. We also formulate extensions of this algorithm to utilize randomization for increased power and for constructing confidence intervals in online selective inference.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xu24a.html
https://proceedings.mlr.press/v238/xu24a.htmlA/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarityWe investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set $\mathcal{X}\subset\mathbb{R}^d$, a fixed budget $T$, and an unpredictable sequence of parameters $\left\lbrace\theta_t\right\rbrace_{t=1}^{T}$, an algorithm will aim to correctly identify the best arm $x^* := \arg\max_{x\in\mathcal{X}}x^\top\sum_{t=1}^{T}\theta_t$ with probability as high as possible. Prior work has addressed the stationary setting where $\theta_t = \theta_1$ for all $t$ and demonstrated that the error probability decreases as $\exp(-T /\rho^*)$ for a problem-dependent constant $\rho^*$. But in many real-world $A/B/n$ multivariate testing scenarios that motivate our work, the environment is non-stationary and an algorithm expecting a stationary setting can easily fail. For robust identification, it is well-known that if arms are chosen randomly and non-adaptively from a G-optimal design over $\mathcal{X}$ at each time then the error probability decreases as $\exp(-T\Delta^2_{(1)}/d)$, where $\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t$. As there exist environments where $\Delta_{(1)}^2/ d \ll 1/ \rho^*$, we are motivated to propose a novel algorithm P1-RAGE that aims to obtain the best of both worlds: robustness to non-stationarity and fast rates of identification in benign settings. We characterize the error probability of P1-RAGE and demonstrate empirically that the algorithm indeed never performs worse than G-optimal design but compares favorably to the best algorithms in the stationary setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xiong24a.html
https://proceedings.mlr.press/v238/xiong24a.htmlBetter Representations via Adversarial Training in Pre-Training: A Theoretical PerspectivePre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., Kim et al. (2020), empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xing24a.html
https://proceedings.mlr.press/v238/xing24a.htmlDistributionally Robust Quickest Change Detection using Wasserstein Uncertainty SetsThe problem of quickest detection of a change in the distribution of streaming data is considered. It is assumed that the pre-change distribution is known, while the only information about the post-change is through a (small) set of labeled data. This post-change data is used in a data-driven minimax robust framework, where an uncertainty set for the post-change distribution is constructed. The robust change detection problem is studied in an asymptotic setting where the mean time to false alarm goes to infinity. It is shown that the least favorable distribution (LFD) is an exponentially tilted version of the pre-change density and can be obtained efficiently. A Cumulative Sum (CuSum) test based on the LFD, which is referred to as the distributionally robust (DR) CuSum test, is then shown to be asymptotically robust. The results are extended to the case with multiple post-change uncertainty sets and validated using synthetic and real data examples.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xie24a.html
https://proceedings.mlr.press/v238/xie24a.htmlA Neural Architecture Predictor based on GNN-Enhanced TransformerNeural architecture performance predictor is an efficient approach for architecture estimation in Neural Architecture Search (NAS). However, existing predictors based on Graph Neural Networks (GNNs) are deficient in modeling long-range interactions between operation nodes and prone to the problem of over-smoothing, which limits their ability to learn neural architecture representation. Furthermore, some Transformer-based predictors use simple position encodings to improve performance via self-attention mechanism, but they fail to fully exploit the subgraph structure information of the graph. To solve this problem, we propose a novel method to enhance the graph representation of neural architectures by combining GNNs and Transformer blocks. We evaluate the effectiveness of our predictor on NAS-Bench-101 and NAS-bench-201 benchmarks, the discovered architecture on DARTS search space achieves an accuracy of 97.61% on CIFAR-10 dataset, which outperforms traditional position encoding methods such as adjacency and Laplacian matrices. The code of our work is available at \url{https://github.com/GNET}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/xiang24a.html
https://proceedings.mlr.press/v238/xiang24a.htmlSurrogate Active Subspaces for Jump-Discontinuous FunctionsSurrogate modeling and active subspaces have emerged as powerful paradigms in computational science and engineering. Porting such techniques to computational models in the social sciences brings into sharp relief their limitations in dealing with discontinuous simulators, such as Agent-Based Models, which have discrete outputs. Nevertheless, prior applied work has shown that surrogate estimates of active subspaces for such estimators can yield interesting results. But given that active subspaces are defined by way of gradients, it is not clear what quantity is being estimated when this methodology is applied to a discontinuous simulator. We begin this article by showing some pathologies that can arise when conducting such an analysis. This motivates an extension of active subspaces to discontinuous functions, clarifying what is actually being estimated in such analyses. We also conduct numerical experiments on synthetic test functions to compare Gaussian process estimates of active subspaces on continuous and discontinuous functions. Finally, we deploy our methodology on Flee, an agent-based model of refugee movement, yielding novel insights into which parameters of the simulation are most important across 8 displacement crises in Africa and the Middle East.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wycoff24a.html
https://proceedings.mlr.press/v238/wycoff24a.htmlUnsupervised Change Point Detection in Multivariate Time SeriesWe consider the challenging problem of unsupervised change point detection in multivariate time series when the number of change points is unknown. Our method eliminates the user’s need for careful parameter tuning, enhancing its practicality and usability. Our approach identifies time series segments with similar empirically estimated distributions, coupled with a novel greedy algorithm guided by the minimum description length principle. We provide theoretical guarantees and, through experiments on synthetic and real-world data, provide empirical evidence for its improved performance in identifying meaningful change points in practical settings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24g.html
https://proceedings.mlr.press/v238/wu24g.htmlOn the estimation of persistence intensity functions and linear representations of persistence diagramsPersistence diagrams are one of the most popular types of data summaries used in Topological Data Analysis. The prevailing statistical approach to analyzing persistence diagrams is concerned with filtering out topological noise. In this paper, we adopt a different viewpoint and aim at estimating the actual distribution of a random persistence diagram, which captures both topological signal and noise. To that effect, Chazel et al., (2018) proved that, under general conditions, the expected value of a random persistence diagram is a measure admitting a Lebesgue density, called the persistence intensity function. In this paper, we are concerned with estimating the persistence intensity function and a novel, normalized version of it – called the persistence density function. We present a class of kernel-based estimators based on an i.i.d. sample of persistence diagrams and derive estimation rates in the supremum norm. As a direct corollary, we obtain uniform consistency rates for estimating linear representations of persistence diagrams, including Betti numbers and persistence surfaces. Interestingly, the persistence density function delivers stronger statistical guarantees.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24f.html
https://proceedings.mlr.press/v238/wu24f.htmlPosterior Uncertainty Quantification in Neural Networks using Data AugmentationIn this paper, we approach the problem of uncertainty quantification in deep learning through a predictive framework, which captures uncertainty in model parameters by specifying our assumptions about the predictive distribution of unseen future data. Under this view, we show that deep ensembling (Lakshminarayanan et al., 2017) is a fundamentally mis-specified model class, since it assumes that future data are supported on existing observations only—a situation rarely encountered in practice. To address this limitation, we propose MixupMP, a method that constructs a more realistic predictive distribution using popular data augmentation techniques. MixupMP operates as a drop-in replacement for deep ensembles, where each ensemble member is trained on a random simulation from this predictive distribution. Grounded in the recently-proposed framework of Martingale posteriors (Fong et al., 2023), MixupMP returns samples from an implicitly defined Bayesian posterior. Our empirical analysis showcases that MixupMP achieves superior predictive per- formance and uncertainty quantification on various image classification datasets, when compared with existing Bayesian and non-Bayesian approaches.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24e.html
https://proceedings.mlr.press/v238/wu24e.htmlLarge-Scale Gaussian Processes via Alternating ProjectionTraining and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24d.html
https://proceedings.mlr.press/v238/wu24d.htmlRobust Data Clustering with Outliers via Transformed Tensor Low-Rank RepresentationRecently, tensor low-rank representation (TLRR) has become a popular tool for tensor data recovery and clustering, due to its empirical success and theoretical guarantees. However, existing TLRR methods consider Gaussian or gross sparse noise, inevitably leading to performance degradation when the tensor data are contaminated by outliers or sample-specific corruptions. This paper develops an outlier-robust tensor low-rank representation (OR-TLRR) method that provides outlier detection and tensor data clustering simultaneously based on the t-SVD framework. For tensor observations with arbitrary outlier corruptions, OR-TLRR has provable performance guarantee for exactly recovering the row space of clean data and detecting outliers under mild conditions. Moreover, an extension of OR-TLRR is proposed to handle the case when parts of the data are missing. Finally, extensive experimental results on synthetic and real data demonstrate the effectiveness of the proposed algorithms. We release our code at \url{https://github.com/twugithub/2024-AISTATS-ORTLRR}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24c.html
https://proceedings.mlr.press/v238/wu24c.htmlData-Adaptive Probabilistic Likelihood Approximation for Ordinary Differential EquationsEstimating the parameters of ordinary differential equations (ODEs) is of fundamental importance in many scientific applications. While ODEs are typically approximated with deterministic algorithms, new research on probabilistic solvers indicates that they produce more reliable parameter estimates by better accounting for numerical errors. However, many ODE systems are highly sensitive to their parameter values. This produces deep local maxima in the likelihood function – a problem which existing probabilistic solvers have yet to resolve. Here we present a novel probabilistic ODE likelihood approximation, DALTON, which can dramatically reduce parameter sensitivity by learning from noisy ODE measurements in a data-adaptive manner. Our approximation scales linearly in both ODE variables and time discretization points, and is applicable to ODEs with both partially-unobserved components and non-Gaussian measurement models. Several examples demonstrate that DALTON produces more accurate parameter estimates via numerical optimization than existing probabilistic ODE solvers, and even in some cases than the exact ODE likelihood itself.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24b.html
https://proceedings.mlr.press/v238/wu24b.htmlLearning Granger Causality from Instance-wise Self-attentive Hawkes ProcessesWe address the problem of learning Granger causality from asynchronous, interdependent, multi-type event sequences. In particular, we are interested in discovering instance-level causal structures in an unsupervised manner. Instance-level causality identifies causal relationships among individual events, providing more fine-grained information for decision-making. Existing work in the literature either requires strong assumptions, such as linearity in the intensity function, or heuristically defined model parameters that do not necessarily meet the requirements of Granger causality. We propose Instance-wise Self-Attentive Hawkes Processes (ISAHP), a novel deep learning framework that can directly infer the Granger causality at the event instance level. ISAHP is the first neural point process model that meets the requirements of Granger causality. It leverages the self-attention mechanism of the transformer to align with the principles of Granger causality. We empirically demonstrate that ISAHP is capable of discovering complex instance-level causal structures that cannot be handled by classical models. We also show that ISAHP achieves state-of-the-art performance in proxy tasks involving type-level causal discovery and instance-level event type prediction.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wu24a.html
https://proceedings.mlr.press/v238/wu24a.htmlAn Analytic Solution to Covariance Propagation in Neural NetworksUncertainty quantification of neural networks is critical to measuring the reliability and robustness of deep learning systems. However, this often involves costly or inaccurate sampling methods and approximations. This paper presents a sample-free moment propagation technique that propagates mean vectors and covariance matrices across a network to accurately characterize the input-output distributions of neural networks. A key enabler of our technique is an analytic solution for the covariance of random variables passed through nonlinear activation functions, such as Heaviside, ReLU, and GELU. The wide applicability and merits of the proposed technique are shown in experiments analyzing the input-output distributions of trained neural networks and training Bayesian neural networks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wright24a.html
https://proceedings.mlr.press/v238/wright24a.htmlQuantized Fourier and Polynomial Features for more Expressive Tensor Network ModelsIn the context of kernel machines, polynomial and Fourier features are commonly used to provide a nonlinear extension to linear models by mapping the data to a higher-dimensional space. Unless one considers the dual formulation of the learning problem, which renders exact large-scale learning unfeasible, the exponential increase of model parameters in the dimensionality of the data caused by their tensor-product structure prohibits to tackle high-dimensional problems. One of the possible approaches to circumvent this exponential scaling is to exploit the tensor structure present in the features by constraining the model weights to be an underparametrized tensor network. In this paper we quantize, i.e. further tensorize, polynomial and Fourier features. Based on this feature quantization we propose to quantize the associated model weights, yielding quantized models. We show that, for the same number of model parameters, the resulting quantized models have a higher bound on the VC-dimension as opposed to their non-quantized counterparts, at no additional computational cost while learning from identical features. We verify experimentally how this additional tensorization regularizes the learning problem by prioritizing the most salient features in the data and how it provides models with increased generalization capabilities. We finally benchmark our approach on large regression task, achieving state-of-the-art results on a laptop computer.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wesel24a.html
https://proceedings.mlr.press/v238/wesel24a.htmlTensor-view Topological Graph Neural NetworkGraph classification is an important learning task for graph-structured data. Graph neural networks (GNNs) have recently gained growing attention in graph learning and shown significant improvements on many important graph problems. Despite their state-of-the-art performances, existing GNNs only use local information from a very limited neighborhood around each node, suffering from loss of multi-modal information and overheads of excessive computation. To address these issues, we propose a novel Tensor-view Topological Graph Neural Network (TTG-NN), a class of simple yet effective topological deep learning built upon persistent homology, graph convolution, and tensor operations. This new method incorporates tensor learning to simultaneously capture {\it Tensor-view Topological} (TT), as well as Tensor-view Graph (TG) structural information on both local and global levels. Computationally, to fully exploit graph topology and structure, we propose two flexible TT and TG representation learning modules which disentangles feature tensor aggregation and transformation, and learns to preserve multi-modal structure with less computation. Theoretically, we derive high probability bounds on both the out-of-sample and in-sample mean squared approximation errors for our proposed Tensor Transformation Layer (TTL). Real data experiments show that the proposed TTG-NN outperforms 20 state-of-the-art methods on various graph benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wen24a.html
https://proceedings.mlr.press/v238/wen24a.htmlFast Fourier Bayesian QuadratureIn numerical integration, Bayesian quadrature (BQ) excels at producing estimates with quantified uncertainties, particularly in sparse data settings. However, its computational scalability and kernel learning capabilities have lagged behind modern advances in Gaussian process research. To bridge this gap, we recast the BQ posterior integral as a convolution operation, which enables efficient computation via fast Fourier transform of low-rank matrices. We introduce two new methods enabled by recasting BQ as a convolution: fast Fourier Bayesian quadrature and sparse spectrum Bayesian quadrature. These methods enhance the computational scalability of BQ and expand kernel flexibility, enabling the use of \textit{any} stationary kernel in the BQ setting. We empirically validate the efficacy of our approach through a range of integration tasks, substantiating the benefits of the proposed methodology.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/warren24a.html
https://proceedings.mlr.press/v238/warren24a.htmlNear-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and GeneralizationWe study the generalization capability of nearly-interpolating linear regressors: ${\beta}$’s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix ${\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, ${\beta}$ has squared $\ell_2$-norm $\mathbb{E}[\|{{\beta}}\|_{2}^{2}] = \Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i({\Sigma}) \sim i^{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24l.html
https://proceedings.mlr.press/v238/wang24l.htmlOn cyclical MCMC samplingCyclical MCMC is a novel MCMC framework recently proposed by Zhang et al. (2019) to address the challenge posed by high- dimensional multimodal posterior distributions like those arising in deep learning. The algorithm works by generating a nonhomogeneous Markov chain that tracks —cyclically in time— tempered versions of the target distribution. We show in this work that cyclical MCMC converges to the desired probability distribution in settings where the Markov kernels used are fast mixing, and sufficiently long cycles are employed. However in the far more common settings of slow mixing kernels, the algorithm may fail to produce samples from the desired distribution. In particular, in a simple mixture example with unequal variance we show by simulation that cyclical MCMC fails to converge to the desired limit. Finally, we show that cyclical MCMC typically estimates well the local shape of the target distribution around each mode, even when we do not have convergence to the target.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24k.html
https://proceedings.mlr.press/v238/wang24k.htmlOn the Effect of Key Factors in Spurious Correlation: A theoretical PerspectiveSpurious correlations arise when irrelevant patterns in input data are mistakenly associated with labels, compromising the generalizability of machine learning models. While these models may be confident during the training stage, they often falter in real-world testing scenarios due to the shift of these misleading correlations. Current solutions to this problem typically involve altering the correlations or regularizing latent representations. However, while these methods show promise in experiments, a rigorous theoretical understanding of their effectiveness and the underlying factors of spurious correlations is lacking. In this work, we provide a comprehensive theoretical analysis, supported by empirical evidence, to understand the intricacies of spurious correlations. Drawing on our proposed theorems, we investigate the behaviors of classifiers when confronted with spurious features, and present our findings on how various factors influence these correlations and their impact on model performances, including the Mahalanobis distance of groups, and training/testing spurious correlation ratios. Additionally, by aligning empirical outcomes with our theoretical discoveries, we highlight the feasibility of assessing the degree of separability of intertwined real-world features. This research paves the way for a nuanced comprehension of spurious correlations, laying a solid theoretical groundwork that promises to steer future endeavors toward crafting more potent mitigation techniques.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24j.html
https://proceedings.mlr.press/v238/wang24j.htmlOptimal estimation of Gaussian (poly)treesWe develop optimal algorithms for learning undirected Gaussian trees and directed Gaussian polytrees from data. We consider both problems of distribution learning (i.e. in KL distance) and structure learning (i.e. exact recovery). The first approach is based on the Chow-Liu algorithm, and learns an optimal tree-structured distribution efficiently. The second approach is a modification of the PC algorithm for polytrees that uses partial correlation as a conditional independence tester for constraint-based structure learning. We derive explicit finite-sample guarantees for both approaches, and show that both approaches are optimal by deriving matching lower bounds. Additionally, we conduct numerical experiments to compare the performance of various algorithms, providing further insights and empirical evidence.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24i.html
https://proceedings.mlr.press/v238/wang24i.htmlDon’t Be Pessimistic Too Early: Look K Steps Ahead!Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24h.html
https://proceedings.mlr.press/v238/wang24h.htmlModel-based Policy Optimization under Approximate Bayesian InferenceModel-based reinforcement learning algorithms (MBRL) present an exceptional potential to enhance sample efficiency within the realm of online reinforcement learning (RL). Nevertheless, a substantial proportion of prevalent MBRL algorithms fail to adequately address the dichotomy of exploration and exploitation. Posterior sampling reinforcement learning (PSRL) emerges as an innovative strategy adept at balancing exploration and exploitation, albeit its theoretical assurances are contingent upon exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24g.html
https://proceedings.mlr.press/v238/wang24g.htmlInvariant Aggregator for Defending against Federated Backdoor AttacksFederated learning enables training high-utility models across several clients without directly sharing their private data. As a downside, the federated setting makes the model vulnerable to various adversarial attacks in the presence of malicious clients. Despite the theoretical and empirical success in defending against attacks that aim to degrade models’ utility, defense against backdoor attacks that increase model accuracy on backdoor samples exclusively without hurting the utility on other samples remains challenging. To this end, we first analyze the failure modes of existing defenses over a flat loss landscape, which is common for well-designed neural networks such as Resnet (He et al., 2015) but is often overlooked by previous works. Then, we propose an invariant aggregator that redirects the aggregated update to invariant directions that are generally useful via selectively masking out the update elements that favor few and possibly malicious clients. Theoretical results suggest that our approach provably mitigates backdoor attacks and remains effective over flat loss landscapes. Empirical results on three datasets with different modalities and varying numbers of clients further demonstrate that our approach mitigates a broad class of backdoor attacks with a negligible cost on the model utility.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24f.html
https://proceedings.mlr.press/v238/wang24f.htmlEfficient Data Shapley for Weighted Nearest Neighbor AlgorithmsThis work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley’s computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24e.html
https://proceedings.mlr.press/v238/wang24e.htmlNear-Optimal Convex Simple Bilevel Optimization with a Bisection MethodThis paper studies a class of simple bilevel optimization problems where we minimize a composite convex function at the upper-level subject to a composite convex lower-level problem. Existing methods either provide asymptotic guarantees for the upper-level objective or attain slow sublinear convergence rates. We propose a bisection algorithm to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. In each iteration, the binary search narrows the interval by assessing inequality system feasibility. Under mild conditions, the total operation complexity of our method is ${{\mathcal{O}}}\left(\max\{\sqrt{L_{f_1}/\epsilon_f},\sqrt{L_{g_1}/\epsilon_g}\} \right)$. Here, a unit operation can be a function evaluation, gradient evaluation, or the invocation of the proximal mapping, $L_{f_1}$ and $L_{g_1}$ are the Lipschitz constants of the upper- and lower-level objectives’ smooth components, and ${\mathcal{O}}$ hides logarithmic terms. Our approach achieves a near-optimal rate in unconstrained smooth or composite convex optimization when disregarding logarithmic terms. Numerical experiments demonstrate the effectiveness of our method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24d.html
https://proceedings.mlr.press/v238/wang24d.htmlJoint control variate for faster black-box variational inferenceBlack-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24c.html
https://proceedings.mlr.press/v238/wang24c.htmlOn Counterfactual Metrics for Social Welfare: Incentives, Ranking, and Information AsymmetryFrom the social sciences to machine learning, it is well documented that metrics do not always align with social welfare. In healthcare, Dranove et al. (2003) showed that publishing surgery mortality metrics actually harmed sicker patients by increasing provider selection behavior. Using a principal-agent model, we analyze the incentive misalignments that arise from such average treated outcome metrics, and show that the incentives driving treatment decisions would align with maximizing total patient welfare if the metrics (i) accounted for counterfactual untreated outcomes and (ii) considered total welfare instead of averaging over treated patients. Operationalizing this, we show how counterfactual metrics can be modified to behave reasonably in patient-facing ranking systems. Extending to realistic settings when providers observe more about patients than the regulatory agencies do, we bound the decay in performance by the degree of information asymmetry between principal and agent. In doing so, our model connects principal-agent information asymmetry with unobserved heterogeneity in causal inference.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24b.html
https://proceedings.mlr.press/v238/wang24b.htmlOracle-Efficient Pessimism: Offline Policy Optimization In Contextual BanditsWe consider offline policy optimization (OPO) in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are either specialized or computationally inefficient. We present the first \emph{general} oracle-efficient algorithm for pessimistic OPO: it reduces to supervised learning, leading to broad applicability. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We instantiate our approach for both discrete and continuous actions and perform experiments in both settings, showing advantage over unregularized OPO across a wide range of configurations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wang24a.html
https://proceedings.mlr.press/v238/wang24a.htmlInterpretability Guarantees with Merlin-Arthur ClassifiersWe propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/waldchen24a.html
https://proceedings.mlr.press/v238/waldchen24a.htmlDiagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models via Reparameterisation and SmoothingIt is well-known that the reparameterisation gradient estimator, which exhibits low variance in practice, is biased for non-differentiable models. This may compromise correctness of gradient-based optimisation methods such as stochastic gradient descent (SGD). We introduce a simple syntactic framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings for which the reparameterisation gradient estimator is unbiased. Our main contribution is a novel variant of SGD, Diagonalisation Stochastic Gradient Descent, which progressively enhances the accuracy of the smoothed approximation during optimisation, and we prove convergence to stationary points of the unsmoothed (original) objective. Our empirical evaluation reveals benefits over the state of the art: our approach is simple, fast, stable and attains orders of magnitude reduction in work-normalised variance.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/wagner24a.html
https://proceedings.mlr.press/v238/wagner24a.htmlAnalysis of Privacy Leakage in Federated Large Language ModelsWith the rapid adoption of Federated Learning (FL) as the training and tuning protocol for applications utilizing Large Language Models (LLMs), recent research highlights the need for significant modifications to FL to accommodate the large-scale of LLMs. While substantial adjustments to the protocol have been introduced as a response, comprehensive privacy analysis for the adapted FL protocol is currently lacking. To address this gap, our work delves into an extensive examination of the privacy analysis of FL when used for training LLMs, both from theoretical and practical perspectives. In particular, we design two active membership inference attacks with guaranteed theoretical success rates to assess the privacy leakages of various adapted FL configurations. Our theoretical findings are translated into practical attacks, revealing substantial privacy vulnerabilities in popular LLMs, including BERT, RoBERTa, DistilBERT, and OpenAI’s GPTs, across multiple real-world language datasets. Additionally, we conduct thorough experiments to evaluate the privacy leakage of these models when data is protected by state-of-the-art differential privacy (DP) mechanisms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/vu24a.html
https://proceedings.mlr.press/v238/vu24a.htmlLEDetection: A Simple Framework for Semi-Supervised Few-Shot Object DetectionFew-shot object detection (FSOD) is a challenging problem aimed at detecting novel concepts from few exemplars. Existing approaches to FSOD all assume abundant base labels to adapt to novel objects. This paper studies the new task of semi-supervised FSOD by considering a realistic scenario in which both base and novel labels are simultaneously scarce. We explore the utility of unlabeled data within our proposed label-efficient detection framework and discover its remarkable ability to boost semi-supervised FSOD by way of region proposals. Motivated by this finding, we introduce SoftER Teacher, a robust detector combining pseudo-labeling with consistency learning on region proposals, to harness unlabeled data for improved FSOD without relying on abundant labels. Rigorous experiments show that SoftER Teacher surpasses the novel performance of a strong supervised detector using only 10% of required base labels, without catastrophic forgetting observed in prior approaches. Our work also sheds light on a potential relationship between semi-supervised and few-shot detection suggesting that a stronger semi-supervised detector leads to a more effective few-shot detector.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/vu-tran24a.html
https://proceedings.mlr.press/v238/vu-tran24a.htmlTaming False Positives in Out-of-Distribution Detection with Human FeedbackRobustness to out-of-distribution (OOD) samples is crucial for the safe deployment of machine learning models in the open world. Recent works have focused on designing scoring functions to quantify OOD uncertainty. Setting appropriate thresholds for these scoring functions for OOD detection is challenging as OOD samples are often unavailable up front. Typically, thresholds are set to achieve a desired true positive rate (TPR), e.g., $95%$ TPR. However, this can lead to very high false positive rates (FPR), ranging from 60 to 96%, as observed in the Open-OOD benchmark. In safety critical real-life applications, e.g., medical diagnosis, controlling the FPR is essential when dealing with various OOD samples dynamically. To address these challenges, we propose a mathematically grounded OOD detection framework that leverages expert feedback to \emph{safely} update the threshold on the fly. We provide theoretical results showing that it is guaranteed to meet the FPR constraint at all times while minimizing the use of human feedback. Another key feature of our framework is that it can work with any scoring function for OOD uncertainty quantification. Empirical evaluation of our system on synthetic and benchmark OOD datasets shows that our method can maintain FPR at most $5%$ while maximizing TPR.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/vishwakarma24a.html
https://proceedings.mlr.press/v238/vishwakarma24a.htmlAsymptotic Characterisation of the Performance of Robust Linear Regression in the Presence of OutliersWe study robust linear regression in high-dimension, when both the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $\alpha=n/d$, and study a data model that includes outliers. We provide exact asymptotics for the performances of the empirical risk minimisation (ERM) using $\ell_2$-regularised $\ell_2$, $\ell_1$, and Huber losses, which are the standard approach to such problems. We focus on two metrics for the performance: the generalisation error to similar datasets with outliers, and the estimation error of the original, unpolluted function. Our results are compared with the information theoretic Bayes-optimal estimation bound. For the generalization error, we find that optimally-regularised ERM is asymptotically consistent in the large sample complexity limit if one perform a simple calibration, and compute the rates of convergence. For the estimation error however, we show that due to a norm calibration mismatch, the consistency of the estimator requires an oracle estimate of the optimal norm, or the presence of a cross-validation set not corrupted by the outliers. We examine in detail how performance depends on the loss function and on the degree of outlier corruption in the training set and identify a region of parameters where the optimal performance of the Huber loss is identical to that of the $\ell_2$ loss, offering insights into the use cases of different loss functions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/vilucchio24a.html
https://proceedings.mlr.press/v238/vilucchio24a.htmlLeveraging PAC-Bayes Theory and Gibbs Distributions for Generalization Bounds with Complexity MeasuresIn statistical learning theory, a generalization bound usually involves a complexity measure imposed by the considered theoretical framework. This limits the scope of such bounds, as other forms of capacity measures or regularizations are used in algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayes bounds to derive a general generalization bound instantiable with arbitrary complexity measures. One trick to prove such a result involves considering a commonly used family of distributions: the Gibbs distributions. Our bound stands in probability jointly over the hypothesis and the learning sample, which allows the complexity to be adapted to the generalization gap as it can be customized to fit both the hypothesis class and the task.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/viallard24a.html
https://proceedings.mlr.press/v238/viallard24a.htmlVariational Gaussian Process Diffusion ProcessesDiffusion processes are a class of stochastic differential equations (SDEs) providing a rich family of expressive models that arise naturally in dynamic modelling tasks. Probabilistic inference and learning under generative models with latent processes endowed with a non-linear diffusion process prior are intractable problems. We build upon work within variational inference, approximating the posterior process as a linear diffusion process, and point out pathologies in the approach. We propose an alternative parameterization of the Gaussian variational process using a site-based exponential family description. This allows us to trade a slow inference algorithm with fixed-point iterations for a fast algorithm for convex optimization akin to natural gradient descent, which also provides a better objective for learning model parameters.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/verma24a.html
https://proceedings.mlr.press/v238/verma24a.htmlOptimal Budgeted Rejection Sampling for Generative ModelsRejection sampling methods have recently been proposed to improve the performance of discriminator-based generative models. However, these methods are only optimal under an unlimited sampling budget, and are usually applied to a generator trained independently of the rejection procedure. We first propose an Optimal Budgeted Rejection Sampling (OBRS) scheme that is provably optimal with respect to \textit{any} $f$-divergence between the true distribution and the post-rejection distribution, for a given sampling budget. Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model’s overall performance. Through experiments and supporting theory, we show that the proposed methods are effective in significantly improving the quality and diversity of the samples.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/verine24a.html
https://proceedings.mlr.press/v238/verine24a.htmlLearning Sparse Codes with Entropy-Based ELBOsStandard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) the novel objective is fully analytic; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytic solutions, which leads to the fully analytic objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning; and (2) using the entropy objective, we derive a fully analytic ELBO objective for the standard sparse coding generative model.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/velychko24a.html
https://proceedings.mlr.press/v238/velychko24a.htmlStochastic Methods in Variational Inequalities: Ergodicity, Bias and RefinementsFor min-max optimization and variational inequalities problems (VIPs), Stochastic Extragradient (SEG) and Stochastic Gradient Descent Ascent (SGDA) have emerged as preeminent algorithms. Constant step-size versions of SEG/SGDA have gained popularity due to several appealing benefits, but their convergence behaviors are complicated even in rudimentary bilinear models. Our work elucidates the probabilistic behavior of these algorithms and their projected variants, for a wide range of monotone and non-monotone VIPs with potentially biased stochastic oracles. By recasting them as time-homogeneous Markov Chains, we establish geometric convergence to a unique invariant distribution and aymptotic normality. Specializing to min-max optimization, we characterize the relationship between the step-size and the induced bias with respect to the global solution, which in turns allows for bias refinement via the Richardson-Romberg scheme. Our theoretical analysis is corroborated by numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/vasileios-vlatakis-gkaragkounis24a.html
https://proceedings.mlr.press/v238/vasileios-vlatakis-gkaragkounis24a.htmlGeneral Identifiability and Achievability for Causal Representation LearningThis paper focuses on causal representation learning (CRL) under a general nonparametric latent causal model and a general transformation model that maps the latent data to the observational data. It establishes identifiability and achievability results using two hard uncoupled interventions per node in the latent causal graph. Notably, one does not know which pair of intervention environments have the same node intervened (hence, uncoupled). For identifiability, the paper establishes that perfect recovery of the latent causal model and variables is guaranteed under uncoupled interventions. For achievability, an algorithm is designed that uses observational and interventional data and recovers the latent causal model and variables with provable guarantees. This algorithm leverages score variations across different environments to estimate the inverse of the transformer and, subsequently, the latent variables. The analysis, additionally, recovers the identifiability result for two hard coupled interventions, that is when metadata about the pair of environments that have the same node intervened is known. This paper also shows that when observational data is available, additional faithfulness assumptions that are adopted by the existing literature are unnecessary.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/varici24a.html
https://proceedings.mlr.press/v238/varici24a.htmlOrdinal Potential-based Player RatingIt was recently observed that Elo ratings fail at preserving transitive relations among strategies and therefore cannot correctly extract the transitive component of a game. We provide a characterization of transitive games as a weak variant of ordinal potential games and show that Elo ratings actually do preserve transitivity when computed in the right space, using suitable invertible mappings. Leveraging this insight, we introduce a new game decomposition of an arbitrary game into transitive and cyclic components that is learnt using a neural network-based architecture and that prioritises capturing the sign pattern of the game, namely transitive and cyclic relations among strategies. We link our approach to the known concept of sign-rank, and evaluate our methodology using both toy examples and empirical data from real-world games.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/vadori24a.html
https://proceedings.mlr.press/v238/vadori24a.htmlProvable Mutual Benefits from Federated Learning in Privacy-Sensitive DomainsCross-silo federated learning (FL) allows data owners to train accurate machine learning models by benefiting from each others private datasets. Unfortunately, the model accuracy benefits of collaboration are often undermined by privacy defenses. Therefore, to incentivize client participation in privacy-sensitive domains, a FL protocol should strike a delicate balance between privacy guarantees and end-model accuracy. In this paper, we study the question of when and how a server could design a FL protocol provably beneficial for all participants. First, we provide necessary and sufficient conditions for the existence of mutually beneficial protocols in the context of mean estimation and convex stochastic optimization. We also derive protocols that maximize the total clients’ utility, given symmetric privacy preferences. Finally, we design protocols maximizing end-model accuracy and demonstrate their benefits in synthetic experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tsoy24a.html
https://proceedings.mlr.press/v238/tsoy24a.htmlManifold-Aligned Counterfactual Explanations for Neural NetworksWe study the problem of finding optimal manifold-aligned counterfactual explanations for neural networks. Existing approaches that involve solving a complex mixed-integer optimization (MIP) problem frequently suffer from scalability issues, limiting their practical usefulness. Furthermore, the solutions are not guaranteed to follow the data manifold, resulting in unrealistic counterfactual explanations. To address these challenges, we first present a MIP formulation where we explicitly enforce manifold alignment by reformulating the highly nonlinear Local Outlier Factor (LOF) metric as mixed-integer constraints. To address the computational challenge, we leverage the geometry of a trained neural network and propose an efficient decomposition scheme that reduces the initial large, hard-to-solve optimization problem into a series of significantly smaller, easier-to-solve problems by constraining the search space to “live” polytopes, i.e., regions that contain at least one actual data point. Experiments on real-world datasets demonstrate the efficacy of our approach in producing both optimal and realistic counterfactual explanations, and computational traceability.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tsiourvas24a.html
https://proceedings.mlr.press/v238/tsiourvas24a.htmlProxy Methods for Domain AdaptationWe study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available. We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional “concept” variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tsai24b.html
https://proceedings.mlr.press/v238/tsai24b.htmlFast Minimization of Expected Logarithmic Loss via Stochastic Dual AveragingConsider the problem of minimizing an expected logarithmic loss over either the probability simplex or the set of quantum density matrices. This problem includes tasks such as solving the Poisson inverse problem, computing the maximum-likelihood estimate for quantum state tomography, and approximating positive semi-definite matrix permanents with the currently tightest approximation ratio. Although the optimization problem is convex, standard iteration complexity guarantees for first-order methods do not directly apply due to the absence of Lipschitz continuity and smoothness in the loss function. In this work, we propose a stochastic first-order algorithm named $B$-sample stochastic dual averaging with the logarithmic barrier. For the Poisson inverse problem, our algorithm attains an $\varepsilon$-optimal solution in $\smash{\tilde{O}}(d^2/\varepsilon^2)$ time, matching the state of the art, where $d$ denotes the dimension. When computing the maximum-likelihood estimate for quantum state tomography, our algorithm yields an $\varepsilon$-optimal solution in $\smash{\tilde{O}}(d^3/\varepsilon^2)$ time. This improves on the time complexities of existing stochastic first-order methods by a factor of $d^{\omega-2}$ and those of batch methods by a factor of $d^2$, where $\omega$ denotes the matrix multiplication exponent. Numerical experiments demonstrate that empirically, our algorithm outperforms existing methods with explicit complexity guarantees.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tsai24a.html
https://proceedings.mlr.press/v238/tsai24a.htmlSequence Length Independent Norm-Based Generalization Bounds for TransformersThis paper provides norm-based generalization bounds for the Transformer architecture that do not depend on the input sequence length. We employ a covering number based approach to prove our bounds. We use three novel covering number bounds for the function class of bounded linear mappings to upper bound the Rademacher complexity of the Transformer. Furthermore, we show this generalization bound applies to the common Transformer training technique of masking and then predicting the masked word. We also run a simulated study on a sparse majority data set that empirically validates our theoretical findings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/trauger24a.html
https://proceedings.mlr.press/v238/trauger24a.htmlSimulation-Free Schrödinger Bridges via Score and Flow MatchingWe present simulation-free score and flow matching ([SF]$^2$M), a simulation-free objective for inferring stochastic dynamics given unpaired samples drawn from arbitrary source and target distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic generative modeling as a Schrödinger bridge problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data. Our code is available in the TorchCFM package at \url{https://github.com/atong01/conditional-flow-matching}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tong24a.html
https://proceedings.mlr.press/v238/tong24a.htmlOn Convergence in Wasserstein Distance and f-divergence Minimization ProblemsThe zero-sum game in generative adversarial networks (GANs) for learning the distribution of observed data is known to reduce to the minimization of a divergence measure between the underlying and generative models. However, the current theoretical understanding of the role of the target divergence in the characteristics of GANs’ generated samples remains largely inadequate. In this work, we aim to analyze the influence of the divergence measure on the local optima and convergence properties of divergence minimization problems in learning a multi-modal data distribution. We show a mode-seeking f-divergence, e.g. the Jensen-Shannon (JS) divergence in the vanilla GAN, could lead to poor locally optimal solutions missing some underlying modes. On the other hand, we demonstrate that the optimization landscape of 1-Wasserstein distance in Wasserstein GANs does not suffer from such suboptimal local minima. Furthermore, we prove that a randomly-initialized gradient-based optimization of the Wasserstein distance will, with high probability, capture all the existing modes. We present numerical results on standard image datasets, revealing the success of Wasserstein GANs compared to JS-GANs in avoiding suboptimal local optima under a mixture model.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ting-li24a.html
https://proceedings.mlr.press/v238/ting-li24a.htmlScalable Meta-Learning with Gaussian ProcessesMeta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tighineanu24a.html
https://proceedings.mlr.press/v238/tighineanu24a.htmlGenerative Flow Networks as Entropy-Regularized RLThe recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating RL principles into the realm of generative flow networks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tiapkin24a.html
https://proceedings.mlr.press/v238/tiapkin24a.htmlContextual Directed Acyclic GraphsEstimating the structure of directed acyclic graphs (DAGs) from observational data remains a significant challenge in machine learning. Most research in this area concentrates on learning a single DAG for the entire population. This paper considers an alternative setting where the graph structure varies across individuals based on available "contextual" features. We tackle this contextual DAG problem via a neural network that maps the contextual features to a DAG, represented as a weighted adjacency matrix. The neural network is equipped with a novel projection layer that ensures the output matrices are sparse and satisfy a recently developed characterization of acyclicity. We devise a scalable computational framework for learning contextual DAGs and provide a convergence guarantee and an analytical gradient for backpropagating through the projection layer. Our experiments suggest that the new approach can recover the true context-specific graph where existing approaches fail.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/thompson24a.html
https://proceedings.mlr.press/v238/thompson24a.htmlEfficiently Computable Safety Bounds for Gaussian Processes in Active LearningActive learning of physical systems must commonly respect practical safety constraints, which restricts the exploration of the design space. Gaussian Processes (GPs) and their calibrated uncertainty estimations are widely used for this purpose. In many technical applications the design space is explored via continuous trajectories, along which the safety needs to be assessed. This is particularly challenging for strict safety requirements in GP methods, as it employs computationally expensive Monte Carlo sampling of high quantiles. We address these challenges by providing provable safety bounds based on the adaptively sampled median of the supremum of the posterior GP. Our method significantly reduces the number of samples required for estimating high safety probabilities, resulting in faster evaluation without sacrificing accuracy and exploration speed. The effectiveness of our safe active learning approach is demonstrated through extensive simulations and validated using a real-world engine example.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tebbe24a.html
https://proceedings.mlr.press/v238/tebbe24a.htmlLearning Populations of Preferences via Pairwise Comparison QueriesIdeal point based preference learning using pairwise comparisons of type "Do you prefer a or b?" has emerged as a powerful tool for understanding how we make preferences. Existing preference learning approaches assume homogeneity and focus on learning preference on average over the population or require a large number of queries per individual to localize individual preferences. However, in practical scenarios with heterogeneous preferences and limited availability of responses, these approaches are impractical. Therefore, we introduce the problem of learning the distribution of preferences over a population via pairwise comparisons using only one response per individual. Due to binary answers from comparison queries, we focus on learning the mass of the underlying distribution in the regions created by the intersection of bisecting hyperplanes between queried item pairs. We investigate this fundamental question in both 1-D and higher dimensional settings with noiseless response to comparison queries. We show that the problem is identifiable in 1-D setting and provide recovery guarantees. We show that the problem is not identifiable for higher dimensional settings in general and establish sufficient condition for identifiability. We propose using a regularized recovery, and provide guarantees on the total variation distance between the true mass and the learned distribution. We validate our findings through simulations and experiments on real datasets. We also introduce a new dataset for this task collected on a real crowdsourcing platform.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tatli24a.html
https://proceedings.mlr.press/v238/tatli24a.htmlStochastic Multi-Armed Bandits with Strongly Reward-Dependent DelaysThere has been increasing interest in applying multi-armed bandits to adaptive designs in clinical trials. However, most literature assumes that a previous patient’s survival response of a treatment is known before the next patient is treated, which is unrealistic. The inability to account for response delays is cited frequently as one of the problems in using adaptive designs in clinical trials. More critically, the “delays” in observing the survival response are the same as the rewards rather than being external stochastic noise. We formalize this problem as a novel stochastic multi-armed bandit (MAB) problem with reward-dependent delays, where the delay at each round depends on the reward generated on the same round. For general reward/delay distributions with finite expectation, our proposed censored-UCB algorithm achieves near-optimal regret in terms of both problem-dependent and problem-independent bounds. With bounded or sub-Gaussian reward distributions, the upper bounds are optimal with a matching lower bound. Our theoretical results and the algorithms’ effectiveness are validated by empirical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tang24c.html
https://proceedings.mlr.press/v238/tang24c.htmlSolving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient ViewpointSolving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy. Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tang24b.html
https://proceedings.mlr.press/v238/tang24b.htmlAdaptivity of Diffusion Models to Manifold StructuresEmpirical studies have demonstrated the effectiveness of (score-based) diffusion models in generating high-dimensional data, such as texts and images, which typically exhibit a low-dimensional manifold nature. These empirical successes raise the theoretical question of whether score-based diffusion models can optimally adapt to low-dimensional manifold structures. While recent work has validated the minimax optimality of diffusion models when the target distribution admits a smooth density with respect to the Lebesgue measure of the ambient data space, these findings do not fully account for the ability of diffusion models in avoiding the the curse of dimensionality when estimating high-dimensional distributions. This work considers two common classes of diffusion models: Langevin diffusion and forward-backward diffusion. We show that both models can adapt to the intrinsic manifold structure by showing that the convergence rate of the inducing distribution estimator depends only on the intrinsic dimension of the data. Moreover, our considered estimator does not require knowing or explicitly estimating the manifold. We also demonstrate that the forward-backward diffusion can achieve the minimax optimal rate under the Wasserstein metric when the target distribution possesses a smooth density with respect to the volume measure of the low-dimensional manifold.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tang24a.html
https://proceedings.mlr.press/v238/tang24a.htmlInformative Path Planning with Limited AdaptivityWe consider the informative path planning (IPP) problem in which a robot interacts with an uncertain environment and gathers information by visiting locations. The goal is to minimize its expected travel cost to cover a given submodular function. Adaptive solutions, where the robot incorporates all available information to select the next location to visit, achieve the best objective. However, such a solution is resource-intensive as it entails recomputing after every visited location. A more practical approach is to design solutions with a small number of adaptive "rounds", where the robot recomputes only once at the start of each round. In this paper, we design an algorithm for IPP parameterized by the number k of adaptive rounds, and prove a smooth tradeoff between k and the solution quality (relative to fully adaptive solutions). We validate our theoretical results by experiments on a real road network, where we observe that a few rounds of adaptivity suffice to obtain solutions of cost almost as good as fully-adaptive ones.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tan24a.html
https://proceedings.mlr.press/v238/tan24a.htmlData-Driven Confidence Intervals with Optimal Rates for the Mean of Heavy-Tailed DistributionsEstimating the expected value is one of the key problems of statistics, and it serves as a backbone for countless methods in machine learning. In this paper we propose a new algorithm to build non-asymptotically exact confidence intervals for the mean of a symmetric distribution based on an independent, identically distributed sample. The method combines resampling with median-of-means estimates to ensure optimal subgaussian bounds for the sizes of the confidence intervals under mild, heavy-tailed moment conditions. The scheme is completely data-driven: the construction does not need any information about the moments, yet it manages to build exact confidence regions which shrink at the optimal rate. We also show how to generalize the approach to higher dimensions and prove dimension-free, subgaussian PAC bounds for the exclusion probabilities of false candidates. Finally, we illustrate the method and its properties for heavy-tailed distributions with numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tamas24a.html
https://proceedings.mlr.press/v238/tamas24a.htmlModel-Based Best Arm Identification for Decreasing BanditsWe study the problem of reliably identifying the best (lowest loss) arm in a stochastic multi-armed bandit when the expected loss of each arm is monotone decreasing as a function of its pull count. This models, for instance, scenarios where each arm itself represents an optimization algorithm for finding the minimizer of a common function, and there is a limited time available to test the algorithms before committing to one of them. We assume that the decreasing expected loss of each arm depends on the number of its pulls as a (inverse) polynomial with unknown coefficients. We propose two fixed-budget best arm identification algorithms – one for the case of sparse polynomial decay models and the other for general polynomial models – along with bounds on the identification error probability. We also derive algorithm-independent lower bounds on the error probability. These bounds are seen to be factored into the product of the usual problem complexity and the model complexity that only depends on the parameters of the model. This indicates that our methods can identify the best arm even when the budget is smaller. We conduct empirical studies of our algorithms to complement our theoretical findings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/takemori24a.html
https://proceedings.mlr.press/v238/takemori24a.htmlLearning to Defer to a Population: A Meta-Learning ApproachThe learning to defer (L2D) framework allows autonomous systems to be safe and robust by allocating difficult decisions to a human expert. All existing work on L2D assumes that each expert is well-identified, and if any expert were to change, the system should be re-trained. In this work, we alleviate this constraint, formulating an L2D system that can cope with never-before-seen experts at test-time. We accomplish this by using meta-learning, considering both optimization- and model-based variants. Given a small context set to characterize the currently available expert, our framework can quickly adapt its deferral policy. For the model-based approach, we employ an attention mechanism that is able to look for points in the context set that are similar to a given test point, leading to an even more precise assessment of the expert’s abilities. In the experiments, we validate our methods on image recognition, traffic sign detection, and skin lesion diagnosis benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/tailor24a.html
https://proceedings.mlr.press/v238/tailor24a.htmlUnderstanding Progressive Training Through the Framework of Randomized Coordinate DescentWe propose a Randomized Progressive Training algorithm (RPT)—a stochastic proxy for the well-known Progressive Training method (PT) (Karras et al., 2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was proposed as a heuristic, with no convergence analysis even for the simplest objective functions. On the contrary, to the best of our knowledge, RPT is the first PT-type algorithm with rigorous and sound theoretical guarantees for general smooth objective functions. We cast our method into the established framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richtarik & Takac, 2014), for which (as a by-product of our investigations) we also propose a novel, simple and general convergence analysis encapsulating strongly-convex, convex and nonconvex objectives. We then use this framework to establish a convergence theory for RPT. Finally, we validate the effectiveness of our method through extensive computational experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/szlendak24a.html
https://proceedings.mlr.press/v238/szlendak24a.htmlSampling-based Safe Reinforcement Learning for Nonlinear Dynamical SystemsWe develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/suttle24a.html
https://proceedings.mlr.press/v238/suttle24a.htmlSharpened Lazy Incremental Quasi-Newton MethodThe problem of minimizing the sum of $n$ functions in $d$ dimensions is ubiquitous in machine learning and statistics. In many applications where the number of observations $n$ is large, it is necessary to use incremental or stochastic methods, as their per-iteration cost is independent of $n$. Of these, Quasi-Newton (QN) methods strike a balance between the per-iteration cost and the convergence rate. Specifically, they exhibit a superlinear rate with $O(d^2)$ cost in contrast to the linear rate of first-order methods with $O(d)$ cost and the quadratic rate of second-order methods with $O(d^3)$ cost. However, existing incremental methods have notable shortcomings: Incremental Quasi-Newton (IQN) only exhibits asymptotic superlinear convergence. In contrast, Incremental Greedy BFGS (IGS) offers explicit superlinear convergence but suffers from poor empirical performance and has a per-iteration cost of $O(d^3)$. To address these issues, we introduce the Sharpened Lazy Incremental Quasi-Newton Method (SLIQN) that achieves the best of both worlds: an explicit superlinear convergence rate, and superior empirical performance at a per-iteration $O(d^2)$ cost. SLIQN features two key changes: first, it incorporates a hybrid strategy of using both classic and greedy BFGS updates, allowing it to empirically outperform both IQN and IGS. Second, it employs a clever constant multiplicative factor along with a lazy propagation strategy, which enables it to have a cost of $O(d^2)$. Additionally, our experiments demonstrate the superiority of SLIQN over other incremental and stochastic Quasi-Newton variants and establish its competitiveness with second-order incremental methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sunil-lahoti24a.html
https://proceedings.mlr.press/v238/sunil-lahoti24a.htmlDistributionally Robust Model-based Reinforcement Learning with Large State SpacesThree major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment. To overcome these issues, we study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback-Leibler, chi-square, and total variation uncertainty sets. We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics, leveraging access to a generative model (i.e., simulator). We further demonstrate the statistical sample complexity of the proposed method for different uncertainty sets. These complexity bounds are independent of the number of states and extend beyond linear dynamics, ensuring the effectiveness of our approach in identifying near-optimal distributionally-robust policies. The proposed method can be further combined with other model-free distributionally robust reinforcement learning methods to obtain a near-optimal robust policy. Experimental results demonstrate the robustness of our algorithm to distributional shifts and its superior performance in terms of the number of samples needed.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sundhar-ramesh24a.html
https://proceedings.mlr.press/v238/sundhar-ramesh24a.htmlSparse and Faithful Explanations Without Sparse ModelsEven if a model is not globally sparse, it is possible for decisions made from that model to be accurately and faithfully described by a small number of features. For instance, an application for a large loan might be denied to someone because they have no credit history, which overwhelms any evidence towards their creditworthiness. In this work, we introduce the Sparse Explanation Value (SEV), a new way of measuring sparsity in machine learning models. In the loan denial example above, the SEV is 1 because only one factor is needed to explain why the loan was denied. SEV is a measure of decision sparsity rather than overall model sparsity, and we are able to show that many machine learning models – even if they are not sparse – actually have low decision sparsity, as measured by SEV. SEV is defined using movements over a hypercube, allowing SEV to be defined consistently over various model classes, with movement restrictions reflecting real-world constraints. Our algorithms reduce SEV without sacrificing accuracy, providing sparse and completely faithful explanations, even without globally sparse models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sun24b.html
https://proceedings.mlr.press/v238/sun24b.htmlUnderstanding Generalization of Federated Learning via Stability: Heterogeneity MattersGeneralization performance is a key metric in evaluating machine learning models when applied to real-world applications. Good generalization indicates the model can predict unseen data correctly when trained under a limited number of data. Federated learning (FL), which has emerged as a popular distributed learning framework, allows multiple devices or clients to train a shared model without violating privacy requirements. While the existing literature has studied extensively the generalization performances of centralized machine learning algorithms, similar analysis in the federated settings is either absent or with very restrictive assumptions on the loss functions. In this paper, we aim to analyze the generalization performances of federated learning by means of algorithmic stability, which measures the change of the output model of an algorithm when perturbing one data point. Three widely-used algorithms are studied, including FedAvg, SCAFFOLD, and FedProx, under convex and non-convex loss functions. Our analysis shows that the generalization performances of models trained by these three algorithms are closely related to the heterogeneity of clients’ datasets as well as the convergence behaviors of the algorithms. Particularly, in the i.i.d. setting, our results recover the classical results of stochastic gradient descent (SGD).Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sun24a.html
https://proceedings.mlr.press/v238/sun24a.htmlMINTY: Rule-based models that minimize the need for imputing features with missing valuesRule models are often preferred in prediction tasks with tabular inputs as they can be easily interpreted using natural language and provide predictive performance on par with more complex models. However, most rule models’ predictions are undefined or ambiguous when some inputs are missing, forcing users to rely on statistical imputation models or heuristics like zero imputation, undermining the interpretability of the models. In this work, we propose fitting concise yet precise rule models that learn to avoid relying on features with missing values and, therefore, limit their reliance on imputation at test time. We develop MINTY, a method that learns rules in the form of disjunctions between variables that act as replacements for each other when one or more is missing. This results in a sparse linear rule model, regularized to have small dependence on features with missing values, that allows a trade-off between goodness of fit, interpretability, and robustness to missing values at test time. We demonstrate the value of MINTY in experiments using synthetic and real-world data sets and find its predictive performance comparable or favorable to baselines, with smaller reliance on features with missing values.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/stempfle24a.html
https://proceedings.mlr.press/v238/stempfle24a.htmlFixed-kinetic Neural Hamiltonian Flows for enhanced interpretability and reduced complexityNormalizing Flows (NF) are Generative models which transform a simple prior distribution into the desired target. They however require the design of an invertible mapping whose Jacobian determinant has to be computable. Recently introduced, Neural Hamiltonian Flows (NHF) are Hamiltonian dynamics-based flows, which are continuous, volume-preserving and invertible and thus make for natural candidates for robust NF architectures. In particular, their similarity to classical Mechanics could lead to easier interpretability of the learned mapping. In this paper, we show that the current NHF architecture may still pose a challenge to interpretability. Inspired by Physics, we introduce a fixed-kinetic energy version of the model. This approach improves interpretability and robustness while requiring fewer parameters than the original model. We illustrate that on a 2D Gaussian mixture and on the MNIST and Fashion-MNIST datasets. Finally, we show how to adapt NHF to the context of Bayesian inference and illustrate the method on an example from cosmology.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/souveton24a.html
https://proceedings.mlr.press/v238/souveton24a.htmlFast Dynamic Sampling for Determinantal Point Processesn this work, we provide fast dynamic algorithms for repeatedly sampling from distributions characterized by Determinantal Point Processes (DPPs) and Nonsymmetric Determinantal Point Processes (NDPPs). DPPs are a very well-studied class of distributions on subsets of items drawn from a ground set of cardinality $n$ characterized by a symmetric $n \times n$ kernel matrix $L$ such that the probability of any subset is proportional to the determinant of its corresponding principal submatrix. Recent work has shown that the kernel symmetry constraint can be relaxed, leading to NDPPs, which can better model data in several machine learning applications. Given a low-rank kernel matrix ${\cal L}=L+L^\top\in \mathbb{R}^{n\times n}$ and its corresponding eigendecomposition specified by $\{\lambda_i, u_i \}_{i=1}^d$ where $d\leq n$ is the rank, we design a data structure that uses $O(nd)$ space and preprocesses data in $O(nd^{\omega-1})$ time where $\omega\approx 2.37$ is the exponent of matrix multiplication. The data structure can generate a sample according to DPP distribution in time $O(|E|^3\log n+|E|^{\omega-1}d^2)$ or according to NDPP distribution in time $O((|E|^3 \log n+ |E|^{\omega-1}d^2)(1+w)^d)$ for $E$ being the sampled indices and $w$ is a data-dependent parameter. This improves upon the space and preprocessing time over prior works, and achieves a state-of-the-art sampling time when the sampling set is relatively dense. At the heart of our data structure is an efficient sampling tree that can leverage batch initialization and fast inner product query simultaneously.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/song24b.html
https://proceedings.mlr.press/v238/song24b.htmlSolving Attention Kernel Regression Problem via Pre-conditionerAttention mechanism is the key to large language models, and attention matrix serves as an algorithmic and computational bottleneck for such a scheme. In this paper, we define two problems, motivated by designing fast algorithms for \emph{proxy} of attention matrix and solving regressions against them. Given an input matrix $A\in \mathbb{R}^{n\times d}$ with $n\gg d$ and a response vector $b$, we first consider the matrix exponential of the matrix $A^\top A$ as a proxy, and we in turn design algorithms for two types of regression problems: $\min_{x\in \mathbb{R}^d}\|(A^\top A)^jx-b\|_2$ and $\min_{x\in \mathbb{R}^d}\|A(A^\top A)^jx-b\|_2$ for any positive integer $j$. Studying algorithms for these regressions is essential, as matrix exponential can be approximated term-by-term via these smaller problems. The second proxy is applying exponential entrywise to the Gram matrix, denoted by $\exp(AA^\top)$ and solving the regression $\min_{x\in \mathbb{R}^n}\|\exp(AA^\top)x-b \|_2$. We call this problem the \emph{attention kernel regression} problem, as the matrix $\exp(AA^\top)$ could be viewed as a kernel function with respect to $A$. We design fast algorithms for these regression problems, based on sketching and preconditioning. We hope these efforts will provide an alternative perspective of studying efficient approximation of attention matrices.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/song24a.html
https://proceedings.mlr.press/v238/song24a.htmlFair Supervised Learning with A Simple Random Sampler of Sensitive AttributesAs the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sohn24a.html
https://proceedings.mlr.press/v238/sohn24a.htmlOptimising Distributions with Natural Gradient SurrogatesNatural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/so24a.html
https://proceedings.mlr.press/v238/so24a.htmlOnline Distribution Learning with Local Privacy ConstraintsWe study the problem of online conditional distribution estimation with \emph{unbounded} label sets under local differential privacy. The problem may be succinctly stated as follows. Let $\mathcal{F}$ be a distribution-valued function class with an unbounded label set. Our aim is to estimate an \emph{unknown} function $f\in \mathcal{F}$ in an online fashion. More precisely, at time $t$, given a sample ${\mathbf{x}}_t$, we generate an estimate of $f({\mathbf{x}}_t)$ using only a \emph{privatized} version of the true \emph{labels} sampled from $f({\mathbf{x}}_t)$. The objective is to minimize the cumulative KL-risk of a finite horizon $T$. We show that under $(\epsilon,0)$-local differential privacy for the labels, the KL-risk equals $\tilde{\Theta}(\frac{1}{\epsilon}\sqrt{KT}),$ up to poly-logarithmic factors, where $K=|\mathcal{F}|$. This result significantly differs from the $\tilde{\Theta}(\sqrt{T\log K})$ bound derived in Wu et al., (2023a) for \emph{bounded} label sets. As a side-result, our approach recovers a nearly tight upper bound for the hypothesis selection problem of Gopi et al., (2020), which has only been established for the \emph{batch} setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sima24a.html
https://proceedings.mlr.press/v238/sima24a.htmlUnified Transfer Learning in High-Dimensional Linear RegressionTransfer learning plays a key role in modern data analysis when: (1) the target data are scarce but the source data are sufficient; (2) the distributions of the source and target data are heterogeneous. This paper develops an interpretable unified transfer learning model, termed as UTrans, which can detect both transferable variables and source data. More specifically, we establish the estimation error bounds and prove that our bounds are lower than those with target data only. Besides, we propose a source detection algorithm based on hypothesis testing to exclude the nontransferable data. We evaluate and compare UTrans to the existing algorithms in multiple experiments. It is shown that UTrans attains much lower estimation and prediction errors than the existing methods, while preserving interpretability. We finally apply it to the US intergenerational mobility data and compare our proposed algorithms to the classical machine learning algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shuo-liu24a.html
https://proceedings.mlr.press/v238/shuo-liu24a.htmlDiffRed: Dimensionality reduction guided by stable rankIn this work, we propose a novel dimensionality reduction technique, \textit{DiffRed}, which first projects the data matrix, A, along first $k_1$ principal components and the residual matrix $A^{*}$ (left after subtracting its $k_1$-rank approximation) along $k_2$ Gaussian random vectors. We evaluate \emph{M1}, the distortion of mean-squared pair-wise distance, and \emph{Stress}, the normalized value of RMS of distortion of the pairwise distances. We rigorously prove that \textit{DiffRed} achieves a general upper bound of $O\left(\sqrt{\frac{1-p}{k_2}}\right)$ on \emph{Stress} and $O\left(\frac{1-p}{\sqrt{k_2*\rho(A^{*})}}\right)$ on \emph{M1} where $p$ is the fraction of variance explained by the first $k_1$ principal components and $\rho(A^{*})$ is the \textit{stable rank} of $A^{*}$. These bounds are tighter than the currently known results for Random maps. Our extensive experiments on a variety of real-world datasets demonstrate that \textit{DiffRed} achieves near zero \emph{M1} and much lower values of \emph{Stress} as compared to the well-known dimensionality reduction techniques. In particular, \textit{DiffRed} can map a 6 million dimensional dataset to 10 dimensions with 54% lower \emph{Stress} than PCA.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shukla24a.html
https://proceedings.mlr.press/v238/shukla24a.htmlBayesian Online Learning for Consensus PredictionGiven a pre-trained classifier and multiple human experts, we investigate the task of online classification where model predictions are provided for free but querying humans incurs a cost. In this practical but under-explored setting, oracle ground truth is not available. Instead, the prediction target is defined as the consensus vote of all experts. Given that querying full consensus can be costly, we propose a general framework for online Bayesian consensus estimation, leveraging properties of the multivariate hypergeometric distribution. Based on this framework, we propose a family of methods that dynamically estimate expert consensus from partial feedback by producing a posterior over expert and model beliefs. Analyzing this posterior induces an interpretable trade-off between querying cost and classification performance. We demonstrate the efficacy of our framework against a variety of baselines on CIFAR-10H and ImageNet-16H, two large-scale crowdsourced datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/showalter24a.html
https://proceedings.mlr.press/v238/showalter24a.htmlIdentification and Estimation of “Causes of Effects” using Covariate-Mediator InformationIn this paper, we deal with the evaluation problem of "causes of effects" (CoE), which focuses on the likelihood that one event was the cause of another. To assess this likelihood, three types of probabilities of causation have been utilized: probability of necessity, probability of sufficiency, and probability of necessity and sufficiency. However, these usually cannot be estimated, even if "effects of causes" (EoC) is estimable from statistical data, regardless of how large the data is. To solve this problem, we propose novel identification conditions for CoE, using an intermediate variable together with covariate information. Additionally, we also propose a new method for estimating CoE that is applicable whenever they are identifiable through the proposed identification conditions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shingaki24a.html
https://proceedings.mlr.press/v238/shingaki24a.htmlAdaptive and non-adaptive minimax rates for weighted Laplacian-Eigenmap based nonparametric regressionWe show both adaptive and non-adaptive minimax rates of convergence for a family of weighted Laplacian-Eigenmap based nonparametric regression methods, when the true regression function belongs to a Sobolev space and the sampling density is bounded from above and below. The adaptation methodology is based on extensions of Lepski’s method and is over both the smoothness parameter ($s\in\mathbb{N}_{+}$) and the norm parameter ($M>0$) determining the constraints on the Sobolev space. Our results extend the non-adaptive result in Green et al., (2023), established for a specific normalized graph Laplacian, to a wide class of weighted Laplacian matrices used in practice, including the unnormalized Laplacian and random walk Laplacian.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shi24b.html
https://proceedings.mlr.press/v238/shi24b.htmlLearning Cartesian Product Graphs with Laplacian ConstraintsGraph Laplacian learning, also known as network topology inference, is a problem of great interest to multiple communities. In Gaussian graphical models (GM), graph learning amounts to endowing covariance selection with the Laplacian structure. In graph signal processing (GSP), it is essential to infer the unobserved graph from the outputs of a filtering system. In this paper, we study the problem of learning Cartesian product graphs under Laplacian constraints. The Cartesian graph product is a natural way for modeling higher-order conditional dependencies and is also the key for generalizing GSP to multi-way tensors. We establish statistical consistency for the penalized maximum likelihood estimation (MLE) of a Cartesian product Laplacian, and propose an efficient algorithm to solve the problem. We also extend our method for efficient joint graph learning and imputation in the presence of structural missing values. Experiments on synthetic and real-world datasets demonstrate that our method is superior to previous GSP and GM methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shi24a.html
https://proceedings.mlr.press/v238/shi24a.htmlStochastic Smoothed Gradient Descent Ascent for Federated Minimax OptimizationIn recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved successful in centralized nonconvex minimax optimization, how and whether smoothing techniques could be helpful in a federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly applied to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shen24c.html
https://proceedings.mlr.press/v238/shen24c.htmlEfficient Variational Sequential Information ControlWe develop a family of fast variational methods for sequential control in dynamic settings where an agent is incentivized to maximize information gain. We consider the case of optimal control in continuous nonlinear dynamical systems that prohibit exact evaluation of the mutual information (MI) reward. Our approach couples efficient message-passing inference with variational bounds on the MI objective under Gaussian projections. We also develop a Gaussian mixture approximation that enables exact MI evaluation under constraints on the component covariances. We validate our methodology in nonlinear systems with superior and faster control compared to standard particle-based methods. We show our approach improves the accuracy and efficiency of one-shot robotic learning with intrinsic MI rewards. Furthermore, we demonstrate that our method is applicable to a wider range of contexts, e.g., the active information acquisition problem.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shen24b.html
https://proceedings.mlr.press/v238/shen24b.htmlContinual Domain Adversarial Adaptation via Double-Head DiscriminatorsDomain adversarial adaptation in a continual setting poses significant challenges due to the limitations of accessing previous source domain data. Despite extensive research in continual learning, adversarial adaptation cannot be effectively accomplished using only a small number of stored source domain data, a standard setting in memory replay approaches. This limitation arises from the erroneous empirical estimation of $\mathcal{H}$-divergence with few source domain samples. To tackle this problem, we propose a double-head discriminator algorithm by introducing an addition source-only domain discriminator trained solely on the source learning phase. We prove that by introducing a pre-trained source-only domain discriminator, the empirical estimation error of $\mathcal{H}$-divergence related adversarial loss is reduced from the source domain side. Further experiments on existing domain adaptation benchmarks show that our proposed algorithm achieves more than 2$%$ improvement on all categories of target domain adaptation tasks while significantly mitigating the forgetting of the source domain.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shen24a.html
https://proceedings.mlr.press/v238/shen24a.htmlStrategic Usage in a Multi-Learner SettingReal-world systems often involve some pool of users choosing between a set of services. With the increase in popularity of online learning algorithms, these services can now self-optimize, leveraging data collected on users to maximize some reward such as service quality. On the flipside, users may strategically choose which services to use in order to pursue their own reward functions, in the process wielding power over which services can see and use their data. Extensive prior research has been conducted on the effects of strategic users in single-service settings, with strategic behavior manifesting in the manipulation of observable features to achieve a desired classification; however, this can often be costly or unattainable for users and fails to capture the full behavior of multi-service dynamic systems. As such, we analyze a setting in which strategic users choose among several available services in order to pursue positive classifications, while services seek to minimize loss functions on their observations. We focus our analysis on realizable settings, and show that naive retraining can still lead to oscillation even if all users are observed at different times; however, if this retraining uses memory of past observations, convergent behavior can be guaranteed for certain loss function classes. We provide results obtained from synthetic and real-world data to empirically validate our theoretical findings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shekhtman24a.html
https://proceedings.mlr.press/v238/shekhtman24a.htmlTuning-Free Maximum Likelihood Training of Latent Variable Models via Coin BettingWe introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sharrock24a.html
https://proceedings.mlr.press/v238/sharrock24a.htmlNonparametric Automatic Differentiation Variational Inference with Spline ApproximationAutomatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded support. Compared with widely-used nonparametric variational inference methods, the proposed method is easy to implement and adaptive to various data structures. By adopting the spline approximation, we derive a lower bound of the importance weighted autoencoder and establish the asymptotic consistency. Experiments demonstrate the efficiency of the proposed method in approximating complex posterior distributions and improving the performance of generative models with incomplete data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shao24a.html
https://proceedings.mlr.press/v238/shao24a.htmlA/B testing under Interference with Partial Network InformationA/B tests are often required to be conducted on subjects that might have social connections. For e.g., experiments on social media, or medical and social interventions to control the spread of an epidemic. In such settings, the SUTVA assumption for randomized-controlled trials is violated due to network interference, or spill-over effects, as treatments to group A can potentially also affect the control group B. When the underlying social network is known exactly, prior works have demonstrated how to conduct A/B tests adequately to estimate the global average treatment effect (GATE). However, in practice, it is often impossible to obtain knowledge about the exact underlying network. In this paper, we present UNITE: a novel estimator that relax this assumption and can identify GATE while only relying on knowledge of the superset of neighbors for any subject in the graph. Through theoretical analysis and extensive experiments, we show that the proposed approach performs better in comparison to standard estimators.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shankar24a.html
https://proceedings.mlr.press/v238/shankar24a.htmlWeight-Sharing RegularizationWeight-sharing is ubiquitous in deep learning. Motivated by this, we propose a “weight-sharing regularization” penalty on the weights $w \in \mathbb{R}^d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\mathrm{prox}_{\mathcal{R}}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log^3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on \href{https://github.com/motahareh-sohrabi/weight-sharing-regularization}{github}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shakerinava24a.html
https://proceedings.mlr.press/v238/shakerinava24a.htmlInformation Theoretically Optimal Sample Complexity of Learning Dynamical Directed Acyclic GraphsIn this article, the optimal sample complexity of learning the underlying interactions or dependencies of a Linear Dynamical System (LDS) over a Directed Acyclic Graph (DAG) is studied. We call such a DAG underlying an LDS as dynamical DAG (DDAG). In particular, we consider a DDAG where the nodal dynamics are driven by unobserved exogenous noise sources that are wide-sense stationary (WSS) in time but are mutually uncorrelated, and have the same power spectral density (PSD). Inspired by the static DAG setting, a metric and an algorithm based on the PSD matrix of the observed time series are proposed to reconstruct the DDAG. It is shown that the optimal sample complexity (or length of state trajectory) needed to learn the DDAG is $n=\Theta(q\log(p/q))$, where $p$ is the number of nodes and $q$ is the maximum number of parents per node. To prove the sample complexity upper bound, a concentration bound for the PSD estimation is derived, under two different sampling strategies. A matching min-max lower bound using generalized Fano’s inequality also is provided, thus showing the order optimality of the proposed algorithm. The codes used in the paper are available at \url{https://github.com/Mishfad/Learning-Dynamical-DAGs}Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/shaikh-veedu24a.html
https://proceedings.mlr.press/v238/shaikh-veedu24a.htmlEstimation of partially known Gaussian graphical models with score-based structural priorsWe propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or maximum a posteriori approach using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments in different setups demonstrate the benefits of our approach.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sevilla24a.html
https://proceedings.mlr.press/v238/sevilla24a.htmlExtended Deep Adaptive Input Normalization for Preprocessing Time Series Data for Neural NetworksData preprocessing is a crucial part of any machine learning pipeline, and it can have a significant impact on both performance and training efficiency. This is especially evident when using deep neural networks for time series prediction and classification: real-world time series data often exhibit irregularities such as multi-modality, skewness and outliers, and the model performance can degrade rapidly if these characteristics are not adequately addressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input Normalization) layer, a novel adaptive neural layer that learns how to appropriately normalize irregular time series data for a given task in an end-to-end fashion, instead of using a fixed normalization scheme. This is achieved by optimizing its unknown parameters simultaneously with the deep neural network using back-propagation. Our experiments, conducted using synthetic data, a credit default prediction dataset, and a large-scale limit order book benchmark dataset, demonstrate the superior performance of the EDAIN layer when compared to conventional normalization methods and existing adaptive time series preprocessing layers.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/september24a.html
https://proceedings.mlr.press/v238/september24a.htmlMulti-Level Symbolic Regression: Function Structure Learning for Multi-Level DataSymbolic Regression (SR) is an approach which learns a closed-form function relating the predictors to the outcome in a dataset. Datasets are often multi-level (MuL), meaning that certain features can be used to split data into groups for analysis (we refer to these features as levels). The advantage of viewing datasets as MuL is that we can exploit the high similarity of data within a group. SR is well-suited for MuL datasets, in which the learnt function structure serves as ‘shared information’ between the groups while the learnt parameter values capture the unique relationships within each group. In this context, this paper makes three contributions: (i) We design an algorithm, Multi-level Symbolic Regression (MSR), which runs multiple parallel SR processes for each group and merges them to produce a single function structure. (ii) To tackle datasets that are not explicitly MuL, we develop a metric termed MLICC to select the best feature to serve as a level. (iii) We also release MSRBench, a database of MuL datasets (synthetic and real-world) which we developed and collated, that can be used to evaluate MSR. Our results and ablation studies demonstrate that MSR achieves a higher recovery rate and lower error on MSRBench compared to SOTA methods for SR and MuL datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sen-fong24a.html
https://proceedings.mlr.press/v238/sen-fong24a.htmlStructured Transforms Across Spaces with Cost-Regularized Optimal TransportMatching a source to a target probability measure is often solved by instantiating a linear optimal transport (OT) problem, parameterized by a ground cost function that quantifies discrepancy between points. When these measures live in the same metric space, the ground cost often defaults to its distance. When instantiated across two different spaces, however, choosing that cost in the absence of aligned data is a conundrum. As a result, practitioners often resort to solving instead a quadratic Gromow-Wasserstein (GW) problem. We exploit in this work a parallel between GW and cost-regularized OT, the regularized minimization of a linear OT objective parameterized by a ground cost. We use this cost-regularized formulation to match measures across two different Euclidean spaces, where the cost is evaluated between transformed source points and target points. We show that several quadratic OT problems fall in this category, and consider enforcing structure in linear transform (e.g., sparsity), by introducing structure-inducing regularizers. We provide a proximal algorithm to extract such transforms from unaligned data, and demonstrate its applicability to single-cell spatial transcriptomics/multiomics matching tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sebbouh24a.html
https://proceedings.mlr.press/v238/sebbouh24a.htmlAdaptive Quasi-Newton and Anderson Acceleration Framework with Explicit Global (Accelerated) Convergence RatesDespite the impressive numerical performance of the quasi-Newton and Anderson/nonlinear acceleration methods, their global convergence rates have remained elusive for over 50 years. This study addresses this long-standing issue by introducing a framework that derives novel, adaptive quasi-Newton and nonlinear/Anderson acceleration schemes. Under mild assumptions, the proposed iterative methods exhibit explicit, non-asymptotic convergence rates that blend those of the gradient descent and Cubic Regularized Newton’s methods. The proposed approach also includes an accelerated version for convex functions. Notably, these rates are achieved adaptively without prior knowledge of the function’s parameters. The framework presented in this study is generic, and its special cases include algorithms such as Newton’s method with random subspaces, finite differences, or lazy Hessian. Numerical experiments demonstrated the efficiency of the proposed framework, even compared to the l-BFGS algorithm with Wolfe line-search. The code used in the experiments is available on \url{https://github.com/windows7lover/QN_With_Guarantees}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/scieur24a.html
https://proceedings.mlr.press/v238/scieur24a.htmlMinimax Excess Risk of First-Order Methods for Statistical Learning with Data-Dependent OraclesIn this paper, our aim is to analyse the generalization capabilities of first-order methods for statistical learning in multiple, different yet related, scenarios including supervised learning, transfer learning, robust learning and federated learning. To do so, we provide sharp upper and lower bounds for the minimax excess risk of strongly convex and smooth statistical learning when the gradient is accessed through partial observations given by a data-dependent oracle. This novel class of oracles can query the gradient with any given data distribution, and is thus well suited to scenarios in which the training data distribution does not match the target (or test) distribution. In particular, our upper and lower bounds are proportional to the smallest mean square error achievable by gradient estimators, thus allowing us to easily derive multiple sharp bounds in the aforementioned scenarios using the extensive literature on parameter estimation.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/scaman24a.html
https://proceedings.mlr.press/v238/scaman24a.htmlError bounds for any regression model using Gaussian processes with gradient informationWe provide an upper bound for the expected quadratic loss on new data for any regression model. We derive the bound by modelling the underlying function by a Gaussian process (GP). Instead of a single kernel or family of kernels of the same form, we consider all GPs with translation-invariant and continuously twice differentiable kernels having a bounded signal variance and prior covariance of the gradient. To obtain a bound for the expected posterior loss, we present bounds for the posterior variance and squared bias. The squared bias bound depends on the regression model used, which can be arbitrary and not based on GPs. The bounds scale well with data size, in contrast to computing the GP posterior by a Cholesky factorisation of a large matrix. More importantly, our bounds do not require strong prior knowledge as we do not specify the exact kernel form. We validate our theoretical findings by numerical experiments and show that the bounds have applications in uncertainty estimation and concept drift detection.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/savvides24a.html
https://proceedings.mlr.press/v238/savvides24a.htmlImplicit Bias in Noisy-SGD: With Applications to Differentially Private TrainingTraining Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) often results in superior test performance compared to larger batches. This implicit bias is attributed to the specific noise structure inherent to SGD. When ensuring Differential Privacy (DP) in DNNs’ training, DP-SGD adds Gaussian noise to the clipped gradients. However, large-batch training still leads to a significant performance decrease, posing a challenge as strong DP guarantees necessitate the use of massive batches. Our study first demonstrates that this phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity, not the clipping, is responsible for this implicit bias, even with additional isotropic Gaussian noise. We then theoretically analyze the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings. Our analysis reveals that the additional noise indeed amplifies the implicit bias. It suggests that the performance issues of private training stem from the same underlying principles as SGD, offering hope for improvements in large batch training strategies.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/sander24a.html
https://proceedings.mlr.press/v238/sander24a.htmlCausally Inspired Regularization Enables Domain General RepresentationsGiven a causal graph representing the data-generating process shared across different domains/distributions, enforcing sufficient graph-implied conditional independencies can identify domain-general (non-spurious) feature representations. For the standard input-output predictive setting, we categorize the set of graphs considered in the literature into two distinct groups: (i) those in which the empirical risk minimizer across training domains gives domain-general representations and (ii) those where it does not. For the latter case (ii), we propose a novel framework with regularizations, which we demonstrate are sufficient for identifying domain-general feature representations without a priori knowledge (or proxies) of the spurious features. Empirically, our proposed method is effective for both (semi) synthetic and real-world data, outperforming other state-of-the-art methods in average and worst-domain transfer accuracy.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/salaudeen24a.html
https://proceedings.mlr.press/v238/salaudeen24a.htmlTesting exchangeability by pairwise bettingIn this paper, we address the problem of testing exchangeability of a sequence of random variables, $X_1, X_2,\cdots$. This problem has been studied under the recently popular framework of \emph{testing by betting}. But the mapping of testing problems to game is not one to one: many games can be designed for the same test. Past work established that it is futile to play single game betting on every observation: test martingales in the data filtration are powerless. Two avenues have been explored to circumvent this impossibility: betting in a reduced filtration (wealth is a test martingale in a coarsened filtration), or playing many games in parallel (wealth is an e-process in the data filtration). The former has proved to be difficult to theoretically analyze, while the latter only works for binary or discrete observation spaces. Here, we introduce a different approach that circumvents both drawbacks. We design a new (yet simple) game in which we observe the data sequence in pairs. Even though betting on individual observations is futile, we show that betting on pairs of observations is not. To elaborate, we prove that our game leads to a nontrivial test martingale, which is interesting because it has been obtained by shrinking the filtration very slightly. We show that our test controls type-1 error despite continuous monitoring, and is consistent for both binary and continuous observations, under a broad class of alternatives. Due to the shrunk filtration, optional stopping is only allowed at even stopping times: a relatively minor price. We provide a variety of simulations that align with our theoretical findings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/saha24b.html
https://proceedings.mlr.press/v238/saha24b.htmlFaster Convergence with MultiWay PreferencesWe address the problem of convex optimization with preference feedback, where the goal is to minimize a convex function given a weaker form of comparison queries. Each query consists of two points and the dueling feedback returns a (noisy) single-bit binary comparison of the function values of the two queried points. Here we consider the sign-function-based comparison feedback model and analyze the convergence rates with batched and multiway (argmin of a set queried points) comparisons. Our main goal is to understand the improved convergence rates owing to parallelization in sign-feedback-based optimization problems. Our work is the first to study the problem of convex optimization with multiway preferences and analyze the optimal convergence rates. Our first contribution lies in designing efficient algorithms with a convergence rate of $\smash{\widetilde O}(\frac{d}{\min\{m,d\} \epsilon})$ for $m$-batched preference feedback where the learner can query $m$-pairs in parallel. We next study a $m$-multiway comparison (‘battling’) feedback, where the learner can get to see the argmin feedback of $m$-subset of queried points and show a convergence rate of $\smash{\widetilde O}(\frac{d}{ \min\{\log m,d\}\epsilon })$. We show further improved convergence rates with an additional assumption of strong convexity. Finally, we also study the convergence lower bounds for batched preferences and multiway feedback optimization showing the optimality of our convergence rates w.r.t. $m$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/saha24a.html
https://proceedings.mlr.press/v238/saha24a.htmlScalable Higher-Order Tensor Product Spline ModelsIn the current era of vast data and transparent machine learning, it is essential for techniques to operate at a large scale while providing a clear mathematical comprehension of the internal workings of the method. Although there already exist interpretable semi-parametric regression methods for large-scale applications that take into account non-linearity in the data, the complexity of the models is still often limited. One of the main challenges is the absence of interactions in these models, which are left out for the sake of better interpretability but also due to impractical computational costs. To overcome this limitation, we propose a new approach using a factorization method to derive a highly scalable higher-order tensor product spline model. Our method allows for the incorporation of all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We further develop a meaningful penalization scheme and examine the induced optimization problem. We conclude by evaluating the predictive and estimation performance of our method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ruegamer24a.html
https://proceedings.mlr.press/v238/ruegamer24a.htmlMind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware PriorsMachine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance—even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/rudner24a.html
https://proceedings.mlr.press/v238/rudner24a.htmlElectronic Medical Records Assisted Digital Clinical Trial DesignRandomized controlled trials (RCTs) are gold standards for assessing intervention efficacy. Yet, generalizing evidence from classical RCTs can be challenging and sometimes problematic due to their limited external validity under stringent eligibility criteria and inadequate statistical power resulting from limited sample sizes under budgetary constraints. "Digital clinical trial," which utilizes digital technology and electronic medical records (EMRs) to expand eligibility criteria and enhance data collection efficiency, offers a promising concept for solving the above-mentioned conundrums encountered in classical RCTs. In this paper, we propose two novel digital clinical trial design strategies assisted by EMRs collected from diverse patient populations. On the one hand, leveraging digital technologies, our design strategies adaptively modify both the eligibility criteria and treatment assignment mechanism to enhance data collection efficiency. As a result, evidence gathered from our design can possess greater statistical power. On the other hand, since EMRs capture diverse patient populations and provide large sample sizes, our design not only broadens the trial’s eligibility criteria but also enhances its statistical power, enabling us to collect more generalizable evidence with boosted statistical power for evaluating intervention efficacy than classical RCTs. We demonstrate the validity and merit of the proposed designs with detailed theoretical investigation, simulation studies, and a synthetic case study.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ruan24a.html
https://proceedings.mlr.press/v238/ruan24a.htmlIntegrating Uncertainty Awareness into Conformalized Quantile RegressionConformalized Quantile Regression (CQR) is a recently proposed method for constructing prediction intervals for a response $Y$ given covariates $X$, without making distributional assumptions. However, existing constructions of CQR can be ineffective for problems where the quantile regressors perform better in certain parts of the feature space than others. The reason is that the prediction intervals of CQR do not distinguish between two forms of uncertainty: first, the variability of the conditional distribution of $Y$ given $X$ (i.e., aleatoric uncertainty), and second, our uncertainty in estimating this conditional distribution (i.e., epistemic uncertainty). This can lead to intervals that are overly narrow in regions where epistemic uncertainty is high. To address this, we propose a new variant of the CQR methodology, Uncertainty-Aware CQR (UACQR), that explicitly separates these two sources of uncertainty to adjust quantile regressors differentially across the feature space. Compared to CQR, our methods enjoy the same distribution-free theoretical coverage guarantees, while demonstrating in our experiments stronger conditional coverage properties in simulated settings and real-world data sets alike.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/rossellini24a.html
https://proceedings.mlr.press/v238/rossellini24a.htmlIntrinsic Gaussian Vector Fields on ManifoldsVarious applications ranging from robotics to climate science require modeling signals on non-Euclidean domains, such as the sphere. Gaussian process models on manifolds have recently been proposed for such tasks, in particular when uncertainty quantification is needed. In the manifold setting, vector-valued signals can behave very differently from scalar-valued ones, with much of the progress so far focused on modeling the latter. The former, however, are crucial for many applications, such as modeling wind speeds or force fields of unknown dynamical systems. In this paper, we propose novel Gaussian process models for vector-valued signals on manifolds that are intrinsically defined and account for the geometry of the space in consideration. We provide computational primitives needed to deploy the resulting Hodge-Matérn Gaussian vector fields on the two-dimensional sphere and the hypertori. Further, we highlight two generalization directions: discrete two-dimensional meshes and "ideal" manifolds like hyperspheres, Lie groups, and homogeneous spaces. Finally, we show that our Gaussian vector fields constitute considerably more refined inductive biases than the extrinsic fields proposed before.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/robert-nicoud24a.html
https://proceedings.mlr.press/v238/robert-nicoud24a.htmlSimulating weighted automata over sequences and trees with transformersTransformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these models can compactly simulate the sequential reasoning abilities of deterministic finite automata (DFAs). This leads to the following question: can transformers simulate the reasoning of more complex finite state machines? In this work, we show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA), a generalization of weighted automata to tree structured inputs. We prove these claims formally and provide upper bounds on the size of the transformer models needed as a function of the number of states of the target automata. Empirically, we perform synthetic experiments showing that transformers are able to learn these compact solutions via standard gradient-based training.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/rizvi-martel24a.html
https://proceedings.mlr.press/v238/rizvi-martel24a.htmlConstant or Logarithmic Regret in Asynchronous Multiplayer Bandits with Limited CommunicationMultiplayer bandits have recently garnered significant attention due to their relevance in cognitive radio networks. While the existing body of literature predominantly focuses on synchronous players, real-world radio networks, such as those in IoT applications, often feature asynchronous (i.e., randomly activated) devices. This highlights the need for addressing the more challenging asynchronous multiplayer bandits problem. Our first result shows that a natural extension of UCB achieves a minimax regret of $\mathcal{O}(\sqrt{T\log(T)})$ in the centralized setting. More significantly, we introduce Cautious Greedy, which uses $\mathcal{O}(\log(T))$ communications and whose instance-dependent regret is constant if the optimal policy assigns at least one player to each arm (a situation proven to occur when arm means are sufficiently close). Otherwise, the regret is, as usual, $\log(T)$ times the sum of some inverse sub-optimality gaps. We substantiate the optimality of Cautious Greedy through lower-bound analysis based on data-dependent terms. Therefore, we establish a strong baseline for asynchronous multiplayer bandits, at least with $\mathcal{O}(\log(T))$ communications.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/richard24a.html
https://proceedings.mlr.press/v238/richard24a.htmlSinkhorn Flow as Mirror Flow: A Continuous-Time Framework for Generalizing the Sinkhorn AlgorithmMany problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we build upon this result by introducing a continuous-time analogue of the Sinkhorn algorithm. This perspective allows us to derive novel variants of Sinkhorn schemes that are robust to noise and bias. Moreover, our continuous-time dynamics offers a unified perspective on several recently discovered dynamics in machine learning and mathematics, such as the "Wasserstein mirror flow" of Deb et al. (2023) or the "mean-field Schrödinger equation" of Claisse et al. (2023).Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/reza-karimi24a.html
https://proceedings.mlr.press/v238/reza-karimi24a.htmlUnderstanding the Generalization Benefits of Late Learning Rate DecayWhy do neural networks trained with large learning rates for longer time often lead to better generalization? In this paper, we delve into this question by examining the relation between training and testing loss in neural networks. Through visualization of these losses, we note that the training trajectory with a large learning rate navigates through the minima manifold of the training loss, finally nearing the neighborhood of the testing loss minimum. Motivated by these findings, we introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. Upon investigating the training process using SGD on our model, we demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ren24c.html
https://proceedings.mlr.press/v238/ren24c.htmlMulti-objective Optimization via Wasserstein-Fisher-Rao Gradient FlowMulti-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a “dominance potential” to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ren24b.html
https://proceedings.mlr.press/v238/ren24b.htmlLearning Adaptive Kernels for Statistical Independence TestsWe propose a novel framework for kernel-based statistical independence tests that enable adaptatively learning parameterized kernels to maximize test power. Our framework can effectively address the pitfall inherent in the existing signal-to-noise ratio criterion by modeling the change of the null distribution during the learning process. Based on the proposed framework, we design a new class of kernels that can adaptatively focus on the significant dimensions of variables to judge independence, which makes the tests more flexible than using simple kernels that are adaptive only in length-scale, and especially suitable for high-dimensional complex data. Theoretically, we demonstrate the consistency of our independence tests, and show that the non-convex objective function used for learning fits the L-smoothing condition, thus benefiting the optimization. Experimental results on both synthetic and real data show the superiority of our method. The source code and datasets are available at \url{https://github.com/renyixin666/HSIC-LK.git}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ren24a.html
https://proceedings.mlr.press/v238/ren24a.htmlEM for Mixture of Linear Regression with Clustered DataModern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from m batches of dependent samples each containing n measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)}$). In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as m grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/reisizadeh24a.html
https://proceedings.mlr.press/v238/reisizadeh24a.htmlBeyond Bayesian Model Averaging over Paths in Probabilistic Programs with Stochastic SupportThe posterior in probabilistic programs with stochastic support decomposes as a weighted sum of the local posterior distributions associated with each possible program path. We show that making predictions with this full posterior implicitly performs a Bayesian model averaging (BMA) over paths. This is potentially problematic, as BMA weights can be unstable due to model misspecification or inference approximations, leading to sub-optimal predictions in turn. To remedy this issue, we propose alternative mechanisms for path weighting: one based on stacking and one based on ideas from PAC-Bayes. We show how both can be implemented as a cheap post-processing step on top of existing inference engines. In our experiments, we find them to be more robust and lead to better predictions compared to the default BMA weights.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/reichelt24a.html
https://proceedings.mlr.press/v238/reichelt24a.htmlCategorical Generative Model Evaluation via Synthetic Distribution CoarseningAs we expect to see a rapid integration of generative models in our day to day lives, the development of rigorous methods of evaluation and analysis for generative models has never been more pressing. Multiple works have highlighted the shortcomings of widely used metrics and exposed how they fail to behave as expected in some settings. So far, the response has been to use a variety of metrics that target different desirable and interpretable properties such as fidelity, diversity, and authenticity, to obtain a clearer picture of a generative model’s capabilities. These methods mainly focus on ordinal data and they all suffer from the same unavoidable issues stemming from estimating quantities of high-dimensional data from a limited number of samples. We propose to take an alternative approach and to return to the synthetic data setting where the ground truth is explicit and known. We focus on nominal categorical data and introduce an evaluation method that can scale to the high-dimensional settings often encountered in practice. Our method involves successively binning the large space to obtain smaller probability spaces and coarser distributions where meaningful statistical estimates can be obtained. This allows us to provide probabilistic guarantees and sample complexities and we illustrate how our method can be applied to distinguish between the capabilities of several state-of-the-art categorical models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/regol24a.html
https://proceedings.mlr.press/v238/regol24a.htmlDifferentially Private Reward Estimation with Preference FeedbackLearning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using reinforcement learning with human feedback (RLHF). These approaches first solicit feedback from human labelers typically in the form of pairwise comparisons between two possible actions, then estimate a reward model using these comparisons, and finally employ a policy based on the estimated reward model. An adversarial attack in any step of the above pipeline might reveal private and sensitive information of human labelers. In this work, we adopt the notion of \emph{label differential privacy} (DP) and focus on the problem of reward estimation from preference-based feedback while protecting privacy of each individual labelers. Specifically, we consider the parametric Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving a latent reward parameter $\theta^* \in \mathbb{R}^d$. Within a standard minimax estimation framework, we provide tight upper and lower bounds on the error in estimating $\theta^*$ under both \emph{local} and \emph{central} models of DP. We show, for a given privacy budget $\epsilon$ and number of samples $n$, that the additional cost to ensure label-DP under local model is $\Theta \big(\frac{1}{ e^\epsilon-1}\sqrt{\frac{d}{n}}\big)$, while it is $\Theta\big(\frac{\sqrt{d}}{\epsilon n} \big)$ under the weaker central model. We perform simulations on synthetic data that corroborate these theoretical results.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ray-chowdhury24a.html
https://proceedings.mlr.press/v238/ray-chowdhury24a.htmlCylindrical Thompson Sampling for High-Dimensional Bayesian OptimizationMany industrial and scientific applications require optimization of one or more objectives by tuning dozens or hundreds of input parameters. While Bayesian optimization has been a popular approach for the efficient optimization of blackbox functions, its performance decreases drastically as the dimensionality of the search space increases (i.e., above twenty). Recent advancements in high-dimensional Bayesian optimization (HDBO) seek to mitigate this issue through techniques such as adaptive local search with trust regions or dimensionality reduction using random embeddings. In this paper, we provide a close examination of these advancements and show that sampling strategy plays a prominent role and is key to tackling the curse-of-dimensionality. We then propose cylindrical Thompson sampling (CTS), a novel strategy that can be integrated into single- and multi-objective HDBO algorithms. We demonstrate this by integrating CTS as a modular component in state-of-the-art HDBO algorithms. We verify the effectiveness of CTS on both synthetic and real-world high-dimensional problems, and show that CTS largely enhances existing HDBO methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/rashidi24a.html
https://proceedings.mlr.press/v238/rashidi24a.htmlPreventing Arbitrarily High Confidence on Far-Away Data in Point-Estimated Discriminative Neural NetworksDiscriminatively trained, deterministic neural networks are the de facto choice for classification problems. However, even though they achieve state-of-the-art results on in-domain test sets, they tend to be overconfident on out-of-distribution (OOD) data. For instance, ReLU networks—a popular class of neural network architectures—have been shown to almost always yield high confidence predictions when the test data are far away from the training set, even when they are trained with OOD data. We overcome this problem by adding a term to the output of the neural network that corresponds to the logit of an extra class, that we design to dominate the logits of the original classes as we move away from the training data. This technique provably prevents arbitrarily high confidence on far-away test data while maintaining a simple discriminative point-estimate training. Evaluation on various benchmarks demonstrates strong performance against competitive baselines on both far-away and realistic OOD data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/rashid24a.html
https://proceedings.mlr.press/v238/rashid24a.htmlBlockBoost: Scalable and Efficient Blocking through BoostingAs datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ramos24a.html
https://proceedings.mlr.press/v238/ramos24a.htmlCommunication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved RatesByzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression – Byz-DASHA-PAGE – and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback – Byz-EF21 – along with its bi-directional compression version – Byz-EF21-BC – and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/rammal24a.html
https://proceedings.mlr.press/v238/rammal24a.htmlSubmodular Minimax Optimization: Finding Effective SetsDespite the rich existing literature about minimax optimization in continuous settings, only very partial results of this kind have been obtained for combinatorial settings. In this paper, we fill this gap by providing a characterization of submodular minimax optimization, the problem of finding a set (for either the min or the max player) that is effective against every possible response. We show when and under what conditions we can find such sets. We also demonstrate how minimax submodular optimization provides robust solutions for downstream machine learning applications such as (i) prompt engineering in large language models, (ii) identifying robust waiting locations for ride-sharing, (iii) kernelization of the difficulty of instances of the last setting, and (iv) finding adversarial images. Our experiments show that our proposed algorithms consistently outperform other baselines.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/raed-mualem24a.html
https://proceedings.mlr.press/v238/raed-mualem24a.htmlA General Algorithm for Solving Rank-one Matrix SensingMatrix sensing has many real-world applications in science and engineering, such as system control, distance embedding, and computer vision. The goal of matrix sensing is to recover a matrix $A_\star \in \mathbb{R}^{n \times n}$, based on a sequence of measurements $(u_i,b_i) \in \mathbb{R}^{n} \times \mathbb{R}$ such that $u_i^\top A_\star u_i = b_i$. Previous work (Zhong et al., 2015) focused on the scenario where matrix $A_{\star}$ has a small rank, e.g. rank-$k$. Their analysis heavily relies on the RIP assumption, making it unclear how to generalize to high-rank matrices. In this paper, we relax that rank-$k$ assumption and solve a much more general matrix sensing problem. Given an accuracy parameter $\delta \in (0,1)$, we can compute $A \in \mathbb{R}^{n \times n}$ in $\widetilde{O}(m^{3/2} n^2 \delta^{-1} )$, such that $ |u_i^\top A u_i - b_i| \leq \delta$ for all $i \in [m]$. We design an efficient algorithm with provable convergence guarantees using stochastic gradient descent for this problem.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/qin24a.html
https://proceedings.mlr.press/v238/qin24a.htmlExploring the Power of Graph Neural Networks in Solving Linear Optimization ProblemsRecently, machine learning, particularly message-passing graph neural networks (MPNNs), has gained traction in enhancing exact optimization algorithms. For example, MPNNs speed up solving mixed-integer optimization problems by imitating computational intensive heuristics like strong branching, which entails solving multiple linear optimization problems (LPs). Despite the empirical success, the reasons behind MPNNs’ effectiveness in emulating linear optimization remain largely unclear. Here, we show that MPNNs can simulate standard interior-point methods for LPs, explaining their practical success. Furthermore, we highlight how MPNNs can serve as a lightweight proxy for solving LPs, adapting to a given problem instance distribution. Empirically, we show that MPNNs solve LP relaxations of standard combinatorial optimization problems close to optimality, often surpassing conventional solvers and competing approaches in solving time.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/qian24a.html
https://proceedings.mlr.press/v238/qian24a.htmlImproving Robustness via Tilted Exponential Layer: A Communication-Theoretic PerspectiveState-of-the-art techniques for enhancing robustness of deep networks mostly rely on empirical risk minimization with suitable data augmentation. In this paper, we propose a complementary approach motivated by communication theory, aimed at enhancing the signal-to-noise ratio at the output of a neural network layer via neural competition during learning and inference. In addition to standard empirical risk minimization, neurons compete to sparsely represent layer inputs by maximization of a tilted exponential (TEXP) objective function for the layer. TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. Inference in a TEXP layer is accomplished by replacing batch norm by a tilted softmax, which can be interpreted as computation of posterior probabilities for the competing signaling hypotheses represented by each neuron. After providing insights via simplified models, we show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise and other common corruptions, without requiring data augmentation. Further cumulative gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with data augmentation techniques. The code for all our experiments is available at \url{https://github.com/bhagyapuranik/texp_for_robustness}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/puranik24a.html
https://proceedings.mlr.press/v238/puranik24a.htmlBreaking the Heavy-Tailed Noise Barrier in Stochastic Optimization ProblemsWe consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $O(K^{-2(\alpha - 1) / \alpha})$, when the stochastic gradients have finite $\alpha$-th moment, $\alpha \in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/puchkin24a.html
https://proceedings.mlr.press/v238/puchkin24a.htmlConsistent and Asymptotically Unbiased Estimation of Proper Calibration ErrorsProper scoring rules evaluate the quality of probabilistic predictions, playing an essential role in the pursuit of accurate and well-calibrated models. Every proper score decomposes into two fundamental components – proper calibration error and refinement – utilizing a Bregman divergence. While uncertainty calibration has gained significant attention, current literature lacks a general estimator for these quantities with known statistical properties. To address this gap, we propose a method that allows consistent, and asymptotically unbiased estimation of all proper calibration errors and refinement terms. In particular, we introduce Kullback-Leibler calibration error, induced by the commonly used cross-entropy loss. As part of our results, we prove the relation between refinement and f-divergences, which implies information monotonicity in neural networks, regardless of which proper scoring rule is optimized. Our experiments validate empirically the claimed properties of the proposed estimator and suggest that the selection of a post-hoc calibration method should be determined by the particular calibration error of interest.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/popordanoska24a.html
https://proceedings.mlr.press/v238/popordanoska24a.htmlEfficient Conformal Prediction under Data HeterogeneityConformal prediction (CP) stands out as a robust framework for uncertainty quantification, which is crucial for ensuring the reliability of predictions. However, common CP methods heavily rely on the data exchangeability, a condition often violated in practice. Existing approaches for tackling non-exchangeability lead to methods that are not computable beyond the simplest examples. In this work, we introduce a new efficient approach to CP that produces provably valid confidence sets for fairly general non-exchangeable data distributions. We illustrate the general theory with applications to the challenging setting of federated learning under data heterogeneity between agents. Our method allows constructing provably valid personalized prediction sets for agents in a fully federated way. The effectiveness of the proposed method is demonstrated in a series of experiments on real-world datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/plassier24a.html
https://proceedings.mlr.press/v238/plassier24a.htmlOn the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing ProblemWe introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that universal (adversary-unaware) backdoor detection is impossible, except for very small alphabet sizes. Thus, we argue, that backdoor detection methods need to be either explicitly, or implicitly adversary-aware. However, our work does not imply that backdoor detection cannot work in specific scenarios, as evidenced by successful backdoor detection methods in the scientific literature. Furthermore, we connect our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pichler24a.html
https://proceedings.mlr.press/v238/pichler24a.htmlImportance Matching Lemma for Lossy Compression with Side InformationWe propose two extensions to existing importance sampling based methods for lossy compression. First, we introduce an importance sampling based compression scheme that is a variant of ordered random coding (Theis and Ahmed, 2022) and is amenable to direct evaluation of the achievable compression rate for a finite number of samples. Our second and major contribution is the \emph{importance matching lemma}, which is a finite proposal counterpart of the recently introduced {Poisson matching lemma} (Li and Anantharam, 2021). By integrating with deep learning, we provide a new coding scheme for distributed lossy compression with side information at the decoder. We demonstrate the effectiveness of the proposed scheme through experiments involving synthetic Gaussian sources, distributed image compression with MNIST and vertical federated learning with CIFAR-10.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/phan24a.html
https://proceedings.mlr.press/v238/phan24a.htmlTraining a Tucker Model With Shared Factors: a Riemannian Optimization ApproachFactorization of a matrix into a product of two rectangular factors, is a classic tool in various machine learning applications. Tensor factorizations generalize this concept to more than two dimensions. In applications, where some of the tensor dimensions have the same size or encode the same objects (e.g., knowledge graphs with entity-relation-entity 3D tensors), it can also be beneficial for the respective factors to be shared. In this paper, we consider a well-known Tucker tensor factorization and study its properties under the shared factor constraint. We call it a shared-factor Tucker factorization (SF-Tucker). Since sharing factors breaks polylinearity of classical tensor factorizations, common optimization schemes such as alternating least squares become inapplicable. Nevertheless, as we show in this paper, a set of fixed-rank SF-Tucker tensors preserves a Riemannian manifold structure. Therefore, we develop efficient algorithms for the main ingredients of Riemannian optimization on the SF-Tucker manifold and implement a Riemannian optimization method with momentum. We showcase the benefits of our approach on several machine learning tasks including knowledge graph completion and compression of neural networks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/peshekhonov24a.html
https://proceedings.mlr.press/v238/peshekhonov24a.htmlEquivariant bootstrapping for uncertainty quantification in imaging inverse problemsScientific imaging problems are often severely ill-posed and hence have significant intrinsic uncertainty. Accurately quantifying the uncertainty in the solutions to such problems is therefore critical for the rigorous interpretation of experimental results as well as for reliably using the reconstructed images as scientific evidence. Unfortunately, existing imaging methods are unable to quantify the uncertainty in the reconstructed images in a way that is robust to experiment replications. This paper presents a new uncertainty quantification methodology based on an equivariant formulation of the parametric bootstrap algorithm that leverages symmetries and invariance properties commonly encountered in imaging problems. Additionally, the proposed methodology is general and can be easily applied with any image reconstruction technique, including unsupervised training strategies that can be trained from observed data alone, thus enabling uncertainty quantification in situations where there is no ground truth data available. We demonstrate the proposed approach with a series of experiments and comparisons with alternative state-of-the-art uncertainty quantification strategies. In all our experiments, the proposed equivariant bootstrap delivers remarkably accurate high-dimensional confidence regions and outperforms the competing approaches in terms of estimation accuracy, uncertainty quantification accuracy, and computing time. These empirical findings are supported by a detailed theoretical analysis of equivariant bootstrap for linear estimators.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pereyra24a.html
https://proceedings.mlr.press/v238/pereyra24a.htmlVector Quantile Regression on ManifoldsQuantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features. QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain. Although the notion of quantiles was recently extended to multi-variate distributions, QR for multi-variate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate and geological phenomena), and tori (dihedral angles in proteins). By leveraging optimal transport theory and c-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs). Our approach allows for quantile estimation, regression, and computation of conditional confidence sets and likelihoods. We demonstrate the approach’s efficacy and provide insights regarding the meaning of non-Euclidean quantiles through synthetic and real data experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pegoraro24a.html
https://proceedings.mlr.press/v238/pegoraro24a.htmlOn learning history-based policies for controlling Markov decision processesReinforcement learning (RL) folklore suggests that methods of function approximation based on history, such as recurrent neural networks or state abstractions that include past information, outperform those without memory, because function approximation in Markov decision processes (MDP) can lead to a scenario akin to dealing with a partially observable MDP (POMDP). However, formal analysis of history-based algorithms has been limited, with most existing frameworks concentrating on features without historical context. In this paper, we introduce a theoretical framework to examine the behaviour of RL algorithms that control an MDP using feature abstraction mappings based on historical data. Additionally, we leverage this framework to develop a practical RL algorithm and assess its performance across various continuous control tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/patil24b.html
https://proceedings.mlr.press/v238/patil24b.htmlFailures and Successes of Cross-Validation for Early-Stopped Gradient DescentWe analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient descent (GD) in high-dimensional least squares regression. We prove that GCV is generically inconsistent as an estimator of the prediction risk of early-stopped GD, even for a well-specified linear model with isotropic features. In contrast, we show that LOOCV converges uniformly along the GD trajectory to the prediction risk. Our theory requires only mild assumptions on the data distribution and does not require the underlying regression function to be linear. Furthermore, by leveraging the individual LOOCV errors, we construct consistent estimators for the entire prediction error distribution along the GD trajectory and consistent estimators for a wide class of error functionals. This in particular enables the construction of pathwise prediction intervals based on GD iterates that have asymptotically correct nominal coverage conditional on the training data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/patil24a.html
https://proceedings.mlr.press/v238/patil24a.htmlConformal Contextual Robust OptimizationData-driven approaches to predict-then-optimize decision-making problems seek to mitigate the risk of uncertainty region misspecification in safety-critical settings. Current approaches, however, suffer from considering overly conservative uncertainty regions, often resulting in suboptimal decision-making. To this end, we propose Conformal-Predict-Then-Optimize (CPO), a framework for leveraging highly informative, nonconvex conformal prediction regions over high-dimensional spaces based on conditional generative models, which have the desired distribution-free coverage guarantees. Despite guaranteeing robustness, such black-box optimization procedures alone inspire little confidence owing to the lack of explanation of why a particular decision was found to be optimal. We, therefore, augment CPO to additionally provide semantically meaningful visual summaries of the uncertainty regions to give qualitative intuition for the optimal decision. We highlight the CPO framework by demonstrating results on a suite of simulation-based inference benchmark tasks and a vehicle routing task based on probabilistic weather prediction.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/patel24a.html
https://proceedings.mlr.press/v238/patel24a.htmlSum-max Submodular BanditsMany online decision-making problems correspond to maximizing a sequence of submodular functions. In this work, we introduce sum-max functions, a subclass of monotone submodular functions capturing several interesting problems, including best-of-$K$-bandits, combinatorial bandits, and the bandit versions on $M$-medians and hitting sets. We show that all functions in this class satisfy a key property that we call pseudo-concavity. This allows us to prove $\big(1 - \frac{1}{e}\big)$-regret bounds for bandit feedback in the nonstochastic setting of the order of $\sqrt{MKT}$ (ignoring log factors), where $T$ is the time horizon and $M$ is a cardinality constraint. This bound, attained by a simple and efficient algorithm, significantly improves on the $\widetilde{\mathcal{O}}\big(T^{2/3}\big)$ regret bound for online monotone submodular maximization with bandit feedback. We also extend our results to a bandit version of the facility location problem.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pasteris24a.html
https://proceedings.mlr.press/v238/pasteris24a.htmlCousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine LearningMeasuring diversity accurately is important for many scientific fields, including machine learning (ML), ecology, and chemistry. The Vendi Score was introduced as a generic similarity-based diversity metric that extends the Hill number of order $q=1$ by leveraging ideas from quantum statistical mechanics. Contrary to many diversity metrics in ecology, the Vendi Score accounts for similarity and does not require knowledge of the prevalence of the categories in the collection to be evaluated for diversity. However, the Vendi Score treats each item in a given collection with a level of sensitivity proportional to the item’s prevalence. This is undesirable in settings where there is a significant imbalance in item prevalence. In this paper, we extend the other Hill numbers using similarity to provide flexibility in allocating sensitivity to rare or common items. This leads to a family of diversity metrics–Vendi scores with different levels of sensitivity controlled by the order $q$–that can be used in a variety of applications. We study the properties of the scores in a synthetic controlled setting where the ground truth diversity is known. We then test the utility of the Vendi scores in improving molecular simulations via Vendi Sampling. Finally, we use the scores to better understand the behavior of image generative models in terms of memorization, duplication, diversity, and sample quality.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pasarkar24a.html
https://proceedings.mlr.press/v238/pasarkar24a.htmlDensity Uncertainty Layers for Reliable Uncertainty EstimationAssessing the predictive uncertainty of deep neural networks is crucial for safety-related applications of deep learning. Although Bayesian deep learning offers a principled framework for estimating model uncertainty, the common approaches that approximate the parameter posterior often fail to deliver reliable estimates of predictive uncertainty. In this paper, we propose a novel criterion for reliable predictive uncertainty: a model’s predictive variance should be grounded in the empirical density of the input. That is, the model should produce higher uncertainty for inputs that are improbable in the training data and lower uncertainty for inputs that are more probable. To operationalize this criterion, we develop the density uncertainty layer, a stochastic neural network architecture that satisfies the density uncertain criterion by design. We study density uncertainty layers on the UCI and CIFAR-10/100 uncertainty benchmarks. Compared to existing approaches, density uncertainty layers provide more reliable uncertainty estimates and robust out-of-distribution detection performance.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/park24a.html
https://proceedings.mlr.press/v238/park24a.htmlSafe and Interpretable Estimation of Optimal Treatment RegimesRecent advancements in statistical and reinforcement learning methods have contributed to superior patient care strategies. However, these methods face substantial challenges in high-stakes contexts, including missing data, stochasticity, and the need for interpretability and patient safety. Our work operationalizes a safe and interpretable approach for optimizing treatment regimes by matching patients with similar medical and pharmacological profiles. This allows us to construct optimal policies via interpolation. Our comprehensive simulation study demonstrates our method’s effectiveness in complex scenarios. We use this approach to study seizure treatment in critically ill patients, advocating for personalized strategies based on medical history and pharmacological features. Our findings recommend reducing medication doses for mild, brief seizure episodes and adopting aggressive treatment strategies for severe cases, leading to improved outcomes.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/parikh24a.html
https://proceedings.mlr.press/v238/parikh24a.htmlLeveraging Continuous Time to Understand Momentum When Training Diagonal Linear NetworksIn this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/papazov24a.html
https://proceedings.mlr.press/v238/papazov24a.htmlDeep anytime-valid hypothesis testingWe propose a general framework for constructing powerful, sequential hypothesis tests for a large class of nonparametric testing problems. The null hypothesis for these problems is defined in an abstract form using the action of two known operators on the data distribution. This abstraction allows for a unified treatment of several classical tasks, such as two-sample testing, independence testing, and conditional-independence testing, as well as modern problems, such as testing for adversarial robustness of machine learning (ML) models. Our proposed framework has the following advantages over classical batch tests: 1) it continuously monitors online data streams and efficiently aggregates evidence against the null, 2) it provides tight control over the type I error without the need for multiple testing correction, 3) it adapts the sample size requirement to the unknown hardness of the problem. We develop a principled approach of leveraging the representation capability of ML models within the testing-by-betting framework, a game-theoretic approach for designing sequential tests. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines on several tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pandeva24a.html
https://proceedings.mlr.press/v238/pandeva24a.htmlAn Online Bootstrap for Time SeriesResampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications. This method is based on an autoregressive sequence of increasingly dependent resampling weights. We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/palm24a.html
https://proceedings.mlr.press/v238/palm24a.htmlSample-Efficient Personalization: Modeling User Parameters as Low Rank Plus Sparse ComponentsPersonalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation systems. Standard personalization approaches involve learning a user/domain specific \emph{embedding} that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain — a.k.a meta-learning — has high storage/infrastructure cost. Moreover, rigorous theoretical studies of scalable personalization approaches have been very limited. To address the above issues, we propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse components. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose a computationally efficient alternating minimization method with iterative hard thresholding — AMHT-LRS — to learn the low-rank and sparse part. Theoretically, for the realizable Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal sample complexity. Finally, a significant challenge in personalization is ensuring privacy of each user’s sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/pal24a.html
https://proceedings.mlr.press/v238/pal24a.htmlThompson Sampling Itself is Differentially PrivateIn this work we first show that the classical Thompson sampling algorithm for multi-arm bandits is differentially private as-is, without any modification. We provide per-round privacy guarantees as a function of problem parameters and show composition over $T$ rounds; since the algorithm is unchanged, existing $O(\sqrt{NT\log N})$ regret bounds still hold and there is no loss in performance due to privacy. We then show that simple modifications – such as pre-pulling all arms a fixed number of times, increasing the sampling variance – can provide tighter privacy guarantees. We again provide privacy guarantees that now depend on the new parameters introduced in the modification, which allows the analyst to tune the privacy guarantee as desired. We also provide a novel regret analysis for this new algorithm, and show how the new parameters also impact expected regret. Finally, we empirically validate and illustrate our theoretical findings in two parameter regimes and demonstrate that tuning the new parameters substantially improve the privacy-regret tradeoff.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ou24a.html
https://proceedings.mlr.press/v238/ou24a.htmlLow-rank MDPs with Continuous Action SpacesLow-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Hölder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Hölder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/oprescu24a.html
https://proceedings.mlr.press/v238/oprescu24a.htmlThink Global, Adapt Local: Learning Locally Adaptive K-Nearest Neighbor Kernel Density EstimatorsKernel density estimation (KDE) is a powerful technique for non-parametric density estimation, yet practical use of KDE-based methods remains limited by insufficient representational flexibility, especially for higher-dimensional data. Contrary to KDE, K-nearest neighbor (KNN) density estimation procedures locally adapt the density based on the K-nearest neighborhood, but unfortunately only provide asymptotically correct density estimates. We present the KNN-KDE method introducing observation-specific kernels for KDE that are locally adapted through priors defined by the covariance of the K-nearest neighborhood, forming a fully Bayesian model with exact density estimates. We further derive a scalable inference procedure that infers parameters through variational inference by optimizing the predictive likelihood exploiting sparsity, batched optimization, and parallel computation for massive inference speedups. We find that KNN-KDE provides valid density estimates superior to conventional KDE and KNN density estimation on both synthetic and real data sets. We further observe that the bayesian KNN-KDE even outperforms recent neural density estimation procedures on two of the five considered real data sets. The KNN-KDE unifies conventional kernel and KNN density estimation providing a scalable, generic and accurate framework for density estimation.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/olsen24a.html
https://proceedings.mlr.press/v238/olsen24a.htmlOn the connection between Noise-Contrastive Estimation and Contrastive DivergenceNoise-contrastive estimation (NCE) is a popular method for estimating unnormalised probabilistic models, such as energy-based models, which are effective for modelling complex data distributions. Unlike classical maximum likelihood (ML) estimation that relies on importance sampling (resulting in ML-IS) or MCMC (resulting in contrastive divergence, CD), NCE uses a proxy criterion to avoid the need for evaluating an often intractable normalisation constant. Despite apparent conceptual differences, we show that two NCE criteria, ranking NCE (RNCE) and conditional NCE (CNCE), can be viewed as ML estimation methods. Specifically, RNCE is equivalent to ML estimation combined with conditional importance sampling, and both RNCE and CNCE are special cases of CD. These findings bridge the gap between the two method classes and allow us to apply techniques from the ML-IS and CD literature to NCE, offering several advantageous extensions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/olmin24a.html
https://proceedings.mlr.press/v238/olmin24a.htmlFaster Recalibration of an Online Predictor via ApproachabilityPredictive models in ML need to be trustworthy and reliable, which often at the very least means outputting calibrated probabilities. This can be particularly difficult to guarantee in the online prediction setting when the outcome sequence can be generated adversarially. In this paper we introduce a technique using Blackwell’s approachability theorem for taking an online predictive model which might not be calibrated and transforming its predictions to calibrated predictions without much increase to the loss of the original model. Our proposed algorithm achieves calibration and accuracy at a faster rate than existing techniques (Kuleshov and Ermon, 2017) and is the first algorithm to offer a flexible tradeoff between calibration error and accuracy in the online setting. We demonstrate this by characterizing the space of jointly achievable calibration and regret using our technique.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/okoroafor24a.html
https://proceedings.mlr.press/v238/okoroafor24a.htmlFair Machine Unlearning: Data Removal while Mitigating DisparitiesThe Right to be Forgotten is a core principle outlined by regulatory frameworks such as the EU’s General Data Protection Regulation (GDPR). This principle allows individuals to request that their personal data be deleted from deployed machine learning models. While "forgetting" can be naively achieved by retraining on the remaining dataset, it is computationally expensive to do to so with each new request. As such, several machine unlearning methods have been proposed as efficient alternatives to retraining. These methods aim to approximate the predictive performance of retraining, but fail to consider how unlearning impacts other properties critical to real-world applications such as fairness. In this work, we demonstrate that most efficient unlearning methods cannot accommodate popular fairness interventions, and we propose the first fair machine unlearning method that can efficiently unlearn data instances from a fair objective. We derive theoretical results which demonstrate that our method can provably unlearn data and provably maintain fairness performance. Extensive experimentation with real-world datasets highlight the efficacy of our method at unlearning data instances while preserving fairness. Code is provided at https://github.com/AI4LIFE-GROUP/fair-unlearning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/oesterling24a.html
https://proceedings.mlr.press/v238/oesterling24a.htmlLeveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection BiasSelf-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, \texttt{softmax} prediction probabilities are often used as a confidence measure, although they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraints. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities. The code is available at https://github.com/ambroiseodt/tsim.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/odonnat24a.html
https://proceedings.mlr.press/v238/odonnat24a.htmlALAS: Active Learning for Autoconversion Rates Prediction from Satellite DataHigh-resolution simulations, such as the ICOsahedral Non-hydrostatic Large-Eddy Model (ICON-LEM), provide valuable insights into the complex interactions among aerosols, clouds, and precipitation, which are the major contributors to climate change uncertainty. However, due to their exorbitant computational costs, they can only be employed for a limited period and geographical area. To address this, we propose a more cost-effective method powered by an emerging machine learning approach to better understand the intricate dynamics of the climate system. Our approach involves active learning techniques by leveraging high-resolution climate simulation as an oracle that is queried based on an abundant amount of unlabeled data drawn from satellite observations. In particular, we aim to predict autoconversion rates, a crucial step in precipitation formation, while significantly reducing the need for a large number of labeled instances. In this study, we present novel methods: custom fusion query strategies for labeling instances – weight fusion (WiFi) and merge fusion (MeFi) – along with active feature selection based on SHAP. These methods are designed to tackle real-world challenges – in this case, climate change, with a specific focus on the prediction of autoconversion rates – due to their simplicity and practicality in application.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/novitasari24a.html
https://proceedings.mlr.press/v238/novitasari24a.htmlWhy is parameter averaging beneficial in SGD? An objective smoothing perspectiveIt is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nitanda24a.html
https://proceedings.mlr.press/v238/nitanda24a.htmlCorruption-Robust Offline Two-Player Zero-Sum Markov GamesWe study data corruption robustness in offline two-player zero-sum Markov games. Given a dataset of realized trajectories of two players, an adversary is allowed to modify an $\epsilon$-fraction of it. The learner’s goal is to identify an approximate Nash Equilibrium policy pair from the corrupted data. We consider this problem in linear Markov games under different degrees of data coverage and corruption. We start by providing an information-theoretic lower bound on the suboptimality gap of any learner. Next, we propose robust versions of the Pessimistic Minimax Value Iteration algorithm (Zhong et al., 2022), both under coverage on the corrupted data and under coverage only on the clean data, and show that they achieve (near)-optimal suboptimality gap bounds with respect to $\epsilon$. We note that we are the first to provide such a characterization of the problem of learning approximate Nash Equilibrium policies in offline two-player zero-sum Markov games under data corruption.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nika24a.html
https://proceedings.mlr.press/v238/nika24a.htmlMixture-of-Linear-Experts for Long-term Time Series ForecastingLong-term time series forecasting (LTSF) aims to predict future values of a time series given the past values. The current state-of-the-art (SOTA) on this problem is attained in some cases by linear-centric models, which primarily feature a linear mapping layer. However, due to their inherent simplicity, they are not able to adapt their prediction rules to periodic changes in time series patterns. To address this challenge, we propose a Mixture-of-Experts-style augmentation for linear-centric models and propose Mixture-of-Linear-Experts (MoLE). Instead of training a single model, MoLE trains multiple linear-centric models (i.e., experts) and a router model that weighs and mixes their outputs. While the entire framework is trained end-to-end, each expert learns to specialize in a specific temporal pattern, and the router model learns to compose the experts adaptively. Experiments show that MoLE reduces forecasting error of linear-centric models, including DLinear, RLinear, and RMLP, in over 78% of the datasets and settings we evaluated. By using MoLE existing linear-centric models can achieve SOTA LTSF results in 68% of the experiments that PatchTST reports and we compare to, whereas existing single-head linear-centric models achieve SOTA results in only 25% of cases.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ni24a.html
https://proceedings.mlr.press/v238/ni24a.htmlNear-optimal Per-Action Regret Bounds for Sleeping BanditsWe derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts’ confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nguyen24d.html
https://proceedings.mlr.press/v238/nguyen24d.htmlFrom Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based ApproachWe propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to still reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nguyen24c.html
https://proceedings.mlr.press/v238/nguyen24c.htmlTowards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of ExpertsOriginally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from complete. To shed new light on this problem, we provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model. The main challenge of that analysis comes from the inclusion of covariates in the Gaussian gating functions and expert networks, which leads to their intrinsic interaction via some partial differential equations with respect to their parameters. We tackle these issues by designing novel Voronoi loss functions among parameters to accurately capture the heterogeneity of parameter estimation rates. Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions, namely when all these parameters are non-zero versus when at least one among them vanishes. Notably, these behaviors can be characterized by the solvability of two different systems of polynomial equations. Finally, we conduct a simulation study to empirically verify our theoretical results.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nguyen24b.html
https://proceedings.mlr.press/v238/nguyen24b.htmlOn Parameter Estimation in Deviated Gaussian Mixture of ExpertsWe consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - \lambda^{\ast}) g_0(Y| X)+ \lambda^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},\sigma_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $\lambda^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, \sigma_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nguyen24a.html
https://proceedings.mlr.press/v238/nguyen24a.htmlCAD-DA: Controllable Anomaly Detection after Domain Adaptation by Statistical InferenceWe propose a novel statistical method for testing the results of anomaly detection (AD) under domain adaptation (DA), which we call CAD-DA—controllable AD under DA. The distinct advantage of the CAD-DA lies in its ability to control the probability of misidentifying anomalies under a pre-specified level $\alpha$ (e.g., 0.05). The challenge within this DA setting is the necessity to account for the influence of DA to ensure the validity of the inference results. We overcome the challenge by leveraging the concept of Selective Inference to handle the impact of DA. To our knowledge, this is the first work capable of conducting a valid statistical inference within the context of DA. We evaluate the performance of the CAD-DA method on both synthetic and real-world datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nguyen-le-duy24a.html
https://proceedings.mlr.press/v238/nguyen-le-duy24a.htmlConfident Feature RankingMachine learning models are widely applied in various fields. Stakeholders often use post-hoc feature importance methods to better understand the input features’ contribution to the models’ predictions. The interpretation of the importance values provided by these methods is frequently based on the relative order of the features (their ranking) rather than the importance values themselves. Since the order may be unstable, we present a framework for quantifying the uncertainty in global importance values. We propose a novel method for the post-hoc interpretation of feature importance values that is based on the framework and pairwise comparisons of the feature importance values. This method produces simultaneous confidence intervals for the features’ ranks, which include the “true” (infinite sample) ranks with high probability, and enables the selection of the set of top-k important features.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/neuhof24a.html
https://proceedings.mlr.press/v238/neuhof24a.htmlStochastic Frank-Wolfe: Unified Analysis and Zoo of Special CasesThe Conditional Gradient (or Frank-Wolfe) method is one of the most well-known methods for solving constrained optimization problems appearing in various machine learning tasks. The simplicity of iteration and applicability to many practical problems helped the method to gain popularity in the community. In recent years, the Frank-Wolfe algorithm received many different extensions, including stochastic modifications with variance reduction and coordinate sampling for training of huge models or distributed variants for big data problems. In this paper, we present a unified convergence analysis of the Stochastic Frank-Wolfe method that covers a large number of particular practical cases that may have completely different nature of stochasticity, intuitions and application areas. Our analysis is based on a key parametric assumption on the variance of the stochastic gradients. But unlike most works on unified analysis of other methods, such as SGD, we do not assume an unbiasedness of the real gradient estimation. We conduct analysis for convex and non-convex problems due to the popularity of both cases in machine learning. With this general theoretical framework, we not only cover rates of many known methods, but also develop numerous new methods. This shows the flexibility of our approach in developing new algorithms based on the Conditional Gradient approach. We also demonstrate the properties of the new methods through numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nazykov24a.html
https://proceedings.mlr.press/v238/nazykov24a.htmlOn the Misspecification of Linear Assumptions in Synthetic ControlsThe synthetic control (SC) method is popular for estimating causal effects from observational panel data. It rests on a crucial assumption that we can write the treated unit as a linear combination of the untreated units. In practice, this assumption may not hold, and when violated, the resulting SC estimates are incorrect. This paper examines two questions: (1) How large can the misspecification error be? (2) How can we minimize it? First, we provide theoretical bounds to quantify the misspecification error. The bounds are comforting: small misspecifications induce small errors. With these bounds in hand, we develop new SC estimators specially designed to minimize misspecification error. The estimators are based on additional data about each unit. (E.g., if the units are countries, it might be demographic information about each.) We study our estimators on synthetic data; we find they produce more accurate causal estimates than standard SC. We then re-analyze the California tobacco program data of the original SC paper, now including additional data from the US census about per-state demographics. Our estimators show that the observations in the pre-treatment period lie within the bounds of misspecification error and that observations post-treatment lie outside of those bounds. This is evidence that our SC methods have uncovered a true effect.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nazaret24a.html
https://proceedings.mlr.press/v238/nazaret24a.htmlTime to Cite: Modeling Citation Networks using the Dynamic Impact Single-Event Embedding ModelUnderstanding the structure and dynamics of scientific research, i.e., the science of science (SciSci), has become an important area of research in order to address imminent questions including how scholars interact to advance science, how disciplines are related and evolve, and how research impact can be quantified and predicted. Central to the study of SciSci has been the analysis of citation networks. Here, two prominent modeling methodologies have been employed: one is to assess the citation impact dynamics of papers using parametric distributions, and the other is to embed the citation networks in a latent space optimal for characterizing the static relations between papers in terms of their citations. Interestingly, citation networks are a prominent example of single-event dynamic networks, i.e., networks for which each dyad only has a single event (i.e., the point in time of citation). We presently propose a novel likelihood function for the characterization of such single-event networks. Using this likelihood, we propose the Dynamic Impact Single-Event Embedding model (DISEE). The DISEE model characterizes the scientific interactions in terms of a latent distance model in which random effects account for citation heterogeneity while the time-varying impact is characterized using existing parametric representations for assessment of dynamic impact. We highlight the proposed approach on several real citation networks finding that DISEE well reconciles static latent distance network embedding approaches with classical dynamic impact assessments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nakis24a.html
https://proceedings.mlr.press/v238/nakis24a.htmlWarped Diffusion for Latent Differentiation InferenceThis paper proposes a Bayesian nonparametric diffusion model with a black-box warping function represented by a Gaussian process to infer potential diffusion structures latent in observed data, such as differentiation mechanisms of living cells and phylogenetic evolution processes of media information. In general, the task of inferring latent differentiation structures is very difficult to handle due to two interrelated settings. One is that the conversion mechanism between hidden structure and often higher dimensional observations is unknown (and is a complex mechanism). The other is that the topology of the hidden diffuse structure itself is unknown. Therefore, in this paper, we propose a BNP-based strategy as a natural way to deal with these two challenging settings simultaneously. Specifically, as an extension of the Gaussian process latent variable model, we propose a model in which the black box transformation from latent variable space to observed data space is represented by a Gaussian process, and introduce a BNP diffusion model for the latent variable space. We show its application to the visualization of the diffusion structure of media information and to the task of inferring cell differentiation structure from single-cell gene expression levels.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nakano24a.html
https://proceedings.mlr.press/v238/nakano24a.htmlFixed-Budget Real-Valued Combinatorial Pure Exploration of Multi-Armed BanditWe study the real-valued combinatorial pure exploration of the multi-armed bandit in the fixed-budget setting. We first introduce an algorithm named the Combinatorial Successive Asign (CSA) algorithm, which is the first algorithm that can identify the best action even when the size of the action class is exponentially large with respect to the number of arms. We show that the upper bound of the probability of error of the CSA algorithm matches a lower bound up to a logarithmic factor in the exponent. Then, we introduce another algorithm named the Minimax Combinatorial Successive Accepts and Rejects (Minimax-CombSAR) algorithm for the case where the size of the action class is polynomial, and show that it is optimal, which matches a lower bound. Finally, we experimentally compare the algorithms with previous methods and show that our algorithm performs better.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/nakamura24a.html
https://proceedings.mlr.press/v238/nakamura24a.htmlNon-vacuous Generalization Bounds for Adversarial Risk in Stochastic Neural NetworksAdversarial examples are manipulated samples used to deceive machine learning models, posing a serious threat in safety-critical applications. Existing safety certificates for machine learning models are limited to individual input examples, failing to capture generalization to unseen data. To address this limitation, we propose novel generalization bounds based on the PAC-Bayesian and randomized smoothing frameworks, providing certificates that predict the model’s performance and robustness on unseen test samples based solely on the training data. We present an effective procedure to train and compute the first non-vacuous generalization bounds for neural networks in adversarial settings. Experimental results on the widely recognized MNIST and CIFAR-10 datasets demonstrate the efficacy of our approach, yielding the first robust risk certificates for stochastic convolutional neural networks under the $L_2$ threat model. Our method offers valuable tools for evaluating model susceptibility to real-world adversarial risks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mustafa24a.html
https://proceedings.mlr.press/v238/mustafa24a.htmlLP-based Construction of DC Decompositions for Efficient Inference of Markov Random FieldsThe success of the convex-concave procedure (CCCP), a widely used technique for non-convex optimization, crucially depends on finding a decomposition of the objective function as a difference of convex functions (dcds). Despite the widespread applicability of CCCP, finding such dcds has attracted little attention in machine learning. For graphical models with polynomial potentials, existing methods for finding dcds require solving a Sum-of-Squares (SOS) program, which is often prohibitively expensive. In this work, we leverage tools from algebraic geometry certifying the positivity of polynomials, to derive LP-based constructions of dcds of polynomials which are particularly suited for graphical model inference. Our experiments demonstrate that using our LP-based technique constructs dcds for polynomial potentials of Markov random fields significantly faster compared to SOS-based approaches used in previous works.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/murti24a.html
https://proceedings.mlr.press/v238/murti24a.htmlSPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic BanditsIn this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as $\widetilde{O}_{}(d^3 n^{-3/2})$ and prove a matching lower bound of $\Omega(d^2 n^{-3/2})$. Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mukherjee24a.html
https://proceedings.mlr.press/v238/mukherjee24a.htmlScalable Algorithms for Individual Preference Stable ClusteringIn this paper, we study the individual preference (IP) stability, which is an notion capturing individual fairness and stability in clustering. Within this setting, a clustering is $\alpha$-IP stable when each data point’s average distance to its cluster is no more than $\alpha$ times its average distance to any other cluster. In this paper, we study the natural local search algorithm for IP stable clustering. Our analysis confirms a $O(\log n)$-IP stability guarantee for this algorithm, where $n$ denotes the number of points in the input. Furthermore, by refining the local search approach, we show it runs in an almost linear time, $\tilde{O}(nk)$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mosenzon24a.html
https://proceedings.mlr.press/v238/mosenzon24a.htmlOn the price of exact truthfulness in incentive-compatible online learning with bandit feedback: a regret lower bound for WSU-UXIn one view of the classical game of prediction with expert advice with binary outcomes, in each round, each expert maintains an adversarially chosen belief and honestly reports this belief. We consider a recently introduced, strategic variant of this problem with selfish (reputation-seeking) experts, where each expert strategically reports in order to maximize their expected future reputation based on their belief. In this work, our goal is to design an algorithm for the selfish experts problem that is incentive-compatible (IC, or \emph{truthful}), meaning each expert’s best strategy is to report truthfully, while also ensuring the algorithm enjoys sublinear regret with respect to the expert with the best belief. Freeman et al. (2020) recently studied this problem in the full information and bandit settings and obtained truthful, no-regret algorithms by leveraging prior work on wagering mechanisms. While their results under full information match the minimax rate for the classical ("honest experts") problem, the best-known regret for their bandit algorithm WSU-UX is $O(T^{2/3})$, which does not match the minimax rate for the classical ("honest bandits") setting. It was unclear whether the higher regret was an artifact of their analysis or a limitation of WSU-UX. We show, via explicit construction of loss sequences, that the algorithm suffers a worst-case $\Omega(T^{2/3})$ lower bound. Left open is the possibility that a different IC algorithm obtains $O(\sqrt{T})$ regret. Yet, WSU-UX was a natural choice for such an algorithm owing to the limited design room for IC algorithms in this setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mortazavi24a.html
https://proceedings.mlr.press/v238/mortazavi24a.htmlDifferentiable Rendering with Reparameterized Volume SamplingIn view synthesis, a neural radiance field approximates underlying density and radiance fields based on a sparse set of scene pictures. To generate a pixel of a novel view, it marches a ray through the pixel and computes a weighted sum of radiance emitted from a dense set of ray points. This rendering algorithm is fully differentiable and facilitates gradient-based optimization of the fields. However, in practice, only a tiny opaque portion of the ray contributes most of the radiance to the sum. We propose a simple end-to-end differentiable sampling algorithm based on inverse transform sampling. It generates samples according to the probability distribution induced by the density field and picks non-transparent points on the ray. We utilize the algorithm in two ways. First, we propose a novel rendering approach based on Monte Carlo estimates. This approach allows for evaluating and optimizing a neural radiance field with just a few radiance field calls per ray. Second, we use the sampling algorithm to modify the hierarchical scheme proposed in the original NeRF work. We show that our modification improves reconstruction quality of hierarchical models, at the same time simplifying the training procedure by removing the need for auxiliary proposal network losses.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/morozov24a.html
https://proceedings.mlr.press/v238/morozov24a.htmlEfficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror DescentMany machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/moreno24a.html
https://proceedings.mlr.press/v238/moreno24a.htmlSDEs for Minimax OptimizationMinimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of Itô calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/monzio-compagnoni24a.html
https://proceedings.mlr.press/v238/monzio-compagnoni24a.htmlImproved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision ProcessesWe consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mondal24a.html
https://proceedings.mlr.press/v238/mondal24a.htmlFederated Learning For Heterogeneous Electronic Health Records Utilising Augmented Temporal Graph Attention NetworksThe proliferation of decentralised electronic healthcare records (EHRs) across medical institutions requires innovative federated learning strategies for collaborative data analysis and global model training, prioritising data privacy. A prevalent issue during decentralised model training is the data-view discrepancies across medical institutions that arises from differences or availability of healthcare services, such as blood test panels. The prevailing way to handle this issue is to select a common subset of features across institutions to make data-views consistent. This approach, however, constrains some institutions to shed some critical features that may play a significant role in improving the model performance. This paper introduces a federated learning framework that relies on augmented graph attention networks to address data-view heterogeneity. The proposed framework utilises an alignment augmentation layer over self-attention mechanisms to weigh the importance of neighbouring nodes when updating a node’s embedding irrespective of the data-views. Furthermore, our framework adeptly addresses both the temporal nuances and structural intricacies of EHR datasets. This dual capability not only offers deeper insights but also effectively encapsulates EHR graphs’ time-evolving nature. Using diverse real-world datasets, we show that the proposed framework significantly outperforms conventional FL methodology for dealing with heterogeneous data-views.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/molaei24a.html
https://proceedings.mlr.press/v238/molaei24a.htmlMaximum entropy GFlowNets with soft Q-learningGenerative Flow Networks (GFNs) have emerged as a powerful tool for sampling discrete objects from unnormalized distributions, offering a scalable alternative to Markov Chain Monte Carlo (MCMC) methods. While GFNs draw inspiration from maximum entropy reinforcement learning (RL), the connection between the two has largely been unclear and seemingly applicable only in specific cases. This paper addresses the connection by constructing an appropriate reward function, thereby establishing an exact relationship between GFNs and maximum entropy RL. This construction allows us to introduce maximum entropy GFNs, which achieve the maximum entropy attainable by GFNs without constraints on the state space, in contrast to GFNs with uniform backward policy.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mohammadpour24a.html
https://proceedings.mlr.press/v238/mohammadpour24a.htmlFaithful graphical representations of local independenceGraphical models use graphs to represent conditional independence structure in the distribution of a random vector. In stochastic processes, graphs may represent so-called local independence or conditional Granger causality. Under some regularity conditions, a local independence graph implies a set of independences using a graphical criterion known as delta-separation, or using its generalization, mu-separation. This is a stochastic process analogue of d-separation in DAGs. However, there may be more independences than implied by this graph and this is a violation of so-called faithfulness. We characterize faithfulness in local independence graphs and give a method to construct a faithful graph from any local independence model such that the output equals the true graph when Markov and faithfulness assumptions hold. We discuss various assumptions that are weaker than faithfulness, and we explore different structure learning algorithms and their properties under varying assumptions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mogensen24a.html
https://proceedings.mlr.press/v238/mogensen24a.htmlAn Impossibility Theorem for Node EmbeddingWith the increasing popularity of graph-based methods for dimensionality reduction and representation learning, node embedding functions have become important objects of study in the literature. In this paper, we take an axiomatic approach to understanding node embedding methods. Motivated by desirable properties of node embeddings for encoding the role of a node in the structure of a network, we first state three properties for embedding dissimilarity networks. We then prove that no node embedding method can satisfy all three properties at once, reflecting fundamental difficulties inherent to the task. Having identified these difficulties, we show that mild relaxations of these axioms allow for certain node embedding methods to be admissible.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mitchell-roddenberry24a.html
https://proceedings.mlr.press/v238/mitchell-roddenberry24a.htmlLength independent PAC-Bayes bounds for Simple RNNsWhile the practical interest of Recurrent neural networks (RNNs) is attested, much remains to be done to develop a thorough theoretical understanding of their abilities, particularly in what concerns their learning capacities. A powerful framework to tackle this question is the one of PAC-Bayes theory, which allows one to derive bounds providing guarantees on the expected performance of learning models on unseen data. In this paper, we provide an extensive study on the conditions leading to PAC-Bayes bounds for non-linear RNNs that are independent of the length of the data. The derivation of our results relies on a perturbation analysis on the weights of the network. We prove bounds that hold for \emph{$\beta$-saturated} and \emph{DS $\beta$-saturated} SRNs, classes of RNNs we introduce to formalize saturation regimes of RNNs. The first regime corresponds to the case where the values of the hidden state of the SRN are always close to the boundaries of the activation functions. The second one, closely related to practical observations, only requires that it happens at least once in each component of the hidden state on a sliding window of a given size.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mitarchuk24a.html
https://proceedings.mlr.press/v238/mitarchuk24a.htmlEfficient Low-Dimensional Compression of Overparameterized ModelsIn this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/min-kwon24a.html
https://proceedings.mlr.press/v238/min-kwon24a.htmlFALCON: FLOP-Aware Combinatorial Optimization for Neural Network PruningThe increasing computational demands of modern neural networks present deployment challenges on resource-constrained devices. Network pruning offers a solution to reduce model size and computational cost while maintaining performance. However, current pruning methods focus primarily on improving sparsity by reducing the number of nonzero parameters, often neglecting other deployment costs such as inference time, which are closely related to the number of floating-point operations (FLOPs). In this paper, we propose {FALCON}, a novel combinatorial-optimization-based framework for network pruning that jointly takes into account model accuracy (fidelity), FLOPs, and sparsity constraints. A main building block of our approach is an integer linear program (ILP) that simultaneously handles FLOP and sparsity constraints. We present a novel algorithm to approximately solve the ILP. We propose a novel first-order method for our optimization framework which makes use of our ILP solver. Using problem structure (e.g., the low-rank structure of approx. Hessian), we can address instances with millions of parameters. Our experiments demonstrate that {FALCON} achieves superior accuracy compared to other pruning approaches within a fixed FLOP budget. For instance, for ResNet50 with 20% of the total FLOPs retained, our approach improves the accuracy by 48% relative to state-of-the-art. Furthermore, in gradual pruning settings with re-training between pruning steps, our framework outperforms existing pruning methods, emphasizing the significance of incorporating both FLOP and sparsity constraints for effective network pruning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/meng24a.html
https://proceedings.mlr.press/v238/meng24a.htmlEfficient Graph Laplacian Estimation by Proximal NewtonThe Laplacian-constrained Gaussian Markov Random Field (LGMRF) is a common multivariate statistical model for learning a weighted sparse dependency graph from given data. This graph learning problem can be formulated as a maximum likelihood estimation (MLE) of the precision matrix, subject to Laplacian structural constraints, with a sparsity-inducing penalty term. This paper aims to solve this learning problem accurately and efficiently. First, since the commonly used $\ell_1$-norm penalty is inappropriate in this setting and may lead to a complete graph, we employ the nonconvex minimax concave penalty (MCP), which promotes sparse solutions with lower estimation bias. Second, as opposed to existing first-order methods for this problem, we develop a second-order proximal Newton approach to obtain an efficient solver, utilizing several algorithmic features, such as using conjugate gradients, preconditioning, and splitting to active/free sets. Numerical experiments demonstrate the advantages of the proposed method in terms of both computational complexity and graph learning accuracy compared to existing methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/medvedovsky24a.html
https://proceedings.mlr.press/v238/medvedovsky24a.htmlSequential Monte Carlo for Inclusive KL Minimization in Amortized Variational InferenceFor training an encoder network to perform amortized variational inference, the Kullback-Leibler (KL) divergence from the exact posterior to its approximation, known as the inclusive or forward KL, is an increasingly popular choice of variational objective due to the mass-covering property of its minimizer. However, minimizing this objective is challenging. A popular existing approach, Reweighted Wake-Sleep (RWS), suffers from heavily biased gradients and a circular pathology that results in highly concentrated variational distributions. As an alternative, we propose SMC-Wake, a procedure for fitting an amortized variational approximation that uses likelihood-tempered sequential Monte Carlo samplers to estimate the gradient of the inclusive KL divergence. We propose three gradient estimators, all of which are asymptotically unbiased in the number of iterations and two of which are strongly consistent. Our method interleaves stochastic gradient updates, SMC samplers, and iterative improvement to an estimate of the normalizing constant to reduce bias from self-normalization. In experiments with both simulated and real datasets, SMC-Wake fits variational distributions that approximate the posterior more accurately than existing methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mcnamara24a.html
https://proceedings.mlr.press/v238/mcnamara24a.htmlAnytime-Constrained Reinforcement LearningWe introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an arbitrarily accurate approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or the absolute budget. Given our hardness results, our approximation guarantees are the best possible under worst-case analysis.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mcmahan24a.html
https://proceedings.mlr.press/v238/mcmahan24a.htmlAn Improved Algorithm for Learning Drifting Discrete DistributionsWe present a new adaptive algorithm for learning discrete distributions under distribution drift. In this setting, we observe a sequence of independent samples from a discrete distribution that is changing over time, and the goal is to estimate the current distribution. Since we have access to only a single sample for each time step, a good estimation requires a careful choice of the number of past samples to use. To use more samples, we must resort to samples further in the past, and we incur a drift error due to the bias introduced by the change in distribution. On the other hand, if we use a small number of past samples, we incur a large statistical error as the estimation has a high variance. We present a novel adaptive algorithm that can solve this trade-off without any prior knowledge of the drift. Unlike previous adaptive results, our algorithm characterizes the statistical error using data-dependent bounds. This technicality enables us to overcome the limitations of the previous work that require a fixed finite support whose size is known in advance and that cannot change over time. Additionally, we can obtain tighter bounds depending on the complexity of the drifting distribution, and also consider distributions with infinite support.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mazzetto24a.html
https://proceedings.mlr.press/v238/mazzetto24a.htmlRobust Approximate Sampling via Stochastic Gradient Barker DynamicsStochastic Gradient (SG) Markov Chain Monte Carlo algorithms (MCMC) are popular algorithms for Bayesian sampling in the presence of large datasets. However, they come with little theoretical guarantees and assessing their empirical performances is non-trivial. In such context, it is crucial to develop algorithms that are robust to the choice of hyperparameters and to gradients heterogeneity since, in practice, both the choice of step-size and behaviour of target gradients induce hard-to-control biases in the invariant distribution. In this work we introduce the stochastic gradient Barker dynamics (SGBD) algorithm, extending the recently developed Barker MCMC scheme, a robust alternative to Langevin-based sampling algorithms, to the stochastic gradient framework. We characterize the impact of stochastic gradients on the Barker transition mechanism and develop a bias-corrected version that, under suitable assumptions, eliminates the error due to the gradient noise in the proposal. We illustrate the performance on a number of high-dimensional examples, showing that SGBD is more robust to hyperparameter tuning and to irregular behavior of the target gradients compared to the popular stochastic gradient Langevin dynamics algorithm.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mauri24a.html
https://proceedings.mlr.press/v238/mauri24a.htmlAcceleration and Implicit Regularization in Gaussian Phase RetrievalWe study accelerated optimization methods in the Gaussian phase retrieval problem. In this setting, we prove that gradient methods with Polyak or Nesterov momentum have similar implicit regularization to gradient descent. This implicit regularization ensures that the algorithms remain in a nice region, where the cost function is strongly convex and smooth despite being nonconvex in general. This ensures that these accelerated methods achieve faster rates of convergence than gradient descent. Experimental evidence demonstrates that the accelerated methods converge faster than gradient descent in practice.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/maunu24a.html
https://proceedings.mlr.press/v238/maunu24a.htmlAchieving Group Distributional Robustness and Minimax Group Fairness with Interpolating ClassifiersGroup distributional robustness optimization methods (GDRO) learn models that guarantee performance across a broad set of demographics. GDRO is often framed as a minimax game where an adversary proposes data distributions under which the model performs poorly; importance weights are used to mimic the adversarial distribution on finite samples. Prior work has show that applying GDRO with interpolating classifiers requires strong regularization to generalize to unseen data. Moreover, these classifiers are not responsive to importance weights in the asymptotic training regime. In this work we propose Bi-level GDRO, a provably convergent formulation that decouples the adversary’s and model learner’s objective and improves generalization guarantees. To address non-responsiveness of importance weights, we combine Bi-level GDRO with a learner that optimizes a temperature-scaled loss that can provably trade off performance between demographics, even on interpolating classifiers. We experimentally demonstrate the effectiveness of our proposed method on learning minimax classifiers on a variety of datasets. Code is available at github.com/MartinBertran/BiLevelGDRO.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/martinez24a.html
https://proceedings.mlr.press/v238/martinez24a.htmlOn the Impact of Overparameterization on the Training of a Shallow Neural Network in High DimensionsWe study the training dynamics of a shallow neural network with quadratic activation functions and quadratic cost in a teacher-student setup. In line with previous works on the same neural architecture, the optimization is performed following the gradient flow on the population risk, where the average over data points is replaced by the expectation over their distribution, assumed to be Gaussian. We first derive convergence properties for the gradient flow and quantify the overparameterization that is necessary to achieve a strong signal recovery. Then, assuming that the teachers and the students at initialization form independent orthonormal families, we derive a high-dimensional limit for the flow and show that the minimal overparameterization is sufficient for strong recovery. We verify by numerical experiments that these results hold for more general initializations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/martin24a.html
https://proceedings.mlr.press/v238/martin24a.htmlPolicy Learning for Localized Interventions from Observational DataA largely unaddressed problem in causal inference is that of learning reliable policies in continuous, high-dimensional treatment variables from observational data. Especially in the presence of strong confounding, it can be infeasible to learn the entire heterogeneous response surface from treatment to outcome. It is also not particularly useful, when there are practical constraints on the size of the interventions altering the observational treatments. Since it tends to be easier to learn the outcome for treatments near existing observations, we propose a new framework for evaluating and optimizing the effect of small, tailored, and localized interventions that nudge the observed treatment assignments. Our doubly robust effect estimator plugs into a policy learner that stays within the interventional scope by optimal transport. Consequently, the error of the total policy effect is restricted to prediction errors nearby the observational distribution, rather than the whole response surface.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/marmarelis24a.html
https://proceedings.mlr.press/v238/marmarelis24a.htmlTheoretically Grounded Loss Functions and Algorithms for Score-Based Multi-Class AbstentionLearning with abstention is a key scenario where the learner can abstain from making a prediction at some cost. In this paper, we analyze the score-based formulation of learning with abstention in the multi-class classification setting. We introduce new families of surrogate losses for the abstention loss function, which include the state-of-the-art surrogate losses in the single-stage setting and a novel family of loss functions in the two-stage setting. We prove strong non-asymptotic and hypothesis set-specific consistency guarantees for these surrogate losses, which upper-bound the estimation error of the abstention loss function in terms of the estimation error of the surrogate loss. Our bounds can help compare different score-based surrogates and guide the design of novel abstention algorithms by minimizing the proposed surrogate losses. We experimentally evaluate our new algorithms on CIFAR-10, CIFAR-100, and SVHN datasets and the practical significance of our new surrogate losses and two-stage abstention algorithms. Our results also show that the relative performance of the state-of-the-art score-based surrogate losses can vary across datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mao24a.html
https://proceedings.mlr.press/v238/mao24a.htmlConsistent Optimal Transport with Empirical Conditional MeasuresGiven samples from two joint distributions, we consider the problem of Optimal Transportation (OT) between them when conditioned on a common variable. We focus on the general setting where the conditioned variable may be continuous, and the marginals of this variable in the two joint distributions may not be the same. In such settings, standard OT variants cannot be employed, and novel estimation techniques are necessary. Since the main challenge is that the conditional distributions are not explicitly available, the key idea in our OT formulation is to employ kernelized-least-squares terms computed over the joint samples, which implicitly match the transport plan’s marginals with the empirical conditionals. Under mild conditions, we prove that our estimated transport plans, as a function of the conditioned variable, are asymptotically optimal. For finite samples, we show that the deviation in terms of our regularized objective is bounded by $O(m^{-1/4})$, where $m$ is the number of samples. We also discuss how the conditional transport plan could be modelled using explicit probabilistic models as well as using implicit generative ones. We empirically verify the consistency of our estimator on synthetic datasets, where the optimal plan is analytically known. When employed in applications like prompt learning for few-shot classification and conditional-generation in the context of predicting cell responses to treatment, our methodology improves upon state-of-the-art methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/manupriya24a.html
https://proceedings.mlr.press/v238/manupriya24a.htmlA Cubic-regularized Policy Newton Algorithm for Reinforcement LearningWe consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/maniyar24a.html
https://proceedings.mlr.press/v238/maniyar24a.htmlDensity-Regression: Efficient and Distance-aware Deep Regressor for Uncertainty Estimation under Distribution ShiftsMorden deep ensembles technique achieves strong uncertainty estimation performance by going through multiple forward passes with different models. This is at the price of a high storage space and a slow speed in the inference (test) time. To address this issue, we propose Density-Regression, a method that leverages the density function in uncertainty estimation and achieves fast inference by a single forward pass. We prove it is distance aware on the feature space, which is a necessary condition for a neural network to produce high-quality uncertainty estimation under distribution shifts. Empirically, we conduct experiments on regression tasks with the cubic toy dataset, benchmark UCI, weather forecast with time series, and depth estimation under real-world shifted applications. We show that Density-Regression has competitive uncertainty estimation performance under distribution shifts with modern deep regressors while using a lower model size and a faster inference speed.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/manh-bui24a.html
https://proceedings.mlr.press/v238/manh-bui24a.htmlIdentifying Confounding from Causal Mechanism ShiftsCausal discovery methods commonly assume that all data is independently and identically distributed (i.i.d.) and that there are no unmeasured confounding variables. In practice, neither is likely to hold, and detecting confounding in non-i.i.d. settings poses a significant challenge. Motivated by this, we explore how to discover confounders from data in multiple environments with causal mechanism shifts. We show that the mechanism changes of observed variables can reveal which variable sets are confounded. Based on this idea, we propose an empirically testable criterion based on mutual information, show under which conditions it can identify confounding, and introduce CoCo to discover confounders from data in multiple contexts. In our experiments, we show that CoCo works well on synthetic and real-world data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mameche24a.html
https://proceedings.mlr.press/v238/mameche24a.htmlNear-Optimal Pure Exploration in Matrix Games: A Generalization of Stochastic Bandits & Dueling BanditsWe study the sample complexity of identifying the pure strategy Nash equilibrium (PSNE) in a two-player zero-sum matrix game with noise. Formally, we are given a stochastic model where any learner can sample an entry $(i,j)$ of the input matrix $A\in [-1,1]^{n\times m}$ and observe $A_{i,j}+\eta$ where $\eta$ is a zero-mean $1$-sub-Gaussian noise. The aim of the learner is to identify the PSNE of $A$, whenever it exists, with high probability while taking as few samples as possible. Zhou et al., (2017) presents an instance-dependent sample complexity lower bound that depends only on the entries in the row and column in which the PSNE lies. We design a near-optimal algorithm whose sample complexity matches the lower bound, up to log factors. The problem of identifying the PSNE also generalizes the problem of pure exploration in stochastic multi-armed bandits and dueling bandits, and our result matches the optimal bounds, up to log factors, in both the settings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/maiti24a.html
https://proceedings.mlr.press/v238/maiti24a.htmlHolographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware DetectionMalware detection is an interesting and valuable domain to work in because it has significant real-world impact and unique machine-learning challenges. We investigate existing long-range techniques and benchmarks and find that they’re not very suitable in this problem area. In this paper, we introduce Holographic Global Convolutional Networks (HGConv) that utilize the properties of Holographic Reduced Representations (HRR) to encode and decode features from sequence elements. Unlike other global convolutional methods, our method does not require any intricate kernel computation or crafted kernel design. HGConv kernels are defined as simple parameters learned through backpropagation. The proposed method has achieved new SOTA results on Microsoft Malware Classification Challenge, Drebin, and EMBER malware benchmarks. With log-linear complexity in sequence length, the empirical results demonstrate substantially faster run-time by HGConv compared to other methods achieving far more efficient scaling even with sequence length $\geq 100,000$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/mahmudul-alam24a.html
https://proceedings.mlr.press/v238/mahmudul-alam24a.htmlMulti-Agent Learning in Contextual Games under Unknown ConstraintsWe consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent’s action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents’ policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach that exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds, and the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition, we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $\epsilon$-c.z.CCEs can be approached whenever players follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/maddux24a.html
https://proceedings.mlr.press/v238/maddux24a.htmlDirected Hypergraph Representation Learning for Link PredictionLink prediction is a critical problem in network structure processing. With the prevalence of deep learning, graph-based learning pattern in link prediction has been well-proven to successfully apply. However, existing representation-based computing paradigms retain some lack in processing complex networks: most methods only consider low-order pairwise information or eliminate the direction message, which tends to obtain a sub-optimal representation. To tackle the above challenges, we propose using directed hypergraph to model the real world and design a directed hypergraph neural network framework for data representation learning. Specifically, our work can be concluded into two sophisticated aspects: (1) We define the approximate Laplacian of the directed hypergraph, and further formulate the convolution operation on the directed hypergraph structure, solving the issue of the directed hypergraph structure representation learning. (2) By efficiently learning complex information from directed hypergraphs to obtain high-quality representations, we develop a framework DHGNN for link prediction on directed hypergraph structures. We empirically show that the merit of DHGNN lies in its ability to model complex correlations and encode information effectively of directed hypergraphs. Extensive experiments conducted on multi-field datasets demonstrate the superiority of the proposed DHGNN over various state-of-the-art approaches.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ma24b.html
https://proceedings.mlr.press/v238/ma24b.htmlAbsence of spurious solutions far from ground truth: A low-rank analysis with high-order lossesMatrix sensing problems exhibit pervasive non-convexity, plaguing optimization with a proliferation of suboptimal spurious solutions. Avoiding convergence to these critical points poses a major challenge. This work provides new theoretical insights that help demystify the intricacies of the non-convex landscape. In this work, we prove that under certain conditions, critical points sufficiently distant from the ground truth matrix exhibit favorable geometry by being strict saddle points rather than troublesome local minima. Moreover, we introduce the notion of higher-order losses for the matrix sensing problem and show that the incorporation of such losses into the objective function amplifies the negative curvature around those distant critical points. This implies that increasing the complexity of the objective function via high-order losses accelerates the escape from such critical points and acts as a desirable alternative to increasing the complexity of the optimization problem via over-parametrization. By elucidating key characteristics of the non-convex optimization landscape, this work makes progress towards a comprehensive framework for tackling broader machine learning objectives plagued by non-convexity.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ma24a.html
https://proceedings.mlr.press/v238/ma24a.htmlInconsistency of Cross-Validation for Structure Learning in Gaussian Graphical ModelsDespite numerous years of research into the merits and trade-offs of various model selection criteria, obtaining robust results that elucidate the behavior of cross-validation remains a challenging endeavor. In this paper, we highlight the inherent limitations of cross-validation when employed to discern the structure of a Gaussian graphical model. We provide finite-sample bounds on the probability that the Lasso estimator for the neighborhood of a node within a Gaussian graphical model, optimized using a prediction oracle, misidentifies the neighborhood. Our results pertain to both undirected and directed acyclic graphs, encompassing general, sparse covariance structures. To support our theoretical findings, we conduct an empirical investigation of this inconsistency by contrasting our outcomes with other commonly used information criteria through an extensive simulation study. Given that many algorithms designed to learn the structure of graphical models require hyperparameter selection, the precise calibration of this hyperparameter is paramount for accurately estimating the inherent structure. Consequently, our observations shed light on this widely recognized practical challenge.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lyu24a.html
https://proceedings.mlr.press/v238/lyu24a.htmlGraph Pruning for Enumeration of Minimal Unsatisfiable SubsetsFinding Minimal Unsatisfiable Subsets (MUSes) of boolean constraints is a common problem in infeasibility analysis of over-constrained systems. However, because of the exponential search space of the problem, enumerating MUSes is extremely time-consuming in real applications. In this work, we propose to prune formulas using a learned model to speed up MUS enumeration. We represent formulas as graphs and then develop a graph-based learning model to predict which part of the formula should be pruned. Importantly, the training of our model does not require labeled data. It does not even require training data from the target application because it extrapolates to data with different distributions. In our experiments we combine our model with existing MUS enumerators and validate its effectiveness in multiple benchmarks including a set of real-world problems outside our training distribution. The experiment results show that our method significantly accelerates MUS enumeration on average on these benchmark problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lymperopoulos24a.html
https://proceedings.mlr.press/v238/lymperopoulos24a.htmlDE-HNN: An effective neural model for Circuit Netlist representationThe run-time for optimization tools used in chip design has grown with the complexity of designs to the point where it can take several days to go through one design cycle which has become a bottleneck. Designers want fast tools that can quickly give feedback on a design. Using the input and output data of the tools from past designs, one can attempt to build a machine learning model that predicts the outcome of a design in significantly shorter time than running the tool. The accuracy of such models is affected by the representation of the design data, which is usually a netlist that describes the elements of the digital circuit and how they are connected. Graph representations for the netlist together with graph neural networks have been investigated for such models. However, the characteristics of netlists pose several challenges for existing graph learning frameworks, due to the large number of nodes and the importance of long-range interactions between nodes. To address these challenges, we represent the netlist as a directed hypergraph and propose a Directional Equivariant Hypergraph Neural Network (DE-HNN) for the effective learning of (directed) hypergraphs. Theoretically, we show that our DE-HNN can universally approximate any node or hyperedge based function that satisfies certain permutation equivariant and invariant properties natural for directed hypergraphs. We compare the proposed DE-HNN with several State-of-the-art (SOTA) machine learning models for (hyper)graphs and netlists, and show that the DE-HNN significantly outperforms them in predicting the outcome of optimized place-and-route tools directly from the input netlists.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/luo24a.html
https://proceedings.mlr.press/v238/luo24a.htmlTowards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes ApproachEstimating the potential behavior of the surrounding human-driven vehicles is crucial for the safety of autonomous vehicles in a mixed traffic flow. Recent state-of-the-art achieved accurate prediction using deep neural networks. However, these end-to-end models are usually black boxes with weak interpretability and generalizability. This paper proposes the Goal-based Neural Variational Agent (GNeVA), an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves target-driven motion prediction by estimating the spatial distribution of long-term destinations with a variational mixture of Gaussians. We identify a causal structure among maps and agents’ histories and derive a variational posterior to enhance generalizability. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable and can achieve comparable performance to state-of-the-art results.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lu24a.html
https://proceedings.mlr.press/v238/lu24a.htmlNo-Regret Algorithms for Safe Bayesian Optimization with Monotonicity ConstraintsWe consider the problem of sequentially maximizing an unknown function $f$ over a set of actions of the form $(s, x)$, where the selected actions must satisfy a safety constraint with respect to an unknown safety function $g$. We model $f$ and $g$ as lying in a reproducing kernel Hilbert space (RKHS), which facilitates the use of Gaussian process methods. While existing works for this setting have provided algorithms that are guaranteed to identify a near-optimal safe action, the problem of attaining low cumulative regret has remained largely unexplored, with a key challenge being that expanding the safe region can incur high regret. To address this challenge, we show that if $g$ is monotone with respect to just the single variable $s$ (with no such constraint on $f$), sublinear regret becomes achievable with our proposed algorithm. In addition, we show that a modified version of our algorithm is able to attain sublinear regret (for suitably defined notions of regret) for the task of finding a near-optimal $s$ corresponding to every $x$, as opposed to only finding the global safe optimum. Our findings are supported with empirical evaluations on various objective and safety functions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/losalka24a.html
https://proceedings.mlr.press/v238/losalka24a.htmlCausal Modeling with Stationary DiffusionsWe develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system’s behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion’s generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lorch24a.html
https://proceedings.mlr.press/v238/lorch24a.htmlEquation Discovery with Bayesian Spike-and-Slab Priors and Efficient KernelsDiscovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity and noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS). We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We combine it with a Bayesian spike-and-slab prior — an ideal Bayesian sparse distribution — for effective operator selection and uncertainty quantification. We develop an expectation-propagation expectation-maximization (EP-EM) algorithm for efficient posterior inference and function estimation. To overcome the computational challenge of kernel regression, we place the function values on a mesh and induce a Kronecker product construction, and we use tensor algebra to enable efficient computation and optimization. We show the advantages of KBASS on a list of benchmark ODE and PDE discovery tasks. The code is available at \url{https://github.com/long-da/KBASS}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/long24a.html
https://proceedings.mlr.press/v238/long24a.htmlMitigating Underfitting in Learning to Defer with Consistent LossesLearning to defer (L2D) allows the classifier to defer its prediction to an expert for safer predictions, by balancing the system’s accuracy and extra costs incurred by consulting the expert. Various loss functions have been proposed for L2D, but they were shown to cause the underfitting of trained classifiers when extra consulting costs exist, resulting in degraded performance. In this paper, we propose a novel loss formulation that can mitigate the underfitting issue while remaining the statistical consistency. We first show that our formulation can avoid a common characteristic shared by most existing losses, which has been shown to be a cause of underfitting, and show that it can be combined with the representative losses for L2D to enhance their performance and yield consistent losses. We further study the regret transfer bounds of the proposed losses and experimentally validate its improvements over existing methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24h.html
https://proceedings.mlr.press/v238/liu24h.htmlUser-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal RatesWe study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items. Existing work for user-level DP-SCO either requires super-polynomial runtime (Ghazi et al., 2023) or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions (Bassily et al., 2023). We develop new algorithms for user-level DP-SCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for non-smooth functions in polynomial time. These algorithms are based on multiple-pass DP-SGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24g.html
https://proceedings.mlr.press/v238/liu24g.htmlSupervised Feature Selection via Ensemble Gradient Information from Sparse Neural NetworksFeature selection algorithms aim to select a subset of informative features from a dataset to reduce the data dimensionality, consequently saving resource consumption and improving the model’s performance and interpretability. In recent years, feature selection based on neural networks has become a new trend, demonstrating superiority over traditional feature selection methods. However, most existing methods use dense neural networks to detect informative features, which requires significant computational and memory overhead. In this paper, taking inspiration from the successful application of local sensitivity analysis on neural networks, we propose a novel resource-efficient supervised feature selection algorithm based on sparse multi-layer perceptron called “GradEnFS". By utilizing the gradient information of various sparse models from different training iterations, our method successfully detects the informative feature subset. We performed extensive experiments on nine classification datasets spanning various domains to evaluate the effectiveness of our method. The results demonstrate that our proposed approach outperforms the state-of-the-art methods in terms of selecting informative features while saving resource consumption substantially. Moreover, we show that using a sparse neural network for feature selection not only alleviates resource consumption but also has a significant advantage over other methods when performing feature selection on noisy datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24f.html
https://proceedings.mlr.press/v238/liu24f.htmlFitting ARMA Time Series Models without Identification: A Proximal ApproachFitting autoregressive moving average (ARMA) time series models requires model identification before parameter estimation. Model identification involves determining the order of the autoregressive and moving average components which is generally performed by inspection of the autocorrelation and partial autocorrelation functions or other offline methods. In this work, we regularize the parameter estimation optimization problem with a non-smooth hierarchical sparsity-inducing penalty based on two path graphs that allow performing model identification and parameter estimation simultaneously. A proximal block coordinate descent algorithm is then proposed to solve the underlying optimization problem efficiently. The resulting model satisfies the required stationarity and invertibility conditions for ARMA models. Numerical results supporting the proposed method are also presented.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24e.html
https://proceedings.mlr.press/v238/liu24e.htmlDistributionally Robust Off-Dynamics Reinforcement Learning: Provable Efficiency with Linear Function ApproximationWe study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We aim to solve this problem via online distributionally robust Markov decision processes (DRMDPs), where the learning algorithm actively interacts with the source domain while seeking the optimal performance under the worst possible dynamics that is within an uncertainty set of the source domain’s transition kernel. We provide the first study on online DRMDPs with function approximation for off-dynamics RL. We find that DRMDPs’ dual formulation can induce nonlinearity, even when the nominal transition kernel is linear, leading to error propagation. By designing a $d$-rectangular uncertainty set using the total variation distance, we remove this additional nonlinearity and bypass the error propagation. We then introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation. Finally, we substantiate the performance and robustness of DR-LSVI-UCB through different numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24d.html
https://proceedings.mlr.press/v238/liu24d.htmlEnhancing Distributional Stability among Sub-populationsEnhancing the stability of machine learning algorithms under distributional shifts is at the heart of the Out-of-Distribution (OOD) Generalization problem. Derived from causal learning, recent works of invariant learning pursue strict invariance with multiple training environments. Although intuitively reasonable, strong assumptions on the availability and quality of environments are made to learn the strict invariance property. In this work, we come up with the “distributional stability" notion to mitigate such limitations. It quantifies the stability of prediction mechanisms among sub-populations down to a prescribed scale. Based on this, we propose the learnability assumption and derive the generalization error bound under distribution shifts. Inspired by theoretical analyses, we propose our novel stable risk minimization (SRM) algorithm to enhance the model’s stability w.r.t. shifts in prediction mechanisms (Y|X-shifts). Experimental results are consistent with our intuition and validate the effectiveness of our algorithm. The code can be found at https://github.com/LJSthu/SRM.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24c.html
https://proceedings.mlr.press/v238/liu24c.htmlEfficient Neural Architecture Design via Capturing Architecture-Performance Joint DistributionThe relationship between architecture and performance is critical for improving the efficiency of neural architecture design, yet few efforts have been devoted to understanding this relationship between architecture and performance, especially architecture-performance joint distribution. In this paper, we propose Semi-Supervised Generative Adversarial Networks Neural Architecture Design Method or SemiGAN-NAD to capture the architecture-performance joint distribution with few performance labels. It is composed of Bidirectional Transformer of Architecture and Performance (Bi-Arch2Perf) and Neural Architecture Conditional Generation (NACG). Bi-Arch2Perf is developed to learn the joint distribution of architecture and performance from bidirectional conditional distribution through the adversarial training of the discriminator, the architecture generator, and the performance predictor. Then, the incorporation of semi-supervised learning optimizes the construction of Bi-Arch2Perf by utilizing a large amount of architecture information without performance annotation in search space. Based on the learned bidirectional relationship, the performance of architecture is predicted by NACG in high-performance architecture space to efficiently discover well-promising neural architectures. The experimental results on NAS benchmarks demonstrate that SemiGAN-NAD achieves competitive performance with reduced evaluation time compared with the latest NAS methods. Moreover, the high-performance architecture signatures learned by Bi-Arch2Perf are also illustrated in our experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24b.html
https://proceedings.mlr.press/v238/liu24b.htmlProximal Causal Inference for Synthetic Control with SurrogatesThe synthetic control method (SCM) has become a popular tool for estimating causal effects in policy evaluation, where a single treated unit is observed. However, SCM faces challenges in accurately predicting post-intervention potential outcomes had, contrary to fact, the treatment been withheld, when the pre-intervention period is short or the post-intervention period is long. To address these issues, we propose a novel method that leverages post-intervention information, specifically time-varying correlates of the causal effect called "surrogates", within the synthetic control framework. We establish conditions for identifying model parameters using the proximal inference framework and apply the generalized method of moments (GMM) approach for estimation and inference about the average treatment effect on the treated (ATT). Interestingly, we uncover specific conditions under which exclusively using post-intervention data suffices for estimation within our framework. Through a synthetic experiment and a real-world application, we demonstrate that our method can outperform other synthetic control methods in estimating both short-term and long-term effects, yielding more accurate inferences.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liu24a.html
https://proceedings.mlr.press/v238/liu24a.htmlHow Good is a Single Basin?The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles within a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lion24a.html
https://proceedings.mlr.press/v238/lion24a.htmlLearning Safety Constraints from Demonstrations with Unknown RewardsWe propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior. Importantly, we can safely transfer the learned constraints to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lindner24a.html
https://proceedings.mlr.press/v238/lindner24a.htmlDNNLasso: Scalable Graph Learning for Matrix-Variate DataWe consider the problem of jointly learning row-wise and column-wise dependencies of matrix-variate observations, which are modelled separately by two precision matrices. Due to the complicated structure of Kronecker-product precision matrices in the commonly used matrix-variate Gaussian graphical models, a sparser Kronecker-sum structure was proposed recently based on the Cartesian product of graphs. However, existing methods for estimating Kronecker-sum structured precision matrices do not scale well to large scale datasets. In this paper, we introduce DNNLasso, a diagonally non-negative graphical lasso model for estimating the Kronecker-sum structured precision matrix, which outperforms the state-of-the-art methods by a large margin in both accuracy and computational time.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lin24b.html
https://proceedings.mlr.press/v238/lin24b.htmlA Specialized Semismooth Newton Method for Kernel-Based Optimal TransportKernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples. Recent works suggest that these estimators are more statistically efficient than plug-in (linear programming-based) OT estimators when comparing probability measures in high-dimensions (Vacher et al., 2021). Unfortunately,that statistical benefit comes at a very steep computational price: because their computation relies on the short-step interior-point method (SSIPM), which comes with a large iteration count in practice, these estimators quickly become intractable w.r.t. sample size $n$. To scale these estimators to larger $n$, we propose a nonsmooth fixedpoint model for the kernel-based OT problem, and show that it can be efficiently solved via a specialized semismooth Newton (SSN) method: We show, exploring the problem’s structure, that the per-iteration cost of performing one SSN step can be significantly reduced in practice. We prove that our SSN method achieves a global convergence rate of $O(1/\sqrt{k})$, and a local quadratic convergence rate under standard regularity conditions. We show substantial speedups over SSIPM on both synthetic and real datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lin24a.html
https://proceedings.mlr.press/v238/lin24a.htmlOn Ranking-based Tests of IndependenceIn this paper we develop a novel nonparametric framework to test the independence of two random variables $X$ and $Y$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dxdy)$, based on Receiver Operating Characteristic (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis $\mathcal{H}_0$ is necessarily false as soon as the optimal scoring function related to the pair of distributions $(H\otimes G,;{F})$, obtained from a bipartite ranking algorithm, has a ROC curve that deviates from the main diagonal of the unit square. We consider a wide class of rank statistics encompassing many ways of deviating from the diagonal in the ROC space to build tests of independence. Beyond its great flexibility, this new method has theoretical properties that far surpass those of its competitors. Nonasymptotic bounds for the two types of testing errors are established. From an empirical perspective, the novel procedure we promote in this paper exhibits a remarkable ability to detect small departures, of various types, from the null assumption $\mathcal{H}_0$, even in high dimension, as supported by the numerical experiments presented here.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/limnios24a.html
https://proceedings.mlr.press/v238/limnios24a.htmlPathwise Explanation of ReLU Neural NetworksNeural networks have demonstrated a wide range of successes, but their “black box" nature raises concerns about transparency and reliability. Previous research on ReLU networks has sought to unwrap these networks into linear models based on activation states of all hidden units. In this paper, we introduce a novel approach that considers subsets of the hidden units involved in the decision making path. This pathwise explanation provides a clearer and more consistent understanding of the relationship between the input and the decision-making process. Our method also offers flexibility in adjusting the range of explanations within the input, i.e., from an overall attribution input to particular components within the input. Furthermore, it allows for the decomposition of explanations for a given input for more detailed explanations. Our experiments demonstrate that the proposed method outperforms existing methods both quantitatively and qualitatively.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lim24a.html
https://proceedings.mlr.press/v238/lim24a.htmlRegret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk MeasuresWe study finite episodic Markov decision processes incorporating dynamic risk measures to capture risk sensitivity. To this end, we present two model-based algorithms applied to \emph{Lipschitz} dynamic risk measures, a wide range of risk measures that subsumes spectral risk measure, optimized certainty equivalent, and distortion risk measures, among others. We establish both regret upper bounds and lower bounds. Notably, our upper bounds demonstrate optimal dependencies on the number of actions and episodes while reflecting the inherent trade-off between risk sensitivity and sample complexity. Our approach offers a unified framework that not only encompasses multiple existing formulations in the literature but also broadens the application spectrum.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/liang24a.html
https://proceedings.mlr.press/v238/liang24a.htmlComputing epidemic metrics with edge differential privacyMetrics such as the outbreak size in an epidemic process on a network are fundamental quantities used in public health analyses. The datasets used in such models used in practice, e.g., the contact network and disease states, are sensitive in many settings. We study the complexity of computing epidemic outbreak size within a given time horizon, under edge differential privacy. These quantities have high sensitivity, and we show that giving algorithms with good utility guarantees is impossible for general graphs. To address these hardness results, we consider a smaller class of graphs with similar properties as social networks (called expander graphs) and give a polynomial-time algorithm with strong utility guarantees. Our results are the first to give any non-trivial guarantees for differentially private infection size estimation.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24q.html
https://proceedings.mlr.press/v238/li24q.htmlEfficient Active Learning Halfspaces with Tsybakov Noise: A Non-convex Optimization ApproachWe study the problem of computationally and label efficient PAC active learning $d$-dimensional halfspaces with Tsybakov Noise (Tsybakov, 2004) under structured unlabeled data distributions. Inspired by Diakonikolas et al., (2020c), we prove that any approximate first-order stationary point of a smooth nonconvex loss function yields a halfspace with a low excess error guarantee. In light of the above structural result, we design a nonconvex optimization-based algorithm with a label complexity of $\tilde{O}(d (\frac{1}{\epsilon})^{\frac{8-6\alpha}{3\alpha-1}})$, under the assumption that the Tsybakov noise parameter $\alpha \in (\frac13, 1]$, which narrows down the gap between the label complexities of the previously known efficient passive or active algorithms (Diakonikolas et al., 2020b; Zhang and Li, 2021) and the information-theoretic lower bound in this setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24p.html
https://proceedings.mlr.press/v238/li24p.htmlTrigonometric Quadrature Fourier Features for Scalable Gaussian Process RegressionFourier feature approximations have been successfully applied in the literature for scalable Gaussian Process (GP) regression. In particular, Quadrature Fourier Features (QFF) derived from Gaussian quadrature rules have gained popularity in recent years due to their improved approximation accuracy and better calibrated uncertainty estimates compared to Random Fourier Feature (RFF) methods. However, a key limitation of QFF is that its performance can suffer from well-known pathologies related to highly oscillatory quadrature, resulting in mediocre approximation with limited features. We address this critical issue via a new Trigonometric Quadrature Fourier Feature (TQFF) method, which uses a novel non-Gaussian quadrature rule specifically tailored for the desired Fourier transform. We derive an exact quadrature rule for TQFF, along with kernel approximation error bounds for the resulting feature map. We then demonstrate the improved performance of our method over RFF and Gaussian QFF in a suite of numerical experiments and applications, and show the TQFF enjoys accurate GP approximations over a broad range of length-scales using fewer features.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24o.html
https://proceedings.mlr.press/v238/li24o.htmlImproved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown TransitionWe study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\tilde{\mathcal{O}}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\tilde{\mathcal{O}}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24n.html
https://proceedings.mlr.press/v238/li24n.htmlOn the Model-Misspecification in Reinforcement LearningThe success of reinforcement learning (RL) crucially depends on effective function approximation when dealing with complex ground-truth models. Existing sample-efficient RL algorithms primarily employ three approaches to function approximation: policy-based, value-based, and model-based methods. However, in the face of model misspecification—a disparity between the ground-truth and optimal function approximators—it is shown that policy-based approaches can be robust even when the policy function approximation is under a large \emph{locally-bounded} misspecification error, with which the function class may exhibit a $\Omega(1)$ approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of $\widetilde{O}\left(\mathrm{poly}(dH)\cdot(\sqrt{K} + K\cdot\zeta) \right)$, where $d$ represents the complexity of the function class, $H$ is the episode length, $K$ is the total number of episodes, and $\zeta$ denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $\zeta$, thereby enhancing its practical applicability.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24m.html
https://proceedings.mlr.press/v238/li24m.htmlPolicy Evaluation for Reinforcement Learning from Human Feedback: A Sample Complexity AnalysisA recently popular approach to solving reinforcement learning is with data from human preferences. In fact, human preference data are now used with classic reinforcement learning algorithms such as actor-critic methods, which involve evaluating an intermediate policy over a reward learned from human preference data with distribution shift, known as off-policy evaluation (OPE). Such algorithm includes (i) learning reward function from human preference dataset, and (ii) learning expected cumulative reward of a target policy. Despite the huge empirical success, existing OPE methods with preference data often lack theoretical understandings and rely heavily on heuristics. In this paper, we study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. Specifically, we approach OPE with learning the value function by fitted-Q-evaluation with a deep neural network. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Under the assumption of high reward smoothness, our results almost align with the classical OPE results with observable reward data. To the best of our knowledge, this is the first result that establishes a provably efficient guarantee for off-policy evaluation with RLHF.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24l.html
https://proceedings.mlr.press/v238/li24l.htmlMulti-Resolution Active Learning of Fourier Neural OperatorsFourier Neural Operator (FNO) is a popular operator learning framework. It not only achieves the state-of-the-art performance in many tasks, but also is efficient in training and prediction. However, collecting training data for the FNO can be a costly bottleneck in practice, because it often demands expensive physical simulations. To overcome this problem, we propose Multi-Resolution Active Learning of FNO (MRA-FNO), which can dynamically select the input functions and resolutions to lower the data cost as much as possible while optimizing the learning efficiency. Specifically, we propose a probabilistic multi-resolution FNO and use ensemble Monte-Carlo to develop an effective posterior inference algorithm. To conduct active learning, we maximize a utility-cost ratio as the acquisition function to acquire new examples and resolutions at each step. We use moment matching and the matrix determinant lemma to enable tractable, efficient utility computation. Furthermore, we develop a cost annealing framework to avoid over-penalizing high-resolution queries at the early stage. The over-penalization is severe when the cost difference is significant between the resolutions, which renders active learning often stuck at low-resolution queries and inferior performance. Our method overcomes this problem and applies to general multi-fidelity active learning and optimization problems. We have shown the advantage of our method in several benchmark operator learning tasks. The code is available at https://github.com/shib0li/MRA-FNO.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24k.html
https://proceedings.mlr.press/v238/li24k.htmlBackward Filtering Forward Deciding in Linear Non-Gaussian State Space ModelsThe paper considers linear state space models with non-Gaussian inputs and/or constraints. As shown previously, NUP representations (normal with unknown parameters) allow to compute MAP estimates in such models by iterating Kalman smoothing recursions. In this paper, we propose to compute such MAP estimates by iterating backward-forward recursions where the forward recursion amounts to coordinatewise input estimation. The advantages of the proposed approach include faster convergence, no “zero-variance stucking”, and easier control of constraint satisfaction. The approach is demonstrated with simulation results of exemplary applications including (i) regression with non-Gaussian priors or constraints on k-th order differences and (ii) control with linearly constrained inputs.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24j.html
https://proceedings.mlr.press/v238/li24j.htmlEthics in Action: Training Reinforcement Learning Agents for Moral Decision-making In Text-based Adventure GamesReinforcement Learning (RL) has demonstrated its potential in solving goal-oriented sequential tasks. However, with the increasing capabilities of RL agents, ensuring morally responsible agent behavior is becoming a pressing concern. Previous approaches have included moral considerations by statically assigning a moral score to each action at runtime. However, these methods do not account for the potential moral value of future states when evaluating immoral actions. This limits the ability to find trade-offs between different aspects of moral behavior and the utility of the action. In this paper, we aim to factor in moral scores by adding a constraint to the RL objective that is incorporated during training, thereby dynamically adapting the policy function. By combining Lagrangian optimization and meta-gradient learning, we develop an RL method that is able to find a trade-off between immoral behavior and performance in the decision-making process.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24i.html
https://proceedings.mlr.press/v238/li24i.htmlOptimal Exploration is no harder than Thompson SamplingGiven a set of arms $\mathcal{Z}\subset \mathbb{R}^d$ and an unknown parameter vector $\theta_\ast\in\mathbb{R}^d$, the pure exploration linear bandits problem aims to return $\arg\max_{z\in \mathcal{Z}} z^{\top}\theta_{\ast}$, with high probability through noisy measurements of $x^{\top}\theta_{\ast}$ with $x\in \mathcal{X}\subset \mathbb{R}^d$. Existing (asymptotically) optimal methods require either a) potentially costly projections for each arm $z\in \mathcal{Z}$ or b) explicitly maintaining a subset of $\mathcal{Z}$ under consideration at each time. This complexity is at odds with the popular and simple Thompson Sampling algorithm for regret minimization, which just requires access to a posterior sampling and argmax oracle, and does not need to enumerate $\mathcal{Z}$ at any point. Unfortunately, Thompson sampling is known to be sub-optimal for pure exploration. In this work, we pose a natural question: is there an algorithm that can explore optimally and only needs the same computational primitives as Thompson Sampling? We answer the question in the affirmative. We provide an algorithm that leverages only sampling and argmax oracles and achieves an exponential convergence rate, with the exponent equal to the exponent of the optimal fixed allocation asymptotically. In addition, we show that our algorithm can be easily implemented and performs as well empirically as existing asymptotically optimal methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24h.html
https://proceedings.mlr.press/v238/li24h.htmlWhen No-Rejection Learning is Consistent for Regression with RejectionLearning with rejection has been a prototypical model for studying the human-AI interaction on prediction tasks. Upon the arrival of a sample instance, the model first uses a rejector to decide whether to accept and use the AI predictor to make a prediction or reject and defer the sample to humans. Learning such a model changes the structure of the original loss function and often results in undesirable non-convexity and inconsistency issues. For the classification with rejection problem, several works develop consistent surrogate losses for the joint learning of the predictor and the rejector, while there have been fewer works for the regression counterpart. This paper studies the regression with rejection (RwR) problem and investigates a no-rejection learning strategy that uses all the data to learn the predictor. We first establish the consistency for such a strategy under the weak realizability condition. Then for the case without the weak realizability, we show that the excessive risk can also be upper bounded with the sum of two parts: prediction error and calibration error. Lastly, we demonstrate the advantage of such a proposed learning strategy with empirical evidence.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24g.html
https://proceedings.mlr.press/v238/li24g.htmlMechanics of Next Token Prediction with Self-AttentionTransformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: What does a single self-attention layer learn from next-token prediction? We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: (1) Hard retrieval: Given input sequence, self-attention precisely selects the high-priority input tokens associated with the last input token. (2) Soft composition: It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24f.html
https://proceedings.mlr.press/v238/li24f.htmlPrior-dependent analysis of posterior sampling reinforcement learning with function approximationThis work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of $\tilde{\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24e.html
https://proceedings.mlr.press/v238/li24e.htmlTenGAN: Pure Transformer Encoders Make an Efficient Discrete GAN for De Novo Molecular GenerationDeep generative models for de novo molecular generation using discrete data, such as the simplified molecular-input line-entry system (SMILES) strings, have attracted widespread attention in drug design. However, training instability often plagues generative adversarial networks (GANs), leading to problems such as mode collapse and low diversity. This study proposes a pure transformer encoder-based GAN (TenGAN) to solve these issues. The generator and discriminator of TenGAN are variants of the transformer encoders and are combined with reinforcement learning (RL) to generate molecules with the desired chemical properties. Besides, data augmentation of the variant SMILES is leveraged for the TenGAN training to learn the semantics and syntax of SMILES strings. Additionally, we introduce an enhanced variant of TenGAN, named Ten(W)GAN, which incorporates mini-batch discrimination and Wasserstein GAN to improve the ability to generate molecules. The experimental results and ablation studies on the QM9 and ZINC datasets showed that the proposed models generated highly valid and novel molecules with the desired chemical properties in a computationally efficient manner.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24d.html
https://proceedings.mlr.press/v238/li24d.htmlBest Arm Identification with Resource ConstraintsMotivated by the cost heterogeneity in experimentation across different alternatives, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem. The agent aims to identify the best arm under resource constraints, where resources are consumed for each arm pull. We make two novel contributions. We design and analyze the Successive Halving with Resource Rationing algorithm (SH-RR). The SH-RR achieves a near-optimal non-asymptotic rate of convergence in terms of the probability of successively identifying an optimal arm. Interestingly, we identify a difference in convergence rates between the cases of deterministic and stochastic resource consumption.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24c.html
https://proceedings.mlr.press/v238/li24c.htmlNon-Neighbors Also Matter to Kriging: A New Contrastive-Prototypical LearningKriging aims to estimate the attributes of unseen geo-locations from observations in the spatial vicinity or physical connections. Existing works assume that neighbors’ information offers the basis for estimating the unobserved target while ignoring non-neighbors. However, neighbors could also be quite different or even misleading, and the non-neighbors could still offer constructive information. To this end, we propose "Contrastive-Prototypical" self-supervised learning for Kriging (KCP): (1) The neighboring contrastive module coarsely pushes neighbors together and non-neighbors apart. (2) In parallel, the prototypical module identifies similar representations via exchanged prediction, such that it refines the misleading neighbors and recycles the useful non-neighbors from the neighboring contrast component. As a result, not all the neighbors and some of the non-neighbors will be used to infer the target. (3) To learn general and robust representations, we design an adaptive augmentation module that encourages data diversity. Theoretical bound is derived for the proposed augmentation. Extensive experiments on real-world datasets demonstrate the superior performance of KCP compared to its peers with 6% improvements and exceptional transferability and robustness.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24b.html
https://proceedings.mlr.press/v238/li24b.htmlPersonalized Federated X-armed BanditIn this work, we study the personalized federated $\mathcal{X}$-armed bandit problem, where the heterogeneous local objectives of the clients are optimized simultaneously in the federated learning paradigm. We propose the \texttt{PF-PNE} algorithm with a unique double elimination strategy, which safely eliminates the non-optimal regions while encouraging federated collaboration through biased but effective evaluations of the local objectives. The proposed \texttt{PF-PNE} algorithm is able to optimize local objectives with arbitrary levels of heterogeneity, and its limited communications protects the confidentiality of the client-wise reward data. Our theoretical analysis shows the benefit of the proposed algorithm over single-client algorithms. Experimentally, \texttt{PF-PNE} outperforms multiple baselines on both synthetic and real life datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/li24a.html
https://proceedings.mlr.press/v238/li24a.htmlAny-dimensional equivariant neural networksTraditional supervised learning aims to learn an unknown mapping by fitting a function to a set of input-output pairs with a fixed dimension. The fitted function is then defined on inputs of the same dimension. However, in many settings, the unknown mapping takes inputs in any dimension; examples include graph parameters defined on graphs of any size and physics quantities defined on an arbitrary number of particles. We leverage a newly-discovered phenomenon in algebraic topology, called representation stability, to define equivariant neural networks that can be trained with data in a fixed dimension and then extended to accept inputs in any dimension. Our approach is black-box and user-friendly, requiring only the network architecture and the groups for equivariance, and can be combined with any training procedure. We provide a simple open-source implementation of our methods and offer preliminary numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/levin24a.html
https://proceedings.mlr.press/v238/levin24a.htmlGraph fission and cross-validationWe introduce a technique called graph fission which takes in a graph which potentially contains only one observation per node (whose distribution lies in a known class) and produces two (or more) independent graphs with the same node/edge set in a way that splits the original graph’s information amongst them in any desired proportion. Our proposal builds on data fission/thinning, a method that uses external randomization to create independent copies of an unstructured dataset. We extend this idea to the graph setting where there may be latent structure between observations. We demonstrate the utility of this framework via two applications: inference after structural trend estimation on graphs and a model selection procedure we term "graph cross-validation"’.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/leiner24a.html
https://proceedings.mlr.press/v238/leiner24a.htmlImproved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set ConversionLogistic bandit is a ubiquitous framework of modeling users’ choices, e.g., click vs. no click for advertisement recommender system. We observe that the prior works overlook or neglect dependencies in $S \geq \Vert \theta_\star \Vert_2$, where $\theta_\star \in \mathbb{R}^d$ is the unknown parameter vector, which is particularly problematic when $S$ is large, e.g., $S \geq d$. In this work, we improve the dependency on $S$ via a novel approach called {\it regret-to-confidence set conversion (R2CS)}, which allows us to construct a convex confidence set based on only the \textit{existence} of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility and the dependence on other factors such as $d$ and $T$. We apply our new confidence set to the regret analyses of logistic bandits with a new martingale concentration step that circumvents an additional factor of $S$. We then extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS. While we applied R2CS to the (multinomial) logistic model, R2CS is a generic approach for developing confidence sets that can be used for various models, which can be of independent interest.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lee24d.html
https://proceedings.mlr.press/v238/lee24d.htmlApproximate Bayesian Class-Conditional Models under Continuous Representation ShiftFor models consisting of a classifier in some representation space, learning online from a non-stationary data stream often necessitates changes in the representation. So, the question arises of what is the best way to adapt the classifier to shifts in representation. Current methods only slowly change the classifier to representation shift, introducing noise into learning as the classifier is misaligned to the representation. We propose DeepCCG, an empirical Bayesian approach to solve this problem. DeepCCG works by updating the posterior of a class conditional Gaussian classifier such that the classifier adapts in one step to representation shift. The use of a class conditional Gaussian classifier also enables DeepCCG to use a log conditional marginal likelihood loss to update the representation. To perform the update to the classifier and representation, DeepCCG maintains a fixed number of examples in memory and so a key part of DeepCCG is selecting what examples to store, choosing the subset that minimises the KL divergence between the true posterior and the posterior induced by the subset. We explore the behaviour of DeepCCG in online continual learning (CL), demonstrating that it performs well against a spectrum of online CL methods and that it reduces the change in performance due to representation shift.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lee24c.html
https://proceedings.mlr.press/v238/lee24c.htmlXB-MAML: Learning Expandable Basis Parameters for Effective Meta-Learning with Wide Task CoverageMeta-learning, which pursues an effective initialization model, has emerged as a promising approach to handling unseen tasks. However, a limitation remains to be evident when a meta-learner tries to encompass a wide range of task distribution, e.g., learning across distinctive datasets or domains. Recently, a group of works has attempted to employ multiple model initializations to cover widely-ranging tasks, but they are limited in adaptively expanding initializations. We introduce XB-MAML, which learns expandable basis parameters, where they are linearly combined to form an effective initialization to a given task. XB-MAML observes the discrepancy between the vector space spanned by the basis and fine-tuned parameters to decide whether to expand the basis. Our method surpasses the existing works in the multi-domain meta-learning benchmarks and opens up new chances of meta-learning for obtaining the diverse inductive bias that can be combined to stretch toward the effective initialization for diverse unseen tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lee24b.html
https://proceedings.mlr.press/v238/lee24b.htmlAnalysis of Using Sigmoid Loss for Contrastive LearningContrastive learning has emerged as a prominent branch of self-supervised learning for several years. Especially, CLIP, which applies contrastive learning to large sets of captioned images, has garnered significant attention. Recently, SigLIP, a variant of CLIP, has been proposed, which uses the sigmoid loss instead of the standard InfoNCE loss. SigLIP achieves the performance comparable to CLIP in a more efficient manner by eliminating the need for a global view. However, theoretical understanding of using the sigmoid loss in contrastive learning is underexplored. In this paper, we provide a theoretical analysis of using the sigmoid loss in contrastive learning, in the perspective of the geometric structure of learned embeddings. First, we propose the double-Constant Embedding Model (CCEM), a framework for parameterizing various well-known embedding structures by a single variable. Interestingly, the proposed CCEM is proven to contain the optimal embedding with respect to the sigmoid loss. Second, we mathematically analyze the optimal embedding minimizing the sigmoid loss for contrastive learning. The optimal embedding ranges from simplex equiangular-tight-frame to antipodal structure, depending on the temperature parameter used in the sigmoid loss. Third, our experimental results on synthetic datasets coincide with the theoretical results on the optimal embedding structures.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lee24a.html
https://proceedings.mlr.press/v238/lee24a.htmlQueuing dynamics of asynchronous Federated LearningWe study asynchronous federated learning mechanisms with nodes having potentially different computational speeds. In such an environment, each node is allowed to work on models with potential delays and contribute to updates to the central server at its own pace. Existing analyses of such algorithms typically depend on intractable quantities such as the maximum node delay and do not consider the underlying queuing dynamics of the system. In this paper, we propose a non-uniform sampling scheme for the central server that allows for lower delays with better complexity, taking into account the closed Jackson network structure of the associated computational graph. Our experiments clearly show a significant improvement of our method over current state-of-the-art asynchronous algorithms on image classification problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/leconte24a.html
https://proceedings.mlr.press/v238/leconte24a.htmlOn the Privacy of Selection Mechanisms with Gaussian NoiseReport Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lebensold24a.html
https://proceedings.mlr.press/v238/lebensold24a.htmlOptimal Transport for Measures with Noisy Tree MetricWe study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input measures over an uncertainty set of tree metrics. In general, this approach is hard to compute, even for measures supported in one-dimensional space, due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose novel uncertainty sets of tree metrics from the lens of edge deletion/addition which covers a diversity of tree structures in an elegant framework. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the robust OT also admits a closed-form expression for a fast computation as its counterpart standard OT (i.e., TW). Furthermore, we demonstrate that the robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite kernels and test them in several simulations on various real-world datasets on document classification and topological data analysis.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/le24a.html
https://proceedings.mlr.press/v238/le24a.htmlConditional Adjustment in a Markov Equivalence ClassWe consider the problem of identifying a conditional causal effect through covariate adjustment. We focus on the setting where the causal graph is known up to one of two types of graphs: a maximally oriented partially directed acyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs represent equivalence classes of possible underlying causal models. After defining adjustment sets in this setting, we provide a necessary and sufficient graphical criterion – the conditional adjustment criterion – for finding these sets under conditioning on variables unaffected by treatment. We further provide explicit sets from the graph that satisfy the conditional adjustment criterion, and therefore, can be used as adjustment sets for conditional causal effect identification.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/laplante24a.html
https://proceedings.mlr.press/v238/laplante24a.htmlCausal Q-Aggregation for CATE Model SelectionAccurate estimation of conditional average treatment effects (CATE) is at the core of personalized decision making. While there is a plethora of models for CATE estimation, model selection is a non-trivial task, due to the fundamental problem of causal inference. Recent empirical work provides evidence in favor of proxy loss metrics with double robust properties and in favor of model ensembling. However, theoretical understanding is lacking. Direct application of prior theoretical works leads to suboptimal oracle model selection rates due to the non-convexity of the model selection problem. We provide regret rates for the major existing CATE ensembling approaches and propose a new CATE model ensembling approach based on $Q$-aggregation using the doubly robust loss. Our main result shows that causal $Q$-aggregation achieves statistically optimal oracle model selection regret rates of $\log(M)/n$ (with $M$ models and $n$ samples), with the addition of higher-order estimation error terms related to products of errors in the nuisance functions. Crucially, our regret rate does not require that any of the candidate CATE models be close to the truth. We validate our new method on many semi-synthetic datasets and also provide extensions of our work to CATE model selection with instrumental variables and unobserved confounding.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/lan24a.html
https://proceedings.mlr.press/v238/lan24a.htmlTackling the XAI Disagreement Problem with Regional ExplanationsThe XAI Disagreement Problem concerns the fact that various explainability methods yield different local/global insights on model behavior. Thus, given the lack of ground truth in explainability, practitioners are left wondering “Which explanation should I believe?”. In this work, we approach the Disagreement Problem from the point of view of Functional Decomposition (FD). First, we demonstrate that many XAI techniques disagree because they handle feature interactions differently. Secondly, we reduce interactions locally by fitting a so-called FD-Tree, which partitions the input space into regions where the model is approximately additive. Thus instead of providing global explanations aggregated over the whole dataset, we advocate reporting the FD-Tree structure as well as the regional explanations extracted from its leaves. The beneficial effects of FD-Trees on the Disagreement Problem are demonstrated on toy and real datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/laberge24a.html
https://proceedings.mlr.press/v238/laberge24a.htmlMinimax optimal density estimation using a shallow generative model with a one-dimensional latent variableA deep generative model yields an implicit estimator for the unknown distribution or density function of the observation. This paper investigates some statistical properties of the implicit density estimator pursued by VAE-type methods from a nonparametric density estimation framework. More specifically, we obtain convergence rates of the VAE-type density estimator under the assumption that the underlying true density function belongs to a locally Holder class. Remarkably, a near minimax optimal rate with respect to the Hellinger metric can be achieved by the simplest network architecture, a shallow generative model with a one-dimensional latent variable.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kyu-kwon24a.html
https://proceedings.mlr.press/v238/kyu-kwon24a.htmlVariational ResamplingWe cast the resampling step in particle filters (PFs) as a variational inference problem, resulting in a new class of resampling schemes: variational resampling. Variational resampling is flexible as it allows for choices of 1) divergence to minimize, 2) target distribution to input to the divergence, and 3) divergence minimization algorithm. With this novel application of VI to particle filters, variational resampling further unifies these two powerful and popular methodologies. We construct two variational resamplers that replicate particles in order to maximize lower bounds with respect to two different target measures. We benchmark our variational resamplers on challenging smoothing tasks, outperforming PFs that implement the state-of-the-art resampling schemes.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kviman24a.html
https://proceedings.mlr.press/v238/kviman24a.htmlBest-of-Both-Worlds Algorithms for Linear Contextual BanditsWe study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\!\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{\mathcal{O}}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{\mathcal{O}}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{\mathcal{O}}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kuroki24a.html
https://proceedings.mlr.press/v238/kuroki24a.htmlTowards Costless Model Selection in Contextual Bandits: A Bias-Variance PerspectiveModel selection in supervised learning provides costless guarantees as if the model that best balances bias and variance was known a priori. We study the feasibility of similar guarantees for cumulative regret minimization in the stochastic contextual bandit setting. Recent work [Marinov and Zimmert, 2021] identifies instances where no algorithm can guarantee costless regret bounds. Nevertheless, we identify benign conditions where costless model selection is feasible: gradually increasing class complexity, and diminishing marginal returns for best-in-class policy value with increasing class complexity. Our algorithm is based on a novel misspecification test, and our analysis demonstrates the benefits of using model selection for reward estimation. Unlike prior work on model selection in contextual bandits, our algorithm carefully adapts to the evolving bias-variance trade-off as more data is collected. In particular, our algorithm and analysis go beyond adapting to the complexity of the simplest realizable class and instead adapt to the complexity of the simplest class whose estimation variance dominates the bias. For short horizons, this provides improved regret guarantees that depend on the complexity of simpler classes.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kumar-krishnamurthy24a.html
https://proceedings.mlr.press/v238/kumar-krishnamurthy24a.htmlUnveiling Latent Causal Rules: A Temporal Point Process Approach for Abnormal Event ExplanationIn high-stakes systems such as healthcare, it is critical to understand the causal reasons behind unusual events, such as sudden changes in patient’s health. Unveiling the causal reasons helps with quick diagnoses and precise treatment planning. In this paper, we propose an automated method for uncovering “if-then” logic rules to explain observational events. We introduce {\it temporal point processes} to model the events of interest, and discover the set of latent rules to explain the occurrence of events. To achieve this goal, we employ an Expectation-Maximization (EM) algorithm. In the E-step, we calculate the posterior probability of each event being explained by each discovered rule. In the M-step, we update both the rule set and model parameters to enhance the likelihood function’s lower bound. Notably, we will optimize the rule set in a {\it differential} manner. Our approach demonstrates accurate performance in both discovering rules and identifying root causes. We showcase its promising results using synthetic and real healthcare datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kuang24a.html
https://proceedings.mlr.press/v238/kuang24a.htmlPrivate Learning with Public FeaturesWe study a class of private learning problems in which the data is a join of private and public features. This is often the case in private personalization tasks such as recommendation or ad prediction, in which features related to individuals are sensitive, while features related to items (the movies or songs to be recommended, or the ads to be shown to users) are publicly available and do not require protection. A natural question is whether private algorithms can achieve higher utility in the presence of public features. We give a positive answer for multi-encoder models where one of the encoders operates on public features. We develop new algorithms that take advantage of this separation by only protecting certain sufficient statistics (instead of adding noise to the gradient). This method has a guaranteed utility improvement for linear regression, and importantly, achieves the state of the art on two standard private recommendation benchmarks, demonstrating the importance of methods that adapt to the private-public feature separation.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/krichene24a.html
https://proceedings.mlr.press/v238/krichene24a.htmlStructural perspective on constraint-based learning of Markov networksMarkov networks are probabilistic graphical models that employ undirected graphs to depict conditional independence relationships among variables. Our focus lies in constraint-based structure learning, which entails learning the undirected graph from data through the execution of conditional independence tests. We establish theoretical limits concerning two critical aspects of constraint-based learning of Markov networks: the number of tests and the sizes of the conditioning sets. These bounds uncover an exciting interplay between the structural properties of the graph and the amount of tests required to learn a Markov network. The starting point of our work is that the graph parameter maximum pairwise connectivity, $\kappa$, that is, the maximum number of vertex-disjoint paths connecting a pair of vertices in the graph, is responsible for the sizes of independence tests required to learn the graph. On one hand, we show that at least one test with the size of the conditioning set at least $\kappa$ is always necessary. On the other hand, we prove that any graph can be learned by performing tests of size at most $\kappa$. This completely resolves the question of the minimum size of conditioning sets required to learn the graph. When it comes to the number of tests, our upper bound on the sizes of conditioning sets implies that every $n$-vertex graph can be learned by at most $n^{\kappa}$ tests with conditioning sets of sizes at most $\kappa$. We show that for any upper bound q on the sizes of the conditioning sets, there exist graphs with $O(nq)$ vertices that require at least $n^{\Omega(\kappa)}$ tests to learn. This lower bound holds even when the treewidth and the maximum degree of the graph are at most $\kappa+2$. On the positive side, we prove that every graph of bounded treewidth can be learned by a polynomial number of tests with conditioning sets of sizes at most $2*\kappa$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/korhonen24a.html
https://proceedings.mlr.press/v238/korhonen24a.htmlTwo Birds with One Stone: Enhancing Uncertainty Quantification and Interpretability with Graph Functional Neural ProcessGraph neural networks (GNNs) are powerful tools on graph data. However, their predictions are mis-calibrated and lack interpretability, limiting their adoption in critical applications. To address this issue, we propose a new uncertainty-aware and interpretable graph classification model that combines graph functional neural process and graph generative model. The core of our method is to assume a set of latent rationales which can be mapped to a probabilistic embedding space; the predictive distribution of the classifier is conditioned on such rationale embeddings by learning a stochastic correlation matrix. The graph generator serves to decode the graph structure of the rationales from the embedding space for model interpretability. For efficient model training, we adopt an alternating optimization procedure which mimics the well known Expectation-Maximization (EM) algorithm. The proposed method is general and can be applied to any existing GNN architecture. Extensive experiments on five graph classification datasets demonstrate that our framework outperforms state-of-the-art methods in both uncertainty quantification and GNN interpretability. We also conduct case studies to show that the decoded rationale structure can provide meaningful explanations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kong24a.html
https://proceedings.mlr.press/v238/kong24a.htmlBandit Pareto Set Identification: the Fixed Budget SettingWe study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the “hardness to classify” each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kone24a.html
https://proceedings.mlr.press/v238/kone24a.htmlSVARM-IQ: Efficient Approximation of Any-order Shapley Interactions through StratificationAddressing the limitations of individual attribution scores via the Shapley value (SV), the field of explainable AI (XAI) has recently explored intricate interactions of features or data points. In particular, extensions of the SV, such as the Shapley Interaction Index (SII), have been proposed as a measure to still benefit from the axiomatic basis of the SV. However, similar to the SV, their exact computation remains computationally prohibitive. Hence, we propose with SVARM-IQ a sampling-based approach to efficiently approximate Shapley-based interaction indices of any order. SVARM-IQ can be applied to a broad class of interaction indices, including the SII, by leveraging a novel stratified representation. We provide non-asymptotic theoretical guarantees on its approximation quality and empirically demonstrate that SVARM-IQ achieves state-of-the-art estimation results in practical XAI scenarios on different model classes and application domains.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kolpaczki24a.html
https://proceedings.mlr.press/v238/kolpaczki24a.htmlConsistency of Dictionary-Based Manifold LearningWe analyze a paradigm for interpretable Manifold Learning for scientific data analysis, whereby one parametrizes a manifold with d smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse regression in the manifold tangent bundle, bypassing more standard, agnostic manifold learning algorithms. We prove conditions for the existence of such parameterizations in function space and the first end to end recovery results from finite samples. The method is demonstrated on both synthetic problems and with data from a real scientific domain.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/koelle24a.html
https://proceedings.mlr.press/v238/koelle24a.htmlFair Soft ClusteringScholars in the machine learning community have recently focused on analyzing the fairness of learning models, including clustering algorithms. In this work we study fair clustering in a probabilistic (soft) setting, where observations may belong to several clusters determined by probabilities. We introduce new probabilistic fairness metrics, which generalize and extend existing non-probabilistic fairness frameworks and propose an algorithm for obtaining a fair probabilistic cluster solution from a data representation known as a fairlet decomposition. Finally, we demonstrate our proposed fairness metrics and algorithm by constructing a fair Gaussian mixture model on three real-world datasets. We achieve this by identifying balanced micro-clusters which minimize the distances induced by the model, and on which traditional clustering can be performed while ensuring the fairness of the solution.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kjaersgaard24a.html
https://proceedings.mlr.press/v238/kjaersgaard24a.htmlA Doubly Robust Approach to Sparse Reinforcement LearningWe propose a new regret minimization algorithm for episodic sparse linear Markov decision process (SMDP) where the state-transition distribution is a linear function of observed features. The only previously known algorithm for SMDP requires the knowledge of the sparsity parameter and oracle access to an unknown policy. We overcome these limitations by combining the doubly robust method that allows one to use feature vectors of \emph{all} actions with a novel analysis technique that enables the algorithm to use data from all periods in all episodes. The regret of the proposed algorithm is $\tilde{O}(\sigma^{-1}_{\min}s_{\star} H \sqrt{N})$, where $\sigma_{\min}$ denotes the restrictive the minimum eigenvalue of the average Gram matrix of feature vectors, $s_\star$ is the sparsity parameter, $H$ is the length of an episode, and $N$ is the number of rounds. We provide a lower regret bound that matches the upper bound to logarithmic factors on a newly identified subclass of SMDPs. Our numerical experiments support our theoretical results and demonstrate the superior performance of our algorithm.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kim24c.html
https://proceedings.mlr.press/v238/kim24c.htmlDeepFDR: A Deep Learning-based False Discovery Rate Control Method for Neuroimaging DataVoxel-based multiple testing is widely used in neuroimaging data analysis. Traditional false discovery rate (FDR) control methods often ignore the spatial dependence among the voxel-based tests and thus suffer from substantial loss of testing power. While recent spatial FDR control methods have emerged, their validity and optimality remain questionable when handling the complex spatial dependencies of the brain. Concurrently, deep learning methods have revolutionized image segmentation, a task closely related to voxel-based multiple testing. In this paper, we propose DeepFDR, a novel spatial FDR control method that leverages unsupervised deep learning-based image segmentation to address the voxel-based multiple testing problem. Numerical studies, including comprehensive simulations and Alzheimer’s disease FDG-PET image analysis, demonstrate DeepFDR’s superiority over existing methods. DeepFDR not only excels in FDR control and effectively diminishes the false nondiscovery rate, but also boasts exceptional computational efficiency highly suited for tackling large-scale neuroimaging data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kim24b.html
https://proceedings.mlr.press/v238/kim24b.htmlLinear Convergence of Black-Box Variational Inference: Should We Stick the Landing?We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called “linear”) rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projected stochastic gradient descent. For the projection operator, we consider a domain with triangular scale matrices, which the projection onto is computable in $\theta(d)$ time, where $d$ is the dimensionality of the target posterior. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STL estimator, providing explicit non-asymptotic complexity guarantees for both.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kim24a.html
https://proceedings.mlr.press/v238/kim24a.htmlAnalyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction FunctionsMachine learning methods have significantly improved in their predictive capabilities, but at the same time they are becoming more complex and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are robust. In this paper we focus on one particular aspect of robustness, namely that an explainer should give similar explanations for similar data inputs. We formalize this notion by introducing and defining explainer astuteness, analogous to astuteness of prediction functions. Our formalism allows us to connect explainer robustness to the predictor’s probabilistic Lipschitzness, which captures the probability of local smoothness of a function. We provide lower bound guarantees on the astuteness of a variety of explainers (e.g., SHAP, RISE, CXPlain) given the Lipschitzness of the prediction function. These theoretical results imply that locally smooth prediction functions lend themselves to locally robust explanations. We evaluate these results empirically on simulated as well as real datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/khan24a.html
https://proceedings.mlr.press/v238/khan24a.htmlFunctional Flow MatchingWe propose Functional Flow Matching (FFM), a function-space generative model that generalizes the recently-introduced Flow Matching model to operate directly in infinite-dimensional spaces. Our approach works by first defining a path of probability measures that interpolates between a fixed Gaussian measure and the data distribution, followed by learning a vector field on the underlying space of functions that generates this path of measures. Our method does not rely on likelihoods or simulations, making it well-suited to the function space setting. We provide both a theoretical framework for building such models and an empirical evaluation of our techniques. We demonstrate through experiments on synthetic and real-world benchmarks that our proposed FFM method outperforms several recently proposed function-space generative models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kerrigan24a.html
https://proceedings.mlr.press/v238/kerrigan24a.htmlOffline Policy Evaluation and Optimization Under ConfoundingEvaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, we demonstrate the effectiveness of our method and investigate the importance of a clustering sub-routine.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kausik24a.html
https://proceedings.mlr.press/v238/kausik24a.htmlInterpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional DataMany modern causal questions ask how treatments affect complex outcomes that are measured using wearable devices and sensors. Current analysis approaches require summarizing these data into scalar statistics (e.g., the mean), but these summaries can be misleading. For example, disparate distributions can have the same means, variances, and other statistics. Researchers can overcome the loss in information by instead representing the data as distributions. We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision making: Analyzing Distributional Data via Matching After Learning to Stretch (ADD MALTS). We (i) provide analytical guarantees of the correctness of our estimation strategy, (ii) demonstrate via simulation that ADD MALTS outperforms other distributional data analysis methods at estimating treatment effects, and (iii) illustrate ADD MALTS’ ability to verify whether there is enough cohesion between treatment and control units within subpopulations to trustworthily estimate treatment effects. We demonstrate ADD MALTS’ utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/katta24a.html
https://proceedings.mlr.press/v238/katta24a.htmlFirst Passage Percolation with Queried HintsSolving optimization problems leads to elegant and practical solutions in a wide variety of real-world applications. In many of those real-world applications, some of the information required to specify the relevant optimization problem is noisy, uncertain, and expensive to obtain. In this work, we study how much of that information needs to be queried in order to obtain an approximately optimal solution to the relevant problem. In particular, we focus on the shortest path problem in graphs with dynamic edge costs. We adopt the {\em first passage percolation} model from probability theory wherein a graph $G’$ is derived from a weighted base graph $G$ by multiplying each edge weight by an independently chosen, random number in $[1, \rho]$. Mathematicians have studied this model extensively when $G$ is a $d$-dimensional grid graph, but the behavior of shortest paths in this model is still poorly understood in general graphs. We make progress in this direction for a class of graphs that resemble real-world road networks. Specifically, we prove that if $G$ has a constant continuous doubling dimension, then for a given $s-t$ pair, we only need to probe the weights on $((\rho \log n )/ \epsilon)^{O(1)}$ edges in $G’$ in order to obtain a $(1 + \epsilon)$-approximation to the $s-t$ distance in $G’$. We also generalize the result to a correlated setting and demonstrate experimentally that probing improves accuracy in estimating $s-t$ distances.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/karntikoon24a.html
https://proceedings.mlr.press/v238/karntikoon24a.htmlA Unified Framework for Discovering Discrete SymmetriesWe consider the problem of learning a function respecting a symmetry from among a class of symmetries. We develop a unified framework that enables symmetry discovery across a broad range of subgroups including locally symmetric, dihedral and cyclic subgroups. At the core of the framework is a novel architecture composed of linear, matrix-valued and non-linear functions that expresses functions invariant to these subgroups in a principled manner. The structure of the architecture enables us to leverage multi-armed bandit algorithms and gradient descent to efficiently optimize over the linear and the non-linear functions, respectively, and to infer the symmetry that is ultimately learnt. We also discuss the necessity of the matrix-valued functions in the architecture. Experiments on image-digit sum and polynomial regression tasks demonstrate the effectiveness of our approach.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/karjol24a.html
https://proceedings.mlr.press/v238/karjol24a.htmlLearning the Pareto Set Under Incomplete Preferences: Pure Exploration in Vector BanditsWe study pure exploration in bandit problems with vector-valued rewards, where the goal is to (approximately) identify the Pareto set of arms given incomplete preferences induced by a polyhedral convex cone. We address the open problem of designing sample-efficient learning algorithms for such problems. We propose Pareto Vector Bandits (PaVeBa), an adaptive elimination algorithm that nearly matches the gap-dependent and worst-case lower bounds on the sample complexity of $(\epsilon, \delta)$-PAC Pareto set identification. Finally, we provide an in-depth numerical investigation of PaVeBa and its heuristic variants by comparing them with the state-of-the-art multi-objective and vector optimization algorithms on several real-world datasets with conflicting objectives.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/karagozlu24a.html
https://proceedings.mlr.press/v238/karagozlu24a.htmlIdentifiability of Product of Experts ModelsProduct of experts (PoE) are layered networks in which the value at each node is an AND (or product) of the values (possibly negated) at its inputs. These were introduced as a neural network architecture that can efficiently learn to generate high-dimensional data which satisfy many low-dimensional constraints-thereby allowing each individual expert to perform a simple task. PoEs have found a variety of applications in learning. We study the problem of identifiability of a product of experts model having a layer of binary latent variables, and a layer of binary observables that are iid conditional on the latents. The previous best upper bound on the number of observables needed to identify the model was exponential in the number of parameters. We show: (a) When the latents are uniformly distributed, the model is identifiable with a number of observables equal to the number of parameters (and hence best possible). (b) In the more general case of arbitrarily distributed latents, the model is identifiable for a number of observables that is still linear in the number of parameters (and within a factor of two of best-possible). The proofs rely on root interlacing phenomena for some special three-term recurrences.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kant24a.html
https://proceedings.mlr.press/v238/kant24a.htmlLearning to Rank for Optimal Treatment Allocation Under Resource ConstraintsCurrent causal inference approaches for estimating conditional average treatment effects (CATEs) often prioritize accuracy. However, in resource constrained settings, decision makers may only need a ranking of individuals based on their estimated CATE. In these scenarios, exact CATE estimation may be an unnecessarily challenging task, particularly when the underlying function is difficult to learn. In this work, we study the relationship between CATE estimation and optimizing for CATE ranking, demonstrating that optimizing for ranking may be more appropriate than optimizing for accuracy in certain settings. Guided by our analysis, we propose an approach to directly optimize for rankings of individuals to inform treatment assignment that aims to maximize benefit. Our tree-based approach maximizes the expected benefit of the treatment assignment using a novel splitting criteria. In an empirical case-study across synthetic datasets, our approach leads to better treatment assignments compared to CATE estimation methods as measured by expected total benefit. By providing a practical and efficient approach to learning a CATE ranking, this work offers an important step towards bridging the gap between CATE estimation techniques and their downstream applications.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kamran24a.html
https://proceedings.mlr.press/v238/kamran24a.htmlDifferentially Private Conditional Independence TestingConditional independence (CI) tests are widely used in statistical data analysis, e.g., they are the building block of many algorithms for causal graph discovery. The goal of a CI test is to accept or reject the null hypothesis that $X \perp \!\!\! \perp Y \mid Z$, where $X \in \mathbb{R}, Y \in \mathbb{R}, Z \in \mathbb{R}^d$. In this work, we investigate conditional independence testing under the constraint of differential privacy. We design two private CI testing procedures: one based on the generalized covariance measure of Shah and Peters (2020) and another based on the conditional randomization test of Cand{è}s et al. (2016) (under the model-X assumption). We provide theoretical guarantees on the performance of our tests and validate them empirically. These are the first private CI tests with rigorous theoretical guarantees that work for the general case when $Z$ is continuous.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kalemaj24a.html
https://proceedings.mlr.press/v238/kalemaj24a.htmlAsynchronous Randomized Trace EstimationRandomized trace estimation is a popular technique to approximate the trace of an implicitly-defined matrix $A$ by averaging the quadratic form $x’Ax$ across several samples of a random vector $x$. This paper focuses on the application of randomized trace estimators on asynchronous computing environments where the quadratic form $x’Ax$ is computed partially by observing only a random row subset of $A$ for each sample of the random vector $x$. Our asynchronous framework treats the number of rows, as well as the row subset observed for each sample, as random variables, and our theoretical analysis establishes the variance of the randomized estimator for Rademacher and Gaussian samples. We also present error analysis and sampling complexity bounds for the proposed asynchronous randomized trace estimator. Our numerical experiments illustrate that the asynchronous variant can be competitive even when a small number of rows is updated per each sample.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kalantzis24a.html
https://proceedings.mlr.press/v238/kalantzis24a.htmlShape Arithmetic Expressions: Advancing Scientific Discovery Beyond Closed-Form EquationsSymbolic regression has excelled in uncovering equations from physics, chemistry, biology, and related disciplines. However, its effectiveness becomes less certain when applied to experimental data lacking inherent closed-form expressions. Empirically derived relationships, such as entire stress-strain curves, may defy concise closed-form representation, compelling us to explore more adaptive modeling approaches that balance flexibility with interpretability. In our pursuit, we turn to Generalized Additive Models (GAMs), a widely used class of models known for their versatility across various domains. Although GAMs can capture non-linear relationships between variables and targets, they cannot capture intricate feature interactions. In this work, we investigate both of these challenges and propose a novel class of models, Shape Arithmetic Expressions (SHAREs), that fuses GAM’s flexible shape functions with the complex feature interactions found in mathematical expressions. SHAREs also provide a unifying framework for both of these approaches. We also design a set of rules for constructing SHAREs that guarantee transparency of the found expressions beyond the standard constraints based on the model’s size.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/kacprzyk24a.html
https://proceedings.mlr.press/v238/kacprzyk24a.htmlData-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over QuantityContrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP’s performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance.Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \textsc{ClipCov} achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: \url{https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/joshi24a.html
https://proceedings.mlr.press/v238/joshi24a.htmlIndependent Learning in Constrained Markov Potential GamesConstrained Markov games offer a formal mathematical framework for modeling multi-agent reinforcement learning problems where the behavior of the agents is subject to constraints. In this work, we focus on the recently introduced class of constrained Markov Potential Games. While centralized algorithms have been proposed for solving such constrained games, the design of converging independent learning algorithms tailored for the constrained setting remains an open question. We propose an independent policy gradient algorithm for learning approximate constrained Nash equilibria: Each agent observes their own actions and rewards, along with a shared state. Inspired by the optimization literature, our algorithm performs proximal-point-like updates augmented with a regularized constraint set. Each proximal step is solved inexactly using a stochastic switching gradient algorithm. Notably, our algorithm can be implemented independently without a centralized coordination mechanism requiring turn-based agent updates. Under some technical constraint qualification conditions, we establish convergence guarantees towards constrained approximate Nash equilibria. We perform simulations to illustrate our results.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jordan24a.html
https://proceedings.mlr.press/v238/jordan24a.htmlGenerating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted TreesTabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html
https://proceedings.mlr.press/v238/jolicoeur-martineau24a.htmlFairRR: Pre-Processing for Group Fairness through Randomized ResponseThe increasing usage of machine learning models in consequential decision-making processes has spurred research into the fairness of these systems. While significant work has been done to study group fairness in the in-processing and post-processing setting, there has been little that theoretically connects these results to the pre-processing domain. This paper extends recent fair statistical learning results and proposes that achieving group fairness in downstream models can be formulated as finding the optimal design matrix in which to modify a response variable in a Randomized Response framework. We show that measures of group fairness can be directly controlled for with optimal model utility, proposing a pre-processing algorithm called FairRR that yields excellent downstream model utility and fairness.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/john-ward24a.html
https://proceedings.mlr.press/v238/john-ward24a.htmlFeasible $Q$-Learning for Average Reward Reinforcement LearningAverage reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $Q$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$-close to optimal, (ii) estimate optimal average reward with $\epsilon$-accuracy, and (iii) estimate the bias function (similar to $Q$-function in discounted case) with $\epsilon$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jin24b.html
https://proceedings.mlr.press/v238/jin24b.htmlSubsampling Error in Stochastic Gradient Langevin DiffusionsThe Stochastic Gradient Langevin Dynamics (SGLD) are popularly used to approximate Bayesian posterior distributions in statistical learning procedures with large-scale data. As opposed to many usual Markov chain Monte Carlo (MCMC) algorithms, SGLD is not stationary with respect to the posterior distribution; two sources of error appear: The first error is introduced by an Euler–Maruyama discretisation of a Langevin diffusion process, the second error comes from the data subsampling that enables its use in large-scale data settings. In this work, we consider an idealised version of SGLD to analyse the method’s pure subsampling error that we then see as a best-case error for diffusion-based subsampling MCMC methods. Indeed, we introduce and study the Stochastic Gradient Langevin Diffusion (SGLDiff), a continuous-time Markov process that follows the Langevin diffusion corresponding to a data subset and switches this data subset after exponential waiting times. There, we show the exponential ergodicity of SLGDiff and that the Wasserstein distance between the posterior and the limiting distribution of SGLDiff is bounded above by a fractional power of the mean waiting time. We bring our results into context with other analyses of SGLD.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jin24a.html
https://proceedings.mlr.press/v238/jin24a.htmlKrylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence RateSecond-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second-order updates within a lower-dimensional subspace, giving rise to subspace second-order methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem’s dimension $d$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $\mathcal{O}\left(\frac{1}{mk}+\frac{1}{k^2}\right)$ for solving convex optimization problems. Here, $m$ represents the subspace dimension, which can be significantly smaller than $d$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jiang24a.html
https://proceedings.mlr.press/v238/jiang24a.htmlFedFisher: Leveraging Fisher Information for One-Shot Federated LearningStandard federated learning (FL) algorithms typically require multiple rounds of communication between the server and the clients, which has several drawbacks, including requiring constant network connectivity, repeated investment of computational resources, and susceptibility to privacy attacks. One-Shot FL is a new paradigm that aims to address this challenge by enabling the server to train a global model in a single round of communication. In this work, we present FedFisher, a novel algorithm for one-shot FL that makes use of Fisher information matrices computed on local client models, motivated by a Bayesian perspective of FL. First, we theoretically analyze FedFisher for two-layer over-parameterized ReLU neural networks and show that the error of our one-shot FedFisher global model becomes vanishingly small as the width of the neural networks and amount of local training at clients increases. Next, we propose practical variants of FedFisher using the diagonal Fisher and K-FAC approximation for the full Fisher and highlight their communication and compute efficiency for FL. Finally, we conduct extensive experiments on various datasets, which show that these variants of FedFisher consistently improve over competing baselines.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jhunjhunwala24a.html
https://proceedings.mlr.press/v238/jhunjhunwala24a.htmlOn-Demand Federated Learning for Arbitrary Target Class DistributionsWe introduce On-Demand Federated Learning (On-Demand FL), which enables on-demand federated learning of a deep model for an arbitrary target data distribution of interest by making the best use of the heterogeneity (non-IID-ness) of local client data, unlike existing approaches trying to circumvent the non-IID nature of federated learning. On-Demand FL composes a dataset of the target distribution, which we call the composite dataset, from a selected subset of local clients whose aggregate distribution is expected to emulate the target distribution as a whole. As the composite dataset consists of a precise yet diverse subset of clients reflecting the target distribution, the on-demand model trained with exactly enough selected clients becomes able to improve the model performance on the target distribution compared when trained with off-target and/or unknown distributions while reducing the number of participating clients and federating rounds. We model the target data distribution in terms of class and estimate the class distribution of each local client from the weight gradient of its local model. Our experiment results show that On-Demand FL achieves up to 5% higher classification accuracy on various target distributions just involving 9${\times}$ fewer clients with FashionMNIST, CIFAR-10, and CIFAR-100.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jeong24a.html
https://proceedings.mlr.press/v238/jeong24a.htmlQuantifying intrinsic causal contributions via structure preserving interventionsWe propose a notion of causal influence that describes the ‘intrinsic’ part of the contribution of a node on a target node in a DAG. By recursively writing each node as a function of the upstream noise terms, we separate the intrinsic information added by each node from the one obtained from its ancestors. To interpret the intrinsic information as a causal contribution, we consider ‘structure-preserving interventions’ that randomize each node in a way that mimics the usual dependence on the parents and does not perturb the observed joint distribution. To get a measure that is invariant across arbitrary orderings of nodes we use Shapley based symmetrization and show that it reduces in the linear case to simple ANOVA after resolving the target node into noise variables. We describe our contribution analysis for variance and entropy, but contributions for other target metrics can be defined analogously.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/janzing24a.html
https://proceedings.mlr.press/v238/janzing24a.htmlExploration via linearly perturbed loss minimisationWe introduce \emph{exploration via linear loss perturbations} (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/janz24a.html
https://proceedings.mlr.press/v238/janz24a.htmlReparameterized Variational Rejection SamplingTraditional approaches to variational inference rely on parametric families of variational distributions, with the choice of family playing a critical role in determining the accuracy of the resulting posterior approximation. Simple mean-field families often lead to poor approximations, while rich families of distributions like normalizing flows can be difficult to optimize and usually do not incorporate the known structure of the target distribution due to their black-box nature. To expand the space of flexible variational families, we revisit Variational Rejection Sampling (VRS) [Grover et al., 2018], which combines a parametric proposal distribution with rejection sampling to define a rich non-parametric family of distributions that explicitly utilizes the known target distribution. By introducing a low-variance reparameterized gradient estimator for the parameters of the proposal distribution, we make VRS an attractive inference strategy for models with continuous latent variables. We argue theoretically and demonstrate empirically that the resulting method–Reparameterized Variational Rejection Sampling (RVRS)–offers an attractive trade-off between computational cost and inference fidelity. In experiments we show that our method performs well in practice and that it is well suited for black-box inference, especially for models with local latent variables.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jankowiak24a.html
https://proceedings.mlr.press/v238/jankowiak24a.htmlAchieving Fairness through Separability: A Unified Framework for Fair Representation LearningFairness is a growing concern in machine learning as state-of-the-art models may amplify social prejudice by making biased predictions against specific demographics such as race and gender. Such discrimination raises issues in various fields such as employment, criminal justice, and trust score evaluation. To address the concerns, we propose learning fair representation through a straightforward yet effective approach to project intrinsic information while filtering sensitive information for downstream tasks. Our model consists of two goals: one is to ensure that the latent data from different demographic groups is non-separable (i.e., make the latent data distribution independent of the sensitive feature to improve fairness); the other is to maximize the separability of latent data from different classes (i.e., maintain the discriminative power of data for the sake of the downstream tasks like classification). Our method adopts a non-zero-sum adversarial game to minimize the distance between data from different demographic groups while maximizing the margin between data from different classes. Moreover, the proposed objective function can be easily generalized to multiple sensitive attributes and multi-class scenarios as it upper bounds popular fairness metrics in these cases. We provide theoretical analysis of the fairness of our model and validate w.r.t. both fairness and predictive performance on benchmark datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jang24a.html
https://proceedings.mlr.press/v238/jang24a.htmlEfficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing SystemsWe consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-trivial to find. While Reinforcement Learning (RL) has been recognized to have great potential for learning policies in such cases, our problem has an exponentially large state space size, rendering standard RL inefficient. In this work, we propose ACHQ, an efficient policy gradient based algorithm with a low dimensional soft threshold policy parameterization that leverages the underlying queueing structure. We provide stationary-point convergence guarantees for the general case and despite the low-dimensional parameterization prove that ACHQ converges to an approximate global optimum for the special case of two servers. Simulations demonstrate an improvement in expected response time of up to ${\sim}30%$ over the greedy policy that routes to the fastest available server.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jali24a.html
https://proceedings.mlr.press/v238/jali24a.htmlFrom Data Imputation to Data Cleaning — Automated Cleaning of Tabular Data Improves Downstream Predictive PerformanceThe translation of Machine Learning (ML) research innovations to real-world applications and the maintenance of ML components are hindered by reoccurring challenges, such as reaching high predictive performance, robustness, complying with regulatory constraints, or meeting ethical standards. Many of these challenges are related to data quality and, in particular, to the lack of automation in data pipelines upstream of ML components. Automated data cleaning remains challenging since many approaches neglect the dependency structure of the data errors and require task-specific heuristics or human input for calibration. In this study, we develop and evaluate an application-agnostic ML-based data cleaning approach using well-established imputation techniques for automated detection and cleaning of erroneous values. To improve the degree of automation, we combine imputation techniques with conformal prediction (CP), a model-agnostic and distribution-free method to quantify and calibrate the uncertainty of ML models. Extensive empirical evaluations demonstrate that Conformal Data Cleaning (CDC) improves predictive performance in downstream ML tasks in the majority of cases. Our code is available on GitHub: \url{https://github.com/se-jaeger/conformal-data-cleaning}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jager24a.html
https://proceedings.mlr.press/v238/jager24a.htmlProvable local learning rule by expert aggregation for a Hawkes networkWe propose a simple network of Hawkes processes as a cognitive model capable of learning to classify objects. Our learning algorithm, named HAN for Hawkes Aggregation of Neurons, is based on a local synaptic learning rule based on spiking probabilities at each output node. We were able to use local regret bounds to prove mathematically that the network is able to learn on average and even asymptotically under more restrictive assumptions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jaffard24a.html
https://proceedings.mlr.press/v238/jaffard24a.htmlA Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary OpponentIn this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $\tilde\mathcal{O}(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of $\tilde\mathcal{O}(\sqrt[3]{DS^2AT^2})$ by Wei et al., (2017) under the same assumption and matches the theoretical lower bound in $T$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/jafarnia-jahromi24a.html
https://proceedings.mlr.press/v238/jafarnia-jahromi24a.htmlRisk Seeking Bayesian Optimization under Uncertainty for Obtaining ExtremumReal-world black-box optimization tasks often focus on obtaining the best reward, which includes an intrinsic random quantity from uncontrollable environmental factors. For this problem, we formulate a novel risk-seeking optimization problem whose aim is to obtain the best possible reward within a fixed budget under uncontrollable factors. We consider two settings: (1) environmental model setting for the case that we can observe uncontrollable environmental variables that affect the observation as the input of a target function, and (2) heteroscedastic model setting for the case that any uncontrollable variables cannot be observed. We propose a novel Bayesian optimization method called kernel explore-then-commit (kernel-ETC) and provide the regret upper bound for both settings. We demonstrate the effectiveness of kernel-ETC through several numerical experiments, including the hyperparameter tuning task and the simulation function derived from polymer synthesis real data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/iwazaki24a.html
https://proceedings.mlr.press/v238/iwazaki24a.htmlAsGrad: A Sharp Unified Analysis of Asynchronous-SGD AlgorithmsWe analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing with other workers. We present a unified convergence theory for non-convex smooth functions in the heterogeneous regime. The proposed analysis provides convergence for pure asynchronous SGD and its various modifications. Moreover, our theory explains what affects the convergence rate and what can be done to improve the performance of asynchronous algorithms. In particular, we introduce a novel asynchronous method based on worker shuffling. As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD. The derived rates match the best-known results for those algorithms, highlighting the tightness of our approach. Finally, our numerical evaluations support theoretical findings and show the good practical performance of our method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/islamov24a.html
https://proceedings.mlr.press/v238/islamov24a.htmlAdaptive Compression in Federated Learning via Side InformationThe high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods – in which the client n sends a sample from a client-only probability distribution $q_{\phi^{(n)}}$, and the server estimates the mean of the clients’ distributions using these samples. However, such methods do not take full advantage of the FL setup where the server, throughout the training process, has side information in the form of a global distribution $p_{\theta}$ that is close to the client-only distribution $q_{\phi^{(n)}}$ in Kullback-Leibler (KL) divergence. In this work, we exploit this \emph{closeness} between the clients’ distributions $q_{\phi^{(n)}}$’s and the side information $p_{\theta}$ at the server, and propose a framework that requires approximately $D_{KL}(q_{\phi^{(n)}}|| p_{\theta})$ bits of communication. We show that our method can be integrated into many existing stochastic compression frameworks to attain the same (and often higher) test accuracy with up to 82 times smaller bitrate than the prior work – corresponding to 2,650 times overall compression.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/isik24a.html
https://proceedings.mlr.press/v238/isik24a.htmlBounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input UncertaintyIn this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method performs as well or better than existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/inatsu24a.html
https://proceedings.mlr.press/v238/inatsu24a.htmlUnderstanding Inverse Scaling and Emergence in Multitask Representation LearningLarge language models exhibit strong multitasking capabilities, however, their learning dynamics as a function of task characteristics, sample size, and model complexity remain mysterious. For instance, it is known that, as the model size grows, large language models exhibit emerging abilities where certain tasks can abruptly jump from poor to respectable performance. Such phenomena motivate a deeper understanding of how individual tasks evolve during multitasking. To this aim, we study a multitask representation learning setup where tasks can have distinct distributions, quantified by their covariance priors. Through random matrix theory, we precisely characterize the optimal linear representation for few-shot learning that minimizes the average test risk in terms of task covariances. When tasks have equal sample sizes, we prove a reduction to an equivalent problem with a single effective covariance from which the individual task risks of the original problem can be deduced. Importantly, we introduce “task competition” to explain how tasks with dominant covariance eigenspectrum emerge faster than others. We show that task competition can potentially explain the inverse scaling of certain tasks i.e. reduced test accuracy as the model grows. Overall, this work sheds light on the risk and emergence of individual tasks and uncovers new high-dimensional phenomena (including multiple-descent risk curves) that arise in multitask representation learning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ildiz24a.html
https://proceedings.mlr.press/v238/ildiz24a.htmlLearning Dynamics in Linear VAE: Posterior Collapse Threshold, Superfluous Latent Space Pitfalls, and Speedup with KL AnnealingVariational autoencoders (VAEs) face a notorious problem wherein the variational posterior often aligns closely with the prior, a phenomenon known as posterior collapse, which hinders the quality of representation learning. To mitigate this problem, an adjustable hyperparameter $\beta$ and a strategy for annealing this parameter, called KL annealing, are proposed. This study presents a theoretical analysis of the learning dynamics in a minimal VAE. It is rigorously proved that the dynamics converge to a deterministic process within the limit of large input dimensions, thereby enabling a detailed dynamical analysis of the generalization error. Furthermore, the analysis shows that the VAE initially learns entangled representations and gradually acquires disentangled representations. A fixed-point analysis of the deterministic process reveals that when $\beta$ exceeds a certain threshold, posterior collapse becomes inevitable regardless of the learning period. Additionally, the superfluous latent variables for the data-generative factors lead to overfitting of the background noise; this adversely affects both generalization and learning convergence. The analysis further unveiled that appropriately tuned KL annealing can accelerate convergence.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ichikawa24a.html
https://proceedings.mlr.press/v238/ichikawa24a.htmlEnd-to-end Feature Selection Approach for Learning Skinny TreesWe propose a new optimization-based approach for feature selection in tree ensembles, an important problem in statistics and machine learning. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importance scores, while very popular, they are known to have drawbacks. We propose Skinny Trees: an end-to-end toolkit for feature selection in tree ensembles where we train a tree ensemble while controlling the number of selected features. Our optimization-based approach learns an ensemble of differentiable trees, and simultaneously performs feature selection using a grouped $\ell_0$-regularizer. We use first-order methods for optimization and present convergence guarantees for our approach. We use a dense-to-sparse regularization scheduling scheme that can lead to more expressive and sparser tree ensembles. On 15 synthetic and real-world datasets, Skinny Trees can achieve $1.5{\times}$–$620{\times}$ feature compression rates, leading up to $10{\times}$ faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25% feature budget, Skinny Trees outperforms LightGBM by 10.2% (up to 37.7%), and Random Forests by 3% (up to 12.5%).Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ibrahim24a.html
https://proceedings.mlr.press/v238/ibrahim24a.htmlAdaptive Experiment Design with Synthetic ControlsClinical trials are typically run in order to understand the effects of a new treatment on a given population of patients. However, patients in large populations rarely respond the same way to the same treatment. This heterogeneity in patient responses necessitates trials that investigate effects on multiple subpopulations—especially when a treatment has marginal or no benefit for the overall population but might have significant benefit for a particular subpopulation. Motivated by this need, we propose Syntax, an exploratory trial design that identifies subpopulations with positive treatment effect among many subpopulations. Syntax is sample efficient as it (i) recruits and allocates patients adaptively and (ii) estimates treatment effects by forming synthetic controls for each subpopulation that combines control samples from other subpopulations. We validate the performance of Syntax and provide insights into when it might have an advantage over conventional trial designs through experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/huyuk24a.html
https://proceedings.mlr.press/v238/huyuk24a.htmlDAGnosis: Localized Identification of Data Inconsistencies using StructuresIdentification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set’s features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/huynh24a.html
https://proceedings.mlr.press/v238/huynh24a.htmlDirectional Optimism for Safe Linear BanditsThe safe linear bandit problem is a version of the classical stochastic linear bandit problem where the learner’s actions must satisfy an uncertain constraint at all rounds. Due its applicability to many real-world settings, this problem has received considerable attention in recent years. By leveraging a novel approach that we call directional optimism, we find that it is possible to achieve improved regret guarantees for both well-separated problem instances and action sets that are finite star convex sets. Furthermore, we propose a novel algorithm for this setting that improves on existing algorithms in terms of empirical performance, while enjoying matching regret guarantees. Lastly, we introduce a generalization of the safe linear bandit setting where the constraints are convex and adapt our algorithms and analyses to this setting by leveraging a novel convex-analysis based approach.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hutchinson24a.html
https://proceedings.mlr.press/v238/hutchinson24a.htmlParameter-Agnostic Optimization under Relaxed SmoothnessTuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hubler24a.html
https://proceedings.mlr.press/v238/hubler24a.htmlAdaptive Federated Minimax Optimization with Lower ComplexitiesFederated learning is a popular distributed and privacy-preserving learning paradigm in machine learning. Recently, some federated learning algorithms have been proposed to solve the distributed minimax problems. However, these federated minimax algorithms still suffer from high gradient or communication complexity. Meanwhile, few algorithm focuses on using adaptive learning rate to accelerate these algorithms. To fill this gap, in the paper, we study a class of nonconvex minimax optimization, and propose an efficient adaptive federated minimax optimization algorithm (i.e., AdaFGDA) to solve these distributed minimax problems. Specifically, our AdaFGDA builds on the momentum-based variance reduced and local-SGD techniques, and it can flexibly incorporate various adaptive learning rates by using the unified adaptive matrices. Theoretically, we provide a solid convergence analysis framework for our AdaFGDA algorithm under non-i.i.d. setting. Moreover, we prove our AdaFGDA algorithm obtains a lower gradient (i.e., stochastic first-order oracle, SFO) complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding $\epsilon$-stationary point of the nonconvex minimax problems. Experimentally, we conduct some experiments on the deep AUC maximization and robust neural network training tasks to verify efficiency of our algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/huang24c.html
https://proceedings.mlr.press/v238/huang24c.htmlHorizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function ApproximationTo tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analysis: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/huang24b.html
https://proceedings.mlr.press/v238/huang24b.htmlOn the Statistical Efficiency of Mean-Field Reinforcement Learning with General Function ApproximationIn this paper, we study the fundamental statistical efficiency of Reinforcement Learning in Mean-Field Control (MFC) and Mean-Field Game (MFG) with general model-based function approximation. We introduce a new concept called Mean-Field Model-Based Eluder Dimension (MF-MBED), which characterizes the inherent complexity of mean-field model classes. We show that a rich family of Mean-Field RL problems exhibits low MF-MBED. Additionally, we propose algorithms based on maximal likelihood estimation, which can return an $\epsilon$-optimal policy for MFC or an $\epsilon$-Nash Equilibrium policy for MFG. The overall sample complexity depends only polynomially on MF-MBED, which is potentially much lower than the size of state-action space. Compared with previous works, our results only require the minimal assumptions including realizability and Lipschitz continuity.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/huang24a.html
https://proceedings.mlr.press/v238/huang24a.htmlExtragradient Type Methods for Riemannian Variational Inequality ProblemsIn this work, we consider monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian convex optimization and minimax optimization as particular cases. In Euclidean space, the last-iterates of both the extragradient (EG) and past extragradient (PEG) methods converge to the solution of monotone variational inequality problems at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$ (Cai et al., 2022). However, analogous behavior on Riemannian manifolds remains open. To bridge this gap, we introduce the Riemannian extragradient (REG) and Riemannian past extragradient (RPEG) methods. We demonstrate that both exhibit $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence and $O\left(\frac{1}{{T}}\right)$ average-iterate convergence, aligning with observations in the Euclidean case. These results are enabled by judiciously addressing the holonomy effect so that additional complications in Riemannian cases can be reduced and the Euclidean proof inspired by the performance estimation problem (PEP) technique or the sum-of-squares (SOS) technique can be applied again.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hu24c.html
https://proceedings.mlr.press/v238/hu24c.htmlCentral Limit Theorem for Two-Timescale Stochastic Approximation with Markovian Noise: Theory and ApplicationsTwo-timescale stochastic approximation (TTSA) is among the most general frameworks for iterative stochastic algorithms. This includes well-known stochastic optimization methods such as SGD variants and those designed for bilevel or minimax problems, as well as reinforcement learning like the family of gradient-based temporal difference (GTD) algorithms. In this paper, we conduct an in-depth asymptotic analysis of TTSA under controlled Markovian noise via central limit theorem (CLT), uncovering the coupled dynamics of TTSA influenced by the underlying Markov chain, which has not been addressed by previous CLT results of TTSA only with Martingale difference noise. Building upon our CLT, we expand its application horizon of efficient sampling strategies from vanilla SGD to a wider TTSA context in distributed learning, thus broadening the scope of Hu et al. 2020. In addition, we leverage our CLT result to deduce the statistical properties of GTD algorithms with nonlinear function approximation using Markovian samples and show their identical asymptotic performance, a perspective not evident from current finite-time bounds.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hu24b.html
https://proceedings.mlr.press/v238/hu24b.htmlPixel-wise Smoothing for Certified Robustness against Camera Motion PerturbationsDeep learning-based visual perception models lack robustness when faced with camera motion perturbations in practice. The current certification process for assessing robustness is costly and time-consuming due to the extensive number of image projections required for Monte Carlo sampling in the 3D camera motion space. To address these challenges, we present a novel, efficient, and practical framework for certifying the robustness of 3D-2D projective transformations against camera motion perturbations. Our approach leverages a smoothing distribution over the 2D-pixel space instead of in the 3D physical space, eliminating the need for costly camera motion sampling and significantly enhancing the efficiency of robustness certifications. With the pixel-wise smoothed classifier, we are able to fully upper bound the projection errors using a technique of uniform partitioning in camera motion space. Additionally, we extend our certification framework to a more general scenario where only a single-frame point cloud is required in the projection oracle. Through extensive experimentation, we validate the trade-off between effectiveness and efficiency enabled by our proposed method. Remarkably, our approach achieves approximately 80% certified accuracy while utilizing only 30% of the projected image frames.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hu24a.html
https://proceedings.mlr.press/v238/hu24a.htmlSurrogate Bayesian Networks for Approximating Evolutionary GamesSpatial evolutionary games are used to model large systems of interacting agents. In earlier work, a method was developed using Bayesian Networks to approximate the population dynamics in these games. One of the advantages of the Bayesian Network modeling approach is that it is possible to smoothly adjust the size of the network to get more accurate approximations. However, scaling the method up can be intractable if the number of strategies in the evolutionary game increases. In this paper, we propose a new method for computing more accurate approximations by using surrogate Bayesian Networks. Instead of computing inference on larger networks directly, we perform inference on a much smaller surrogate network extended with parameters that exploit the symmetry inherent to the domain. We learn the parameters on the surrogate network using KL-divergence as the loss function. We illustrate the value of this method empirically through a comparison on several evolutionary games.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hsiao24a.html
https://proceedings.mlr.press/v238/hsiao24a.htmlFast 1-Wasserstein distance approximations using greedy strategiesAmong numerous linear approximation methods proposed for optimal transport (OT), tree-based methods appear to be fairly reliable, notably for language processing applications. Inspired by these tree methods, we introduce several greedy heuristics aiming to compute even faster approximations of OT. We first explicitly establish the equivalence between greedy matching and optimal transport for tree metrics, and then we show that tree greedy matching can be reduced to greedy matching on a one-dimensional line. Next, we propose two new greedy-based algorithms in one dimension: the $k$-Greedy and 1D-ICT algorithms. This novel approach provides Wasserstein approximations with accuracy similar to the original tree methods on text datasets while being faster in practice. Finally, these algorithms are applicable beyond tree approximations: using sliced projections of the original data still provides fairly good accuracy while eliminating the need for embedding the data in a fixed and rigid tree structure. This property makes these approaches even more versatile than the original tree OT methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/houry24a.html
https://proceedings.mlr.press/v238/houry24a.htmlBenefits of Non-Linear Scale Parameterizations in Black Box Variational Inference through Smoothness Results and Gradient Variance BoundsBlack box variational inference has consistently produced impressive empirical results. Convergence guarantees require that the variational objective exhibits specific structural properties and that the noise of the gradient estimator can be controlled. In this work we study the smoothness and the variance of the gradient estimator for location-scale variational families with non-linear covariance parameterizations. Specifically, we derive novel theoretical results for the popular exponential covariance parameterization and tighter gradient variance bounds for the softplus parameterization. These results reveal the benefits of using non-linear scale parameterizations on large scale datasets. With a non-linear scale parameterization, the smoothness constant of the variational objective and the upper bound on the gradient variance decrease as the scale parameter becomes smaller. Learning posterior approximations with small scales is essential in Bayesian statistics with sufficient amount of data, since under appropriate assumptions, the posterior distribution is known to contract around the parameter of interest as the sample size increases. We validate our theoretical findings through empirical analysis on several large-scale datasets, underscoring the importance of non-linear parameterizations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hotti24a.html
https://proceedings.mlr.press/v238/hotti24a.htmlThe AL$\ell_0$CORE Tensor Decomposition for Sparse Count DataThis paper introduces AL$\ell_0$CORE, a new form of probabilistic non-negative tensor decomposition. AL$\ell_0$CORE is a Tucker decomposition that constrains the number of non-zero elements (i.e., the $\ell_0$-norm) of the core tensor to be at most $Q$. While the user dictates the total budget $Q$, the locations and values of the non-zero elements are latent variables allocated across the core tensor during inference. AL$\ell_0$CORE—i.e., allocated $\ell_0$-constrained core—thus enjoys both the computational tractability of canonical polyadic (CP) decomposition and the qualitatively appealing latent structure of Tucker. In a suite of real-data experiments, we demonstrate that AL$\ell_0$CORE typically requires only tiny fractions (e.g., 1%) of the core to achieve the same results as Tucker at a correspondingly small fraction of the cost.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hood24a.html
https://proceedings.mlr.press/v238/hood24a.htmlA Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement LearningOffline constrained reinforcement learning (RL) aims to learn a policy that maximizes the expected cumulative reward subject to constraints on expected cumulative cost using an existing dataset. In this paper, we propose Primal-Dual-Critic Algorithm (PDCA), a novel algorithm for offline constrained RL with general function approximation. PDCA runs a primal-dual algorithm on the Lagrangian function estimated by critics. The primal player employs a no-regret policy optimization oracle to maximize the Lagrangian estimate and the dual player acts greedily to minimize the Lagrangian estimate. We show that PDCA finds a near saddle point of the Lagrangian, which is nearly optimal for the constrained RL problem. Unlike previous work that requires concentrability and a strong Bellman completeness assumption, PDCA only requires concentrability and realizability assumptions for sample-efficient learning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hong24a.html
https://proceedings.mlr.press/v238/hong24a.htmlRobust variance-regularized risk minimization with concomitant scalingUnder losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/holland24a.html
https://proceedings.mlr.press/v238/holland24a.htmlMIM-Reasoner: Learning with Theoretical Guarantees for Multiplex Influence MaximizationMultiplex influence maximization (MIM) asks us to identify a set of seed users such as to maximize the expected number of influenced users in a multiplex network. MIM has been one of central research topics, especially in nowadays social networking landscape where users participate in multiple online social networks (OSNs) and their influences can propagate among several OSNs simultaneously. Although there exist a couple combinatorial algorithms to MIM, learning-based solutions have been desired due to its generalization ability to heterogeneous networks and their diversified propagation characteristics. In this paper, we introduce MIM-Reasoner, coupling reinforcement learning with probabilistic graphical model, which effectively captures the complex propagation process within and between layers of a given multiplex network, thereby tackling the most challenging problem in MIM. We establish a theoretical guarantee for MIM-Reasoner as well as conduct extensive analyses on both synthetic and real-world datasets to validate our MIM-Reasoner’s performance.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hoang-khoi-do24a.html
https://proceedings.mlr.press/v238/hoang-khoi-do24a.htmlBoundary-Aware Uncertainty for Feature Attribution ExplainersPost-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications. However, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work we propose the Gaussian Process Explanation unCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with explanation function approximation uncertainty. We introduce a novel geodesic-based kernel, which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with decision boundary complexity. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hill24a.html
https://proceedings.mlr.press/v238/hill24a.htmlAdaptive Discretization for Event PredicTion (ADEPT)Recently developed survival analysis methods improve upon existing approaches by predicting the probability of event occurrence in each of a number pre-specified (discrete) time intervals. By avoiding placing strong parametric assumptions on the event density, this approach tends to improve prediction performance, particularly when data are plentiful. However, in clinical settings with limited available data, it is often preferable to judiciously partition the event time space into a limited number of intervals well suited to the prediction task at hand. In this work, we develop Adaptive Discretization for Event PredicTion (ADEPT) to learn from data a set of cut points defining such a partition. We show that in two simulated datasets, we are able to recover intervals that match the underlying generative model. We then demonstrate improved prediction performance on three real-world observational datasets, including a large, newly harmonized stroke risk prediction dataset. Finally, we argue that our approach facilitates clinical decision-making by suggesting time intervals that are most appropriate for each task, in the sense that they facilitate more accurate risk prediction.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hickey24a.html
https://proceedings.mlr.press/v238/hickey24a.htmlSample Efficient Learning of Factored Embeddings of Tensor FieldsData tensors of orders 2 and greater are now routinely being generated. These data collections are increasingly huge and growing. Many scientific and medical data tensors are tensor fields (e.g., images, videos, geographic data) in which the spatial neighborhood contains important information. Directly accessing such large data tensor collections for information has become increasingly prohibitive. We learn approximate full-rank and compact tensor sketches with decompositive representations providing compact space, time and spectral embeddings of tensor fields. All information querying and post-processing on the original tensor field can now be achieved more efficiently and with customizable accuracy as they are performed on these compact factored sketches in latent generative space. We produce optimal rank-r sketchy Tucker decomposition of arbitrary order data tensors by building compact factor matrices from a sample-efficient sub-sampling of tensor slices. Our sample efficient policy is learned via an adaptable stochastic Thompson sampling using Dirichlet distributions with conjugate priors.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/heo24a.html
https://proceedings.mlr.press/v238/heo24a.htmlThe Relative Gaussian Mechanism and its Application to Private Gradient DescentThe Gaussian Mechanism (GM), which consists in adding Gaussian noise to a vector-valued query before releasing it, is a standard privacy protection mechanism. In particular, given that the query respects some L2 sensitivity property (the L2 distance between outputs on any two neighboring inputs is bounded), GM guarantees Rényi Differential Privacy (RDP). Unfortunately, precisely bounding the L2 sensitivity can be hard, thus leading to loose privacy bounds. In this work, we consider a Relative L2 sensitivity assumption, in which the bound on the distance between two query outputs may also depend on their norm. Leveraging this assumption, we introduce the Relative Gaussian Mechanism (RGM), in which the variance of the noise depends on the norm of the output. We prove tight bounds on the RDP parameters under relative L2 sensitivity, and characterize the privacy loss incurred by using output-dependent noise. In particular, we show that RGM naturally adapts to a latent variable that would control the norm of the output. Finally, we instantiate our framework to show tight guarantees for Private Gradient Descent, a problem that naturally fits our relative L2 sensitivity assumption.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hendrikx24a.html
https://proceedings.mlr.press/v238/hendrikx24a.htmlComparing Comparators in Generalization BoundsWe derive generic information-theoretic and PAC-Bayesian generalization bounds involving an arbitrary convex comparator function, which measures the discrepancy between the training loss and the population loss. The bounds hold under the assumption that the cumulant-generating function (CGF) of the comparator is upper-bounded by the corresponding CGF within a family of bounding distributions. We show that the tightest possible bound is obtained with the comparator being the convex conjugate of the CGF of the bounding distribution, also known as the Cramér function. This conclusion applies more broadly to generalization bounds with a similar structure. This confirms the near-optimality of known bounds for bounded and sub-Gaussian losses and leads to novel bounds under other bounding distributions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hellstrom24a.html
https://proceedings.mlr.press/v238/hellstrom24a.htmlA 4-Approximation Algorithm for Min Max Correlation ClusteringWe introduce a lower bounding technique for the min max correlation clustering problem and, based on this technique, a combinatorial 4-approximation algorithm for complete graphs. This improves upon the previous best known approximation guarantees of 5, using a linear program formulation (Kalhan et al., 2019), and 40, for a combinatorial algorithm (Davies et al., 2023). We extend this algorithm by a greedy joining heuristic and show empirically that it improves the state of the art in solution quality and runtime on several benchmark datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/heidrich24a.html
https://proceedings.mlr.press/v238/heidrich24a.htmlAdaptive Parametric Prototype Learning for Cross-Domain Few-Shot ClassificationCross-domain few-shot classification induces a much more challenging problem than its in-domain counterpart due to the existence of domain shifts between the training and test tasks. In this paper, we develop a novel Adaptive Parametric Prototype Learning (APPL) method under the meta-learning convention for cross-domain few-shot classification. Different from existing prototypical few-shot methods that use the averages of support instances to calculate the class prototypes, we propose to learn class prototypes from the concatenated features of the support set in a parametric fashion and meta-learn the model by enforcing prototype-based regularization on the query set. In addition, we fine-tune the model in the target domain in a transductive manner using a weighted-moving-average self-training approach on the query instances. We conduct experiments on multiple cross-domain few-shot benchmark datasets. The empirical results demonstrate that APPL yields superior performance to many state-of-the-art cross-domain few-shot learning methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/heidari24a.html
https://proceedings.mlr.press/v238/heidari24a.htmlCompression with Exact Error Distribution for Federated LearningCompression schemes have been extensively used in Federated Learning (FL) to reduce the communication cost of distributed learning. While most approaches rely on a bounded variance assumption of the noise produced by the compressor, this paper investigates the use of compression and aggregation schemes that produce a specific error distribution, e.g., Gaussian or Laplace, on the aggregated data. We present and analyze different aggregation schemes based on layered quantizers achieving exact error distribution. We provide different methods to leverage the proposed compression schemes to obtain compression-for-free in differential privacy applications. Our general compression methods can recover and improve standard FL schemes with Gaussian perturbations such as Langevin dynamics and randomized smoothing.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hegazy24a.html
https://proceedings.mlr.press/v238/hegazy24a.htmlTransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional RegressionThe main challenge that sets transfer learning apart from traditional supervised learning is the distribution shift, reflected as the shift between the source and target models and that between the marginal covariate distributions. In this work, we tackle model shifts in the presence of covariate shifts in the high-dimensional regression setting. Specifically, we propose a two-step method with a novel fused regularizer that effectively leverages samples from source tasks to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model, showing the robustness of the proposed method to covariate shifts. We further establish conditions under which the estimator is minimax-optimal. Additionally, we extend the method to a distributed setting, allowing for a pretraining-finetuning strategy, requiring just one round of communication while retaining the estimation rate of the centralized version. Numerical tests validate our theory, highlighting the method’s robustness to covariate shifts.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/he24a.html
https://proceedings.mlr.press/v238/he24a.htmlEstimating treatment effects from single-arm trials via latent-variable modelingRandomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for {\em (i)} patient matching if treatment outcomes are not available for the treatment group, or for {\em (ii)} direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/haussmann24a.html
https://proceedings.mlr.press/v238/haussmann24a.htmlQuantifying Uncertainty in Natural Language Explanations of Large Language ModelsLarge Language Models (LLMs) are increasingly used as powerful tools for several high-stakes natural language processing (NLP) applications. Recent prompting works claim to elicit intermediate reasoning steps and key tokens that serve as proxy explanations for LLM predictions. However, there is no certainty whether these explanations are reliable and reflect the LLM’s behavior. In this work, we make one of the first attempts at quantifying the uncertainty in explanations of LLMs. To this end, we propose two novel metrics — Verbalized Uncertainty and Probing Uncertainty — to quantify the uncertainty of generated explanations. While verbalized uncertainty involves prompting the LLM to express its confidence in its explanations, probing uncertainty leverages sample and model perturbations as a means to quantify the uncertainty. Our empirical analysis of benchmark datasets reveals that verbalized uncertainty is not a reliable estimate of explanation confidence. Further, we show that the probing uncertainty estimates are correlated with the faithfulness of an explanation, with lower uncertainty corresponding to explanations with higher faithfulness. Our study provides insights into the challenges and opportunities of quantifying uncertainty in LLM explanations, contributing to the broader discussion of the trustworthiness of foundation models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/harsha-tanneru24a.html
https://proceedings.mlr.press/v238/harsha-tanneru24a.htmlImplicit Regularization in Deep Tucker Factorization: Low-Rankness via Structured SparsityWe theoretically analyze the implicit regularization of deep learning for tensor completion. We show that deep Tucker factorization trained by gradient descent induces a structured sparse regularization. This leads to a characterization of the effect of the depth of the neural network on the implicit regularization and provides a potential explanation for the bias of gradient descent towards solutions with low multilinear rank. Numerical experiments confirm our theoretical findings and give insights into the behavior of gradient descent in deep tensor factorization.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hariz24a.html
https://proceedings.mlr.press/v238/hariz24a.htmlMulti-Agent Bandit Learning through Heterogeneous Action Erasure ChannelsMulti-Armed Bandit (MAB) systems are witnessing an upswing in applications within multi-agent distributed environments, leading to the advancement of collaborative MAB algorithms. In such settings, communication between agents executing actions and the primary learner making decisions can hinder the learning process. A prevalent challenge in distributed learning is action erasure, often induced by communication delays and/or channel noise. This results in agents possibly not receiving the intended action from the learner, subsequently leading to misguided feedback. In this paper, we introduce novel algorithms that enable learners to interact concurrently with distributed agents across heterogeneous action erasure channels with different action erasure probabilities. We illustrate that, in contrast to existing bandit algorithms, which experience linear regret, our algorithms assure sub-linear regret guarantees. Our proposed solutions are founded on a meticulously crafted repetition protocol and scheduling of learning across heterogeneous channels. To our knowledge, these are the first algorithms capable of effectively learning through heterogeneous action erasure channels. We substantiate the superior performance of our algorithm through numerical experiments, emphasizing their practical significance in addressing issues related to communication constraints and delays in multi-agent environments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hanna24a.html
https://proceedings.mlr.press/v238/hanna24a.htmlP-tensors: a General Framework for Higher Order Message Passing in Subgraph Neural NetworksSeveral recent papers have proposed increasing the expressiveness of graph neural networks by exploiting subgraphs or other topological structures. In parallel, researchers have investigated higher order permutation equivariant networks. In this paper we tie these two threads together by providing a general framework for higher order permutation equivariant message passing in subgraph neural networks. Our exposition hinges on so-called $P$-tensors, which provide a simple way to define the most general form of permutation equivariant message passing in this category of networks. We show that this paradigm can achieve state-of-the-art performance on benchmark molecular datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/hands24a.html
https://proceedings.mlr.press/v238/hands24a.htmlConformalized Semi-supervised Random Forest for Classification and Abnormality DetectionThe Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction $C(x)$. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest’s effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yujinhan98/CSForest.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/han24b.html
https://proceedings.mlr.press/v238/han24b.htmlRobust SVD Made Easy: A fast and reliable algorithm for large-scale data analysisThe singular value decomposition (SVD) is a crucial tool in machine learning and statistical data analysis. However, it is highly susceptible to outliers in the data matrix. Existing robust SVD algorithms often sacrifice speed for robustness or fail in the presence of only a few outliers. This study introduces an efficient algorithm, called Spherically Normalized SVD, for robust SVD approximation that is highly insensitive to outliers, computationally scalable, and provides accurate approximations of singular vectors. The proposed algorithm achieves remarkable speed by utilizing only two applications of a standard reduced-rank SVD algorithm to appropriately scaled data, significantly outperforming competing algorithms in computation times. To assess the robustness of the approximated singular vectors and their subspaces against data contamination, we introduce new notions of breakdown points for matrix-valued input, including row-wise, column-wise, and block-wise breakdown points. Theoretical and empirical analyses demonstrate that our algorithm exhibits higher breakdown points compared to standard SVD and its modifications. We empirically validate the effectiveness of our approach in applications such as robust low-rank approximation and robust principal component analysis of high-dimensional microarray datasets. Overall, our study presents a highly efficient and robust solution for SVD approximation that overcomes the limitations of existing algorithms in the presence of outliers.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/han24a.html
https://proceedings.mlr.press/v238/han24a.htmlIdentifiable Feature Learning for Spatial Data with Nonlinear ICARecently, nonlinear ICA has surfaced as a popular alternative to the many heuristic models used in deep representation learning and disentanglement. An advantage of nonlinear ICA is that a sophisticated identifiability theory has been developed; in particular, it has been proven that the original components can be recovered under sufficiently strong latent dependencies. Despite this general theory, practical nonlinear ICA algorithms have so far been mainly limited to data with one-dimensional latent dependencies, especially time-series data. In this paper, we introduce a new nonlinear ICA framework that employs $t$-process (TP) latent components which apply naturally to data with higher-dimensional dependency structures, such as spatial and spatio-temporal data. In particular, we develop a new learning and inference algorithm that extends variational inference methods to handle the combination of a deep neural network mixing function with the TP prior, and employs the method of inducing points for computational efficacy. On the theoretical side, we show that such TP independent components are identifiable under very general conditions. Further, Gaussian Process (GP) nonlinear ICA is established as a limit of the TP Nonlinear ICA model, and we prove that the identifiability of the latent components at this GP limit is more restricted. Namely, those components are identifiable if and only if they have distinctly different covariance kernels. Our algorithm and identifiability theorems are explored on simulated spatial data and real world spatio-temporal data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/halva24a.html
https://proceedings.mlr.press/v238/halva24a.htmlEffect of Ambient-Intrinsic Dimension Gap on Adversarial VulnerabilityThe existence of adversarial attacks on machine learning models imperceptible to a human is still quite a mystery from a theoretical perspective. In this work, we introduce two notions of adversarial attacks: natural or on-manifold attacks, which are perceptible by a human/oracle, and unnatural or off-manifold attacks, which are not. We argue that the existence of the off-manifold attacks is a natural consequence of the dimension gap between the intrinsic and ambient dimensions of the data. For 2-layer ReLU networks, we prove that even though the dimension gap does not affect generalization performance on samples drawn from the observed data space, it makes the clean-trained model more vulnerable to adversarial perturbations in the off-manifold direction of the data space. Our main results provide an explicit relationship between the $\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the dimension gap.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/haldar24a.html
https://proceedings.mlr.press/v238/haldar24a.htmlBures-Wasserstein Means of GraphsFinding the mean of sampled data is a fundamental task in machine learning and statistics. However, in cases where the data samples are graph objects, defining a mean is an inherently difficult task. We propose a novel framework for defining a graph mean via embeddings in the space of smooth graph signal distributions, where graph similarity can be measured using the Wasserstein metric. By finding a mean in this embedding space, we can recover a mean graph that preserves structural information. We establish the existence and uniqueness of the novel graph mean, and provide an iterative algorithm for computing it. To highlight the potential of our framework as a valuable tool for practical applications in machine learning, it is evaluated on various tasks, including k-means clustering of structured aligned graphs, classification of functional brain networks, and semi-supervised node classification in multi-layer graphs. Our experimental results demonstrate that our approach achieves consistent performance, outperforms existing baseline approaches, and improves the performance of state-of-the-art methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/haasler24a.html
https://proceedings.mlr.press/v238/haasler24a.htmlEuclidean, Projective, Conformal: Choosing a Geometric Algebra for Equivariant TransformersThe Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/haan24a.html
https://proceedings.mlr.press/v238/haan24a.htmlClustering Items From Adaptively Collected Inconsistent FeedbackWe study clustering in a query-based model where the learner can repeatedly query an oracle to determine if two items belong to the same cluster. However, these queries are costly and the oracle’s responses are marred by inconsistency and noise. The learner’s goal is to adaptively make a small number of queries and return the correct clusters for \emph{all} $n$ items with high confidence. We develop efficient algorithms for this problem using the sequential hypothesis testing framework. We derive high probability upper bounds on their sample complexity (the number of queries they make) and complement this analysis with an information-theoretic lower bound. In particular, we show that our algorithm for two clusters is nearly optimal when the oracle’s error probability is a constant. Our experiments verify these findings and highlight a few shortcomings of our algorithms. Namely, we show that their sample complexity deviates from the lower bound when the error probability of the oracle depends on $n$. We suggest an improvement based on a more efficient sequential hypothesis test and demonstrate it empirically.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gupta24a.html
https://proceedings.mlr.press/v238/gupta24a.htmlThe effect of Leaky ReLUs on the training and generalization of overparameterized networksWe investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, $\alpha$. We show that $\alpha =-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/guo24c.html
https://proceedings.mlr.press/v238/guo24c.htmlOnline learning in bandits with predicted contextWe consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/guo24b.html
https://proceedings.mlr.press/v238/guo24b.htmlA White-Box False Positive Adversarial Attack Method on Contrastive Loss Based Offline Handwritten Signature Verification ModelsIn this paper, we tackle the challenge of white-box false positive adversarial attacks on contrastive loss based offline handwritten signature verification models. We propose a novel attack method that treats the attack as a style transfer between closely related but distinct writing styles. To guide the generation of deceptive images, we introduce two new loss functions that enhance the attack success rate by perturbing the Euclidean distance between the embedding vectors of the original and synthesized samples, while ensuring minimal perturbations by reducing the difference between the generated image and the original image. Our method demonstrates state-of-the-art performance in white-box attacks on contrastive loss based offline handwritten signature verification models, as evidenced by our experiments. The key contributions of this paper include a novel false positive attack method, two new loss functions, effective style transfer in handwriting styles, and superior performance in white-box false positive attacks compared to other white-box attack methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/guo24a.html
https://proceedings.mlr.press/v238/guo24a.htmlRobust Non-linear Normalization of Heterogeneous Feature Distributions with Adaptive Tanh-EstimatorsFeature normalization is a crucial step in machine learning that scales numerical values to improve model effectiveness. Noisy or impure datasets can pose a challenge for traditional normalization methods as they may contain outliers that violate statistical assumptions, leading to reduced model performance and increased unpredictability. Non-linear Tanh-Estimators (TE) have been found to provide robust feature normalization, but their fixed scaling factor may not be appropriate for all distributions of feature values. This work presents a refinement to the TE that employs the Wasserstein distance to adaptively estimate the optimal scaling factor for each feature individually against a specified target distribution. The results demonstrate that this adaptive approach can outperform the current TE method in the literature in terms of convergence speed by enabling better initial training starts, thus reducing or eliminating the need to re-adjust model weights during early training phases due to inadequately scaled features. Empirical evaluation was done on synthetic data, standard toy computer vision datasets, and a real-world numeric tabular dataset.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/guimera-cuevas24a.html
https://proceedings.mlr.press/v238/guimera-cuevas24a.htmlAdaptive importance sampling for heavy-tailed distributions via $α$-divergence minimizationAdaptive importance sampling (AIS) algorithms are widely used to approximate expectations with respect to complicated target probability distributions. When the target has heavy tails, existing AIS algorithms can provide inconsistent estimators or exhibit slow convergence, as they often neglect the target’s tail behaviour. To avoid this pitfall, we propose an AIS algorithm that approximates the target by Student-t proposal distributions. We adapt location and scale parameters by matching the escort moments - which are defined even for heavy-tailed distributions - of the target and proposal. These updates minimize the $\alpha$-divergence between the target and the proposal, thereby connecting with variational inference. We then show that the $\alpha$-divergence can be approximated by a generalized notion of effective sample size and leverage this new perspective to adapt the tail parameter with Bayesian optimization. We demonstrate the efficacy of our approach through applications to synthetic targets and a Bayesian Student-t regression task on a real example with clinical trial data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/guilmeau24a.html
https://proceedings.mlr.press/v238/guilmeau24a.htmlStochastic Approximation with Biased MCMC for Expectation MaximizationThe expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these results are restricted to the case where asymptotically unbiased MCMC algorithms are used. In practice, MCMC-SAEM is often run with asymptotically biased MCMC, for which the consequences are theoretically less understood. In this work, we fill this gap by analyzing the asymptotics and non-asymptotics of SAEM with biased MCMC steps, particularly the effect of bias. We also provide numerical experiments comparing the Metropolis-adjusted Langevin algorithm (MALA), which is asymptotically unbiased, and the unadjusted Langevin algorithm (ULA), which is asymptotically biased, on synthetic and real datasets. Experimental results show that ULA is more stable with respect to the choice of Langevin stepsize and can sometimes result in faster convergence.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gruffaz24a.html
https://proceedings.mlr.press/v238/gruffaz24a.htmlFunctional Graphical Models: Structure Enables Offline Data-Driven OptimizationWhile machine learning models are typically trained to solve prediction problems, we might often want to use them for optimization problems. For example, given a dataset of proteins and their corresponding fluorescence levels, we might want to optimize for a new protein with the highest possible fluorescence. This kind of data-driven optimization (DDO) presents a range of challenges beyond those in standard prediction problems, since we need models that successfully predict the performance of new designs that are better than the best designs seen in the training set. It is not clear theoretically when existing approaches can even perform better than the na ïve approach that simply selects the best design in the dataset. In this paper, we study how structure can enable sample-efficient data-driven optimization. To formalize the notion of structure, we introduce functional graphical models (FGMs) and show theoretically how they can provide for principled data-driven optimization by decomposing the original high-dimensional optimization problem into smaller sub-problems. This allows us to derive much more practical regret bounds for DDO, and the result implies that DDO with FGMs can achieve nearly optimal designs in situations where naïve approaches fail due to insufficient coverage of the offline data. We further present a data-driven optimization algorithm that inferes the FGM structure itself, either over the original input variables or a latent variable representation of the inputs.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/grudzien24a.html
https://proceedings.mlr.press/v238/grudzien24a.htmlA Greedy Approximation for k-Determinantal Point ProcessesDeterminantal point processes (DPPs) are an important concept in random matrix theory and combinatorics, and increasingly in machine learning. Samples from these processes exhibit a form of self-avoidance, so they are also helpful in guiding algorithms that explore to reduce uncertainty, such as in active learning, Bayesian optimization, reinforcement learning, and marginalization in graphical models. The best-known algorithms for sampling from DPPs exactly require significant computational expense, which can be unwelcome in machine learning applications when the cost of sampling is relatively low and capturing the precise repulsive nature of the DPP may not be critical. We suggest an inexpensive approximate strategy for sampling a fixed number of points (as would typically be desired in a machine learning setting) from a so-called $k$-DPP based on iterative inverse transform sampling. We prove that our algorithm satisfies a $(1 - 1/\epsilon)$ approximation guarantee relative to exact sampling from the $k$-DPP, and provide an efficient implementation for many common kernels used in machine learning, including the Gaussian and Matérn class. Finally, we compare the empirical runtime of our method to exact and Markov-Chain-Monte-Carlo (MCMC) samplers and investigate the approximation quality in a Bayesian Quadrature (BQ) setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/grosse24a.html
https://proceedings.mlr.press/v238/grosse24a.htmlOptimal Zero-Shot Detector for Multi-Armed AttacksThis research delves into a scenario where a malicious actor can manipulate data samples using a multi-armed attack strategy, providing them with multiple ways to introduce noise into the data sample. Our central objective is to protect the data by detecting any alterations to the input. We approach this defensive strategy with utmost caution, operating in an environment where the defender possesses significantly less information compared to the attacker. Specifically, the defender is unable to utilize any data samples for training a defense model or verifying the integrity of the channel. Instead, the defender relies exclusively on a set of pre-existing detectors readily available "off the shelf." To tackle this challenge, we derive an innovative information-theoretic defense approach that optimally aggregates the decisions made by these detectors, eliminating the need for any training data. We further explore a practical use-case scenario for empirical evaluation, where the attacker possesses a pre-trained classifier and launches well-known adversarial attacks against it. Our experiments highlight the effectiveness of our proposed solution, even in scenarios that deviate from the optimal setup.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/granese24a.html
https://proceedings.mlr.press/v238/granese24a.htmlTowards Practical Non-Adversarial Distribution MatchingDistribution matching can be used to learn invariant representations with applications in fairness and robustness. Most prior works resort to adversarial matching methods but the resulting minimax problems are unstable and challenging to optimize. Non-adversarial likelihood-based approaches either require model invertibility, impose constraints on the latent prior, or lack a generic framework for distribution matching. To overcome these limitations, we propose a non-adversarial VAE-based matching method that can be applied to any model pipeline. We develop a set of alignment upper bounds for distribution matching (including a noisy bound) that have VAE-like objectives but with a different perspective. We carefully compare our method to prior VAE-based matching approaches both theoretically and empirically. Finally, we demonstrate that our novel matching losses can replace adversarial losses in standard invariant representation learning pipelines without modifying the original architectures—thereby significantly broadening the applicability of non-adversarial matching methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gong24b.html
https://proceedings.mlr.press/v238/gong24b.htmlTesting Generated Distributions in GANs to Penalize Mode CollapseMode collapse remains the primary unresolved challenge within generative adversarial networks (GANs). In this work, we introduce an innovative approach that supplements the discriminator by additionally enforcing the similarity between the generated and real distributions. We implement a one-sample test on the generated samples and employ the resulting test statistic to penalize deviations from the real distribution. Our method encompasses a practical strategy to estimate distributions, compute the test statistic via a differentiable function, and seamlessly incorporate test outcomes into the training objective. Crucially, our approach preserves the convergence and theoretical integrity of GANs, as the introduced constraint represents a requisite condition for optimizing the generator training objective. Notably, our method circumvents reliance on regularization or network modules, enhancing compatibility and facilitating its practical application. Empirical evaluations on diverse public datasets validate the efficacy of our proposed approach.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gong24a.html
https://proceedings.mlr.press/v238/gong24a.htmlThe Effective Number of Shared Dimensions Between Paired DatasetsA number of recent studies have sought to understand the behavior of both artificial and biological neural networks by comparing representations across layers, networks and brain areas. Increasingly prevalent, too, are comparisons across modalities of data, such as neural network activations and training data or behavioral data and neurophysiological recordings. One approach to such comparisons involves measuring the dimensionality of the space shared between the paired data matrices, where dimensionality serves as a proxy for computational or representational complexity. Established approaches, including CCA, can be used to measure the number of shared embedding dimensions, however they do not account for potentially unequal variance along shared dimensions and so cannot measure effective shared dimensionality. We present a candidate measure for shared dimensionality that we call the effective number of shared dimensions (ENSD). The ENSD is an interpretable and computationally efficient model-free measure of shared dimensionality that can be used to probe shared structure in a wide variety of data types. We demonstrate the relative robustness of the ENSD in cases where data is sparse or low rank and illustrate how the ENSD can be applied in a variety of analyses of representational similarities across layers in convolutional neural networks and between brain regions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/giaffar24a.html
https://proceedings.mlr.press/v238/giaffar24a.htmlSample-efficient neural likelihood-free Bayesian inference of implicit HMMsLikelihood-free inference methods based on neural conditional density estimation were shown to drastically reduce the simulation burden in comparison to classical methods such as ABC. When applied in the context of any latent variable model, such as a Hidden Markov model (HMM), these methods are designed to only estimate the parameters, rather than the joint distribution of the parameters and the hidden states. Naive application of these methods to a HMM, ignoring the inference of this joint posterior distribution, will thus produce an inaccurate estimate of the posterior predictive distribution, in turn hampering the assessment of goodness-of-fit. To rectify this problem, we propose a novel, sample-efficient likelihood-free method for estimating the high-dimensional hidden states of an implicit HMM. Our approach relies on learning directly the intractable posterior distribution of the hidden states, using an autoregressive-flow, by exploiting the Markov property. Upon evaluating our approach on some implicit HMMs, we found that the quality of the estimates retrieved using our method is comparable to what can be achieved using a much more computationally expensive SMC algorithm.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ghosh24b.html
https://proceedings.mlr.press/v238/ghosh24b.htmlTowards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RLWe study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on \emph{soft} constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger \emph{hard} constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the \emph{first model-free}, \emph{simulator-free} algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in \emph{large-scale} systems. In particular, we show that $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ hard constraint violation bounds can be achieved, where $K$ is the number of episodes, $d$ is the dimension of the feature mapping, $H$ is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ghosh24a.html
https://proceedings.mlr.press/v238/ghosh24a.htmlA General Theoretical Paradigm to Understand Learning from Human PreferencesThe prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation DPO has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called ${\Psi}$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of ${\Psi}$PO) and to identify their potential pitfalls. We then consider another special case for ${\Psi}$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gheshlaghi-azar24a.html
https://proceedings.mlr.press/v238/gheshlaghi-azar24a.htmlTransductive conformal inference with adaptive scoresConformal inference is a fundamental and versatile tool that provides distribution-free guarantees for many machine learning tasks. We consider the transductive setting, where decisions are made on a test sample of $m$ new points, giving rise to $m$ conformal $p$-values. While classical results only concern their marginal distribution, we show that their joint distribution follows a Pólya urn model, and establish a concentration inequality for their empirical distribution function. The results hold for arbitrary exchangeable scores, including adaptive ones that can use the covariates of the test${+}$calibration samples at training stage for increased accuracy. We demonstrate the usefulness of these theoretical results through uniform, in-probability guarantees for two machine learning tasks of current interest: interval prediction for transductive transfer learning and novelty detection based on two-class classification.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gazin24a.html
https://proceedings.mlr.press/v238/gazin24a.htmlSoft-constrained Schrödinger Bridge: a Stochastic Control ApproachSchrödinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process whose terminal distribution coincides with a pre-specified target distribution. We propose to generalize this problem by allowing the terminal distribution to differ from the target but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schrödinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of the target and some other distribution. This result is further extended to a time series setting. One application is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/garg24a.html
https://proceedings.mlr.press/v238/garg24a.htmlHow does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic InterpretabilityTransformer-based language models are treated as black-boxes because of their large number of parameters and complex internal interactions, which is a serious safety concern. Mechanistic Interpretability (MI) intends to reverse-engineer neural network behaviors in terms of human-understandable components. In this work, we focus on understanding how GPT-2 Small performs the task of predicting three-letter acronyms. Previous works in the MI field have focused so far on tasks that predict a single token. To the best of our knowledge, this is the first work that tries to mechanistically understand a behavior involving the prediction of multiple consecutive tokens. We discover that the prediction is performed by a circuit composed of 8 attention heads (${\sim}5%$ of the total heads) which we classified in three groups according to their role. We also demonstrate that these heads concentrate the acronym prediction functionality. In addition, we mechanistically interpret the most relevant heads of the circuit and find out that they use positional information which is propagated via the causal mask mechanism. We expect this work to lay the foundation for understanding more complex behaviors involving multiple-token predictions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/garcia-carrasco24a.html
https://proceedings.mlr.press/v238/garcia-carrasco24a.htmlDecentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence RateStochastic multi-level compositional optimization problems cover many new machine learning paradigms, e.g., multi-step model-agnostic meta-learning, which require efficient optimization algorithms for large-scale data. This paper studies the decentralized stochastic multi-level optimization algorithm, which is challenging because the multi-level structure and decentralized communication scheme may make the number of levels significantly affect the order of the convergence rate. To this end, we develop two novel decentralized optimization algorithms to optimize the multi-level compositional optimization problem. Our theoretical results show that both algorithms can achieve the level-independent convergence rate for nonconvex problems under much milder conditions compared with existing single-machine algorithms. To the best of our knowledge, this is the first work that achieves the level-independent convergence rate under the decentralized setting. Moreover, extensive experiments confirm the efficacy of our proposed algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gao24b.html
https://proceedings.mlr.press/v238/gao24b.htmlFusing Individualized Treatment Rules Using Secondary OutcomesAn individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gao24a.html
https://proceedings.mlr.press/v238/gao24a.htmlContextual Bandits with Budgeted Information RevealContextual bandit algorithms are commonly used in digital health to recommend personalized treatments. However, to ensure the effectiveness of the treatments, patients are often requested to take actions that have no immediate benefit to them, which we refer to as pro-treatment actions. In practice, clinicians have a limited budget to encourage patients to take these actions and collect additional information. We introduce a novel optimization and learning algorithm to address this problem. This algorithm effectively combines the strengths of two algorithmic approaches in a seamless manner, including 1) an online primal-dual algorithm for deciding the optimal timing to reach out to patients, and 2) a contextual bandit learning algorithm to deliver personalized treatment to the patient. We prove that this algorithm admits a sub-linear regret bound. We illustrate the usefulness of this algorithm on both synthetic and real-world data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gan24a.html
https://proceedings.mlr.press/v238/gan24a.htmlProbabilistic Integral CircuitsContinuous latent variables (LVs) are a key ingredient of many generative models, as they allow modelling expressive mixtures with an uncountable number of components. In contrast, probabilistic circuits (PCs) are hierarchical discrete mixtures represented as computational graphs composed of input, sum and product units. Unlike continuous LV models, PCs provide tractable inference but are limited to discrete LVs with categorical (i.e. unordered) states. We bridge these model classes by introducing probabilistic integral circuits (PICs), a new language of computational graphs that extends PCs with integral units representing continuous LVs. In the first place, PICs are symbolic computational graphs and are fully tractable in simple cases where analytical integration is possible. In practice, we parameterise PICs with light-weight neural nets delivering an intractable hierarchical continuous mixture that can be approximated arbitrarily well with large PCs using numerical quadrature. On several distribution estimation benchmarks, we show that such PIC-approximating PCs systematically outperform PCs commonly learned via expectation-maximization or SGD.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gala24a.html
https://proceedings.mlr.press/v238/gala24a.htmlOffline Primal-Dual Reinforcement Learning for Linear MDPsOffline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy. This problem has attracted a lot of attention recently, but most existing methods with strong theoretical guarantees are restricted to finite-horizon or tabular settings. In contrast, few algorithms for infinite-horizon settings with function approximation and minimal assumptions on the dataset are both sample and computationally efficient. Another gap in the current literature is the lack of theoretical analysis for the average-reward setting, which is more challenging than the discounted setting. In this paper, we address both of these issues by proposing a primal-dual optimization method based on the linear programming formulation of RL. Our key contribution is a new reparametrization that allows us to derive low-variance gradient estimators that can be used in a stochastic optimization scheme using only samples from the behavior policy. Our method finds an $\varepsilon$-optimal policy with $O(\varepsilon^{-4})$ samples, while being computationally efficient for infinite-horizon discounted and average-reward MDPs with realizable linear function approximation and partial coverage. Moreover, to the best of our knowledge, this is the first theoretical result for average-reward offline RL.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/gabbianelli24a.html
https://proceedings.mlr.press/v238/gabbianelli24a.htmlInformation-theoretic Analysis of Bayesian Test Data SensitivityBayesian inference is often used to quantify uncertainty. Several recent analyses have rigorously decomposed uncertainty in prediction by Bayesian inference into two types: the inherent randomness in the data generation process and the variability due to lack of data respectively. Existing studies have analyzed these uncertainties from an information-theoretic perspective, assuming the model is well-specified and treating the model parameters as latent variables. However, such information-theoretic uncertainty analysis fails to account for a widely believed property of uncertainty known as sensitivity between test and training data. This means that if the test data is similar to the training data in some sense, the uncertainty will be smaller. In this study, we study such sensitivity using a new decomposition of uncertainty. Our analysis successfully defines such sensitivity using information-theoretic quantities. Furthermore, we extend the existing analysis of Bayesian meta-learning and show the novel sensitivities among tasks for the first time.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/futami24a.html
https://proceedings.mlr.press/v238/futami24a.htmlJoint Selection: Adaptively Incorporating Public Information for Private Synthetic DataMechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism JAM-PGM, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that JAM-PGM is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/fuentes24a.html
https://proceedings.mlr.press/v238/fuentes24a.htmlTraining Implicit Generative Models via an Invariant Statistical LossImplicit generative models have the capability to learn arbitrary complex data distributions. On the downside, training requires telling apart real data from artificially-generated ones using adversarial discriminators, leading to unstable training and mode-dropping issues. As reported by Zahee et al. (2017), even in the one-dimensional (1D) case, training a generative adversarial network (GAN) is challenging and often suboptimal. In this work, we develop a discriminator-free method for training one-dimensional (1D) generative implicit models and subsequently expand this method to accommodate multivariate cases. Our loss function is a discrepancy measure between a suitably chosen transformation of the model samples and a uniform distribution; hence, it is invariant with respect to the true distribution of the data. We first formulate our method for 1D random variables, providing an effective solution for approximate reparameterization of arbitrary complex distributions. Then, we consider the temporal setting (both univariate and multivariate), in which we model the conditional distribution of each sample given the history of the process. We demonstrate through numerical simulations that this new method yields promising results, successfully learning true distributions in a variety of scenarios and mitigating some of the well-known problems that state-of-the-art implicit methods present.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/frutos24a.html
https://proceedings.mlr.press/v238/frutos24a.htmlScalable Learning of Item Response Theory ModelsItem Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/frick24a.html
https://proceedings.mlr.press/v238/frick24a.htmlNeural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the MeanDeep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models. The code is available at the following link: \url{https://github.com/AnFreTh/NAMpy}Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/frederik-thielmann24a.html
https://proceedings.mlr.press/v238/frederik-thielmann24a.htmlSIFU: Sequential Informed Federated Unlearning for Efficient and Provable Client Unlearning in Federated OptimizationMachine Unlearning (MU) is an increasingly important topic in machine learning safety, aiming at removing the contribution of a given data point from a training procedure. Federated Unlearning (FU) consists in extending MU to unlearn a given client’s contribution from a federated training routine. While several FU methods have been proposed, we currently lack a general approach providing formal unlearning guarantees to the FedAvg routine, while ensuring scalability and generalization beyond the convex assumption on the clients’ loss functions. We aim at filling this gap by proposing SIFU (Sequential Informed Federated Unlearning), a new FU method applying to both convex and non-convex optimization regimes. SIFU naturally applies to FedAvg without additional computational cost for the clients and provides formal guarantees on the quality of the unlearning task. We provide a theoretical analysis of the unlearning properties of SIFU, and practically demonstrate its effectiveness as compared to a panel of unlearning methods from the state-of-the-art.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/fraboni24a.html
https://proceedings.mlr.press/v238/fraboni24a.htmlThe Risks of Recourse in Binary ClassificationAlgorithmic recourse provides explanations that help users overturn an unfavorable decision by a machine learning system. But so far very little attention has been paid to whether providing recourse is beneficial or not. We introduce an abstract learning-theoretic framework that compares the risks (i.e., expected losses) for classification with and without algorithmic recourse. This allows us to answer the question of when providing recourse is beneficial or harmful at the population level. Surprisingly, we find that there are many plausible scenarios in which providing recourse turns out to be harmful, because it pushes users to regions of higher class uncertainty and therefore leads to more mistakes. We further study whether the party deploying the classifier has an incentive to strategize in anticipation of having to provide recourse, and we find that sometimes they do, to the detriment of their users. Providing algorithmic recourse may therefore also be harmful at the systemic level. We confirm our theoretical findings in experiments on simulated and real-world data. All in all, we conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/fokkema24a.html
https://proceedings.mlr.press/v238/fokkema24a.htmlSymmetric Equilibrium Learning of VAEsWe view variational autoencoders (VAE) as decoder-encoder pairs, which map distributions in the data space to distributions in the latent space and vice versa. The standard learning approach for VAEs is the maximisation of the evidence lower bound (ELBO). It is asymmetric in that it aims at learning a latent variable model while using the encoder as an auxiliary means only. Moreover, it requires a closed form a-priori latent distribution. This limits its applicability in more complex scenarios, such as general semi-supervised learning and employing complex generative models as priors. We propose a Nash equilibrium learning approach, which is symmetric with respect to the encoder and decoder and allows learning VAEs in situations where both the data and the latent distributions are accessible only by sampling. The flexibility and simplicity of this approach allows its application to a wide range of learning scenarios and downstream tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/flach24a.html
https://proceedings.mlr.press/v238/flach24a.htmlProving Linear Mode Connectivity of Neural Networks via Optimal TransportThe energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ferbach24a.html
https://proceedings.mlr.press/v238/ferbach24a.htmlMonitoring machine learning-based risk prediction algorithms in the presence of performativityPerformance monitoring of machine learning (ML)-based risk prediction models in healthcare is complicated by the issue of performativity: when an algorithm predicts a patient to be at high risk for an adverse event, clinicians are more likely to administer prophylactic treatment and alter the very target that the algorithm aims to predict. A simple approach is to ignore performativity and monitor only the untreated patients, whose outcomes remain unaltered. In general, ignoring performativity may inflate Type I error because (i) untreated patients disproportionally represent those with low predicted risk, and (ii) changes in the clinician’s trust in the ML algorithm and the algorithm itself can induce complex dependencies that violate standard assumptions. Nevertheless, we show that valid inference is still possible when monitoring \textit{conditional} rather than marginal performance measures under either the assumption of conditional exchangeability or time-constant selection bias. Finally, performativity can vary over time and induce nonstationarity in the data, which presents challenges for monitoring. To this end, we introduce a new score-based cumulative sum (CUSUM) monitoring procedure with dynamic control limits. Through extensive simulation studies, we study applications of the score-based CUSUM and how it is affected by various factors, including the efficiency of model updating procedures and the level of clinician trust. Finally, we apply the procedure to detect calibration decay of a risk model during the COVID-19 pandemic.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/feng24b.html
https://proceedings.mlr.press/v238/feng24b.htmlIs this model reliable for everyone? Testing for strong calibrationIn a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult—particularly for machine learning (ML) algorithms—due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in empirical analyses.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/feng24a.html
https://proceedings.mlr.press/v238/feng24a.htmlTaming Nonconvex Stochastic Mirror Descent with General Bregman DivergenceThis paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-{Ł}ojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/fatkhullin24a.html
https://proceedings.mlr.press/v238/fatkhullin24a.htmlRL in Markov Games with Independent Function Approximation: Improved Sample Complexity Bound under the Local Access ModelEfficiently learning equilibria with large state and action spaces in general-sum Markov games while overcoming the curse of multi-agency is a challenging problem. Recent works have attempted to solve this problem by employing independent linear function classes to approximate the marginal $Q$-value for each agent. However, existing sample complexity bounds under such a framework have a suboptimal dependency on the desired accuracy $\varepsilon$ or the action space. In this work, we introduce a new algorithm, Lin-Confident-FTRL, for learning coarse correlated equilibria (CCE) with local access to the simulator, i.e., one can interact with the underlying environment on the visited states. Up to a logarithmic dependence on the size of the state space, Lin-Confident-FTRL learns $\epsilon$-CCE with a provable optimal accuracy bound $O(\epsilon^{-2})$ and gets rids of the linear dependency on the action space, while scaling polynomially with relevant problem parameters (such as the number of agents and time horizon). Moreover, our analysis of Linear-Confident-FTRL generalizes the virtual policy iteration technique in the single-agent local planning literature, which yields a new computationally efficient algorithm with a tighter sample complexity bound when assuming random access to the simulator.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/fan24b.html
https://proceedings.mlr.press/v238/fan24b.htmlFast and Adversarial Robust Kernelized SDU LearningSDU learning, a weakly supervised learning problem with only pairwise similarities, dissimilarities data points and unlabeled data available, has many practical applications. However, it is still lacking in defense against adversarial samples, and its learning process can be expensive. To address this gap, we propose a novel adversarial training framework for SDU learning. Our approach reformulates the conventional minimax problem as an equivalent minimization problem based on the kernel perspective, departing from traditional confrontational training methods. Additionally, we employ the random gradient method and random features to accelerate the training process. Theoretical analysis shows that our method can converge to a stationary point at a rate of $\mathcal{O}(1/T^{1/4})$. Our experimental results show that our algorithm is superior to other adversarial training methods in terms of generalization, efficiency and scalability against various adversarial attacks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/fan24a.html
https://proceedings.mlr.press/v238/fan24a.htmlSelf-Compatibility: Evaluating Causal Discovery without Ground TruthAs causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/faller24a.html
https://proceedings.mlr.press/v238/faller24a.htmlAsynchronous SGD on Graphs: a Unified Framework for Asynchronous Decentralized and Federated OptimizationDecentralized and asynchronous communications are two popular techniques to speedup communication complexity of distributed machine learning, by respectively removing the dependency over a central orchestrator and the need for synchronization. Yet, combining these two techniques together still remains a challenge. In this paper, we take a step in this direction and introduce Asynchronous SGD on Graphs (AGRAF SGD) — a general algorithmic framework that covers asynchronous versions of many popular algorithms including SGD, Decentralized SGD, Local SGD, FedBuff, thanks to its relaxed communication and computation assumptions. We provide rates of convergence under much milder assumptions than previous decentralized asynchronous works, while still recovering or even improving over the best know results for all the algorithms covered.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/even24a.html
https://proceedings.mlr.press/v238/even24a.htmlGeneralization Bounds for Label Noise Stochastic Gradient DescentWe develop generalization error bounds for stochastic gradient descent (SGD) with label noise in non-convex settings under uniform dissipativity and smoothness conditions. Under a suitable choice of semimetric, we establish a contraction in Wasserstein distance of the label noise stochastic gradient flow that depends polynomially on the parameter dimension $d$. Using the framework of algorithmic stability, we derive time-independent generalisation error bounds for the discretized algorithm with a constant learning rate. The error bound we achieve scales polynomially with $d$ and with the rate of $n^{-2/3}$, where $n$ is the sample size. This rate is better than the best-known rate of $n^{-1/2}$ established for stochastic gradient Langevin dynamics (SGLD)—which employs parameter-independent Gaussian noise—under similar conditions. Our analysis offers quantitative insights into the effect of label noise.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/eun-huh24a.html
https://proceedings.mlr.press/v238/eun-huh24a.htmlAccuracy-Preserving Calibration via Statistical Modeling on Probability SimplexClassification models based on deep neural networks (DNNs) must be calibrated to measure the reliability of predictions. Some recent calibration methods have employed a probabilistic model on the probability simplex. However, these calibration methods cannot preserve the accuracy of pre-trained models, even those with a high classification accuracy. We propose an accuracy-preserving calibration method using the Concrete distribution as the probabilistic model on the probability simplex. We theoretically prove that a DNN model trained on cross-entropy loss has optimality as the parameter of the Concrete distribution. We also propose an efficient method that synthetically generates samples for training probabilistic models on the probability simplex. We demonstrate that the proposed method can outperform previous methods in accuracy-preserving calibration tasks using benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/esaki24a.html
https://proceedings.mlr.press/v238/esaki24a.htmlNoisyMix: Boosting Model Robustness to Common CorruptionsThe robustness of neural networks has become increasingly important in real-world applications where stable and reliable performance is valued over simply achieving high predictive accuracy. To address this, data augmentation techniques have been shown to improve robustness against input perturbations and domain shifts. In this paper, we propose a new training scheme called NoisyMix that leverages noisy augmentations in both input and feature space to improve model robustness and in-domain accuracy. We demonstrate the effectiveness of NoisyMix on several benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P. Additionally, we provide theoretical analysis to better understand the implicit regularization and robustness properties of NoisyMix.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/erichson24a.html
https://proceedings.mlr.press/v238/erichson24a.htmlMixed Models with Multiple Instance LearningPredicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/engelmann24a.html
https://proceedings.mlr.press/v238/engelmann24a.htmlCross-model Mutual Learning for Exemplar-based Medical Image SegmentationMedical image segmentation typically demands extensive dense annotations for model training, which is both time-consuming and skill-intensive. To mitigate this burden, exemplar-based medical image segmentation methods have been introduced to achieve effective training with only one annotated image. In this paper, we introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image Segmentation (CMEMS), which leverages two models to mutually excavate implicit information from unlabeled data at multiple granularities. CMEMS can eliminate confirmation bias and enable collaborative training to learn complementary information by enforcing consistency at different granularities across models. Concretely, cross-model image perturbation based mutual learning is devised by using weakly perturbed images to generate high-confidence pseudo-labels, supervising predictions of strongly perturbed images across models. This approach enables joint pursuit of prediction consistency at the image granularity. Moreover, cross-model multi-level feature perturbation based mutual learning is designed by letting pseudo-labels supervise predictions from perturbed multi-level features with different resolutions, which can broaden the perturbation space and enhance the robustness of our framework. CMEMS is jointly trained using exemplar data, synthetic data, and unlabeled data in an end-to-end manner. Experimental results on two medical image datasets indicate that the proposed CMEMS outperforms the state-of-the-art segmentation methods with extremely limited supervision.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/en24a.html
https://proceedings.mlr.press/v238/en24a.htmlStochastic Extragradient with Random Reshuffling: Improved Convergence for Variational InequalitiesThe Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving finite-sum min-max optimization and variational inequality problems (VIPs) appearing in various machine learning tasks. However, existing convergence analyses of SEG focus on its with-replacement variants, while practical implementations of the method randomly reshuffle components and sequentially use them. Unlike the well-studied with-replacement variants, SEG with Random Reshuffling (SEG-RR) lacks established theoretical guarantees. In this work, we provide a convergence analysis of SEG-RR for three classes of VIPs: (i) strongly monotone, (ii) affine, and (iii) monotone. We derive conditions under which SEG-RR achieves a faster convergence rate than the uniform with-replacement sampling SEG. In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes, a strong requirement needed in the classical with-replacement SEG. As a byproduct of our results, we provide convergence guarantees for Shuffle Once SEG (shuffles the data only at the beginning of the algorithm) and the Incremental Extragradient (does not shuffle the data). We supplement our analysis with experiments validating empirically the superior performance of SEG-RR over the classical with-replacement sampling SEG.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/emmanouilidis24a.html
https://proceedings.mlr.press/v238/emmanouilidis24a.htmlOn The Temporal Domain of Differential Equation Inspired Graph Neural NetworksGraph Neural Networks (GNNs) have demonstrated remarkable success in modeling complex relationships in graph-structured data. A recent innovation in this field is the family of Differential Equation-Inspired Graph Neural Networks (DE-GNNs), which leverage principles from continuous dynamical systems to model information flow on graphs with built-in properties such as feature smoothing or preservation. However, existing DE-GNNs rely on first or second-order temporal dependencies. In this paper, we propose a neural extension to those pre-defined temporal dependencies. We show that our model, called TDE-GNN, can capture a wide range of temporal dynamics that go beyond typical first or second-order methods, and provide use cases where existing temporal models are challenged. We demonstrate the benefit of learning the temporal dependencies using our method rather than using pre-defined temporal dynamics on several graph benchmarks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/eliasof24a.html
https://proceedings.mlr.press/v238/eliasof24a.htmlGeneral Tail Bounds for Non-Smooth Stochastic Mirror DescentIn this paper, we provide novel tail bounds on the optimization error of Stochastic Mirror Descent for convex and Lipschitz objectives. Our analysis extends the existing tail bounds from the classical light-tailed Sub-Gaussian noise case to heavier-tailed noise regimes. We study the optimization error of the last iterate as well as the average of the iterates. We instantiate our results in two important cases: a class of noise with exponential tails and one with polynomial tails. A remarkable feature of our results is that they do not require an upper bound on the diameter of the domain. Finally, we support our theory with illustrative experiments that compare the behavior of the average of the iterates with that of the last iterate in heavy-tailed noise regimes.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/eldowa24a.html
https://proceedings.mlr.press/v238/eldowa24a.htmlFairness in Submodular Maximization over a Matroid ConstraintSubmodular maximization over a matroid constraint is a fundamental problem with various applications in machine learning. Some of these applications involve decision-making over datapoints with sensitive attributes such as gender or race. In such settings, it is crucial to guarantee that the selected solution is fairly distributed with respect to this attribute. Recently, fairness has been investigated in submodular maximization under a cardinality constraint in both the streaming and offline settings, however the more general problem with matroid constraint has only been considered in the streaming setting and only for monotone objectives. This work fills this gap. We propose various algorithms and impossibility results offering different trade-offs between quality, fairness, and generality.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/el-halabi24a.html
https://proceedings.mlr.press/v238/el-halabi24a.htmlSketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with KernelsLeveraging the kernel trick in both the input and output spaces, surrogate kernel methods are a flexible and theoretically grounded solution to structured output prediction. If they provide state-of-the-art performance on complex data sets of moderate size (e.g., in chemoinformatics), these approaches however fail to scale. We propose to equip surrogate kernel methods with sketching-based approximations, applied to both the input and output feature maps. We prove excess risk bounds on the original structured prediction problem, showing how to attain close-to-optimal rates with a reduced sketch size that depends on the eigendecay of the input/output covariance operators. From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. Empirically, our approach is shown to scale, achieving state-of-the-art performance on benchmark data sets where non-sketched methods are intractable.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/el-ahmad24a.html
https://proceedings.mlr.press/v238/el-ahmad24a.htmlDiscriminator Guidance for Autoregressive Diffusion ModelsWe introduce discriminator guidance in the setting of Autoregressive Diffusion Models. The use of a discriminator to guide a diffusion process has previously been used for continuous diffusion models, and in this work we derive ways of using a discriminator together with a pretrained generative model in the discrete case. First, we show that using an optimal discriminator will correct the pretrained model and enable exact sampling from the underlying data distribution. Second, to account for the realistic scenario of using a sub-optimal discriminator, we derive a sequential Monte Carlo algorithm which iteratively takes the predictions from the discriminator into account during the generation process. We test these approaches on the task of generating molecular graphs and show how the discriminator improves the generative performance over using only the pretrained model.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ekstrom-kelvinius24a.html
https://proceedings.mlr.press/v238/ekstrom-kelvinius24a.htmlApproximate Control for Continuous-Time POMDPsThis work proposes a decision-making framework for partially observable systems in continuous time with discrete state and action spaces. As optimal decision-making becomes intractable for large state spaces we employ approximation methods for the filtering and the control problem that scale well with an increasing number of states. Specifically, we approximate the high-dimensional filtering distribution by projecting it onto a parametric family of distributions, and integrate it into a control heuristic based on the fully observable system to obtain a scalable policy. We demonstrate the effectiveness of our approach on several partially observed systems, including queueing systems and chemical reaction networks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/eich24a.html
https://proceedings.mlr.press/v238/eich24a.htmlSpectrum Extraction and Clipping for Implicitly Linear LayersWe show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. We provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. We study the effect of the batch normalization layers when concatenated with convolutional layers and show how our clipping method can be applied to their composition. By comparing the accuracy and performance of our algorithms to the state-of-the-art methods, using various experiments, we show they are more precise and efficient and lead to better generalization and adversarial robustness. We provide the code for using our methods at https://github.com/Ali-E/FastClip.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ebrahimpour-boroojeny24a.html
https://proceedings.mlr.press/v238/ebrahimpour-boroojeny24a.htmlAnalysis of Kernel Mirror Prox for Measure OptimizationBy choosing a suitable function space as the dual to the non-negative measure cone, we study in a unified framework a class of functional saddle-point optimization problems, which we term the Mixed Functional Nash Equilibrium (MFNE), that underlies several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters. We model the saddle-point optimization dynamics as an interacting Fisher-Rao-RKHS gradient flow when the function space is chosen as a reproducing kernel Hilbert space (RKHS). As a discrete time counterpart, we propose a primal-dual kernel mirror prox (KMP) algorithm, which uses a dual step in the RKHS, and a primal entropic mirror prox step. We then provide a unified convergence analysis of KMP in an infinite-dimensional setting for this class of MFNE problems, which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case, where $N$ is the iteration counter. As a case study, we apply our analysis to DRO, providing algorithmic guarantees for DRO robustness and convergence.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dvurechensky24a.html
https://proceedings.mlr.press/v238/dvurechensky24a.htmlThe Solution Path of SLOPEThe SLOPE estimator has the particularity of having null components (sparsity) and components that are equal in absolute value (clustering). The number of clusters depends on the regularization parameter of the estimator. This parameter can be chosen as a trade-off between interpretability (with a small number of clusters) and accuracy (with a small mean squared error or a small prediction error). Finding such a compromise requires to compute the solution path, that is the function mapping the regularization parameter to the estimator. We provide in this article an algorithm to compute the solution path of SLOPE and show how it can be used to adjust the regularization parameter.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dupuis24a.html
https://proceedings.mlr.press/v238/dupuis24a.htmlFree-form Flows: Make Any Architecture a Normalizing FlowNormalizing Flows are generative models that directly maximize the likelihood. Previously, the design of normalizing flows was largely constrained by the need for analytical invertibility. We overcome this constraint by a training procedure that uses an efficient estimator for the gradient of the change of variables formula. This enables any dimension-preserving neural network to serve as a generative model through maximum likelihood training. Our approach allows placing the emphasis on tailoring inductive biases precisely to the task at hand. Specifically, we achieve excellent results in molecule generation benchmarks utilizing E(n)-equivariant networks at greatly improved sampling speed. Moreover, our method is competitive in an inverse problem benchmark, while employing off-the-shelf ResNet architectures. We publish our code at https://github.com/vislearn/FFF.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/draxler24a.html
https://proceedings.mlr.press/v238/draxler24a.htmlCertified private data release for sparse Lipschitz functionsAs machine learning has become more relevant for everyday applications, a natural requirement is the protection of the privacy of the training data. When the relevant learning questions are unknown in advance, or hyper-parameter tuning plays a central role, one solution is to release a differentially private synthetic data set that leads to similar conclusions as the original training data. In this work, we introduce an algorithm that enjoys fast rates for the utility loss for sparse Lipschitz queries. Furthermore, we show how to obtain a certificate for the utility loss for a large class of algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/donhauser24a.html
https://proceedings.mlr.press/v238/donhauser24a.htmlConvergence to Nash Equilibrium and No-regret Guarantee in (Markov) Potential GamesIn this work, we study potential games and Markov potential games under stochastic cost and bandit feedback. We propose a variant of the Frank-Wolfe algorithm with sufficient exploration and recursive gradient estimation, which provably converges to the Nash equilibrium while attaining sublinear regret for each individual player. Our algorithm simultaneously achieves a Nash regret and a regret bound of $O(T^{4/5})$ for potential games, which matches the best available result, without using additional projection steps. Through carefully balancing the reuse of past samples and exploration of new samples, we then extend the results to Markov potential games and improve the best available Nash regret from $O(T^{5/6})$ to $O(T^{4/5})$. Moreover, our algorithm requires no knowledge of the game, such as the distribution mismatch coefficient, which provides more flexibility in its practical implementation. Experimental results corroborate our theoretical findings and underscore the practical effectiveness of our method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dong24a.html
https://proceedings.mlr.press/v238/dong24a.htmlBayesian Semi-structured Subspace InferenceSemi-structured regression models enable the joint modeling of interpretable structured and complex unstructured feature effects. The structured model part is inspired by statistical models and can be used to infer the input-output relationship for features of particular importance. The complex unstructured part defines an arbitrary deep neural network and thereby provides enough flexibility to achieve competitive prediction performance. While these models can also account for aleatoric uncertainty, there is still a lack of work on accounting for epistemic uncertainty. In this paper, we address this problem by presenting a Bayesian approximation for semi-structured regression models using subspace inference. To this end, we extend subspace inference for joint posterior sampling from a full parameter space for structured effects and a subspace for unstructured effects. Apart from this hybrid sampling scheme, our method allows for tunable complexity of the subspace and can capture multiple minima in the loss landscape. Numerical experiments validate our approach’s efficacy in recovering structured effect parameter posteriors in semi-structured models and approaching the full-space posterior distribution of MCMC for increasing subspace dimension. Further, our approach exhibits competitive predictive performance across simulated and real-world datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dold24a.html
https://proceedings.mlr.press/v238/dold24a.htmlResilient Constrained Reinforcement LearningWe study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before training. It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward maximization objective and the constraint satisfaction, which is ubiquitous in constrained decision-making. To tackle this issue, we propose a new constrained RL approach that searches for policy and constraint specifications together. This method features the adaptation of relaxing the constraint according to a relaxation cost introduced in the learning objective. Since this feature mimics how ecological systems adapt to disruptions by altering operation, our approach is termed as resilient constrained RL. Specifically, we provide a set of sufficient conditions that balance the constraint satisfaction and the reward maximization in notion of resilient equilibrium, propose a tractable formulation of resilient constrained policy optimization that takes this equilibrium as an optimal solution, and advocate two resilient constrained policy search algorithms with non-asymptotic convergence guarantees on the optimality gap and constraint satisfaction. Furthermore, we demonstrate the merits and the effectiveness of our approach in computational experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ding24a.html
https://proceedings.mlr.press/v238/ding24a.htmlGRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning ModelsWe study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are inversely proportional to the gradient norms of the workers such that recovering the flat regions in the optimization landscape is prioritized. We develop two asynchronous variants of the proposed algorithm that we call Model-level and Layer-level Gradient-based Weighted Averaging (resp. MGRAWA and LGRAWA), which differ in terms of the weighting scheme that is either done with respect to the entire model or is applied layer-wise. On the theoretical front, we prove the convergence guarantee for the proposed approach in both convex and non-convex settings. We then experimentally demonstrate that our algorithms outperform the competitor methods by achieving faster convergence and recovering better quality and flatter local optima. We also carry out an ablation study to analyze the scalability of the proposed algorithms in more crowded distributed training environments. Finally, we report that our approach requires less frequent communication and fewer distributed updates compared to the state-of-the-art baselines.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dimlioglu24a.html
https://proceedings.mlr.press/v238/dimlioglu24a.htmlMixed variational flows for discrete variablesVariational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space—usually via continuous relaxation or dequantization—and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while requiring orders of magnitude less compute.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/diluvi24a.html
https://proceedings.mlr.press/v238/diluvi24a.htmlConformalized Deep Splines for Optimal and Efficient Prediction SetsUncertainty estimation is critical in high-stakes machine learning applications. One effective way to estimate uncertainty is conformal prediction, which can provide predictive inference with statistical coverage guarantees. We present a new conformal regression method, Spline Prediction Intervals via Conformal Estimation (SPICE), that estimates the conditional density using neural- network-parameterized splines. We prove universal approximation and optimality results for SPICE, which are empirically reflected by our experiments. SPICE is compatible with two different efficient-to- compute conformal scores, one designed for size-efficient marginal coverage (SPICE-ND) and the other for size-efficient conditional coverage (SPICE-HPD). Results on benchmark datasets demonstrate SPICE-ND models achieve the smallest average prediction set sizes, including average size reductions of nearly 50% for some datasets compared to the next best baseline. SPICE-HPD models achieve the best conditional coverage compared to baselines. The SPICE implementation is made available.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/diamant24a.html
https://proceedings.mlr.press/v238/diamant24a.htmlOn the Expected Size of Conformal Prediction SetsWhile conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dhillon24a.html
https://proceedings.mlr.press/v238/dhillon24a.htmlProbabilistic Calibration by Design for Neural Network RegressionGenerating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dheur24a.html
https://proceedings.mlr.press/v238/dheur24a.htmlOnline Calibrated and Conformal Prediction Improves Bayesian OptimizationAccurate uncertainty estimates are important in sequential model-based decision-making tasks such as Bayesian optimization. However, these estimates can be imperfect if the data violates assumptions made by the model (e.g., Gaussianity). This paper studies which uncertainties are needed in model-based decision-making and in Bayesian optimization, and argues that uncertainties can benefit from calibration—i.e., an 80% predictive interval should contain the true outcome 80% of the time. Maintaining calibration, however, can be challenging when the data is non-stationary and depends on our actions. We propose using simple algorithms based on online learning to provably maintain calibration on non-i.i.d. data, and we show how to integrate these algorithms in Bayesian optimization with minimal overhead. Empirically, we find that calibrated Bayesian optimization converges to better optima in fewer steps, and we demonstrate improved performance on standard benchmark functions and hyperparameter optimization tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/deshpande24a.html
https://proceedings.mlr.press/v238/deshpande24a.htmlLearning-Based Algorithms for Graph Searching ProblemsWe consider the problem of graph searching with prediction recently introduced by Banerjee et al. (2023). In this problem, an agent starting at some vertex r has to traverse a (potentially unknown) graph G to find a hidden goal node g while minimizing the total distance traveled. We study a setting in which at any node v, the agent receives a noisy estimate of the distance from v to g. We design algorithms for this search task on unknown graphs. We establish the first formal guarantees on unknown weighted graphs and provide lower bounds showing that the algorithms we propose have optimal or nearly-optimal dependence on the prediction error. Further, we perform numerical experiments demonstrating that in addition to being robust to adversarial error, our algorithms perform well in typical instances in which the error is stochastic. Finally, we provide simpler performance bounds on the algorithms of Banerjee et al. (2023) for the case of searching on a known graph and establish new lower bounds for this setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/depavia24a.html
https://proceedings.mlr.press/v238/depavia24a.htmlOn the Generalization Ability of Unsupervised PretrainingRecent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/deng24b.html
https://proceedings.mlr.press/v238/deng24b.htmlSample Complexity Characterization for Linear Contextual MDPsContextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $\epsilon$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/deng24a.html
https://proceedings.mlr.press/v238/deng24a.htmlBenchmarking Observational Studies with Experimental Data under Right-CensoringDrawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We consider two cases where censoring time (1) is independent of time-to-event and (2) depends on time-to-event the same way in OS and RCT. For the former, we adopt a censoring-doubly-robust signal for the conditional average treatment effect (CATE) to facilitate an equivalence test of CATEs in OS and RCT, which serves as a proxy for testing if the validity assumptions hold. For the latter, we show that the same test can still be used even though unbiased CATE estimation may not be possible. We verify the effectiveness of our censoring-aware tests via semi-synthetic experiments and analyze RCT and OS data from the Women’s Health Initiative study.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/demirel24a.html
https://proceedings.mlr.press/v238/demirel24a.htmlBreaking isometric ties and introducing priors in Gromov-Wasserstein distancesGromov-Wasserstein distance has many applications in machine learning due to its ability to compare measures across metric spaces and its invariance to isometric transformations. However, in certain applications, this invariant property can be too flexible, thus undesirable. Moreover, the Gromov-Wasserstein distance solely considers pairwise sample similarities in input datasets, disregarding the raw feature representations. We propose a new optimal transport formulation, called Augmented Gromov-Wasserstein (AGW), that allows for some control over the level of rigidity to transformations. It also incorporates feature alignments, enabling us to better leverage prior knowledge on the input data for improved performance. We first present theoretical insights into the proposed method. We then demonstrate its usefulness for single-cell multi-omic alignment tasks and heterogeneous domain adaptation in machine learning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/demetci24a.html
https://proceedings.mlr.press/v238/demetci24a.htmlThink Before You Duel: Understanding Complexities of Preference Learning under Constrained ResourcesWe consider the problem of reward maximization in the dueling bandit setup along with constraints on resource consumption. As in the classic dueling bandits, at each round the learner has to choose a pair of items from a set of $K$ items and observe a relative feedback for the current pair. Additionally, for both items, the learner also observes a vector of resource consumptions. The objective of the learner is to maximize the cumulative reward, while ensuring that the total consumption of any resource is within the allocated budget. We show that due to the relative nature of the feedback, the problem is more difficult than its bandit counterpart and that without further assumptions the problem is not learnable from a regret minimization perspective. Thereafter, by exploiting assumptions on the available budget, we provide an EXP3 based dueling algorithm that also considers the associated consumptions and show that it achieves an $\tilde{\mathcal{O}}\left(\big({\frac{OPT^{(b)}}{B}}+1\big)K^{1/3}T^{2/3}\right)$ regret, where $OPT^{(b)}$ is the optimal value and $B$ is the available budget. Finally, we provide numerical simulations to demonstrate the efficacy of our proposed method.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/deb24a.html
https://proceedings.mlr.press/v238/deb24a.htmlEmergent specialization from participation dynamics and multi-learner retrainingNumerous online services are data-driven: the behavior of users affects the system’s parameters, and the system’s parameters affect the users’ experience of the service, which in turn affects the way users may interact with the system. For example, people may choose to use a service only for tasks that already works well, or they may choose to switch to a different service. These adaptations influence the ability of a system to learn about a population of users and tasks in order to improve its performance broadly. In this work, we analyze a class of such dynamics—where users allocate their participation amongst services to reduce the individual risk they experience, and services update their model parameters to reduce the service’s risk on their current user population. We refer to these dynamics as \emph{risk-reducing}, which cover a broad class of common model updates including gradient descent and multiplicative weights. For this general class of dynamics, we show that asymptotically stable equilibria are always segmented, with sub-populations allocated to a single learner. Under mild assumptions, the utilitarian social optimum is a stable equilibrium. In contrast to previous work, which shows that repeated risk minimization can result in representation disparity and high overall loss with a single learner (Hashimoto et al., 2018; Miller et al., 2021), we find that repeated myopic updates with multiple learners lead to better outcomes. We illustrate the phenomena via a simulated example initialized from real data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dean24a.html
https://proceedings.mlr.press/v238/dean24a.htmlHidden yet quantifiable: A lower bound for confounding strength using randomized trialsIn the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding above a certain strength. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world example.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/de-bartolomeis24a.html
https://proceedings.mlr.press/v238/de-bartolomeis24a.htmlData-Driven Online Model Selection With Regret GuaranteesWe consider model selection for sequential decision making in stochastic environments with bandit feedback, where a meta-learner has at its disposal a pool of base learners, and decides on the fly which action to take based on the policies recommended by each base learner. Model selection is performed by regret balancing but, unlike the recent literature on this subject, we do not assume any prior knowledge about the base learners like candidate regret guarantees; instead, we uncover these quantities in a data-driven manner. The meta-learner is therefore able to leverage the *realized* regret incurred by each base learner for the learning environment at hand (as opposed to the *expected* regret), and single out the best such regret. We design two model selection algorithms operating with this more ambitious notion of regret and, besides proving model selection guarantees via regret balancing, we experimentally demonstrate the compelling practical benefits of dealing with actual regrets instead of candidate regret bounds.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dann24a.html
https://proceedings.mlr.press/v238/dann24a.htmlGraph Partitioning with a Move BudgetIn many real world networks, there already exists a (not necessarily optimal) $k$-partitioning of the network. Oftentimes, for such networks, one aims to find a $k$-partitioning with a smaller cut value by moving only a few nodes across partitions. The number of nodes that can be moved across partitions is often a constraint forced by budgetary limitations. Motivated by such real-world applications, we introduce and study the $r$-move $k$-partitioning problem, a natural variant of the Multiway cut problem. Given a graph, a set of $k$ terminals and an initial partitioning of the graph, the $r$-move $k$-partitioning problem aims to find a $k$-partitioning with the minimum-weighted cut among all the $k$-partitionings that can be obtained by moving at most $r$ non-terminal nodes to partitions different from their initial ones. Our main result is a polynomial time $3(r+1)$ approximation algorithm for this problem. We further show that this problem is $W[1]$-hard, and give an FPTAS for when $r$ is a small constant.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dalirrooyfard24a.html
https://proceedings.mlr.press/v238/dalirrooyfard24a.htmlSADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR DataMissing values are prevalent in temporal electronic health records (EHR) data and are known to complicate data analysis and lead to biased results. The current state-of-the-art (SOTA) models for imputing missing values in EHR primarily leverage correlations across time points and across features, which perform well when data have strong correlation across time points, such as in ICU data where high-frequency time series data are collected. However, this is often insufficient for temporal EHR data from non-ICU settings (e.g., outpatient visits for primary care or specialty care), where data are collected at substantially sparser time points, resulting in much weaker correlation across time points. To address this methodological gap, we propose the Similarity-Aware Diffusion Model-Based Imputation (SADI), a novel imputation method that leverages the diffusion model and utilizes information across dependent variables. We apply SADI to impute incomplete temporal EHR data and propose a similarity-aware denoising function, which includes a self-attention mechanism to model the correlations between time points, features, and similar patients. To the best of our knowledge, this is the first time that the information of similar patients is directly used to construct imputation for incomplete temporal EHR data. Our extensive experiments on two datasets, the Critical Path For Alzheimer’s Disease (CPAD) data and the PhysioNet Challenge 2012 data, show that SADI outperforms the current SOTA under various missing data mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dai24c.html
https://proceedings.mlr.press/v238/dai24c.htmlCan Probabilistic Feedback Drive User Impacts in Online Platforms?A common explanation for negative user impacts of content recommender systems is misalignment between the platform’s objective and user welfare. In this work, we show that misalignment in the platform’s objective is not the only potential cause of unintended impacts on users: even when the platform’s objective is fully aligned with user welfare, the platform’s learning algorithm can induce negative downstream impacts on users. The source of these user impacts is that different pieces of content may generate observable user reactions (feedback information) at different rates; these feedback rates may correlate with content properties, such as controversiality or demographic similarity of the creator, that affect the user experience. Since differences in feedback rates can impact how often the learning algorithm engages with different content, the learning algorithm may inadvertently promote content with certain such properties. Using the multi-armed bandit framework with probabilistic feedback, we examine the relationship between feedback rates and a learning algorithm’s engagement with individual arms for different no-regret algorithms. We prove that no-regret algorithms can exhibit a wide range of dependencies: if the feedback rate of an arm increases, some no-regret algorithms engage with the arm more, some no-regret algorithms engage with the arm less, and other no-regret algorithms engage with the arm approximately the same number of times. From a platform design perspective, our results highlight the importance of looking beyond regret when measuring an algorithm’s performance, and assessing the nature of a learning algorithm’s engagement with different types of content as well as their resulting downstream impacts.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dai24b.html
https://proceedings.mlr.press/v238/dai24b.htmlLocal Causal Discovery with Linear non-Gaussian Cyclic ModelsLocal causal discovery is of great practical significance, as there are often situations where the discovery of the global causal structure is unnecessary, and the interest lies solely on a single target variable. Most existing local methods utilize conditional independence relations, providing only a partially directed graph, and assume acyclicity for the ground-truth structure, even though real-world scenarios often involve cycles like feedback mechanisms. In this work, we present a general, unified local causal discovery method with linear non-Gaussian models, whether they are cyclic or acyclic. We extend the application of independent component analysis from the global context to independent subspace analysis, enabling the exact identification of the equivalent local directed structures and causal strengths from the Markov blanket of the target variable. We also propose an alternative regression-based method in the particular acyclic scenarios. Our identifiability results are empirically validated using both synthetic and real-world datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dai24a.html
https://proceedings.mlr.press/v238/dai24a.htmlA Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk MinimizationBilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $O((n+m)^{1/2}\epsilon^{-1})$ oracle calls to achieve $\epsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, making it optimal in terms of sample complexity.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/dagreou24a.html
https://proceedings.mlr.press/v238/dagreou24a.htmlPrivacy-Constrained Policies via Mutual Information Regularized Policy GradientsAs reinforcement learning techniques are increasingly applied to real-world decision problems, attention has turned to how these algorithms use potentially sensitive information. We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We give examples of how this setting covers real-world problems in privacy for sequential decision-making. We solve this problem in the policy gradients framework by introducing a regularizer based on the mutual information (MI) between the sensitive state and the actions. We develop a model-based stochastic gradient estimator for optimization of privacy-constrained policies. We also discuss an alternative MI regularizer that serves as an upper bound to our main MI regularizer and can be optimized in a model-free setting, and a powerful direct estimator that can be used in an environment with differentiable dynamics. We contrast previous work in differentially-private RL to our mutual-information formulation of information disclosure. Experimental results show that our training method results in policies that hide the sensitive state, even in challenging high-dimensional tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cundy24a.html
https://proceedings.mlr.press/v238/cundy24a.htmlTheory-guided Message Passing Neural Network for Probabilistic InferenceProbabilistic inference can be tackled by minimizing a variational free energy through message passing. To improve performance, neural networks are adopted for message computation. Neural message learning is heuristic and requires strong guidance to perform well. In this work, we propose a {\em theory-guided message passing neural network} (TMPNN) for probabilistic inference. Inspired by existing work, we consider a generalized Bethe free energy which allows for a learnable variational assumption. Instead of using a black-box neural network for message computation, we utilize a general message equation and introduce a symbolic message function with semantically meaningful parameters. The analytically derived symbolic message function is seamlessly integrated into the MPNN framework, giving rise to the proposed TMPNN. TMPNN is trained using algorithmic supervision without requiring exact inference results. Leveraging the theory-guided symbolic function, TMPNN offers strengthened theoretical guarantees compared to conventional heuristic neural models. It presents a novel contribution by demonstrating its applicability to both MAP and marginal inference tasks, outperforming SOTAs in both cases. Furthermore, TMPNN provides improved generalizability across various graph structures and exhibits enhanced data efficiency.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cui24a.html
https://proceedings.mlr.press/v238/cui24a.htmlSequential learning of the Pareto front for multi-objective banditsWe study the problem of sequential learning of the Pareto front in multi-objective multi-armed bandits. An agent is faced with $K$ possible arms to pull. At each turn she picks one, and receives a vector-valued reward. When she thinks she has enough information to identify the Pareto front of the different arm means, she stops the game and gives an answer. We are interested in designing algorithms such that the answer given is correct with probability at least $1-\delta$. Our main contribution is an efficient implementation of an algorithm achieving the optimal sample complexity when the risk $\delta$ is small. With $K$ arms in $d$ dimensions, $p$ of which are in the Pareto set, the algorithm runs in time $O(K p^d)$ per round.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/crepon24a.html
https://proceedings.mlr.press/v238/crepon24a.htmlTo Pool or Not To Pool: Analyzing the Regularizing Effects of Group-Fair Training on Shared ModelsIn fair machine learning, one source of performance disparities between groups is overfitting to groups with relatively few training samples. We derive group-specific bounds on the generalization error of welfare-centric fair machine learning that benefit from the larger sample size of the majority group. We do this by considering group-specific Rademacher averages over a restricted hypothesis class, which contains the family of models likely to perform well with respect to a fair learning objective (e.g., a power-mean). Our simulations demonstrate these bounds improve over a naïve method, as expected by theory, with particularly significant improvement for smaller group sizes.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cousins24a.html
https://proceedings.mlr.press/v238/cousins24a.htmlA Unifying Variational Framework for Gaussian Process Motion PlanningTo control how a robot moves, motion planning algorithms must compute paths in high-dimensional state spaces while accounting for physical constraints related to motors and joints, generating smooth and stable motions, avoiding obstacles, and preventing collisions. A motion planning algorithm must therefore balance competing demands, and should ideally incorporate uncertainty to handle noise, model errors, and facilitate deployment in complex environments. To address these issues, we introduce a framework for robot motion planning based on variational Gaussian processes, which unifies and generalizes various probabilistic-inference-based motion planning algorithms, and connects them with optimization-based planners. Our framework provides a principled and flexible way to incorporate equality-based, inequality-based, and soft motion-planning constraints during end-to-end training, is straightforward to implement, and provides both interval-based and Monte-Carlo-based uncertainty estimates. We conduct experiments using different environments and robots, comparing against baseline approaches based on the feasibility of the planned paths, and obstacle avoidance quality. Results show that our proposed approach yields a good balance between success rates and path quality.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cosier24a.html
https://proceedings.mlr.press/v238/cosier24a.htmlOnline non-parametric likelihood-ratio estimation by Pearson-divergence functional minimizationQuantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -to our best knowledge- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x’_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated at every iteration. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/concha-duarte24a.html
https://proceedings.mlr.press/v238/concha-duarte24a.htmlConditions on Preference Relations that Guarantee the Existence of Optimal PoliciesLearning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies – that are characterized by recursive optimality equations – even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/colaco-carr24a.html
https://proceedings.mlr.press/v238/colaco-carr24a.htmlLearning Latent Partial Matchings with Gumbel-IPF NetworksLearning to match discrete objects has been a central task in machine learning, often facilitated by a continuous relaxation of the matching structure. However, practical problems entail partial matchings due to missing correspondences, which pose difficulties to the one-to-one matching learning techniques that dominate the state-of-the-art. This paper introduces Gumbel-IPF networks for learning latent partial matchings. At the core of our method is the differentiable Iterative Proportional Fitting (IPF) procedure that biproportionally projects onto the transportation polytope of target marginals. Our theoretical framework also allows drawing samples from the temperature-dependent partial matching distribution. We investigate the properties of common-practice relaxations through the lens of biproportional fitting and introduce a new metric, the empirical prediction shift. Our method’s advantages are demonstrated in experimental results on the semantic keypoints partial matching task on the Pascal VOC, IMC-PT-SparseGM, and CUB2001 datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cohen-indelman24a.html
https://proceedings.mlr.press/v238/cohen-indelman24a.htmlLearning a Fourier Transform for Linear Relative Positional Encodings in TransformersWe propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces. FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE mask. Besides, FLTs allow for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and give accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly test FLTs on other data modalities and tasks, such as image classification, 3D molecular modeling, and learnable optimizers. To the best of our knowledge, for 3D molecular data, FLTs are the first Transformer architectures providing linear attention and incorporating RPE masking.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/choromanski24a.html
https://proceedings.mlr.press/v238/choromanski24a.htmlCausal Discovery under Off-Target InterventionsCausal graph discovery is a significant problem with applications across various disciplines. However, with observational data alone, the underlying causal graph can only be recovered up to its Markov equivalence class, and further assumptions or interventions are necessary to narrow down the true graph. This work addresses the causal discovery problem under the setting of stochastic interventions with the natural goal of minimizing the number of interventions performed. We propose the following stochastic intervention model which subsumes existing adaptive noiseless interventions in the literature while capturing scenarios such as fat-hand interventions and CRISPR gene knockouts: any intervention attempt results in an actual intervention on a random subset of vertices, drawn from a \emph{distribution dependent on attempted action}. Under this model, we study the two fundamental problems in causal discovery of verification and search and provide approximation algorithms with polylogarithmic competitive ratios and provide some preliminary experimental results.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/choo24a.html
https://proceedings.mlr.press/v238/choo24a.htmlProvable Policy Gradient Methods for Average-Reward Markov Potential GamesWe study Markov potential games under the infinite horizon average reward criterion. Most previous studies have been for discounted rewards. We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion. To set the stage for gradient-based methods, we first establish that the average reward is a smooth function of policies and provide sensitivity bounds for the differential value functions, under certain conditions on ergodicity and the second largest eigenvalue of the underlying Markov decision process (MDP). We prove that three algorithms, policy gradient, proximal-Q, and natural policy gradient (NPG), converge to an $\epsilon$-Nash equilibrium with time complexity $O(\frac{1}{\epsilon^2})$, given a gradient/differential Q function oracle. When policy gradients have to be estimated, we propose an algorithm with $\tilde{O}(\frac{1}{\min_{s,a}\pi(a|s)\delta})$ sample complexity to achieve $\delta$ approximation error w.r.t the $\ell_2$ norm. Equipped with the estimator, we derive the first sample complexity analysis for a policy gradient ascent algorithm, featuring a sample complexity of $\tilde{O}(1/\epsilon^5)$. Simulation studies are presented.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cheng24a.html
https://proceedings.mlr.press/v238/cheng24a.htmlGibbs-Based Information Criteria and the Over-Parameterized RegimeDouble-descent refers to the unexpected drop in test loss of a learning algorithm beyond an interpolating threshold with over-parameterization, which is not predicted by information criteria in their classical forms due to the limitations in the standard asymptotic approach. We update these analyses using the information risk minimization framework and provide Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for models learned by the Gibbs algorithm. Notably, the penalty terms for the Gibbs-based AIC and BIC correspond to specific information measures, i.e., symmetrized KL information and KL divergence. We extend this information-theoretic analysis to over-parameterized models by providing two different Gibbs-based BICs to compute the marginal likelihood of random feature models in the regime where the number of parameters $p$ and the number of samples $n$ tend to infinity, with $p/n$ fixed. Our experiments demonstrate that the Gibbs-based BIC can select the high-dimensional model and reveal the mismatch between marginal likelihood and population risk in the over-parameterized regime, providing new insights to understand double-descent.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24g.html
https://proceedings.mlr.press/v238/chen24g.htmlCoreset Markov chain Monte CarloA Bayesian coreset is a small, weighted subset of data that replaces the full dataset during inference in order to reduce computational cost. However, state of the art methods for tuning coreset weights are expensive, require nontrivial user input, and impose constraints on the model. In this work, we propose a new method—coreset MCMC—that simulates a Markov chain targeting the coreset posterior, while simultaneously updating the coreset weights using those same draws. Coreset MCMC is simple to implement and tune, and can be used with any existing MCMC kernel. We analyze coreset MCMC in a representative setting to obtain key insights about the convergence behaviour of the method. Empirical results demonstrate that coreset MCMC provides higher quality posterior approximations and reduced computational cost compared with other coreset construction methods. Further, compared with other general subsampling MCMC methods, we find that coreset MCMC has a higher sampling efficiency with competitively accurate posterior approximations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24f.html
https://proceedings.mlr.press/v238/chen24f.htmlNon-Convex Joint Community Detection and Group Synchronization via Generalized Power MethodThis paper proposes a Generalized Power Method (GPM) to simultaneously solve the joint problem of community detection and group synchronization in a direct non-convex manner, in contrast to the existing method of semidefinite programming (SDP). Under a natural extension of stochastic block model (SBM), our theoretical analysis proves that the proposed algorithm is able to exactly recover the ground truth in $O(n\log^2 n)$ time for problems of size $n$, sharply outperforming the $O(n^{3.5})$ runtime of SDP. Moreover, we give a lower bound of model parameters as a sufficient condition for the exact recovery of GPM. The new bound breaches the information-theoretic limit for pure community detection under SBM, thus demonstrating the superiority of our simultaneous optimization algorithm over any two-stage method that performs the two tasks in succession. We also conduct numerical experiments on GPM and SDP to corroborate our theoretical analysis.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24e.html
https://proceedings.mlr.press/v238/chen24e.htmlEscaping Saddle Points in Heterogeneous Federated Learning via Distributed SGD with Communication CompressionWe consider the problem of finding second-order stationary points in the optimization of heterogeneous federated learning (FL). Previous works in FL mostly focus on first-order convergence guarantees, which do not rule out the scenario of unstable saddle points. Meanwhile, it is a key bottleneck of FL to achieve communication efficiency without compensating the learning accuracy, especially when local data are highly heterogeneous across different clients. Given this, we propose a novel algorithm PowerEF-SGD that only communicates compressed information via a novel error-feedback scheme. To our knowledge, PowerEF-SGD is the first distributed and compressed SGD algorithm that provably escapes saddle points in heterogeneous FL without any data homogeneity assumptions. In particular, PowerEF-SGD improves to second-order stationary points after visiting first-order (possibly saddle) points, using additional gradient queries and communication rounds only of almost the same order required by first-order convergence, and the convergence rate shows a linear-speedup pattern in terms of the number of workers. Our theory improves/recovers previous results, while extending to much more tolerant settings on the local data. Numerical experiments are provided to complement the theory.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24d.html
https://proceedings.mlr.press/v238/chen24d.htmlFederated Experiment Design under Distributed Differential PrivacyExperiment design has a rich history dating back over a century and has found many critical applications across various fields since then. The use and collection of users’ data in experiments often involve sensitive personal information, so additional measures to protect individual privacy are required during data collection, storage, and usage. In this work, we focus on the rigorous protection of users’ privacy (under the notion of differential privacy (DP)) while minimizing the trust toward service providers. Specifically, we consider the estimation of the average treatment effect (ATE) under DP, while only allowing the analyst to collect population-level statistics via secure aggregation, a distributed protocol enabling a service provider to aggregate information without accessing individual data. Although a vital component in modern A/B testing workflows, private distributed experimentation has not previously been studied. To achieve DP, we design local privatization mechanisms that are compatible with secure aggregation and analyze the utility, in terms of the width of confidence intervals, both asymptotically and non-asymptotically. We show how these mechanisms can be scaled up to handle the very large number of participants commonly found in practice. In addition, when introducing DP noise, it is imperative to cleverly split privacy budgets to estimate both the mean and variance of the outcomes and carefully calibrate the confidence intervals according to the DP noise. Last, we present comprehensive experimental evaluations of our proposed schemes and show the privacy-utility trade-offs in experiment design.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24c.html
https://proceedings.mlr.press/v238/chen24c.htmlAn Efficient Stochastic Algorithm for Decentralized Nonconvex-Strongly-Concave Minimax OptimizationThis paper studies the stochastic nonconvex-strongly-concave minimax optimization over a multi-agent network. We propose an efficient algorithm, called Decentralized Recursive gradient descEnt Ascent Method (DREAM), which achieves the best-known theoretical guarantee for finding the $\epsilon$-stationary points. Concretely, it requires $\mathcal{O}(\min (\kappa^3\epsilon^{-3},\kappa^2 \sqrt{N} \epsilon^{-2} ))$ stochastic first-order oracle (SFO) calls and $\tilde \mathcal O(\kappa^2 \epsilon^{-2})$ communication rounds, where $\kappa$ is the condition number and $N$ is the total number of individual functions. Our numerical experiments also validate the superiority of DREAM over previous methods.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24b.html
https://proceedings.mlr.press/v238/chen24b.htmlLower-level Duality Based Reformulation and Majorization Minimization Algorithm for Hyperparameter OptimizationHyperparameter tuning is an important task of machine learning, which can be formulated as a bilevel program (BLP). However, most existing algorithms are not applicable for BLP with non-smooth lower-level problems. To address this, we propose a single-level reformulation of the BLP based on lower-level duality without involving any implicit value function. To solve the reformulation, we propose a majorization minimization algorithm that marjorizes the constraint in each iteration. Furthermore, we show that the subproblems of the proposed algorithm for several widely-used hyperparameter turning models can be reformulated into conic programs that can be efficiently solved by the off-the-shelf solvers. We theoretically prove the convergence of the proposed algorithm and demonstrate its superiority through numerical experiments.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chen24a.html
https://proceedings.mlr.press/v238/chen24a.htmlDynamic Inter-treatment Information Sharing for Individualized Treatment Effects EstimationEstimation of individualized treatment effects (ITE) from observational studies is a fundamental problem in causal inference and holds significant importance across domains, including healthcare. However, limited observational datasets pose challenges in reliable ITE estimation as data have to be split among treatment groups to train an ITE learner. While information sharing among treatment groups can partially alleviate the problem, there is currently no general framework for end-to-end information sharing in ITE estimation. To tackle this problem, we propose a deep learning framework based on ‘\textit{soft weight sharing}’ to train ITE learners, enabling \textit{dynamic end-to-end} information sharing among treatment groups. The proposed framework complements existing ITE learners, and introduces a new class of ITE learners, referred to as \textit{HyperITE}. We extend state-of-the-art ITE learners with \textit{HyperITE} versions and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves ITE estimation error, with increasing effectiveness for smaller datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chauhan24a.html
https://proceedings.mlr.press/v238/chauhan24a.htmlEfficient Quantum Agnostic Improper Learning of Decision TreesThe agnostic setting is the hardest generalization of the PAC model since it is akin to learning with adversarial noise. In this paper, we give a poly $(n, t, 1/\epsilon)$ quantum algorithm for learning size $t$ decision trees over $n$-bit inputs with uniform marginal over instances, in the agnostic setting, without membership queries (MQ). This is the first algorithm (classical or quantum) for efficiently learning decision trees without MQ. First, we construct a quantum agnostic weak learner by designing a quantum variant of the classical Goldreich-Levin algorithm that works with strongly biased function oracles. Next, we show how to quantize the agnostic boosting algorithm by Kalai and Kanade (2009) to obtain the first efficient quantum agnostic boosting algorithm (that has a polynomial speedup over existing adaptive quantum boosting algorithms). We then use the quantum agnostic boosting algorithm to boost the weak quantum agnostic learner constructed previously to obtain a quantum agnostic learner for decision trees. Using the above framework, we also give quantum decision tree learning algorithms without MQ in weaker noise models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chatterjee24a.html
https://proceedings.mlr.press/v238/chatterjee24a.htmlOnline Learning of Decision Trees with Thompson SamplingDecision Trees are prominent prediction models for interpretable Machine Learning. They have been thoroughly researched, mostly in the batch setting with a fixed labelled dataset, leading to popular algorithms such as C4.5, ID3 and CART. Unfortunately, these methods are of heuristic nature, they rely on greedy splits offering no guarantees of global optimality and often leading to unnecessarily complex and hard-to-interpret Decision Trees. Recent breakthroughs addressed this suboptimality issue in the batch setting, but no such work has considered the online setting with data arriving in a stream. To this end, we devise a new Monte Carlo Tree Search algorithm, Thompson Sampling Decision Trees (TSDT), able to produce optimal Decision Trees in an online setting. We analyse our algorithm and prove its almost sure convergence to the optimal tree. Furthermore, we conduct extensive experiments to validate our findings empirically. The proposed TSDT outperforms existing algorithms on several benchmarks, all while presenting the practical advantage of being tailored to the online setting.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chaouki24a.html
https://proceedings.mlr.press/v238/chaouki24a.htmlProbabilistic Modeling for Sequences of Sets in Continuous-TimeNeural marked temporal point processes have been a valuable addition to the existing toolbox of statistical parametric models for continuous-time event data. These models are useful for sequences where each event is associated with a single item (a single type of event or a “mark”)—but such models are not suited for the practical situation where each event is associated with a set of items. In this work, we develop a general framework for modeling set-valued data in continuous-time, compatible with any intensity-based recurrent neural point process model. In addition, we develop inference methods that can use such models to answer probabilistic queries such as “the probability of item A being observed before item B,” conditioned on sequence history. Computing exact answers for such queries is generally intractable for neural models due to both the continuous-time nature of the problem setting and the combinatorially-large space of potential outcomes for each event. To address this, we develop a class of importance sampling methods for querying with set-based sequences and demonstrate orders-of-magnitude improvements in efficiency over direct sampling via systematic experiments with four real-world datasets. We also illustrate how to use this framework to perform model selection using likelihoods that do not involve one-step-ahead prediction.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chang24a.html
https://proceedings.mlr.press/v238/chang24a.htmlScore Operator Newton transportWe propose a new approach for sampling and Bayesian computation that uses the score of the target distribution to construct a transport from a given reference distribution to the target. Our approach is an infinite-dimensional Newton method, involving an elliptic PDE, for finding a zero of a “score-residual” operator. We prove sufficient conditions for convergence to a valid transport map. Our Newton iterates can be computed by exploiting fast solvers for elliptic PDEs, resulting in new algorithms for Bayesian inference and other sampling tasks. We identify elementary settings where score-operator Newton transport achieves fast convergence while avoiding mode collapse.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chandramoorthy24a.html
https://proceedings.mlr.press/v238/chandramoorthy24a.htmlEquivalence Testing: The Power of Bounded AdaptivityEquivalence testing, a fundamental problem in the field of distribution testing, seeks to infer if two unknown distributions on $[n]$ are the same or far apart in the total variation distance. Conditional sampling has emerged as a powerful query model and has been investigated by theoreticians and practitioners alike, leading to the design of optimal algorithms albeit in a sequential setting (also referred to as adaptive tester). Given the profound impact of parallel computing over the past decades, there has been a strong desire to design algorithms that enable high parallelization. Despite significant algorithmic advancements over the last decade, parallelizable techniques (also termed non-adaptive testers) have $\tilde{O}(\log^{12}n)$ query complexity, a prohibitively large complexity to be of practical usage. Therefore, the primary challenge is whether it is possible to design algorithms that enable high parallelization while achieving efficient query complexity. Our work provides an affirmative answer to the aforementioned challenge: we present a highly parallelizable tester with a query complexity of $\tilde{O}(\log n)$, achieved through a single round of adaptivity, marking a significant stride towards harmonizing parallelizability and efficiency in equivalence testing.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chakraborty24b.html
https://proceedings.mlr.press/v238/chakraborty24b.htmlPrIsing: Privacy-Preserving Peer Effect Estimation via Ising ModelThe Ising model, originally developed as a spin-glass model for ferromagnetic elements, has gained popularity as a network-based model for capturing dependencies in agents’ outputs. Its increasing adoption in healthcare and the social sciences has raised privacy concerns regarding the confidentiality of agents’ responses. In this paper, we present a novel $(\varepsilon,\delta)$-differentially private algorithm specifically designed to protect the privacy of individual agents’ outcomes. Our algorithm allows for precise estimation of the natural parameter using a single network through an objective perturbation technique. Furthermore, we establish regret bounds for this algorithm and assess its performance on synthetic datasets and two real-world networks: one involving HIV status in a social network and the other concerning the political leaning of online blogs.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chakraborty24a.html
https://proceedings.mlr.press/v238/chakraborty24a.htmlTowards a Complete Benchmark on Video Moment LocalizationIn this paper, we propose and conduct a comprehensive benchmark on moment localization task, which aims to retrieve a segment that corresponds to a text query from a single untrimmed video. Our study starts from an observation that most moment localization papers report experimental results only on a few datasets in spite of availability of far more benchmarks. Thus, we conduct an extensive benchmark study to measure the performance of representative methods on widely used 7 datasets. Looking further into the details, we pose additional research questions and empirically verify them, including if they rely on unintended biases introduced by specific training data, if advanced visual features trained on classification task transfer well to this task, and if computational cost of each model pays off. With a series of these experiments, we provide multi-faceted evaluation of state-of-the-art moment localization models. Codes are available at \url{https://github.com/snuviplab/MoLEF}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/chae24a.html
https://proceedings.mlr.press/v238/chae24a.htmlRandom Oscillators Network for Time Series ProcessingWe introduce the Random Oscillators Network (RON), a physically-inspired recurrent model derived from a network of heterogeneous oscillators. Unlike traditional recurrent neural networks, RON keeps the connections between oscillators untrained by leveraging on smart random initialisations, leading to exceptional computational efficiency. A rigorous theoretical analysis finds the necessary and sufficient conditions for the stability of RON, highlighting the natural tendency of RON to lie at the edge of stability, a regime of configurations offering particularly powerful and expressive models. Through an extensive empirical evaluation on several benchmarks, we show four main advantages of RON. 1) RON shows excellent long-term memory and sequence classification ability, outperforming other randomised approaches. 2) RON outperforms fully-trained recurrent models and state-of-the-art randomised models in chaotic time series forecasting. 3) RON provides expressive internal representations even in a small parametrisation regime making it amenable to be deployed on low-powered devices and at the edge. 4) RON is up to two orders of magnitude faster than fully-trained models.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ceni24a.html
https://proceedings.mlr.press/v238/ceni24a.htmlDouble InfoGAN for Contrastive AnalysisContrastive Analysis (CA) deals with the discovery of what is common and what is distinctive of a target domain compared to a background one. This is of great interest in many applications, such as medical imaging. Current state-of-the-art (SOTA) methods are latent variable models based on VAE (CA-VAEs). However, they all either ignore important constraints or they don’t enforce fundamental assumptions. This may lead to sub-optimal solutions where distinctive factors are mistaken for common ones (or viceversa). Furthermore, the generated images have a rather poor quality, typical of VAEs, decreasing their interpretability and usefulness. Here, we propose Double InfoGAN, the first GAN based method for CA that leverages the high-quality synthesis of GAN and the separation power of InfoGAN. Experimental results on four visual datasets, from simple synthetic examples to complex medical images, show that the proposed method outperforms SOTA CA-VAEs in terms of latent separation and image quality. Datasets and code are available online.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/carton24a.html
https://proceedings.mlr.press/v238/carton24a.htmlGaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernelsSupervised learning has recently garnered significant attention in the field of computational physics due to its ability to effectively extract complex patterns for tasks like solving partial differential equations, or predicting material properties. Traditionally, such datasets consist of inputs given as meshes with a large number of nodes representing the problem geometry (seen as graphs), and corresponding outputs obtained with a numerical solver. This means the supervised learning model must be able to handle large and sparse graphs with continuous node attributes. In this work, we focus on Gaussian process regression, for which we introduce the Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernel. In contrast to existing graph kernels, the proposed SWWL kernel enjoys positive definiteness and a drastic complexity reduction, which makes it possible to process datasets that were previously impossible to handle. The new kernel is first validated on graph classification for molecular datasets, where the input graphs have a few tens of nodes. The efficiency of the SWWL kernel is then illustrated on graph regression in computational fluid dynamics and solid mechanics, where the input graphs are made up of tens of thousands of nodes.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/carpintero-perez24a.html
https://proceedings.mlr.press/v238/carpintero-perez24a.htmlThe sample complexity of ERMs in stochastic convex optimizationStochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: how many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population? This question was proposed by Feldman who proved that $\Omega(\frac{d}{\epsilon} + \frac{1}{\epsilon^2} )$ data points are necessary (where $d$ is the dimension and $\epsilon > 0$ the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon} + \frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon} + \frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of linear functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/carmon24a.html
https://proceedings.mlr.press/v238/carmon24a.htmlPure Exploration in Bandits with Linear ConstraintsWe address the problem of identifying the optimal policy with a fixed confidence level in a multi-armed bandit setup, when \emph{the arms are subject to linear constraints}. Unlike the standard best-arm identification problem which is well studied, the optimal policy in this case may not be deterministic and could mix between several arms. This changes the geometry of the problem which we characterize via an information-theoretic lower bound. We introduce two asymptotically optimal algorithms for this setting, one based on the Track-and-Stop method and the other based on a game-theoretic approach. Both these algorithms try to track an optimal allocation based on the lower bound and computed by a weighted projection onto the boundary of a normal cone. Finally, we provide empirical results that validate our bounds and visualize how constraints change the hardness of the problem.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/carlsson24a.html
https://proceedings.mlr.press/v238/carlsson24a.htmlConsistent Hierarchical Classification with A Generalized MetricIn multi-class hierarchical classification, a natural evaluation metric is the tree distance loss that takes the value of two labels’ distance on the pre-defined tree hierarchy. This metric is motivated by that its Bayes optimal solution is the deepest label on the tree whose induced superclass (subtree rooted at it) includes the true label with probability at least $\frac{1}{2}$. However, it can hardly handle the risk sensitivity of different tasks since its accuracy requirement for induced superclasses is fixed at $\frac{1}{2}$. In this paper, we first introduce a new evaluation metric that generalizes the tree distance loss, whose solution’s accuracy constraint $\frac{1+c}{2}$ can be controlled by a penalty value $c$ tailored for different tasks: a higher c indicates the emphasis on prediction’s accuracy and a lower one indicates that on specificity. Then, we propose a novel class of consistent surrogate losses based on an intuitive presentation of our generalized metric and its regret, which can be compatible with various binary losses. Finally, we theoretically derive the regret transfer bounds for our proposed surrogates and empirically validate their usefulness on benchmark datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cao24a.html
https://proceedings.mlr.press/v238/cao24a.htmlNear-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov GamesWe study policy optimization algorithms for computing correlated equilibria in multi-player general-sum Markov Games. Previous results achieve $\tilde{O}(T^{-1/2})$ convergence rate to a correlated equilibrium and an accelerated $\tilde{O}(T^{-3/4})$ convergence rate to the weaker notion of coarse correlated equilibrium. In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium. Our algorithm is constructed by combining two main elements (i) smooth value updates and (ii) the \emph{optimistic-follow-the-regularized-leader} algorithm with the log barrier regularizer.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cai24a.html
https://proceedings.mlr.press/v238/cai24a.htmlThe Galerkin method beats Graph-Based Approaches for Spectral AlgorithmsHistorically, the machine learning community has derived spectral decompositions from graph-based approaches. We break with this approach and prove the statistical and computational superiority of the Galerkin method, which consists in restricting the study to a small set of test functions. In particular, we introduce implementation tricks to deal with differential operators in large dimensions with structured kernels. Finally, we extend on the core principles beyond our approach to apply them to non-linear spaces of functions, such as the ones parameterized by deep neural networks, through loss-based optimization procedures.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/cabannes24a.html
https://proceedings.mlr.press/v238/cabannes24a.htmlAuditing Fairness under Unobserved ConfoundingA fundamental problem in decision-making systems is the presence of inequity along demographic lines. However, inequity can be difficult to quantify, particularly if our notion of equity relies on hard-to-measure notions like risk (e.g., equal access to treatment for those who would die without it). Auditing such inequity requires accurate measurements of individual risk, which is difficult to estimate in the realistic setting of unobserved confounding. In the case that these unobservables “explain” an apparent disparity, we may understate or overstate inequity. In this paper, we show that one can still give informative bounds on allocation rates among high-risk individuals, even while relaxing or (surprisingly) even when eliminating the assumption that all relevant risk factors are observed. We utilize the fact that in many real-world settings (e.g., the introduction of a novel treatment) we have data from a period prior to any allocation, to derive unbiased estimates of risk. We apply our framework to a real-world setting of Paxlovid allocation to COVID-19 patients, finding that observed racial inequity cannot be explained by unobserved confounders of the same strength as important observed covariates.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/byun24a.html
https://proceedings.mlr.press/v238/byun24a.htmlSimple and scalable algorithms for cluster-aware precision medicineAI-enabled precision medicine promises a transformational improvement in healthcare outcomes. However, training on biomedical data presents significant challenges as they are often high dimensional, clustered, and of limited sample size. To overcome these challenges, we propose a simple and scalable approach for cluster-aware embedding that combines latent factor methods with a convex clustering penalty in a modular way. Our novel approach overcomes the complexity and limitations of current joint embedding and clustering methods and enables hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through numerical experiments and real-world examples, we demonstrate that our approach outperforms fourteen clustering methods on highly underdetermined problems (e.g., with limited sample size) as well as on large sample datasets. Importantly, our approach does not require the user to choose the desired number of clusters, yields improved model selection if they do, and yields interpretable hierarchically clustered embedding dendrograms. Thus, our approach improves significantly on existing methods for identifying patient subgroups in multiomics and neuroimaging data and enables scalable and interpretable biomarkers for precision medicine.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/buch24a.html
https://proceedings.mlr.press/v238/buch24a.htmlDeep Classifier Mimicry without Data AccessAccess to pre-trained models has recently emerged as a standard across numerous machine learning domains. Unfortunately, access to the original data the models were trained on may not equally be granted. This makes it tremendously challenging to fine-tune, compress models, adapt continually, or to do any other type of data-driven update. We posit that original data access may however not be required. Specifically, we propose Contrastive Abductive Knowledge Extraction (CAKE), a model-agnostic knowledge distillation procedure that mimics deep classifiers without access to the original data. To this end, CAKE generates pairs of noisy synthetic samples and diffuses them contrastively toward a model’s decision boundary. We empirically corroborate CAKE’s effectiveness using several benchmark datasets and various architectural choices, paving the way for broad application.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/braun24b.html
https://proceedings.mlr.press/v238/braun24b.htmlVEC-SBM: Optimal Community Detection with Vectorial Edges CovariatesSocial networks are often associated with rich side information, such as texts and images. While numerous methods have been developed to identify communities from pairwise interactions, they usually ignore such side information. In this work, we study an extension of the Stochastic Block Model (SBM), a widely used statistical framework for community detection, that integrates vectorial edges covariates: the Vectorial Edges Covariates Stochastic Block Model (VEC-SBM). We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities under the VEC-SBM. Furthermore, we rigorously assess the added value of leveraging edge’s side information in the community detection process. We complement our theoretical results with numerical experiments on synthetic and semi-synthetic data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/braun24a.html
https://proceedings.mlr.press/v238/braun24a.htmlData Driven Threshold and Potential Initialization for Spiking Neural NetworksSpiking neural networks (SNNs) present an increasingly popular alternative to artificial neural networks (ANNs), due to their energy and time efficiency when deployed on neuromorphic hardware. However, due to their discrete and highly non-differentiable nature, training SNNs is a challenging task and remains an active area of research. Some of the most prominent ways to train SNNs are based on ANN-to-SNN conversion where an SNN model is initialized with parameters from the corresponding, pre-trained ANN model. SNN models trained through ANN-to-SNN conversion or hybrid training show state of the art performance among SNNs on many machine learning tasks, comparable to those of ANNs. However, the top performing models need high latency or tailored ANNs to perform well, and in general are not using the full information available from ANNs. In this work, we propose novel method to initialize SNN’s thresholds and initial membrane potential after ANN-to-SNN conversion, using distributions of ANN’s activation values. We provide a theoretical framework for feature distribution-based conversion error, providing theoretical results on optimal membrane initialization and thresholds which minimize this error, as well as a practical algorithm for finding these optimal values. We test our method, both as a stand-alone ANN-to-SNN conversion and in combination with other methods, and show state of the art results on high-dimensional datasets such as CIFAR10, CIFAR100 and ImageNet and various architectures. Our code is available at \url{https://github.com/srinuvaasu/data_driven_init}Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bojkovic24a.html
https://proceedings.mlr.press/v238/bojkovic24a.htmlOn the Vulnerability of Fairness Constrained Learning to Malicious NoiseWe consider the vulnerability of fairness-constrained learning to small amounts of malicious noise in the training data. [Konstantinov and Lampert, 2021] initiated the study of this question and presented negative results showing there exist data distributions where for several fairness constraints, any proper learner will exhibit high vulnerability when group sizes are imbalanced. Here, we present a more optimistic view, showing that if we allow randomized classifiers, then the landscape is much more nuanced. For example, for Demographic Parity we show we can incur only a $\Theta(\alpha)$ loss in accuracy, where $\alpha$ is the malicious noise rate, matching the best possible even without fairness constraints. For Equal Opportunity, we show we can incur an $O(\sqrt{\alpha})$ loss, and give a matching $\Omega(\sqrt{\alpha})$ lower bound. In contrast, [Konstantinov and Lampert, 2021] showed for proper learners the loss in accuracy for both notions is $\Omega(1)$. The key technical novelty of our work is how randomization can bypass simple "tricks" an adversary can use to amplify his power. We also consider additional fairness notions including Equalized Odds and Calibration. For these fairness notions, the excess accuracy clusters into three natural regimes $O(\alpha)$, $O(\sqrt{\alpha})$, and $O(1)$. These results provide a more fine-grained view of the sensitivity of fairness-constrained learning to adversarial noise in training data.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/blum24a.html
https://proceedings.mlr.press/v238/blum24a.htmlFederated Linear Contextual Bandits with Heterogeneous ClientsThe demand for collaborative and private bandit learning across multiple agents is surging due to the growing quantity of data generated from distributed systems. Federated bandit learning has emerged as a promising framework for private, efficient, and decentralized online learning. However, almost all previous works rely on strong assumptions of client homogeneity, i.e., all participating clients shall share the same bandit model; otherwise, they all would suffer linear regret. This greatly restricts the application of federated bandit learning in practice. In this work, we introduce a new approach for federated bandits for heterogeneous clients, which clusters clients for collaborative bandit learning under the federated learning setting. Our proposed algorithm achieves non-trivial sub-linear regret and communication cost for all clients, subject to the communication protocol under federated learning that at anytime only one model can be shared by the server.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/blaser24a.html
https://proceedings.mlr.press/v238/blaser24a.htmlautoMALA: Locally adaptive Metropolis-adjusted Langevin algorithmSelecting the step size for the Metropolis-adjusted Langevin algorithm (MALA) is necessary in order to obtain satisfactory performance. However, finding an adequate step size for an arbitrary target distribution can be a difficult task and even the best step size can perform poorly in specific regions of the space when the target distribution is sufficiently complex. To resolve this issue we introduce autoMALA, a new Markov chain Monte Carlo algorithm based on MALA that automatically sets its step size at each iteration based on the local geometry of the target distribution. We prove that autoMALA has the correct invariant distribution, despite continual automatic adjustments of the step size. Our experiments demonstrate that autoMALA is competitive with related state-of-the-art MCMC methods, in terms of the number of log density evaluations per effective sample, and it outperforms state-of-the-art samplers on targets with varying geometries. Furthermore, we find that autoMALA tends to find step sizes comparable to optimally-tuned MALA when a fixed step size suffices for the whole domain.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/biron-lattes24a.html
https://proceedings.mlr.press/v238/biron-lattes24a.htmlMeta Learning in Bandits within shared affine SubspacesWe study the problem of meta-learning several contextual stochastic bandits tasks by leveraging their concentration around a low dimensional affine subspace, which we learn via online principal component analysis to reduce the expected regret over the encountered bandits. We propose and theoretically analyze two strategies that solve the problem: One based on the principle of optimism in the face of uncertainty and the other via Thompson sampling. Our framework is generic and includes previously proposed approaches as special cases. Besides, the empirical results show that our methods significantly reduce the regret on several bandit tasks.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bilaj24a.html
https://proceedings.mlr.press/v238/bilaj24a.htmlMaking Better Use of Unlabelled Data in Bayesian Active LearningFully supervised models are predominant in Bayesian active learning. We argue that their neglect of the information present in unlabelled data harms not just predictive performance but also decisions about what data to acquire. Our proposed solution is a simple framework for semi-supervised Bayesian active learning. We find it produces better-performing models than either conventional Bayesian active learning or semi-supervised learning with randomly acquired data. It is also easier to scale up than the conventional approach. As well as supporting a shift towards semi-supervised models, our findings highlight the importance of studying models and acquisition methods in conjunction.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bickford-smith24a.html
https://proceedings.mlr.press/v238/bickford-smith24a.htmlClassifier Calibration with ROC-Regularized Isotonic RegressionCalibration of machine learning classifiers is necessary to obtain reliable and interpretable predictions, bridging the gap between model outputs and actual probabilities. One prominent technique, isotonic regression (IR), aims at calibrating binary classifiers by minimizing the cross entropy with respect to monotone transformations. IR acts as an adaptive binning procedure that is able to achieve a calibration error of zero but leaves open the issue of the effect on performance. We first prove that IR preserves the convex hull of the ROC curve—an essential performance metric for binary classifiers. This ensures that a classifier is calibrated while controlling for over-fitting of the calibration set. We then present a novel generalization of isotonic regression to accommodate classifiers with $K$-classes. Our method constructs a multidimensional adaptive binning scheme on the probability simplex, again achieving a multi-class calibration error equal to zero. We regularize this algorithm by imposing a form of monotony that preserves the $K$-dimensional ROC surface of the classifier. We show empirically that this general monotony criterion is effective in striking a balance between reducing cross entropy loss and avoiding over-fitting of the calibration set.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/berta24a.html
https://proceedings.mlr.press/v238/berta24a.htmlLearning Extensive-Form Perfect Equilibria in Two-Player Zero-Sum Sequential GamesDesigning efficient algorithms for computing refinements of the Nash equilibrium (NE) in two-player zero-sum sequential games is of paramount importance, since the NE may prescribe sub-optimal actions off the equilibrium path. The extensive-form perfect equilibrium (EFPE) amends such a weakness by accounting for the possibility that players may make mistakes. This is crucial in the real world, which involves humans with bounded rationality, and it is also key in boosting superhuman agents for games like Poker. Nevertheless, there are only few algorithms for computing NE refinements, which either lack convergence guarantees to exact equilibria or do not scale to large games. We provide the first efficient iterative algorithm that provably converges to an EFPE in two-player zero-sum sequential games. Our algorithm works by tracking a sequence of equilibria of regularized-perturbed games, by using a procedure that is specifically tailored to converge last iterate to such equilibria. The procedure can be implemented efficiently by visiting the game tree, making our method computationally appealing. We also empirically evaluate our algorithm, showing that its strategies are much more robust to players’ mistakes than those of state-of-the-art algorithms.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bernasconi24a.html
https://proceedings.mlr.press/v238/bernasconi24a.htmlIdentifying Copeland Winners in Dueling Bandits with IndifferencesWe consider the task of identifying the Copeland winner(s) in a dueling bandits problem with ternary feedback. This is an underexplored but practically relevant variant of the conventional dueling bandits problem, in which, in addition to strict preference between two arms, one may observe feedback in the form of an indifference. We provide a lower bound on the sample complexity for any learning algorithm finding the Copeland winner(s) with a fixed error probability. Moreover, we propose POCOWISTA, an algorithm with a sample complexity that almost matches this lower bound, and which shows excellent empirical performance, even for the conventional dueling bandits problem. For the case where the preference probabilities satisfy a specific type of stochastic transitivity, we provide a refined version with an improved worst case sample complexity.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bengs24a.html
https://proceedings.mlr.press/v238/bengs24a.htmlMMD-based Variable Importance for Distributional Random ForestDistributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/benard24a.html
https://proceedings.mlr.press/v238/benard24a.htmlMulti-armed bandits with guaranteed revenue per armWe consider a Multi-Armed Bandit problem with covering constraints, where the primary goal is to ensure that each arm receives a minimum expected reward while maximizing the total cumulative reward. In this scenario, the optimal policy then belongs to some unknown feasible set. Unlike much of the existing literature, we do not assume the presence of a safe policy or a feasibility margin, which hinders the exclusive use of conservative approaches. Consequently, we propose and analyze an algorithm that switches between pessimism and optimism in the face of uncertainty. We prove both precise problem-dependent and problem-independent bounds, demonstrating that our algorithm achieves the best of the two approaches—depending on the presence or absence of a feasibility margin—in terms of constraint violation guarantees. Furthermore, our results indicate that playing greedily on the constraints actually outperforms pessimism when considering long-term violations rather than violations on a per-round basis.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/baudry24a.html
https://proceedings.mlr.press/v238/baudry24a.htmlTight Verification of Probabilistic Robustness in Bayesian Neural NetworksWe introduce two algorithms for computing tight guarantees on the probabilistic robustness of Bayesian Neural Networks (BNNs). Computing robustness guarantees for BNNs is a significantly more challenging task than verifying the robustness of standard Neural Networks (NNs) because it requires searching the parameters’ space for safe weights. Moreover, tight and complete approaches for the verification of standard NNs, such as those based on Mixed-Integer Linear Programming (MILP), cannot be directly used for the verification of BNNs because of the polynomial terms resulting from the consecutive multiplication of variables encoding the weights. Our algorithms efficiently and effectively search the parameters’ space for safe weights by using iterative expansion and the network’s gradient and can be used with any verification algorithm of choice for BNNs. In addition to proving that our algorithms compute tighter bounds than the SoA, we also evaluate our algorithms against the SoA on standard benchmarks, such as MNIST and CIFAR10, showing that our algorithms compute bounds up to 40% tighter than the SoA.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/batten24a.html
https://proceedings.mlr.press/v238/batten24a.htmlDissimilarity BanditsWe study a novel sequential decision-making setting, namely the dissimilarity bandits. At each round, the learner pulls an arm that provides a stochastic d-dimensional observation vector. The learner aims to identify the pair of arms with the maximum dissimilarity, where such an index is computed over pairs of expected observation vectors. We propose Successive Elimination for Dissimilarity (SED), a fixed-confidence best-pair identification algorithm based on sequential elimination. SED discards individual arms when there is statistical evidence that they cannot belong to a pair of most dissimilar arms and, thus, effectively exploits the structure of the setting by reusing the estimates of the expected observation vectors. We provide results on the sample complexity of SED, depending on {HP}, a novel index characterizing the complexity of identifying the pair of the most dissimilar arms. Then, we provide a sample complexity lower bound, highlighting the challenges of the identification problem for dissimilarity bandits, which is almost matched by our SED. Finally, we compare our approach over synthetically generated data and a realistic environmental monitoring domain against classical and combinatorial best-arm identification algorithms for the cases $d=1$ and $d>1$.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/battellani24a.html
https://proceedings.mlr.press/v238/battellani24a.htmlRevisiting the Noise Model of Stochastic Gradient DescentThe effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al (2019) demonstrated that the $S\alpha S$ Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the $S\alpha S$ distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/battash24a.html
https://proceedings.mlr.press/v238/battash24a.htmlA Scalable Algorithm for Individually Fair k-Means ClusteringWe present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in $O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bateni24a.html
https://proceedings.mlr.press/v238/bateni24a.htmlUncertainty Matters: Stable Conclusions under Unstable Assessment of Fairness ResultsRecent studies highlight the effectiveness of Bayesian methods in assessing algorithm performance, particularly in fairness and bias evaluation. We present Uncertainty Matters, a multi-objective uncertainty-aware algorithmic comparison framework. In fairness focused scenarios, it models sensitive group confusion matrices using Bayesian updates and facilitates joint comparison of performance (e.g., accuracy) and fairness metrics (e.g., true positive rate parity). Our approach works seamlessly with common evaluation methods like K-fold cross-validation, effectively addressing dependencies among the K posterior metric distributions. The integration of correlated information is carried out through a procedure tailored to the classifier’s complexity. Experiments demonstrate that the insights derived from algorithmic comparisons employing the Uncertainty Matters approach are more informative, reliable, and less influenced by particular data partitions. Code for the paper is publicly available at \url{https://github.com/abarrainkua/UncertaintyMatters}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/barrainkua24a.html
https://proceedings.mlr.press/v238/barrainkua24a.htmlBOBA: Byzantine-Robust Federated Learning with Label SkewnessIn federated learning, most existing robust aggregation rules (AGRs) combat Byzantine attacks in the IID setting, where client data is assumed to be independent and identically distributed. In this paper, we address label skewness, a more realistic and challenging non-IID setting, where each client only has access to a few classes of data. In this setting, state-of-the-art AGRs suffer from selection bias, leading to significant performance drop for particular classes; they are also more vulnerable to Byzantine attacks due to the increased variation among gradients of honest clients. To address these limitations, we propose an efficient two-stage method named BOBA. Theoretically, we prove the convergence of BOBA with an error of the optimal order. Our empirical evaluations demonstrate BOBA’s superior unbiasedness and robustness across diverse models and datasets when compared to various baselines.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bao24a.html
https://proceedings.mlr.press/v238/bao24a.htmlLearning Under Random Distributional ShiftsMany existing approaches for generating predictions in settings with distribution shift model distribution shifts as adversarial or low-rank in suitable representations. In various real-world settings, however, we might expect shifts to arise through the superposition of many small and random changes in the population and environment. Thus, we consider a class of random distribution shift models that capture arbitrary changes in the underlying covariate space, and dense, random shocks to the relationship between the covariates and the outcomes. In this setting, we characterize the benefits and drawbacks of several alternative prediction strategies: the standard approach that directly predicts the long-term outcomes of interest, the proxy approach that directly predicts shorter-term proxy outcomes, and a hybrid approach that utilizes both the long-term policy outcome and (shorter-term) proxy outcome(s). We show that the hybrid approach is robust to the strength of the distribution shift and the proxy relationship. We apply this method to datasets in two high-impact domains: asylum-seeker placement and early childhood education. In both settings, we find that the proposed approach results in substantially lower mean-squared error than current approaches.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bansak24a.html
https://proceedings.mlr.press/v238/bansak24a.htmlMonotone Operator Theory-Inspired Message Passing for Learning Long-Range Interaction on GraphsLearning long-range interactions (LRI) between distant nodes is crucial for many graph learning tasks. Predominant graph neural networks (GNNs) rely on local message passing and struggle to learn LRI. In this paper, we propose DRGNN to learn LRI leveraging monotone operator theory. DRGNN contains two key components: (1) we use a full node similarity matrix beyond adjacency matrix – drawing inspiration from the personalized PageRank matrix – as the aggregation matrix for message passing, and (2) we implement message-passing on graphs using Douglas-Rachford splitting to circumvent prohibitive matrix inversion. We demonstrate that DRGNN surpasses various advanced GNNs, including Transformer-based models, on several benchmark LRI learning tasks arising from different application domains, highlighting its efficacy in learning LRI. Code is available at \url{https://github.com/Utah-Math-Data-Science/PR-inspired-aggregation}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/baker24a.html
https://proceedings.mlr.press/v238/baker24a.htmlAutoregressive BanditsAutoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\tilde{O} ( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2} )$, where $T$ is the optimization horizon, $n$ is the number of actions, and $\Gamma < 1$ is a stability index of the process. Finally, we empirically validate our algorithm, illustrating its advantages w.r.t. bandit baselines and its robustness to misspecification of key parameters.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/bacchiocchi24a.html
https://proceedings.mlr.press/v238/bacchiocchi24a.htmlLearning multivariate temporal point processes via the time-change theoremMarked temporal point processes (TPPs) are a class of stochastic processes that describe the occurrence of a countable number of marked events over continuous time. In machine learning, the most common representation of marked TPPs is the univariate TPP coupled with a conditional mark distribution. Alternatively, we can represent marked TPPs as a multivariate temporal point process in which we model each sequence of marks interdependently. We introduce a learning framework for multivariate TPPs leveraging recent progress on learning univariate TPPs via time-change theorems to propose a deep-learning, invertible model for the conditional intensity. We rely neither on Monte Carlo approximation for the compensator nor on thinning for sampling. Therefore, we have a generative model that can efficiently sample the next event given a history of past events. Our models show strong alignment between the percentiles of the distribution expected from theory and the empirical ones.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/augusto-zagatti24a.html
https://proceedings.mlr.press/v238/augusto-zagatti24a.htmlApproximate Leave-one-out Cross Validation for Regression with $\ell_1$ RegularizersThe out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding method to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with twice differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO’s error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with the non-differentiable $\ell_1$ regularizer. We bound the error \(|{\rm ALO}-{\rm LO}|\){in} terms of intuitive metrics such as the size of leave-\(i\)-out perturbations in active sets, sample size $n$, number of features $p$ and signal-to-noise ratio (SNR). As a consequence, for the $\ell_1$ regularized problems, we show that $|{\rm ALO}-{\rm LO}| \stackrel{p\rightarrow \infty}{\longrightarrow} 0$ while $n/p$ and SNR remain bounded.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/auddy24a.html
https://proceedings.mlr.press/v238/auddy24a.htmlOnline Bilevel Optimization: Regret Analysis of Online Alternating Gradient MethodsThis paper introduces \textit{online bilevel optimization} in which a sequence of time-varying bilevel problems is revealed one after the other. We extend the known regret bounds for single-level online algorithms to the bilevel setting. Specifically, we provide new notions of \textit{bilevel regret}, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and give regret bounds in terms of the path-length of the inner and outer minimizer sequences.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ataee-tarzanagh24a.html
https://proceedings.mlr.press/v238/ataee-tarzanagh24a.htmlLearning to Solve the Constrained Most Probable Explanation Task in Probabilistic Graphical ModelsWe propose a self-supervised learning approach for solving the following constrained optimization task in log-linear models or Markov networks. Let $f$ and $g$ be two log-linear models defined over the sets $X$ and $Y$ of random variables. Given an assignment $x$ to all variables in $X$ (evidence or observations) and a real number $q$, the constrained most-probable explanation (CMPE) task seeks to find an assignment $y$ to all variables in $Y$ such that $f(x, y)$ is maximized and $g(x, y) \leq q$. In our proposed self-supervised approach, given assignments $x$ to $X$ (data), we train a deep neural network that learns to output near-optimal solutions to the CMPE problem without requiring access to any pre-computed solutions. The key idea in our approach is to use first principles and approximate inference methods for CMPE to derive novel loss functions that seek to push infeasible solutions towards feasible ones and feasible solutions towards optimal ones. We analyze the properties of our proposed method and experimentally demonstrate its efficacy on several benchmark problems.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/arya24b.html
https://proceedings.mlr.press/v238/arya24b.htmlDeep Dependency Networks and Advanced Inference Schemes for Multi-Label ClassificationWe present a unified framework called deep dependency networks (DDNs) that combines dependency networks and deep learning architectures for multi-label classification, with a particular emphasis on image and video data. The primary advantage of dependency networks is their ease of training, in contrast to other probabilistic graphical models like Markov networks. In particular, when combined with deep learning architectures, they provide an intuitive, easy-to-use loss function for multi-label classification. A drawback of DDNs compared to Markov networks is their lack of advanced inference schemes, necessitating the use of Gibbs sampling. To address this challenge, we propose novel inference schemes based on local search and integer linear programming for computing the most likely assignment to the labels given observations. We evaluate our novel methods on three video datasets (Charades, TACoS, Wetlab) and three image datasets (MS-COCO, PASCAL VOC, NUS-WIDE), comparing their performance with (a) basic neural architectures and (b) neural architectures combined with Markov networks equipped with advanced inference and learning techniques. Our results demonstrate the superiority of our new DDN methods over the two competing approaches.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/arya24a.html
https://proceedings.mlr.press/v238/arya24a.htmlE(3)-Equivariant Mesh Neural NetworksTriangular meshes are widely used to represent three-dimensional objects. As a result, many recent works have addressed the need for geometric deep learning on 3D meshes. However, we observe that the complexities in many of these architectures do not translate to practical performance, and simple deep models for geometric graphs are competitive in practice. Motivated by this observation, we minimally extend the update equations of E(n)-Equivariant Graph Neural Networks (EGNNs) (Satorras et al., 2021) to incorporate mesh face information and further improve it to account for long-range interactions through a hierarchy. The resulting architecture, Equivariant Mesh Neural Network (EMNN), outperforms other, more complicated equivariant methods on mesh tasks, with a fast run-time and no expensive preprocessing. Our implementation is available at \url{https://github.com/HySonLab/EquiMesh}.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/anh-trang24a.html
https://proceedings.mlr.press/v238/anh-trang24a.htmlGmGM: a fast multi-axis Gaussian graphical modelThis paper introduces the Gaussian multi-Graphical Model, a model to construct sparse graph representations of matrix- and tensor-variate data. We generalize prior work in this area by simultaneously learning this representation across several tensors that share axes, which is necessary to allow the analysis of multimodal datasets such as those encountered in multi-omics. Our algorithm uses only a single eigendecomposition per axis, achieving an order of magnitude speedup over prior work in the ungeneralized case. This allows the use of our methodology on large multi-modal datasets such as single-cell multi-omics data, which was challenging with previous approaches. We validate our model on synthetic data and five real-world datasets.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/andrew24a.html
https://proceedings.mlr.press/v238/andrew24a.htmlDelegating Data Collection in Decentralized Machine LearningMotivated by the emergence of decentralized machine learning (ML) ecosystems, we study the delegation of data collection. Taking the field of contract theory as our starting point, we design optimal and near-optimal contracts that deal with two fundamental information asymmetries that arise in decentralized ML: uncertainty in the assessment of model quality and uncertainty regarding the optimal performance of any model. We show that a principal can cope with such asymmetry via simple linear contracts that achieve $1-1/\epsilon$ fraction of the optimal utility. To address the lack of a priori knowledge regarding the optimal performance, we give a convex program that can adaptively and efficiently compute the optimal contract. We also analyze the optimal utility and linear contracts for the more complex setting of multiple interactions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ananthakrishnan24a.html
https://proceedings.mlr.press/v238/ananthakrishnan24a.htmlRecovery Guarantees for Distributed-OMPWe study distributed schemes for high-dimensional sparse linear regression, based on orthogonal matching pursuit (OMP). Such schemes are particularly suited for settings where a central fusion center is connected to end machines, that have both computation and communication limitations. We prove that under suitable assumptions, distributed-OMP schemes recover the support of the regression vector with communication per machine linear in its sparsity and logarithmic in the dimension. Remarkably, this holds even at low signal-to-noise-ratios, where individual machines are unable to detect the support. Our simulations show that distributed-OMP schemes are competitive with more computationally intensive methods, and in some cases even outperform them.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/amiraz24a.html
https://proceedings.mlr.press/v238/amiraz24a.htmlFair k-center Clustering with OutliersThe importance of dealing with big data is further increasing, as machine learning (ML) systems obtain useful knowledge from big datasets. However, using all data is practically prohibitive because of the massive sizes of the datasets, so summarizing them by centers obtained from k-center clustering is a promising approach. We have two concerns here. One is fairness, because if the summary does not have some specific groups, subsequent applications may provide unfair results for the groups. The other is the presence of outliers, and if outliers dominate the summary, it cannot be useful. To overcome these concerns, we address the problem of fair k-center clustering with outliers. Although prior works studied the fair k-center clustering problem, they do not consider outliers. This paper yields a linear time algorithm that satisfies the fairness constraint of our problem and probabilistically guarantees the almost 3-approximation bound. Its empirical efficiency and effectiveness are also reported.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/amagata24a.html
https://proceedings.mlr.press/v238/amagata24a.htmlRobust Sparse VotingMany applications, such as content moderation and recommendation, require reviewing and scoring a large number of alternatives. Doing so robustly is however very challenging. Indeed, voters’ inputs are inevitably sparse: most alternatives are only scored by a small fraction of voters. This sparsity amplifies the effects of biased voters introducing unfairness, and of malicious voters seeking to hack the voting process by reporting dishonest scores. We give a precise definition of the problem of robust sparse voting, highlight its underlying technical challenges, and present a novel voting mechanism addressing the problem. We prove that, using this mechanism, no voter can have more than a small parameterizable effect on each alternative’s score; a property we call Lipschitz resilience. We also identify conditions of voters comparability under which any unanimous preferences can be recovered, even when each voter provides sparse scores, on a scale that is potentially very different from any other voter’s score scale. Proving these properties required us to introduce, analyze and carefully compose novel aggregation primitives which could be of independent interest.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/allouah24a.html
https://proceedings.mlr.press/v238/allouah24a.htmlPessimistic Off-Policy Multi-Objective OptimizationMulti-objective optimization is a class of optimization problems with multiple conflicting objectives. We study offline optimization of multi-objective policies from data collected by a previously deployed policy. We propose a pessimistic estimator for policy values that can be easily plugged into existing formulas for hypervolume computation and optimized. The estimator is based on inverse propensity scores (IPS), and improves upon a naive IPS estimator in both theory and experiments. Our analysis is general, and applies beyond our IPS estimators and methods for optimizing them.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/alizadeh24a.html
https://proceedings.mlr.press/v238/alizadeh24a.htmlComplexity of Single Loop Algorithms for Nonlinear Programming with Stochastic Objective and ConstraintsWe analyze the sample complexity of single-loop quadratic penalty and augmented Lagrangian algorithms for solving nonconvex optimization problems with functional equality constraints. We consider three cases, in all of which the objective is stochastic, that is, an expectation over an unknown distribution that is accessed by sampling. The nature of the equality constraints differs among the three cases: deterministic and linear in the first case, deterministic and nonlinear in the second case, and stochastic and nonlinear in the third case. Variance reduction techniques are used to improve the complexity. To find a point that satisfies $\varepsilon$-approximate first-order conditions, we require $\widetilde{O}(\varepsilon^{-3})$ complexity in the first case, $\widetilde{O}(\varepsilon^{-4})$ in the second case, and $\widetilde{O}(\varepsilon^{-5})$ in the third case. For the first and third cases, they are the first algorithms of “single loop” type that also use $O(1)$ samples at each iteration and still achieve the best-known complexity guarantees.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/alacaoglu24a.html
https://proceedings.mlr.press/v238/alacaoglu24a.htmlMulti-Domain Causal Representation Learning via Weak Distributional InvariancesCausal representation learning has emerged as the center of action in causal machine learning research. In particular, multi-domain datasets present a natural opportunity for showcasing the advantages of causal representation learning over standard unsupervised representation learning. While recent works have taken crucial steps towards learning causal representations, they often lack applicability to multi-domain datasets due to over-simplifying assumptions about the data; e.g. each domain comes from a different single-node perfect intervention. In this work, we relax these assumptions and capitalize on the following observation: there often exists a subset of latents whose certain distributional properties (e.g., support, variance) remain stable across domains; this property holds when, for example, each domain comes from a multi-node imperfect intervention. Leveraging this observation, we show that autoencoders that incorporate such invariances can provably identify the stable set of latents from the rest across different settings.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ahuja24a.html
https://proceedings.mlr.press/v238/ahuja24a.htmlPrivacy-Preserving Decentralized Actor-Critic for Cooperative Multi-Agent Reinforcement LearningMulti-agent reinforcement learning has a wide range of applications in cooperative settings, but ensuring data privacy among agents is a significant challenge. To address this challenge, we propose Privacy-Preserving Decentralized Actor-Critic (PPDAC), an algorithm that motivates agents to cooperate while maintaining their data privacy. Leveraging trajectory ranking, PPDAC enables the agents to learn a cooperation reward that encourages agents to account for other agents’ preferences. Subsequently, each agent trains a policy that maximizes not only its local reward as in independent actor-critic (IAC) but also the cooperation reward, hence, increasing cooperation. Importantly, communication among agents is restricted to their ranking of trajectories that only include public identifiers without any private local data. Moreover, as an additional layer of privacy, the agents can perturb their rankings with the randomized response method. We evaluate PPDAC on the level-based foraging (LBF) environment and a coin-gathering environment. We compare with IAC and Shared Experience Actor-Critic (SEAC) which achieves SOTA results for the LBF environment. The results show that PPDAC consistently outperforms IAC. In addition, PPDAC outperforms SEAC in the coin-gathering environment and achieves similar performance in the LBF environment, all while providing better privacy.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ahmed24a.html
https://proceedings.mlr.press/v238/ahmed24a.htmlUnsupervised Novelty Detection in Pretrained Representation Space with Locally Adapted Likelihood RatioDetecting novelties given unlabeled examples of normal data is a challenging task in machine learning, particularly when the novel and normal categories are semantically close. Large deep models pretrained on massive datasets can provide a rich representation space in which the simple k-nearest neighbor distance works as a novelty measure. However, as we show in this paper, the basic k-NN method might be insufficient in this context due to ignoring the ’local geometry’ of the distribution over representations as well as the impact of irrelevant ’background features’. To address this, we propose a fully unsupervised novelty detection approach that integrates the flexibility of k-NN with a locally adapted scaling of dimensions based on the ’neighbors of nearest neighbor’ and computing a ’likelihood ratio’ in pretrained (self-supervised) representation spaces. Our experiments with image data show the advantage of this method when off-the-shelf vision transformers (e.g., pretrained by DINO) are used as the feature extractor without any fine-tuning.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ahmadian24a.html
https://proceedings.mlr.press/v238/ahmadian24a.htmlAgnostic Multi-Robust Learning using ERMA fundamental problem in robust learning is asymmetry: a learner needs to correctly classify every one of exponentially-many perturbations that an adversary might make to a test-time natural example. In contrast, the attacker only needs to find one successful perturbation. Xiang et al.[2022] proposed an algorithm that in the context of patch attacks for image classification, reduces the effective number of perturbations from an exponential to a polynomial number of perturbations and learns using an ERM oracle. However, to achieve its guarantee, their algorithm requires the natural examples to be robustly realizable. This prompts the natural question; can we extend their approach to the non-robustly-realizable case where there is no classifier with zero robust error? Our first contribution is to answer this question affirmatively by reducing this problem to a setting in which an algorithm proposed by Feige et al. [2015] can be applied, and in the process extend their guarantees. Next, we extend our results to a multi-group setting and introduce a novel agnostic multi-robust learning problem where the goal is to learn a predictor that achieves low robust loss on a (potentially) rich collection of subgroups.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/ahmadi24a.html
https://proceedings.mlr.press/v238/ahmadi24a.htmlSharp error bounds for imbalanced classification: how many examples in the minority class?When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one class in relation to the full sample size and the need to rescale the risk function by a probability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/aghbalou24a.html
https://proceedings.mlr.press/v238/aghbalou24a.htmlStochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian SamplingMotivated by applications in large-scale and multi-agent reinforcement learning, we study the non-asymptotic performance of stochastic approximation (SA) schemes with delayed updates under Markovian sampling. While the effect of delays has been extensively studied for optimization, the manner in which they interact with the underlying Markov process to shape the finite-time performance of SA remains poorly understood. In this context, our first main contribution is to show that under time-varying bounded delays, the delayed SA update rule guarantees exponentially fast convergence of the \emph{last iterate} to a ball around the SA operator’s fixed point. Notably, our bound is \emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel inductive proof technique that, unlike various existing delayed-optimization analyses, relies on establishing uniform boundedness of the iterates. As such, our proof may be of independent interest. Next, to mitigate the impact of the maximum delay on the convergence rate, we provide the first finite-time analysis of a delay-adaptive SA scheme under Markovian sampling. In particular, we show that the exponent of convergence of this scheme gets scaled down by $\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here, $\tau_{avg}$ denotes the average delay across all iterations. Moreover, the adaptive scheme requires no prior knowledge of the delay sequence for step-size tuning. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms, including TD learning, Q-learning, and stochastic gradient descent under Markovian sampling.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/adibi24a.html
https://proceedings.mlr.press/v238/adibi24a.htmlAdaptive Batch Sizes for Active Learning: A Probabilistic Numerics ApproachActive learning parallelization is widely used, but typically relies on fixing the batch size throughout experimentation. This fixed approach is inefficient because of a dynamic trade-off between cost and speed—larger batches are more costly, smaller batches lead to slower wall-clock run-times—and the trade-off may change over the run (larger batches are often preferable earlier). To address this trade-off, we propose a novel Probabilistic Numerics framework that adaptively changes batch sizes. By framing batch selection as a quadrature task, our integration-error-aware algorithm facilitates the automatic tuning of batch sizes to meet predefined quadrature precision objectives, akin to how typical optimizers terminate based on convergence thresholds. This approach obviates the necessity for exhaustive searches across all potential batch sizes. We also extend this to scenarios with constrained active learning and constrained optimization, interpreting constraint violations as reductions in the precision requirement, to subsequently adapt batch construction. Through extensive experiments, we demonstrate that our approach significantly enhances learning efficiency and flexibility in diverse Bayesian batch active learning and Bayesian optimization applications.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/adachi24b.html
https://proceedings.mlr.press/v238/adachi24b.htmlLooping in the Human: Collaborative and Explainable Bayesian OptimizationLike many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO’s efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods. Code is available https://github.com/ma921/CoExBO.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/adachi24a.html
https://proceedings.mlr.press/v238/adachi24a.htmlMultitask Online Learning: Listen to the Neighborhood BuzzWe study multitask online learning in a setting where agents can only exchange information with their neighbors on an arbitrary communication network. We introduce MT-CO\textsubscript{2}OL, a decentralized algorithm for this setting whose regret depends on the interplay between the task similarities and the network structure. Our analysis shows that the regret of MT-CO\textsubscript{2}OL is never worse (up to constants) than the bound obtained when agents do not share information. On the other hand, our bounds significantly improve when neighboring agents operate on similar tasks. In addition, we prove that our algorithm can be made differentially private with a negligible impact on the regret. Finally, we provide experimental support for our theory.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/achddou24a.html
https://proceedings.mlr.press/v238/achddou24a.htmlImposing Fairness Constraints in Synthetic Data GenerationIn several real-world applications (e.g., online advertising, item recommendations, etc.) it may not be possible to release and share the real dataset due to privacy concerns. As a result, synthetic data generation (SDG) has emerged as a promising solution for data sharing. While the main goal of private SDG is to create a dataset that preserves the privacy of individuals contributing to the dataset, the use of synthetic data also creates an opportunity to improve fairness. Since there often exist historical biases in the datasets, using the original real data for training can lead to an unfair model. Using synthetic data, we can attempt to remove such biases from the dataset before releasing the data. In this work, we formalize the definition of fairness in synthetic data generation and provide a general framework to achieve fairness. Then we consider two notions of counterfactual fairness and information filtering fairness and show how our framework can be used for these definitions.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/abroshan24a.html
https://proceedings.mlr.press/v238/abroshan24a.htmlLexicographic Optimization: Algorithms and StabilityA lexicographic maximum of a set $X \subseteq R^n$ is a vector in $X$ whose smallest component is as large as possible, and subject to that requirement, whose second smallest component is as large as possible, and so on for the third smallest component, etc. Lexicographic maximization has numerous practical and theoretical applications, including fair resource allocation, analyzing the implicit regularization of learning algorithms, and characterizing refinements of game-theoretic equilibria. We prove that a minimizer in $X$ of the exponential loss function $L_c(x) = \sum_i \exp(-c x_i)$ converges to a lexicographic maximum of $X$ as $c \to \infty$, provided that $X$ is \emph{stable} in the sense that a well-known iterative method for finding a lexicographic maximum of $X$ cannot be made to fail simply by reducing the required quality of each iterate by an arbitrarily tiny degree. Our result holds for both near and exact minimizers of the exponential loss, while earlier convergence results made much stronger assumptions about the set $X$ and only held for the exact minimizer. We are aware of no previous results showing a connection between the iterative method for computing a lexicographic maximum and exponential loss minimization. We show that every convex polytope is stable, but that there exist compact, convex sets that are not stable. We also provide the first analysis of the convergence rate of an exponential loss minimizer (near or exact) and discover a curious dichotomy: While the two smallest components of the vector converge to the lexicographically maximum values very quickly (at roughly the rate $\frac{\log n}{c}$), all other components can converge arbitrarily slowly.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/abernethy24a.html
https://proceedings.mlr.press/v238/abernethy24a.htmlOn the Nyström Approximation for Preconditioning in Kernel MachinesKernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nyström-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/abedsoltan24a.html
https://proceedings.mlr.press/v238/abedsoltan24a.htmlEnhancing In-context Learning via Linear Probe CalibrationIn-context learning (ICL) is a new paradigm for natural language processing that utilizes Generative Pre-trained Transformer (GPT)-like models. This approach uses prompts that include in-context demonstrations to generate the corresponding output for a new query input. However, applying ICL in real cases does not scale with the number of samples, and lacks robustness to different prompt templates and demonstration permutations. In this paper, we first show that GPT-like models using ICL result in unreliable predictions based on a new metric based on Shannon entropy. Then, to solve this problem, we propose a new technique called the Linear Probe Calibration (LinC), a method that calibrates the model’s output probabilities, resulting in reliable predictions and improved performance, while requiring only minimal additional samples (as few as five labeled data samples). LinC significantly enhances the ICL test performance of GPT models on various benchmark datasets, with an average improvement of up to 21%, and up to a 50% improvement in some cases, and significantly boosts the performance of PEFT methods, especially in the low resource regime. Moreover, LinC achieves lower expected calibration error, and is highly robust to varying label proportions, prompt templates, and demonstration permutations.Thu, 18 Apr 2024 00:00:00 +0000
https://proceedings.mlr.press/v238/abbas24a.html
https://proceedings.mlr.press/v238/abbas24a.html