Proceedings of Machine Learning ResearchProceedings of The 26th International Conference on Artificial Intelligence and Statistics
Held in Palau de Congressos, Valencia, Spain on 25-27 April 2023
Published as Volume 206 by the Proceedings of Machine Learning Research on 11 April 2023.
Volume Edited by:
Francisco Ruiz
Jennifer Dy
Jan-Willem van de Meent
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/v206/
Sun, 20 Aug 2023 19:00:57 +0000Sun, 20 Aug 2023 19:00:57 +0000Jekyll v3.9.3Nonstochastic Contextual Combinatorial BanditsWe study a contextual version of online combinatorial optimisation with full and semi-bandit feedback. In this sequential decision-making problem, an online learner has to select an action from a combinatorial decision space after seeing a vector-valued context in each round. As a result of its action, the learner incurs a loss that is a bilinear function of the context vector and the vector representation of the chosen action. We consider two natural versions of the problem: semi-bandit where the losses are revealed for each component appearing in the learner’s combinatorial action, and full-bandit where only the total loss is observed. We design computationally efficient algorithms based on a new loss estimator that takes advantage of the special structure of the problem, and show regret bounds order $\sqrt{T}$ with respect to the time horizon. The bounds demonstrate polynomial scaling with the relevant problem parameters which is shown to be nearly optimal. The theoretical results are complemented by a set of experiments on simulated data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zierahn23a.html
https://proceedings.mlr.press/v206/zierahn23a.htmlLikelihood-Based Generative Radiance Field with Latent Space Energy-Based Model for 3D-Aware Disentangled Image RepresentationWe propose the NeRF-LEBM, a likelihoodbased top-down 3D-aware 2D image generative model that incorporates 3D representation via Neural Radiance Fields (NeRF) and 2D imaging process via differentiable volume rendering. The model represents an image as a rendering process from 3D object to 2D image and is conditioned on some latent variables that account for object characteristics and are assumed to follow informative trainable energy-based prior models. We propose two likelihood-based learning frameworks to train the NeRF-LEBM: (i) maximum likelihood estimation with Markov chain Monte Carlo-based inference and (ii) variational inference with the reparameterization trick. We study our models in the scenarios with both known and unknown camera poses. Experiments on several benchmark datasets demonstrate that the NeRF-LEBM can infer 3D object structures from 2D images, generate 2D images with novel views and objects, learn from incomplete 2D images, and learn from 2D images with known or unknown camera poses.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhu23d.html
https://proceedings.mlr.press/v206/zhu23d.htmlProvably Efficient Reinforcement Learning via Surprise BoundValue function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the importance and wide applicability of value function approximation, its theoretical understanding is still not as sophisticated as its empirical success, especially in the context of general function approximation. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximations. We show that if the value functions can be approximated by a function class $\mathcal{F}$ which satisfies the bellman-completeness assumption, our algorithm achieves an $\widetilde{O}(\mathrm{poly}(\iota H)\sqrt{T})$ regret bound where $\iota$ is the product of the surprise bound and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes and $T = HK$ is the total number of steps the agent interacts with the environment. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting. Moreover, our algorithm only needs to solve $O(H\log K)$ empirical risk minimization (ERM) problems, which is far more efficient than previous algorithms that need to solve ERM problems for $\Omega(HK)$ times.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhu23c.html
https://proceedings.mlr.press/v206/zhu23c.htmlByzantine-Robust Federated Learning with Optimal Statistical RatesWe propose Byzantine-robust federated learning protocols with nearly optimal statistical rates based on recent progress in high dimensional robust statistics. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a near-optimal statistical rate for strongly convex losses. We also provide statistical lower bound for the problem. For experiments, we benchmark against competing protocols and show the empirical superiority of the proposed protocols.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhu23b.html
https://proceedings.mlr.press/v206/zhu23b.htmlWeather2K: A Multivariate Spatio-Temporal Benchmark Dataset for Meteorological Forecasting Based on Real-Time Observation Data from Ground Weather StationsWeather forecasting is one of the cornerstones of meteorological work. In this paper, we present a new benchmark dataset named Weather2K, which aims to make up for the deficiencies of existing weather forecasting datasets in terms of real-time, reliability, and diversity, as well as the key bottleneck of data quality. To be specific, our Weather2K is featured from the following aspects: 1) Reliable and real-time data. The data is hourly collected from 2,130 ground weather stations covering an area of 6 million square kilometers. 2) Multivariate meteorological variables. 20 meteorological factors and 3 constants for position information are provided with a length of 40,896 time steps. 3) Applicable to diverse tasks. We conduct a set of baseline tests on time series forecasting and spatio-temporal forecasting. To the best of our knowledge, our Weather2K is the first attempt to tackle weather forecasting task by taking full advantage of the strengths of observation data from ground weather stations. Based on Weather2K, we further propose Meteorological Factors based Multi-Graph Convolution Network (MFMGCN), which can effectively construct the intrinsic correlation among geographic locations based on meteorological factors. Sufficient experiments show that MFMGCN improves both the forecasting performance and temporal robustness. We hope our Weather2K can significantly motivate researchers to develop efficient and accurate algorithms to advance the task of weather forecasting. The dataset can be available at https://github.com/bycnfz/weather2k/.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhu23a.html
https://proceedings.mlr.press/v206/zhu23a.htmlDomain Adaptation under Missingness ShiftRates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that if missing data indicators are available, DAMS reduces to covariate shift. Addressing cases where such indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal linear source predictor can perform arbitrarily worse on the target domain than always predicting the mean; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhou23b.html
https://proceedings.mlr.press/v206/zhou23b.htmlOptimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning ApproachIn this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhou23a.html
https://proceedings.mlr.press/v206/zhou23a.htmlOn the Consistency Rate of Decision Tree Learning AlgorithmsDecision tree learning algorithms such as CART are generally based on heuristics that maximizes the purity gain greedily. Though these algorithms are practically successful, theoretical properties such as consistency are far from clear. In this paper, we discover that the most serious obstacle encumbering consistency analysis for decision tree learning algorithms lies in the fact that the worst-case purity gain, i.e., the core heuristics for tree splitting, can be zero. Based on this recognition, we present a new algorithm, named Grid Classification And Regression Tree (GridCART), with a provable consistency rate $\mathcal{O}(n^{-1 / (d + 2)})$, which is the first consistency rate proved for heuristic tree learning algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zheng23b.html
https://proceedings.mlr.press/v206/zheng23b.htmlKnowledge Acquisition for Human-In-The-Loop Image CaptioningImage captioning offers a computational process to understand the semantics of images and convey them using descriptive language. However, automated captioning models may not always generate satisfactory captions due to the complex nature of the images and the quality/size of the training data. We propose an interactive captioning framework to improve machine-generated captions by keeping humans in the loop and performing an online-offline knowledge acquisition (KA) process. In particular, online KA accepts a list of keywords specified by human users and fuses them with the image features to generate a readable sentence that captures the semantics of the image. It leverages a multimodal conditioned caption completion mechanism to ensure the appearance of all user-input keywords in the generated caption. Offline KA further learns from the user inputs to update the model and benefits caption generation for unseen images in the future. It is built upon a Bayesian transformer architecture that dynamically allocates neural resources and supports uncertainty-aware model updates to mitigate overfitting. Our theoretical analysis also proves that Offline KA automatically selects the best model capacity to accommodate the newly acquired knowledge. Experiments on real-world data demonstrate the effectiveness of the proposed framework.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zheng23a.html
https://proceedings.mlr.press/v206/zheng23a.htmlAutomatic Attention Pruning: Improving and Automating Model Pruning using AttentionsPruning is a promising approach to compress deep learning models in order to deploy them on resource-constrained edge devices. However, many existing pruning solutions are based on unstructured pruning, which yields models that cannot efficiently run on commodity hardware; and they often require users to manually explore and tune the pruning process, which is time-consuming and often leads to sub-optimal results. To address these limitations, this paper presents Automatic Attention Pruning (AAP), an adaptive, attention-based, structured pruning approach to automatically generate small, accurate, and hardware-efficient models that meet user objectives. First, it proposes iterative structured pruning using activation-based attention maps to effectively identify and prune unimportant filters. Then, it proposes adaptive pruning policies for automatically meeting the pruning objectives of accuracy-critical, memory-constrained, and latency-sensitive tasks. A comprehensive evaluation shows that AAP substantially outperforms the state-of-the-art structured pruning works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/Automatic-Attention-Pruning.git.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhao23b.html
https://proceedings.mlr.press/v206/zhao23b.htmlBlessing of Class Diversity in Pre-trainingThis paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhao23a.html
https://proceedings.mlr.press/v206/zhao23a.htmlSpread Flows for Manifold ModellingFlow-based models typically define a latent space with dimensionality identical to the observational space. In many problems, however, the data does not populate the full ambient data space that they natively reside in, rather inhabiting a lower-dimensional manifold. In such scenarios, flow-based models are unable to represent data structures exactly as their densities will always have support off the data manifold, potentially resulting in degradation of model performance. To address this issue, we propose to learn a manifold prior for flow models that leverage the recently proposed spread divergence towards fixing the crucial problem; the KL divergence and maximum likelihood estimation are ill-defined for manifold learning. In addition to improving both sample quality and representation quality, an auxiliary benefit enabled by our approach is the ability to identify the intrinsic dimension of the manifold distribution.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23k.html
https://proceedings.mlr.press/v206/zhang23k.htmlImproved Bound on Generalization Error of Compressed KNN EstimatorThis paper studies the generalization capability of the compressed $k$-nearest neighbor (KNN) estimator, where randomly-projected low-dimensional data are put into the KNN estimator rather than the high-dimensional raw data. Considering both regression and classification, we give improved bounds on its generalization errors, to put more specific, $\ell_2$ error for regression and mis-classification rate for classification. As a byproduct of our analysis, we prove that ordered distance is almost preserved with random projections, which we believe is for the first time. In addition, we provide numerical experiments on various public datasets to verify our theorems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23j.html
https://proceedings.mlr.press/v206/zhang23j.htmlContinuous-Time Decision Transformer for Healthcare ApplicationsOffline reinforcement learning (RL) is a promising approach for training intelligent medical agents to learn treatment policies and assist decision making in many healthcare applications, such as scheduling clinical visits and assigning dosages for patients with chronic conditions. In this paper, we investigate the potential usefulness of Decision Transformer (Chen et al., 2021)–a new offline RL paradign– in medical domains where decision making in continuous time is desired. As Decision Transformer only handles discrete-time (or turn-based) sequential decision making scenarios, we generalize it to Continuous-Time Decision Transformer that not only considers the past clinical measurements and treatments but also the timings of previous visits, and learns to suggest the timings of future visits as well as the treatment plan at each visit. Extensive experiments on synthetic datasets and simulators motivated by real-world medical applications demonstrate that Continuous-Time Decision Transformer is able to outperform competitors and has clinical utility in terms of improving patients’ health and prolonging their survival by learning high-performance policies from logged data generated using policies of different levels of quality.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23i.html
https://proceedings.mlr.press/v206/zhang23i.htmlRisk Bounds on Aleatoric Uncertainty RecoveryQuantifying aleatoric uncertainty is a challenging task in machine learning. It is important for decision making associated with data-dependent uncertainty in model outcomes. Recently, many empirical studies in modeling aleatoric uncertainty under regression settings primarily rely on either a Gaussian likelihood or moment matching. However, the performance of these methods varies for different datasets whereas discussions on their theoretical guarantees are lacking. In this work, we investigate theoretical aspects of these approaches and establish risk bounds for their estimates. We provide conditions that are sufficient to guarantee the PAC-learnablility of the aleatoric uncertainty. The study suggests that the likelihood and moment matching-based methods enjoy different types of guarantee in their risk bounds, i.e., they calibrate different aspects of the uncertainty and thus exhibit distinct properties in different regimes of the parameter space. Finally, we conduct empirical study which shows promising results and supports our theorems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23h.html
https://proceedings.mlr.press/v206/zhang23h.htmlFair Representation Learning with Unreliable LabelsIn learning with fairness, for every instance, its label can be randomly flipped to another class due to the practitioner’s prejudice, namely, label bias. The existing well-studied fair representation learning methods focus on removing the dependency between the sensitive factors and the input data, but do not address how the representations retain useful information when the labels are unreliable. In fact, we find that the learned representations become random or degenerated when the instance is contaminated by label bias. To alleviate this issue, we investigate the problem of learning fair representations that are independent of the sensitive factors while retaining the task-relevant information given only access to unreliable labels. Our model disentangles the dependency between fair representations and sensitive factors in the latent space. To remove the reliance between the labels and sensitive factors, we incorporate an additional penalty based on mutual information. The learned purged fair representations can then be used in any downstream processing. We demonstrate the superiority of our method over previous works through multiple experiments on both synthetic and real-world datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23g.html
https://proceedings.mlr.press/v206/zhang23g.htmlOnline Learning for Non-monotone DR-Submodular Maximization: From Full Information to Bandit FeedbackIn this paper, we revisit the online non-monotone continuous DR-submodular maximization problem over a down-closed convex set, which finds wide real-world applications in the domain of machine learning, economics, and operations research. At first, we present the Meta-MFW algorithm achieving a $1/e$-regret of $O(\sqrt{T})$ at the cost of $T^{3/2}$ stochastic gradient evaluations per round. As far as we know, Meta-MFW is the first algorithm to obtain $1/e$-regret of $O(\sqrt{T})$ for the online non-monotone continuous DR-submodular maximization problem over a down-closed convex set. Furthermore, in sharp contrast with ODC algorithm (Thang $&$ Srivastav, 2021), Meta-MFW relies on the simple online linear oracle without discretization, lifting, or rounding operations. Considering the practical restrictions, we then propose the Mono-MFW algorithm, which reduces the per-function stochastic gradient evaluations from $T^{3/2}$ to 1 and achieves a $1/e$-regret bound of $O(T^{4/5})$. Next, we extend Mono-MFW to the bandit setting and propose the Bandit-MFW algorithm which attains a $1/e$-regret bound of $O(T^{8/9})$. To the best of our knowledge, Mono-MFW and Bandit-MFW are the first sublinear-regret algorithms to explore the one-shot and bandit setting for online non-monotone continuous DR-submodular maximization problem over a down-closed convex set, respectively. Finally, we conduct numerical experiments on both synthetic and real-world datasets to verify the effectiveness of our methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23f.html
https://proceedings.mlr.press/v206/zhang23f.htmlNo-Regret Learning in Two-Echelon Supply Chain with Unknown Demand DistributionSupply chain management (SCM) has been recognized as an important discipline with applications to many industries, where the two-echelon stochastic inventory model, involving one downstream retailer and one upstream supplier, plays a fundamental role for developing firms’ SCM strategies. In this work, we aim at designing online learning algorithms for this problem with an unknown demand distribution, which brings distinct features as compared to classic online convex optimization problems. Specifically, we consider the two-echelon supply chain model introduced in [Cachon and Zipkin, 1999] under two different settings: the centralized setting, where a planner decides both agents’ strategy simultaneously, and the decentralized setting, where two agents decide their strategy independently and selfishly. We design algorithms that achieve favorable guarantees for both regret and convergence to the optimal inventory decision in both settings, and additionally for individual regret in the decentralized setting. Our algorithms are based on Online Gradient Descent and Online Newton Step, together with several new ingredients specifically designed for our problem. We also implement our algorithms and show their empirical effectiveness.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23e.html
https://proceedings.mlr.press/v206/zhang23e.htmlAdversarial Noises Are Linearly Separable for (Nearly) Random Neural NetworksAdversarial example, which is usually generated by adding imperceptible adversarial noise to a clean sample, is ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while keeping the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23d.html
https://proceedings.mlr.press/v206/zhang23d.htmlConformal Off-Policy PredictionOff-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy’s return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23c.html
https://proceedings.mlr.press/v206/zhang23c.htmlSequential Gradient Descent and Quasi-Newton’s Method for Change-Point AnalysisOne common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential updating method (SE) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23b.html
https://proceedings.mlr.press/v206/zhang23b.htmlLeveraging Instance Features for Label Aggregation in Programmatic Weak SupervisionProgrammatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently. The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources abstracted as labeling functions (LFs). Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process. In this paper, we attempt to incorporate the instance features into a statistical label model via the proposed FABLE. In particular, it is built on a mixture of Bayesian label models, each corresponding to a global pattern of correlation, and the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features. We adopt an auxiliary variable-based variational inference algorithm to tackle the non-conjugate issue between the Gaussian Process and Bayesian label models. Extensive empirical comparison on eleven benchmark datasets sees FABLE achieving the highest averaged performance across nine baselines. Our implementation of FABLE can be found in https://github.com/JieyuZ2/wrench/blob/main/wrench/labelmodel/fable.py.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zhang23a.html
https://proceedings.mlr.press/v206/zhang23a.htmlSemi-Verified PAC Learning from the CrowdWe study the problem of crowdsourced PAC learning of threshold functions. This is a challenging problem and only recently have query-efficient algorithms been established under the assumption that a noticeable fraction of the workers are perfect. In this work, we investigate a more challenging case where the majority may behave adversarially and the rest behave as the Massart noise – a significant generalization of the perfectness assumption. We show that under the semi-verified model of Charikar et al. (2017), where we have (limited) access to a trusted oracle who always returns correct annotations, it is possible to PAC learn the underlying hypothesis class with a manageable amount of label queries. Moreover, we show that the labeling cost can be drastically mitigated via the more easily obtained comparison queries. Orthogonal to recent developments in semi-verified or list-decodable learning that crucially rely on data distributional assumptions, our PAC guarantee holds by exploring the wisdom of the crowd.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zeng23a.html
https://proceedings.mlr.press/v206/zeng23a.htmlBayesian Strategy-Proof Facility Location via Robust EstimationA seminal work by Moulin (1980) shows that the median voting scheme fully characterizes (deterministic) strategy-proof facility location mechanism for single-peaked preferences. In this simple setting, median also achieves the optimal social cost. In $d$ dimensions, strategy-proof mechanism is characterized by coordinate-wise median, which is known to have a large $\sqrt{d}$ approximation ratio of the social cost in the Euclidean space, whereas the socially optimal mechanism fails at being strategy-proof. In light of the negative results in the classic, worst-case setting, we initiate the study of Bayesian mechanism design for strategy-proof facility location for multi-dimensional Euclidean preferences, where the agents’ preferences are drawn from a distribution. We approach the problem via connections to algorithmic high-dimensional robust statistics. Specially, our contributions are the following: * We provide a general reduction from any robust estimation scheme to Bayesian approximately strategy-proof mechanism. This leads to new strategy-proof mechanisms for Gaussian and bounded moment distributions, by leveraging recent advances in robust statistics. * We show that the Lugosi-Mendelson median arising from heavy-tailed statistics can be used to obtain Bayesian approximately strategy-proof single-facility mechanism with asymptotically optimal social cost, under mild distributional assumptions. * We provide Bayesian approximately strategy-proof multi-facility mechanisms for Gaussian mixture distributions with nearly optimal social cost.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zampetakis23a.html
https://proceedings.mlr.press/v206/zampetakis23a.htmlOracle-free Reinforcement Learning in Mean-Field Games along a Single Sample PathWe consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this Sandbox Learning, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $O(\epsilon^{-4})$ where $\epsilon$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/zaman23a.html
https://proceedings.mlr.press/v206/zaman23a.htmlOptimal Sample Complexity Bounds for Non-convex Optimization under Kurdyka-Lojasiewicz ConditionOptimization of smooth reward functions under bandit feedback is a long-standing problem in online learning. This paper approaches this problem by studying the convergence under smoothness and Kurdyka-Lojasiewicz conditions. We designed a search-based algorithm that achieves an improved rate compared to the standard gradient-based method. In conjunction with a matching lower bound, this algorithm shows optimality in the dependence on precision for the low-dimension regime.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yu23a.html
https://proceedings.mlr.press/v206/yu23a.htmlSample Complexity of Kernel-Based Q-LearningModern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a non-parametric Q-learning algorithm which finds an $\varepsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\varepsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yeh23a.html
https://proceedings.mlr.press/v206/yeh23a.htmlFreeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature NoiseThe existence of spurious correlations such as image backgrounds in the training environment can make empirical risk minimization (ERM) perform badly in the test environment. To address this problem, Kirichenko et al. (2022) empirically found that the core features that are related to the outcome can still be learned well even with the presence of spurious correlations. This opens a promising strategy to first train a feature learner rather than a classifier, and then perform linear probing (last layer retraining) in the test environment. However, a theoretical understanding of when and why this approach works is lacking. In this paper, we find that core features are only learned well when their associated non-realizable noise is smaller than that of spurious features, which is not necessarily true in practice. We provide both theories and experiments to support this finding and to illustrate the importance of non-realizable noise. Moreover, we propose an algorithm called Freeze then Train (FTT), that first freezes certain salient features and then trains the rest of the features using ERM. We theoretically show that FTT preserves features that are more beneficial to test time probing. Across two commonly used spurious correlation datasets, FTT outperforms ERM, IRM, JTT and CVaR-DRO, with substantial improvement in accuracy (by 4.5$%$) when the feature noise is large. FTT also performs better on general distribution shift benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ye23a.html
https://proceedings.mlr.press/v206/ye23a.htmlRandomized Primal-Dual Methods with Adaptive Step SizesIn this paper we propose a class of randomized primal-dual methods incorporating line search to contend with large-scale saddle point (SP) problems defined by a convex-concave function $\mathcal L(\mathbf{x},y) = \sum_{i=1}^M f_i(x_i)+\Phi(\mathbf{x},y)-h(y)$. We analyze the convergence rate of the proposed method under mere convexity and strong convexity assumptions of $\mathcal L$ in $\mathbf{x}$-variable. In particular, assuming $\nabla_y\Phi(\cdot,\cdot)$ is Lipschitz and $\nabla_{\mathbf{x}}\Phi(\cdot,y)$ is coordinate-wise Lipschitz for any fixed $y$, the ergodic sequence generated by the algorithm achieves the $\mathcal O(M/k)$ convergence rate in the expected primal-dual gap. Furthermore, assuming that $\mathcal L(\cdot,y)$ is strongly convex for any $y$, and that $\Phi(\mathbf{x},\cdot)$ is affine for any $\mathbf{x}$, the scheme enjoys a faster rate of $\mathcal O(M/k^2)$ in terms of primal solution suboptimality. We implemented the proposed algorithmic framework to solve kernel matrix learning problem, and tested it against other state-of-the-art first-order methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yazdandoost-hamedani23a.html
https://proceedings.mlr.press/v206/yazdandoost-hamedani23a.htmlStochastic Methods for AUC Optimization subject to AUC-based Fairness ConstraintsAs machine learning being used increasingly in making high-stakes decisions, an arising challenge is to avoid unfair AI systems that lead to discriminatory decisions for protected population. A direct approach for obtaining a fair predictive model is to train the model through optimizing its prediction performance subject to fairness constraints. Among various fairness constraints, the ones based on the area under the ROC curve (AUC) are emerging recently because they are threshold-agnostic and effective for unbalanced data. In this work, we formulate the problem of training a fairness-aware predictive model as an AUC optimization problem subject to a class of AUC-based fairness constraints. This problem can be reformulated as a min-max optimization problem with min-max constraints, which we solve by stochastic first-order methods based on a new Bregman divergence designed for the special structure of the problem. We numerically demonstrate the effectiveness of our approach on real-world data under different fairness metrics.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yao23b.html
https://proceedings.mlr.press/v206/yao23b.htmlError Estimation for Random Fourier FeaturesRandom Fourier Features (RFF) is among the most popular and broadly applicable approaches for scaling up kernel methods. In essence, RFF allows the user to avoid costly computations with a large kernel matrix via a fast randomized approximation. However, a pervasive difficulty in applying RFF is that the user does not know the actual error of the approximation, or how this error will propagate into downstream learning tasks. Up to now, the RFF literature has primarily dealt with these uncertainties using theoretical error bounds, but from a user’s standpoint, such results are typically impractical—either because they are highly conservative or involve unknown quantities. To tackle these general issues in a data-driven way, this paper develops a bootstrap approach to numerically estimate the errors of RFF approximations. Three key advantages of this approach are: (1) The error estimates are specific to the problem at hand, avoiding the pessimism of worst-case bounds. (2) The approach is flexible with respect to different uses of RFF, and can even estimate errors in downstream learning tasks. (3) The approach enables adaptive computation, in the sense that the user can quickly inspect the error of a rough initial kernel approximation and then predict how much extra work is needed. Furthermore, in exchange for all of these benefits, the error estimates can be obtained at a modest computational cost.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yao23a.html
https://proceedings.mlr.press/v206/yao23a.htmlLearning to Generalize Provably in Learning to OptimizeLearning to optimize (L2O) has gained increasing popularity, which automates the design of optimizers by data-driven approaches. However, current L2O methods often suffer from poor generalization performance in at least two folds: (i) applying the L2O-learned optimizer to unseen optimizees, in terms of lowering their loss function values (optimizer generalization, or “generalizable learning of optimizers”); and (ii) the test performance of an optimizee (itself as a machine learning model), trained by the optimizer, in terms of the accuracy over unseen data (optimizee generalization, or “learning to generalize”). While the optimizer generalization has been recently studied, the optimizee generalization (or learning to generalize) has not been rigorously studied in the L2O context, which is the aim of this paper. We first theoretically establish an implicit connection between the local entropy and the Hessian, and hence unify their roles in the handcrafted design of generalizable optimizers as equivalent metrics of the landscape flatness of loss functions. We then propose to incorporate these two metrics as flatness-aware regularizers into the L2O framework in order to meta-train optimizers to learn to generalize, and theoretically show that such generalization ability can be learned during the L2O meta-training process and then transformed to the optimizee loss function. Extensive experiments consistently validate the effectiveness of our proposals with substantially improved generalization on multiple sophisticated L2O models and diverse optimizees.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23h.html
https://proceedings.mlr.press/v206/yang23h.htmlOnline Linearized LASSOSparse regression has been a popular approach to perform variable selection and enhance the prediction accuracy and interpretability of the resulting statistical model. Existing approaches focus on offline regularized regression, while the online scenario has rarely been studied. In this paper, we propose a novel online sparse linear regression framework for analyzing streaming data when data points arrive sequentially. Our proposed method is memory efficient and requires less stringent restricted strong convexity assumptions. Theoretically, we show that with a properly chosen regularization parameter, the $\ell_2$-error of our estimator decays to zero at the optimal order of $\tilde \mathcal{O}(\frac{s}{\sqrt{t}})$, where $s$ is the sparsity level, $t$ is the streaming sample size, and $\tilde \mathcal{O}(\cdot)$ hides logarithmic terms. Numerical experiments demonstrate the practical efficiency of our algorithm.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23g.html
https://proceedings.mlr.press/v206/yang23g.htmlDistributionally Robust Policy Gradient for Offline Contextual BanditsLearning an optimal policy from offline data is notoriously challenging, which requires the evaluation of the learning policy using data pre-collected from a static logging policy. We study the policy optimization problem in offline contextual bandits using policy gradient methods. We employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly. We show that our algorithm converges to a stationary point with rate $O(1/T)$, where $T$ is the number of time steps. We conduct experiments on real-world datasets under various scenarios of logging policies to compare our proposed algorithm with baseline methods in offline contextual bandits. We also propose a variant of our algorithm, DROPO-exp, to further improve the performance when a limited amount of online interaction is allowed. Our results demonstrate the effectiveness and robustness of the proposed algorithms, especially under heavily biased offline data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23f.html
https://proceedings.mlr.press/v206/yang23f.htmlReinforcement Learning for Adaptive Mesh RefinementFinite element simulations of physical systems governed by partial differential equations (PDE) crucially depend on adaptive mesh refinement (AMR) to allocate computational budget to regions where higher resolution is required. Existing scalable AMR methods make heuristic refinement decisions based on instantaneous error estimation and thus do not aim for long-term optimality over an entire simulation. We propose a novel formulation of AMR as a Markov decision process and apply deep reinforcement learning (RL) to train refinement policies directly from simulation. AMR poses a challenge for RL as both the state dimension and available action set changes at every step, which we solve by proposing new policy architectures with differing generality and inductive bias. The model sizes of these policy architectures are independent of the mesh size and hence can be deployed on larger simulations than those used at training time. We demonstrate in comprehensive experiments on static function estimation and time-dependent equations that RL policies can be trained on problems without using ground truth solutions, are competitive with a widely-used error estimator, and generalize to larger and unseen test problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23e.html
https://proceedings.mlr.press/v206/yang23e.htmlLearning While Scheduling in Multi-Server Systems With Unknown Statistics: MaxWeight with Discounted UCBMulti-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, and crowdsourcing. This paper considers a multi-server system with multiple servers and multiple types of jobs, where different job types require different amounts of processing time at different servers. The goal is to schedule jobs on servers without knowing the statistics of the processing times. To fully utilize the processing power of the servers, it is known that one has to at least learn the service rates of different job types on different servers. Prior works on this topic decouple the learning and scheduling phases which leads to either excessive exploration or extremely large job delays. We propose a new algorithm, which combines the MaxWeight scheduling policy with discounted upper confidence bound (UCB), to simultaneously learn the statistics and schedule jobs to servers. We obtain performance bounds for our algorithm that hold for both stationary and nonstationary service rates. Simulations confirm that the delay performance of our algorithm is several orders of magnitude better than previously proposed algorithms. Our algorithm also has the added benefit that it can handle non-stationarity in the service processes.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23d.html
https://proceedings.mlr.press/v206/yang23d.htmlSample Efficiency of Data Augmentation Consistency RegularizationData augmentation is popular in the training of large neural networks; however, currently, theoretical understanding of the discrepancy between different algorithmic choices of leveraging augmented data remains limited. In this paper, we take a step in this direction – we first present a simple and novel analysis for linear regression with label invariant augmentations, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). The analysis is then generalized to misspecified augmentations (i.e., augmentations that change the labels), which again demonstrates the merit of DAC over DA-ERM. Further, we extend our analysis to non-linear models (e.g., neural networks) and present generalization bounds. Finally, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between DAC and DA-ERM using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23c.html
https://proceedings.mlr.press/v206/yang23c.htmlOn the Neural Tangent Kernel Analysis of Randomly Pruned Neural NetworksMotivated by both theory and practice, we study how random pruning the weights affects a neural network’s neural tangent kernel (NTK). In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version. The equivalence is established under two cases. The first main result studies the infinite-width asymptotic. It is shown that given a pruning probability, for fully-connected neural networks with the weights randomly pruned at the initialization, as the width of each layer grows to infinity sequentially, the NTK of the pruned neural network converges to the limiting NTK of the original network with some extra scaling. If the network weights are rescaled appropriately after pruning, this extra scaling can be removed. The second main result considers the finite width case. It is shown that to ensure the NTK’s closeness to the limit, the dependence of width on the sparsity parameter is asymptotically linear, as the NTK’s gap to its limit goes down to zero. Moreover, if the pruning probability is set to zero (i.e., no pruning), the bound on the required width matches the bound for fully-connected neural networks in previous works up to logarithmic factors. The proof of this result requires developing novel analysis of a network structure which we called mask-induced pseudo-networks.Experiments are provided to evaluate our results.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23b.html
https://proceedings.mlr.press/v206/yang23b.htmlBayesian Structure Scores for Probabilistic CircuitsProbabilistic circuits (PCs) are a prominent representation of probability distributions with tractable inference. While parameter learning in PCs is rigorously studied, structure learning is often more based on heuristics than on principled objectives. In this paper, we develop Bayesian structure scores for deterministic PCs, i.e., the structure likelihood with parameters marginalized out, which are well known as rigorous objectives for structure learning in probabilistic graphical models. When used within a greedy cutset algorithm, our scores effectively protect against overfitting and yield a fast and almost hyper-parameter-free structure learner, distinguishing it from previous approaches. In experiments, we achieve good trade-offs between training time and model fit in terms of log-likelihood. Moreover, the principled nature of Bayesian scores unlocks PCs for accommodating frameworks such as structural expectation-maximization.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yang23a.html
https://proceedings.mlr.press/v206/yang23a.htmlMediated Uncoupled Learning and Validation with Bregman Divergences: Loss Family with Maximal GeneralityIn mediated uncoupled learning (MU-learning), the goal is to predict an output variable $Y$ given an input variable $X$ as in ordinary supervised learning while the training dataset has no joint samples of $(X, Y)$ but only independent samples of $(X, U)$ and $(U, Y)$ each observed with a mediating variable $U$. The existing MU-learning methods can only handle the squared loss, which prohibited the use of other popular loss functions such as the cross-entropy loss. We propose a general MU-learning framework that allows for the problems with Bregman divergences, which cover a wide range of loss functions useful for various types of tasks, in a unified manner. This loss family has maximal generality among those whose minimizers characterize the conditional expectation. We prove that the proposed objective function is a tighter approximation to the oracle loss that one would minimize if ordinary supervised samples of $(X, Y)$ were available. We also propose an estimator of an interval containing the expected test loss of predictions of a trained model only using $(X, U)$- and $(U, Y)$-data. We provide a theoretical analysis on the excess risk for the proposed method and confirm its practical usefulness with regression experiments with synthetic data and low-quality image classification experiments with benchmark datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/yamane23a.html
https://proceedings.mlr.press/v206/yamane23a.htmlBenign overfitting of non-smooth neural networks beyond lazy trainingBenign overfitting refers to a recently discovered intriguing phenomenon that over-parameterized neural networks, in many cases, can fit the training data perfectly but still generalize well, surprisingly contrary to the traditional belief that overfitting is harmful for generalization. In spite of its surging popularity in recent years, little has been known in the theoretical aspect of benign overfitting of neural networks. In this work, we provide a theoretical analysis of benign overfitting for two-layer neural networks with possibly non-smooth activation function. Without resorting to the popular Neural Tangent Kernel (NTK) approximation, we prove that neural networks can be trained with gradient descent to classify binary-labeled training data perfectly (achieving zero training loss) even in presence of polluted labels, but still generalize well. Our result removes the smoothness assumption in previous literature and goes beyond the NTK regime; this enables a better theoretical understanding of benign overfitting within a practically more meaningful setting, e.g., with (leaky-)ReLU activation function, small random initialization, and finite network width.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23k.html
https://proceedings.mlr.press/v206/xu23k.htmlUniformly Conservative Exploration in Reinforcement LearningA key challenge to deploying reinforcement learning in practice is avoiding excessive (harmful) exploration in individual episodes. We propose a natural constraint on exploration—uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We design a novel algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to satisfy our exploration constraint with high probability. Importantly, to ensure unbiased exploration across the state space, our algorithm adaptively determines when to explore. We prove that our approach remains conservative while minimizing regret in the tabular setting. We experimentally validate our results on a sepsis treatment task and an HIV treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient; the latter task also demonstrates that our approach extends to continuous state spaces via deep reinforcement learning.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23j.html
https://proceedings.mlr.press/v206/xu23j.htmlDoubly Fair Dynamic PricingWe study the problem of online dynamic pricing with two types of fairness constraints: a “procedural fairness” which requires the “proposed” prices to be equal in expectation among different groups, and a “substantive fairness” which requires the “accepted” prices to be equal in expectation among different groups. A policy that is simultaneously procedural and substantive fair is referred to as “doubly fair”. We show that a doubly fair policy must be random to have higher revenue than the best trivial policy that assigns the same price to different groups. In a two-group setting, we propose an online learning algorithm for the 2-group pricing problems that achieves $\tilde{O}(\sqrt{T})$ regret, zero procedural unfairness and $\tilde{O}(\sqrt{T})$ substantive unfairness over $T$ rounds of learning. We also prove two lower bounds showing that these results on regret and unfairness are both information-theoretically optimal up to iterated logarithmic factors. To the best of our knowledge, this is the first dynamic pricing algorithm that learns to price while satisfying two fairness constraints at the same time.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23i.html
https://proceedings.mlr.press/v206/xu23i.htmlImproved Sample Complexity Bounds for Distributionally Robust Reinforcement LearningWe consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We formulate this as a distributionally robust reinforcement learning (DR-RL) problem where the objective is to learn the policy which maximizes the value function against the worst possible stochastic model of the environment in an uncertainty set. We focus on the tabular episodic learning setting where the algorithm has access to a generative model of the nominal (training) environment around which the uncertainty set is defined. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences: total variation, chi-square, Kullback-Leibler, and Wasserstein. We show that our algorithm achieves $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ sample complexity, which is uniformly better than the existing results by a factor of $|\mathcal{S}|$, where $|\mathcal{S}|$ is number of states, $|\mathcal{A}|$ is the number of actions, and $H$ is the horizon length. We also provide the first-ever sample complexity result for the Wasserstein uncertainty set. Finally, we demonstrate the performance of our algorithm using simulation experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23h.html
https://proceedings.mlr.press/v206/xu23h.htmlOn the Accelerated Noise-Tolerant Power MethodWe revisit the acceleration of the noise-tolerant power method for which, despite previous studies, the results remain unsatisfactory as they are either wrong or suboptimal, also lacking generality. In this work, we present a simple yet general and optimal analysis via noise-corrupted Chebyshev polynomials, which allows a larger iteration rank $p$ than the target rank $k$, requires less noise conditions in a new form, and achieves the optimal iteration complexity $\Theta\left(\sqrt{\frac{\lambda_{k}-\lambda_{q+1}}{\lambda_{k}}}\log\frac{1}{\epsilon}\right)$ for some $q$ satisfying $k\leq q\leq p$ in a certain regime of the momentum parameter. Interestingly, it shows dynamic dependence of the noise tolerance on the spectral gap, i.e., from linear at the beginning to square-root near convergence, while remaining commensurate with the previous in terms of overall tolerance. We relate our new form of noise norm conditions to the existing trigonometric one, which enables an improved analysis of generalized eigenspace computation and canonical correlation analysis. We conduct an extensive experimental study to showcase the great performance of the considered algorithm with a larger iteration rank $p>k$ across different applications.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23g.html
https://proceedings.mlr.press/v206/xu23g.htmlOblivious near-optimal sampling for multidimensional signals with Fourier constraintsWe study the problem of reconstructing a continuous multidimensional signal from a small number of samples under Fourier constraints assuming that the Fourier power spectrum of the signal has some desirable properties, e.g. being compactly supported, being sparse. We further assume that the Fourier constraint can be expressed as a prior distribution on the Fourier power spectrum, which subsumes the aforementioned examples. The study of sampling and reconstructing in this vein has attracted much attention with a long history. In this paper, we are interested in finding oblivious sampling strategies, that is, sampling without knowing what specific constraint is put on the Fourier power spectrum. We show that it is possible to obliviously sample a Fourier-constrained multidimensional signal with a near-optimal (up to a logarithmic factor) number of samples that guarantee successful reconstruction, partially answering an open question in Avron et al. (2019) which considered the $1$-dimensional case. Our approach highlights a phenomenon that is unique for dimension $d\ge 2$ that the sampling strategy should depend on the geometry of the region on which the signal is to be reconstructed, unlike the case $d=1$ where all regions are of the form $[a,b]$ which are all geometrically equivalent. Our proof, using tools from convex geometry, also illuminates an idea obscured in $d=1$, that to reconstruct a signal in a given region, it can be helpful to take some samples outside that region.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23f.html
https://proceedings.mlr.press/v206/xu23f.htmlFAIR: Fair Collaborative Active Learning with Individual Rationality for Scientific DiscoveryScientific discovery aims to find new patterns and test specific hypotheses by analysing large-scale experimental data. However, various practical limitations (e.g., high experimental costs or the inability to perform some experiments) make it challenging for researchers to collect sufficient experimental data for successful scientific discovery. To this end, we propose a collaborative active learning (CAL) framework that enables researchers to share their experimental data for mutual benefit. Specifically, our proposed coordinated acquisition function sets out to achieve individual rationality and fairness so that everyone can equitably benefit from collaboration. We empirically demonstrate that our method outperforms existing batch active learning ones (adapted to the CAL setting) in terms of both learning performance and fairness on various real-world scientific discovery datasets (biochemistry, material science, and physics).Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23e.html
https://proceedings.mlr.press/v206/xu23e.htmlGroup Distributionally Robust Reinforcement Learning with Hierarchical Latent VariablesOne key challenge for multi-task Reinforcement learning (RL) in practice is the absence of task specifications. Robust RL has been applied to deal with task ambiguity but may result in over-conservative policies. To balance the worst-case (robustness) and average performance, we propose Group Distributionally Robust Markov Decision Process (GDR-MDP), a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model. GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups within an ambiguity set. We rigorously show that GDR-MDP’s hierarchical structure improves distributional robustness by adding regularization to the worst possible outcomes. We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods. Extensive experiments on Box2D control tasks, MuJoCo benchmarks, and Google football platforms show that our algorithms outperform classic robust training algorithms across diverse environments in terms of robustness under belief uncertainties. Demos are available on our project page (https://sites.google.com/view/gdr-rl/home).Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23d.html
https://proceedings.mlr.press/v206/xu23d.htmlLinear Convergence of Gradient Descent For Finite Width Over-parametrized Linear Networks With General InitializationRecent theoretical analyses of the convergence of gradient descent (GD) to a global minimum for over-parametrized neural networks make strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (spectral, balanced). In this work, we relax these assumptions and derive a linear convergence rate for two-layer linear networks trained using GD on the squared loss in the case of finite step size, finite width and general initialization. Despite the generality of our analysis, our rate estimates are significantly tighter than those of prior work. Moreover, we provide a time-varying step size rule that monotonically improves the convergence rate as the loss function decreases to zero. Numerical experiments validate our findings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23c.html
https://proceedings.mlr.press/v206/xu23c.htmlA Tale of Two Efficient Value Iteration Algorithms for Solving Linear MDPs with Large Action SpaceMarkov Decision Process (MDP) with large action space naturally occurs in many applications such as language processing, information retrieval, and recommendation system. There have been various approaches to solve these MDPs through value iteration (VI). Unfortunately, all VI algorithms require expensive linear scans over the entire action space for value function estimation during each iteration. To this end, we present two provable Least-Squares Value Iteration (LSVI) algorithms with runtime complexity sublinear in the number of actions for linear MDPs. We formulate the value function estimation procedure in VI as an approximate maximum inner product search problem and propose a Locality Sensitive Hashing (LSH) type data structure to solve this problem with sublinear time complexity. Our major contribution is combining the guarantees of approximate maximum inner product search with the regret analysis of reinforcement learning. We prove that, with the appropriate choice of approximation factor, there exists a sweet spot. Our proposed Sublinear LSVI algorithms maintain the same regret as the original LSVI algorithms while reducing the runtime complexity to sublinear in the number of actions. To the best of our knowledge, this is the first work that combines LSH with reinforcement learning that results in provable improvements. We hope that our novel way of combining data structures and the iterative algorithm will open the door for further study into the cost reduction in reinforcement learning.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23b.html
https://proceedings.mlr.press/v206/xu23b.htmlFinding Regularized Competitive Equilibria of Heterogeneous Agent Macroeconomic Models via Reinforcement LearningWe study a heterogeneous agent macroeconomic model with an infinite number of households and firms competing in a labor market. Each household earns income and engages in consumption at each time step while aiming to maximize a concave utility subject to the underlying market conditions. The households aim to find the optimal saving strategy that maximizes their discounted cumulative utility given the market condition, while the firms determine the market conditions through maximizing corporate profit based on the household population behavior. The model captures a wide range of applications in macroeconomic studies, and we propose a data-driven reinforcement learning framework that finds the regularized competitive equilibrium of the model. The proposed algorithm enjoys theoretical guarantees in converging to the equilibrium of the market at a sub-linear rate.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xu23a.html
https://proceedings.mlr.press/v206/xu23a.htmlStrong Lottery Ticket Hypothesis with $\varepsilon$–perturbationThe strong Lottery Ticket Hypothesis (LTH) (Ramanujan et al., 2019; Zhou et al., 2019) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization. In particular, we focus on the following open questions: By allowing an $\varepsilon$-scale perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? We answer the first question by first extending the theoretical result on subset sum problem (Lueker, 1998) to allow perturbation on the candidates. Applying this result to the neural network setting, we show that by allowing $\varepsilon$-scale perturbation, we can reduce the over-parameterization requirement of the strong LTH by a factor of $O(1/(1+\varepsilon))$. To answer the second question, we show via experiments that the perturbed weight achieved by the projected SGD shows better performance under the strong LTH pruning.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xiong23a.html
https://proceedings.mlr.press/v206/xiong23a.htmlAlternating Projected SGD for Equality-constrained Bilevel OptimizationBilevel optimization, which captures the inherent nested structure of machine learning problems, is gaining popularity in many recent applications. Existing works on bilevel optimization mostly consider either the unconstrained problems or the constrained upper-level problems. In this context, this paper considers the stochastic bilevel optimization problems with equality constraints in both upper and lower levels. By leveraging the special structure of the equality constraints problem, the paper first presents an alternating projected SGD approach to tackle this problem and establishes the $\tilde{\cal O}(\epsilon^{-2})$ sample and iteration complexity that matches the state-of-the-art complexity of ALSET Chen et al. (2021) for stochastic unconstrained bilevel problems. To further save the cost of projection, the paper presents an alternating projected SGD approach with lazy projection and establishes the $\tilde{\cal O}(\epsilon^{-2}/T)$ upper-level and $\tilde{\cal O}(\epsilon^{-1.5}/T^{\frac{3}{4}})$ lower-level projection complexity of this new algorithm, where $T$ is the upper-level projection interval. Application to federated bilevel optimization has been presented to showcase the performance of our algorithms. Our results demonstrate that equality-constrained bilevel optimization with strongly-convex lower-level problems can be solved as efficiently as stochastic single-level optimization problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xiao23a.html
https://proceedings.mlr.press/v206/xiao23a.htmlEfficient Informed Proposals for Discrete Distributions via Newton’s Series ApproximationGradients have been exploited in proposal distributions to accelerate the convergence of Markov chain Monte Carlo algorithms on discrete distributions. However, these methods require a natural differentiable extension of the target discrete distribution, which often does not exist or does not provide effective guidance. In this paper, we develop a gradient-like proposal for any discrete distribution without this strong requirement. Built upon a locally-balanced proposal, our method efficiently approximates the discrete likelihood ratio via Newton’s series expansion to enable a large and efficient exploration in discrete spaces. We show that our method can also be viewed as a multilinear extension, thus inheriting the desired properties. We prove that our method has a guaranteed convergence rate with or without the Metropolis-Hastings step. Furthermore, our method outperforms a number of popular alternatives in several different experiments, including the facility location problem, extractive text summarization, and image retrieval.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xiang23a.html
https://proceedings.mlr.press/v206/xiang23a.htmlImplicit Graphon Neural RepresentationGraphons are general and powerful models for generating graphs of varying size. In this paper, we propose to directly model graphons using neural networks, obtaining Implicit Graphon Neural Representation (IGNR). Existing work in modeling and reconstructing graphons often approximates a target graphon by a fixed resolution piece-wise constant representation. Our IGNR has the benefit that it can represent graphons up to arbitrary resolutions, and enables natural and efficient generation of arbitrary sized graphs with desired structure once the model is learned. Furthermore, we allow the input graph data to be unaligned and have different sizes by leveraging the Gromov-Wasserstein distance. We first demonstrate the effectiveness of our model by showing its superior performance on a graphon learning task. We then propose an extension of IGNR that can be incorporated into an auto-encoder framework, and demonstrate its good performance under a more general setting of graphon learning. We also show that our model is suitable for graph representation learning and graph generation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xia23b.html
https://proceedings.mlr.press/v206/xia23b.htmlKrylov–Bellman boosting: Super-linear policy evaluation in general state spacesWe present and analyze the Krylov–Bellman Boosting algorithm for policy evaluation in general state spaces. It alternates between fitting the Bellman residual using non-parametric regression (as in boosting), and estimating the value function via the least-squares temporal difference (LSTD) procedure applied with a feature set that grows adaptively over time. By exploiting the connection to Krylov methods, we equip this method with two attractive guarantees. First, we provide a general convergence bound that allows for separate estimation errors in residual fitting and LSTD computation. Consistent with our numerical experiments, this bound shows that convergence rates depend on the restricted spectral structure, and are typically super-linear. Second, by combining this meta-result with sample-size dependent guarantees for residual fitting and LTSD computation, we obtain concrete statistical guarantees that depend on the sample size along with the complexity of the function class used to fit the residuals. We illustrate the behavior of the KBB algorithm for various types of policy evaluation problems, and typically find large reductions in sample complexity relative to the standard approach of fitted value iteration.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xia23a.html
https://proceedings.mlr.press/v206/xia23a.htmlIndeterminacy in Generative Models: Characterization and Strong IdentifiabilityMost modern probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Different tasks tolerate different indeterminacies, however recent applications have indicated the need for strongly identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, and recent work excludes many—but not all—indeterminacies. In this work, we motivate model-identifiability in terms of task-identifiability, then construct a theoretical framework for analyzing the indeterminacies of latent variable models, which enables their precise characterization in terms of the generator function and prior distribution spaces. We reveal that strong identifiability is possible even with highly flexible nonlinear generators, and give two such examples. One is a straightforward modification of iVAE (Khemakhem et al., 2020); the other uses triangular monotonic maps, leading to novel connections between optimal transport and identifiability.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/xi23a.html
https://proceedings.mlr.press/v206/xi23a.htmlScore-based Quickest Change Detection for Unnormalized ModelsClassical change detection algorithms typically require modeling pre-change and post-change distributions. The calculations may not be feasible for various machine learning models because of the complexity of computing the partition functions and normalized distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. In this paper, we develop a new variant of the classical Cumulative Sum (CUSUM) change detection, namely Score-based CUSUM (SCUSUM), based on Fisher divergence and the Hyvärinen score. Our method allows the applications of the quickest change detection for unnormalized distributions. We provide a theoretical analysis of the detection delay given the constraints on false alarms. We prove the asymptotic optimality of the proposed method in some particular cases. We also provide numerical experiments to demonstrate our method’s computation, performance, and robustness advantages.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wu23b.html
https://proceedings.mlr.press/v206/wu23b.htmlDoes Label Differential Privacy Prevent Label Inference Attacks?Label differential privacy (label-DP) is a popular framework for training private ML models on datasets with public features and sensitive private labels. Despite its rigorous privacy guarantee, it has been observed that in practice label-DP does not preclude label inference attacks (LIAs): Models trained with label-DP can be evaluated on the public training features to recover, with high accuracy, the very private labels that it was designed to protect. In this work, we argue that this phenomenon is not paradoxical and that label-DP is designed to limit the advantage of an LIA adversary compared to predicting training labels using the Bayes classifier. At label-DP $\epsilon=0$ this advantage is zero, hence the optimal attack is to predict according to the Bayes classifier and is independent of the training labels. Our bound shows the semantic protection conferred by label-DP and gives guidelines on how to choose $\epsilon$ to limit the threat of LIAs below a certain level. Finally, we empirically demonstrate that our result closely captures the behavior of simulated attacks on both synthetic and real world datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wu23a.html
https://proceedings.mlr.press/v206/wu23a.htmlUsing Sliced Mutual Information to Study Memorization and Generalization in Deep Neural NetworksIn this paper, we study the memorization and generalization behaviour of deep neural networks (DNNs) using sliced mutual information (SMI), which is the average of the mutual information (MI) between one-dimensional random projections. We argue that the SMI between features in a DNN ($T$) and ground truth labels ($Y$), $SI(T;Y)$, can be seen as a form of usable information that the features contain about the labels. We show theoretically that $SI(T;Y)$ can encode geometric properties of the feature distribution, such as its spherical soft-margin and intrinsic dimensionality, in a way that MI cannot. Additionally, we present empirical evidence showing how $SI(T;Y)$ can capture memorization and generalization in DNNs. In particular, we find that, in the presence of label noise, all layers start to memorize but the earlier layers stabilize more quickly than the deeper layers. Finally, we point out that, in the context of Bayesian Neural Networks, the SMI between the penultimate layer and the output represents the worst case uncertainty of the network’s output.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wongso23a.html
https://proceedings.mlr.press/v206/wongso23a.htmlAcceleration of Frank-Wolfe Algorithms with Open-Loop Step-SizesFrank-Wolfe algorithms (FW) are popular first-order methods for solving constrained convex optimization problems that rely on a linear minimization oracle instead of potentially expensive projection-like oracles. Many works have identified accelerated convergence rates under various structural assumptions on the optimization problem and for specific FW variants when using line-search or short-step, requiring feedback from the objective function. Little is known about accelerated convergence regimes when utilizing open-loop step-size rules, a.k.a. FW with pre-determined step-sizes, which are algorithmically extremely simple and stable. Not only is FW with open-loop step-size rules not always subject to the same convergence rate lower bounds as FW with line-search or short-step, but in some specific cases, such as kernel herding in infinite dimensions, it has been empirically observed that FW with open-loop step-size rules leads to faster convergence than FW with line-search or short-step. We propose a partial answer to this unexplained phenomenon in kernel herding, characterize a general setting for which FW with open-loop step-size rules converges non-asymptotically faster than with line-search or short-step, and derive several accelerated convergence results for FW with open-loop step-size rules.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wirth23a.html
https://proceedings.mlr.press/v206/wirth23a.htmlOn The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy EvaluationA common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/winnicki23a.html
https://proceedings.mlr.press/v206/winnicki23a.htmlTensor-based Kernel Machines with Structured Inducing Points for Large and High-Dimensional DataKernel machines are one of the most studied family of methods in machine learning. In the exact setting, training requires to instantiate the kernel matrix, thereby prohibiting their application to large-sampled data. One popular kernel approximation strategy which allows to tackle large-sampled data consists in interpolating product kernels on a set of grid-structured inducing points. However, since the number of model parameters increases exponentially with the dimensionality of the data, these methods are limited to small-dimensional datasets. In this work we lift this limitation entirely by placing inducing points on a grid and constraining the primal weights to be a low-rank Canonical Polyadic Decomposition. We derive a block coordinate descent algorithm that efficiently exploits grid-structured inducing points. The computational complexity of the algorithm scales linearly both in the number of samples and in the dimensionality of the data for any product kernel. We demonstrate the performance of our algorithm on large-scale and high-dimensional data, achieving state-of-the art results on a laptop computer. Our results show that grid-structured approaches can work in higher-dimensional problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wesel23a.html
https://proceedings.mlr.press/v206/wesel23a.htmlConvex Bounds on the Softmax Function with Applications to Robustness VerificationThe softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models. We derive bounds using both a natural exponential-reciprocal decomposition of the softmax as well as an alternative decomposition in terms of the log-sum-exp function. The new bounds are provably and/or numerically tighter than linear bounds obtained in previous work on robustness verification of transformers. As illustrations of the utility of the bounds, we apply them to verification of transformers as well as of the robustness of predictive uncertainty estimates of deep ensembles.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wei23c.html
https://proceedings.mlr.press/v206/wei23c.htmlProvably Efficient Model-Free Algorithms for Non-stationary CMDPsWe study model-free reinforcement learning (RL) algorithms in episodic non-stationary constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a cumulative constraint on the expected utility (cost). In the non-stationary environment, the reward, utility functions, and the transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing with the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach for both tabular and linear approximation settings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wei23b.html
https://proceedings.mlr.press/v206/wei23b.htmlMean Parity Fair Regression in RKHSWe study the fair regression problem under the notion of Mean Parity (MP) fairness, which requires the conditional mean of the learned function output to be constant with respect to the sensitive attributes. We address this problem by leveraging reproducing kernel Hilbert space (RKHS) to construct the functional space whose members are guaranteed to satisfy the fairness constraints. The proposed functional space suggests a closed-form solution for the fair regression problem that is naturally compatible with multiple sensitive attributes. Furthermore, by formulating the fairness-accuracy tradeoff as a relaxed fair regression problem, we derive a corresponding regression function that can be implemented efficiently and provides interpretable tradeoffs. More importantly, under some mild assumptions, the proposed method can be applied to regression problems with a covariance-based notion of fairness. Experimental results on benchmark datasets show the proposed methods achieve competitive and even superior performance compared with several state-of-the-art methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wei23a.html
https://proceedings.mlr.press/v206/wei23a.htmlAdversarial Random Forests for Density Estimation and Generative ModelingWe propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art probabilistic circuits and deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. An accompanying $R$ package, $arf$, is available on $CRAN$.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/watson23a.html
https://proceedings.mlr.press/v206/watson23a.htmlHuber-robust confidence sequencesConfidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p $>$ 1), but allowing for (at most) $\varepsilon$ fraction of arbitrary distribution corruption, as in Huber’s contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23p.html
https://proceedings.mlr.press/v206/wang23p.htmlCompositional Probabilistic and Causal Inference using Tractable Circuit ModelsProbabilistic circuits (PCs) are a class of tractable probabilistic models, which admit efficient inference routines depending on their structural properties. In this paper, we introduce md-vtrees, a novel structural formulation of (marginal) determinism in structured decomposable PCs, which generalizes previously proposed classes such as probabilistic sentential decision diagrams. Crucially, we show how md-vtrees can be used to derive tractability conditions and efficient algorithms for advanced inference queries expressed as arbitrary compositions of basic probabilistic operations, such as marginalization, multiplication and reciprocals, in a sound and generalizable manner. In particular, we derive the first polytime algorithms for causal inference queries such as backdoor adjustment on PCs. As a practical instantiation of the framework, we propose MDNets, a novel PC architecture using md-vtrees, and empirically demonstrate their application to causal inference.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23o.html
https://proceedings.mlr.press/v206/wang23o.htmlProbabilistic Conformal Prediction Using Conditional Random SamplesThis paper proposes probabilistic conformal prediction (PCP), a predictive inference algorithm that estimates a target variable by a discontinuous predictive set. Given inputs, PCP constructs the predictive set based on random samples from an estimated generative model. It is efficient and compatible with conditional generative models with either explicit or implicit density functions. We show that PCP guarantees correct marginal coverage with finite samples and give empirical evidence of conditional coverage. We study PCP on a variety of simulated and real datasets. Compared to existing conformal prediction methods, PCP provides sharper predictive sets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23n.html
https://proceedings.mlr.press/v206/wang23n.htmlOverparameterized Random Feature Regression with Nearly Orthogonal DataWe investigate the properties of random feature ridge regression (RFRR) given by a two-layer neural network with random Gaussian initialization. We study the non-asymptotic behaviors of the RFRR with nearly orthogonal deterministic unit-length input data vectors in the overparameterized regime, where the width of the first layer is much larger than the sample size. Our analysis shows high-probability non-asymptotic concentration results for the training errors, cross-validations, and generalization errors of RFRR centered around their respective values for a kernel ridge regression (KRR). This KRR is derived from an expected kernel generated by a nonlinear random feature map. We then approximate the performance of the KRR by a polynomial kernel matrix obtained from the Hermite polynomial expansion of the activation function, whose degree only depends on the orthogonality among different data points. This polynomial kernel determines the asymptotic behavior of the RFRR and the KRR. Our results hold for a wide variety of activation functions and input data sets that exhibit nearly orthogonal properties. Based on these approximations, we obtain a lower bound for the generalization error of the RFRR for a nonlinear student-teacher model.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23m.html
https://proceedings.mlr.press/v206/wang23m.htmlA Targeted Accuracy Diagnostic for Variational ApproximationsVariational Inference (VI) is an attractive alternative to Markov Chain Monte Carlo (MCMC) due to its computational efficiency in the case of large datasets and/or complex models with high-dimensional parameters. However, evaluating the accuracy of variational approximations remains a challenge. Existing methods characterize the quality of the whole variational distribution, which is almost always poor in realistic applications, even if specific posterior functionals such as the component-wise means or variances are accurate. Hence, these diagnostics are of practical value only in limited circumstances. To address this issue, we propose the“TArgeted Diagnostic for Distribution Approximation Accuracy” (TADDAA), which uses many short parallel MCMC chains to obtain lower bounds on the error of each posterior functional of interest. We also develop a reliability check for TADDAA to determine when the lower bounds should not be trusted. Numerical experiments validate the practical utility and computational efficiency of our approach on a range of synthetic distributions and real-data examples, including sparse logistic regression and Bayesian neural network models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23l.html
https://proceedings.mlr.press/v206/wang23l.htmlRevisiting Weighted Strategy for Non-stationary Parametric BanditsNon-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a refined analysis framework, which simplifies the derivation and importantly produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\tilde O(k_\mu^{\frac{5}{4}} c_\mu^{-\frac{3}{4}} d^{\frac{3}{4}} P_T^{\frac{1}{4}}T^{\frac{3}{4}})$ regret, improving the $\tilde O(k_\mu^{2} c_\mu^{-1}d^{\frac{9}{10}} P_T^{\frac{1}{5}}T^{\frac{4}{5}})$ bound in prior work, where $k_\mu$ and $c_\mu$ characterize the reward model’s nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23k.html
https://proceedings.mlr.press/v206/wang23k.htmlIncremental Aggregated Riemannian Gradient Method for Distributed PCAWe consider the problem of distributed principal component analysis (PCA) where the data samples are dispersed across different agents. Despite the rich literature on this problem under various specific settings, there is still a lack of efficient algorithms that are amenable to decentralized and asynchronous implementations. In this paper, we extend the incremental aggregated gradient (IAG) method in convex optimization to the nonconvex PCA problems based on an Riemannian gradient-type method named IARG-PCA. The IARG-PCA method admits low per-iteration computational and communication cost and can be readily implemented in a decentralized and asynchronous manner. Moreover, we show that the IARG-PCA method converges linearly to the leading eigenvector of the sample covariance of the whole dataset with a constant step size. The iteration complexity coincides with the best-known result of the IAG method in terms of the linear dependence on the number of agents. Meanwhile, the communication complexity is much lower than the state-of-the-art decentralized PCA algorithms if the eigengap of the sample covariance is moderate. Numerical experiments on synthetic and real datasets show that our IARG-PCA method exhibits substantially lower communication cost and comparable computational cost compared with other existing algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23j.html
https://proceedings.mlr.press/v206/wang23j.htmlUncertainty-aware Unsupervised Video HashingLearning to hash has become popular for video retrieval due to its fast speed and low storage consumption. Previous efforts formulate video hashing as training a binary auto-encoder, for which noncontinuous latent representations are optimized by the biased straight-through (ST) back-propagation heuristic. We propose to formulate video hashing as learning a discrete variational auto-encoder with the factorized Bernoulli latent distribution, termed as Bernoulli variational auto-encoder (BerVAE). The corresponding evidence lower bound (ELBO) in our BerVAE implementation leads to closed-form gradient expression, which can be applied to achieve principled training along with some other unbiased gradient estimators. BerVAE enables uncertainty-aware video hashing by predicting the probability distribution of video hash code-words, thus providing reliable uncertainty quantification. Experiments on both simulated and real-world large-scale video data demonstrate that our BerVAE trained with unbiased gradient estimators can achieve the state-of-the-art retrieval performance. Furthermore, we show that quantified uncertainty is highly correlated to video retrieval performance, which can be leveraged to further improve the retrieval accuracy. Our code is available at https://github.com/wangyucheng1234/BerVAETue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23i.html
https://proceedings.mlr.press/v206/wang23i.htmlScalable Spectral Clustering with Group Fairness ConstraintsThere are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high computational costs due to the algorithm deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that while it is comparable with FairSC in recovering fair clustering, s-FairSC is 12$\times$ faster than FairSC for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23h.html
https://proceedings.mlr.press/v206/wang23h.htmlReconstructing Training Data from Model Gradient, ProvablyUnderstanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this paper, we present a surprising result: Even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild assumptions: with shallow or deep neural networks and wide range of activation functions. We also present a statistically and computationally efficient algorithm based on tensor decomposition to reconstruct the training data. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy, especially in federated learning.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23g.html
https://proceedings.mlr.press/v206/wang23g.htmlLOFT: Finding Lottery Tickets through Filter-wise TrainingRecent work on the Lottery Ticket Hypothesis (LTH) shows that there exist “winning tickets” in large neural networks. These tickets represent “sparse” versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to pretrain the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the LOttery ticket through Filter-wise Training algorithm, dubbed as LoFT. LoFT is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that LoFT i) preserves and finds good lottery tickets, while ii) it achieves non-trivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23f.html
https://proceedings.mlr.press/v206/wang23f.htmlData Banzhaf: A Robust Data Valuation Framework for Machine LearningData valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23e.html
https://proceedings.mlr.press/v206/wang23e.htmlDifferentially Private Matrix Completion through Low-rank Matrix FactorizationWe study the matrix completion problem under joint differential privacy and develop a non-convex low-rank matrix factorization-based method for solving it. Our method comes with strong privacy and utility guarantees, has a linear convergence rate, and is more scalable than the best-known alternative (Chien et al., 2021). Our method achieves the (near) optimal sample complexity for matrix completion required by the non-private baseline and is much better than the best known result under joint differential privacy. Furthermore, we prove a tight utility guarantee that improves existing approaches and removes the impractical resampling assumption used in the literature. Numerical experiments further demonstrate the superiority of our method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23d.html
https://proceedings.mlr.press/v206/wang23d.htmlToward Fairness in Text Generation via Mutual Information Minimization based on Importance SamplingPretrained language models (PLMs), such as GPT- 2, have achieved remarkable empirical performance in text generation tasks. However, pre- trained on large-scale natural language corpora, the generated text from PLMs may exhibit social bias against disadvantaged demographic groups. To improve the fairness of PLMs in text generation, we propose to minimize the mutual information between the semantics in the generated text sentences and their demographic polarity, i.e., the demographic group to which the sentence is referring. In this way, the mentioning of a demographic group (e.g., male or female) is encouraged to be independent from how it is described in the generated text, thus effectively alleviating the so cial bias. Moreover, we propose to efficiently estimate the upper bound of the above mutual information via importance sampling, leveraging a natural language corpus. We also propose a distillation mechanism that preserves the language modeling ability of the PLMs after debiasing. Empirical results on real-world benchmarks demonstrate that the proposed method yields superior performance in term of both fairness and language modeling ability.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23c.html
https://proceedings.mlr.press/v206/wang23c.htmlA Finite Sample Complexity Bound for Distributionally Robust Q-learningWe consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust Q-learning framework studied in [Liu et. al. 2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust Q-function within an $\epsilon$ error in the sup norm is upper bounded by $\tilde O(|S||A|(1-\gamma)^{-5}\epsilon^{-2}p_{\wedge}^{-6}\delta^{-4})$, where $\gamma$ is the discount rate, $p_{\wedge}$ is the non-zero minimal support probability of the transition kernels and $\delta$ is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23b.html
https://proceedings.mlr.press/v206/wang23b.htmlRegularization for Shuffled Data Problems via Exponential Family Priors on the Permutation GroupIn the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption is that each pair corresponds to the same observational unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this setting under the term “Shuffled Data” in which the underlying correct pairing of (X, Y)-pairs is represented via an unknown permutation. Explicit modeling of the permutation tends to be associated with overfitting, prompting the need for suitable methods of regularization. In this paper, we propose an exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and local shuffling. This prior turns out to be conjugate for canonical shuffled data problems in which the likelihood conditional on a fixed permutation can be expressed as product over the corresponding (X,Y)-pairs. Inference can be based on the EM algorithm in which the E-step is approximated by sampling, e.g., via the Fisher-Yates algorithm. The M-step is shown to admit a reduction from $n^2$ to $n$ terms if the likelihood of (X,Y)-pairs has exponential family form. Comparisons on synthetic and real data show that the proposed approach compares favorably to competing methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wang23a.html
https://proceedings.mlr.press/v206/wang23a.htmlTowards Scalable and Robust Structured Bandits: A Meta-Learning FrameworkOnline learning in large-scale structured bandits is known to be challenging due to the curse of dimensionality. In this paper, we propose a unified meta-learning framework for a wide class of structured bandit problems where the parameter space can be factorized to item-level, which covers many popular tasks. Compared with existing approaches, the proposed solution is both scalable to large systems and robust by utilizing a more flexible model. At the core of this framework is a Bayesian hierarchical model that allows information sharing among items via their features, upon which we design a meta Thompson sampling algorithm. Three representative examples are discussed thoroughly. Theoretical analysis and extensive numerical results both support the usefulness of the proposed method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wan23a.html
https://proceedings.mlr.press/v206/wan23a.htmlUnsupervised representation learning with recognition-parametrised probabilistic modelsWe introduce a new approach to probabilistic unsupervised learning based on the recognition-parametrised model (RPM): a normalised semi-parametric hypothesis class for joint distributions over observed and latent variables. Under the key assumption that observations are conditionally independent given latents, the RPM combines parametric prior and observation-conditioned latent distributions with non-parametric observation marginals. This approach leads to a flexible learnt recognition model capturing latent dependence between observations, without the need for an explicit, parametric generative model. The RPM admits exact maximum-likelihood learning for discrete latents, even for powerful neural network-based recognition. We develop effective approximations applicable in the continuous latent case. Experiments demonstrate the effectiveness of the RPM on high-dimensional data, learning image classification from weak indirect supervision; direct image-level latent Dirichlet allocation; and Recognition-Parametrised Gaussian Process Factor Analysis (RP-GPFA) applied to multi-factorial spatiotemporal datasets. The RPM provides a powerful framework to discover meaningful latent structure underlying observational data, a function critical to both animal and artificial intelligence.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/walker23a.html
https://proceedings.mlr.press/v206/walker23a.htmlComplex-to-Real Sketches for Tensor Products with Applications to the Polynomial KernelRandomized sketches of a tensor product of $p$ vectors follow a tradeoff between statistical efficiency and computational acceleration. Commonly used approaches avoid computing the high-dimensional tensor product explicitly, resulting in a suboptimal dependence of $O(3^p)$ in the embedding dimension. We propose a simple Complex-to-Real (CtR) modification of well-known sketches that replaces real random projections by complex ones, incurring a lower $O(2^p)$ factor in the embedding dimension. The output of our sketches is real-valued, which renders their downstream use straightforward. In particular, we apply our sketches to $p$-fold self-tensored inputs corresponding to the feature maps of the polynomial kernel. We show that our method achieves state-of-the-art performance in terms of accuracy and speed compared to other randomized approximations from the literature.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/wacker23a.html
https://proceedings.mlr.press/v206/wacker23a.htmlLearning to Defer to Multiple Experts: Consistent Surrogate Losses, Confidence Calibration, and Conformal EnsemblesWe study the statistical properties of learning to defer (L2D) to multiple experts. In particular, we address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts. Firstly, we derive two consistent surrogates—one based on a softmax parameterization, the other on a one-vs-all (OvA) parameterization—that are analogous to the single expert losses proposed by Mozannar and Sontag (2020) and Verma and Nalisnick (2022), respectively. We then study the frameworks’ ability to estimate $P( m_j = y | x )$, the probability that the $j$th expert will correctly predict the label for $x$. Theory shows the softmax-based loss causes mis-calibration to propagate between the estimates while the OvA-based loss does not (though in practice, we find there are trade offs). Lastly, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. We perform empirical validation on tasks for galaxy, skin lesion, and hate speech classification.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/verma23a.html
https://proceedings.mlr.press/v206/verma23a.htmlPointwise sampling uncertainties on the Precision-Recall curveQuoting robust uncertainties on machine learning (ML) model metrics, such as f1-score, precision, recall, etc., from sources of uncertainty such as data sampling, parameter initialization, and target labelling, is typically not done in the field of data science, even though these are essential for the proper interpretation and comparison of ML models. This text shows how to calculate and visualize the impact of one dominant source of uncertainty - on each point of the Precision-Recall (PR) and Receiver Operating Characteristic (ROC) curves. This is particularly relevant for PR curves, where the joint uncertainty on recall and precision can be large and non-linear, especially at low recall. Four statistical methods to evaluate this uncertainty, both frequentist and Bayesian in origin, are compared in terms of coverage and speed. Of these, Wilks’ toolbox.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/urlus23a.html
https://proceedings.mlr.press/v206/urlus23a.htmlSafe Sequential Testing and Effect Estimation in Stratified Count DataSequential decision making significantly speeds up research and is more cost-effective compared to fixed-n methods. We present a method for sequential decision making for stratified count data that retains Type-I error guarantee or false discovery rate under optional stopping, using e-variables. We invert the method to construct stratified anytime-valid confidence sequences, where cross-talk between subpopulations in the data can be allowed during data collection to improve power. Finally, we combine information collected in separate subpopulations through pseudo-Bayesian averaging and switching to create effective estimates for the minimal, mean and maximal treatment effects in the subpopulations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/turner23a.html
https://proceedings.mlr.press/v206/turner23a.htmlFurther Adaptive Best-of-Both-Worlds Algorithm for Combinatorial Semi-BanditsWe consider the combinatorial semi-bandit problem and present a new algorithm with a best-of-both-worlds regret guarantee; the regrets are bounded near-optimally in the stochastic and adversarial regimes. In the stochastic regime, we prove a variance-dependent regret bound depending on the tight suboptimality gap introduced by Kveton et al. (2015) with a good leading constant. In the adversarial regime, we show that the same algorithm simultaneously obtains various data-dependent regret bounds. Our algorithm is based on the follow-the-regularized-leader framework with a refined regularizer and adaptive learning rate. Finally, we numerically test the proposed algorithm and confirm its superior or competitive performance over existing algorithms, including Thompson sampling under most settings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tsuchiya23a.html
https://proceedings.mlr.press/v206/tsuchiya23a.htmlDeep equilibrium models as estimators for continuous latent variablesPrincipal Component Analysis (PCA) and its exponential family extensions have three components: observations, latents and parameters of a linear transformation. We consider a generalised setting where the canonical parameters of the exponential family are a nonlinear transformation of the latents. We show explicit relationships between particular neural network architectures and the corresponding statistical models. We find that deep equilibrium models — a recently introduced class of implicit neural networks — solve maximum a-posteriori (MAP) estimates for the latents and parameters of the transformation. Our analysis provides a systematic way to relate activation functions, dropout, and layer structure, to statistical assumptions about the observations, thus providing foundational principles for unsupervised DEQs. For hierarchical latents, individual neurons can be interpreted as nodes in a deep graphical model. Our DEQ feature maps are end-to-end differentiable, enabling fine-tuning for downstream tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tsuchida23a.html
https://proceedings.mlr.press/v206/tsuchida23a.htmlMulti-Agent congestion cost minimization with linear function approximationsThis work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent’s objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/trivedi23a.html
https://proceedings.mlr.press/v206/trivedi23a.htmlLearning Treatment Effects from Observational and Experimental DataDecision making often depends on causal effect estimation. For example, clinical decisions are often based on estimates of the probability of post-treatment outcomes. Experimental data from randomized controlled trials allow for unbiased estimation of these probabilities. However, such data are usually limited in the number of samples and the set of measured covariates. Observational data, such as electronic medical records, contain many more samples and a richer set of measured covariates, which can be used to estimate more personalized treatment effects; however, these estimates may be biased due to latent confounding. In this work, we propose a Bayesian method for combining observational and experimental data for unbiased conditional treatment effect estimation. Our method addresses the following question: Given observational data $D_o$ measuring a set of covariates $\mathbf V$, and experimental data $D_e$ measuring a possibly smaller set of covariates $\mathbf{V_b}\subseteq \mathbf{V}$, which set of covariates $\mathbf{Z}$ leads to the optimal, unbiased prediction of the post-intervention outcome $P(Y |do(X), \mathbf{Z})$, and when can we use observational data for this estimation? In simulated data, we show that our method improves the prediction of post-intervention outcomes.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/triantafillou23a.html
https://proceedings.mlr.press/v206/triantafillou23a.htmlEfficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement LearningA practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tkachuk23a.html
https://proceedings.mlr.press/v206/tkachuk23a.htmlGradient-Informed Neural Network Statistical Robustness EstimationDeep neural networks are robust against random corruptions of the inputs to some extent. This global sense of safety is not sufficient in critical applications where probabilities of failure must be assessed with accuracy. Some previous works applied known statistical methods from the field of rare event analysis to classification. Yet, they use classifiers as black-box models without taking into account gradient information, readily available for deep learning models via auto-differentiation. We propose a new and highly efficient estimator of probabilities of failure dedicated to neural networks as it leverages the fast computation of gradients of the model through back-propagation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tit23a.html
https://proceedings.mlr.press/v206/tit23a.htmlOn the Complexity of Representation Learning in Contextual Linear BanditsIn contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g., model-selection techniques or algorithms for learning with arbitrary function classes) or specialized to particular structures (e.g., nested features or representations with certain spectral properties). As a result, the understanding of the cost of representation learning in contextual linear bandit is still limited. In this paper, we take a systematic approach to the problem and provide a comprehensive study through an instance-dependent perspective. We show that representation learning is fundamentally more complex than linear bandits (i.e., learning with a given representation). In particular, learning with a given set of representations is never simpler than learning with the worst realizable representation in the set, while we show cases where it can be arbitrarily harder. We complement this result with an extensive discussion of how it relates to existing literature and we illustrate positive instances where representation learning is as complex as learning with a fixed representation and where sub-logarithmic regret is achievable.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tirinzoni23a.html
https://proceedings.mlr.press/v206/tirinzoni23a.htmlMode-Seeking Divergences: Theory and Applications to GANsGenerative adversarial networks (GANs) represent a game between two neural network machines designed to learn the distribution of data. It is commonly observed that different GAN formulations and divergence/distance measures used could lead to considerably different performance results, especially when the data distribution is multi-modal. In this work, we give a theoretical characterization of the mode-seeking behavior of general f-divergences and Wasserstein distances, and prove a performance guarantee for the setting where the underlying model is a mixture of multiple symmetric quasiconcave distributions. This can help us understand the trade-off between the quality and diversity of the trained GANs’ output samples. Our theoretical results show the mode-seeking nature of the Jensen-Shannon (JS) divergence over standard KL-divergence and Wasserstein distance measures. We subsequently demonstrate that a hybrid of JS-divergence and Wasserstein distance measures minimized by Lipschitz GANs mimics the mode-seeking behavior of the JS-divergence. We present numerical results showing the mode-seeking nature of the JS-divergence and its hybrid with the Wasserstein distance while highlighting the mode-covering properties of KL-divergence and Wasserstein distance measures. Our numerical experiments indicate the different behavior of several standard GAN formulations in application to benchmark Gaussian mixture and image datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ting-li23a.html
https://proceedings.mlr.press/v206/ting-li23a.htmlRethinking Initialization of the Sinkhorn AlgorithmWhile the optimal transport (OT) problem was originally formulated as a linear program, the addition of entropic regularization has proven beneficial both computationally and statistically, for many applications. The Sinkhorn fixed-point algorithm is the most popular approach to solve this regularized problem, and, as a result, multiple attempts have been made to reduce its runtime using, e.g., annealing in the regularization parameter, momentum or acceleration. The premise of this work is that initialization of the Sinkhorn algorithm has received comparatively little attention, possibly due to two preconceptions: since the regularized OT problem is convex, it may not be worth crafting a good initialization, since any is guaranteed to work; secondly, because the outputs of the Sinkhorn algorithm are often unrolled in end-to-end pipelines, a data-dependent initialization would bias Jacobian computations. We challenge this conventional wisdom, and show that data-dependent initializers result in dramatic speed-ups, with no effect on differentiability as long as implicit differentiation is used. Our initializations rely on closed-forms for exact or approximate OT solutions that are known in the 1D, Gaussian or GMM settings. They can be used with minimal tuning, and result in consistent speed-ups for a wide variety of OT problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/thornton23a.html
https://proceedings.mlr.press/v206/thornton23a.htmlEGG-GAE: scalable graph neural networks for tabular data imputationMissing data imputation (MDI) is crucial when dealing with tabular datasets across various domains. Autoencoders can be trained to reconstruct missing values, and graph autoencoders (GAE) can additionally consider similar patterns in the dataset when imputing new values for a given instance. However, previously proposed GAEs suffer from scalability issues, requiring the user to define a similarity metric among patterns to build the graph connectivity beforehand. In this paper, we leverage recent progress in latent graph learning to propose a novel EdGe Generation Graph AutoEncoder (EGG-GAE) for missing data imputation that overcomes these two drawbacks. EGG-GAE works on randomly sampled mini-batches of the input data (hence scaling to larger datasets), and it automatically infers the best connectivity across the mini-batch for each architecture layer. We also experiment with several extensions, including an ensemble strategy for inference and the inclusion of what we call prototype nodes, obtaining significant improvements, both in terms of imputation error and final downstream accuracy, across multiple benchmarks and baselines.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/telyatnikov23a.html
https://proceedings.mlr.press/v206/telyatnikov23a.htmlNo-regret Sample-efficient Bayesian Optimization for Finding Nash Equilibria with Unknown UtilitiesThe Nash equilibrium (NE) is a classic solution concept for normal-form games that is stable under potential unilateral deviations by self-interested agents. Bayesian optimization (BO) has been used to find NE in continuous general-sum games with unknown costly-to-sample utility functions in a sample-efficient manner. This paper presents the first no-regret BO algorithm that is sample-efficient in finding pure NE by leveraging theory on high probability confidence bounds with Gaussian processes and the maximum information gain of kernel functions. Unlike previous works, our algorithm is theoretically guaranteed to converge to the optimal solution (i.e., NE). We also introduce the novel setting of applying BO to finding mixed NE in unknown discrete general-sum games and show that our theoretical framework is general enough to be extended naturally to this setting by developing a no-regret BO algorithm that is sample-efficient in finding mixed NE. We empirically show that our algorithms are competitive w.r.t. suitable baselines in finding NE.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tay23a.html
https://proceedings.mlr.press/v206/tay23a.htmlManifold Restricted Interventional Shapley ValuesShapley values are model-agnostic methods for explaining model predictions. Many commonly used methods of computing Shapley values, known as off-manifold methods, rely on model evaluations on out-of-distribution input samples. Consequently, explanations obtained are sensitive to model behaviour outside the data distribution, which may be irrelevant for all practical purposes. While on-manifold methods have been proposed which do not suffer from this problem, we show that such methods are overly dependent on the input data distribution, and therefore result in unintuitive and misleading explanations. To circumvent these problems, we propose ManifoldShap, which respects the model’s domain of validity by restricting model evaluations to the data manifold. We show, theoretically and empirically, that ManifoldShap is robust to off-manifold perturbations of the model and leads to more accurate and intuitive explanations than existing state-of-the-art Shapley methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/taufiq23a.html
https://proceedings.mlr.press/v206/taufiq23a.htmlWasserstein Distributional Learning via Majorization-MinimizationLearning function-on-scalar predictive models for conditional densities and identifying factors that influence the entire probability distribution are vital tasks in many data-driven applications. We present an efficient Majorization-Minimization optimization algorithm, Wasserstein Distributional Learning (WDL), that trains Semi-parametric Conditional Gaussian Mixture Models (SCGMM) for conditional density functions and uses the Wasserstein distance $W_2$ as a proper metric for the space of density outcomes. We further provide theoretical convergence guarantees and illustrate the algorithm using boosted machines. Experiments on the synthetic data and real-world applications demonstrate the effectiveness of the proposed WDL algorithm.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tang23b.html
https://proceedings.mlr.press/v206/tang23b.htmlMinimax Nonparametric Two-Sample Test under Adversarial LossesIn this paper, we consider the problem of two-sample hypothesis testing that aims at detecting the difference between two probability densities based on finite samples. The proposed test statistic is constructed by first truncating a sample version of a negative Besov norm and then normalizing it. Here, the negative Besov norm is the norm associated with a Besov space with negative exponent, and is shown to be closely related to a class of commonly used adversarial losses (or integral probability metrics) with smooth discriminators. Theoretically, we characterize the optimal detection boundary of two-sample testing in terms of the dimensionalities and smoothness levels of the underlying densities and the discriminator class defining the adversarial loss. We also show that the proposed approach can simultaneously attain the optimal detection boundary under many common adversarial losses, including those induced by the $\ell_1$, $\ell_2$ distances and Wasserstein distances. Our numerical experiments show that the proposed test procedure tends to exhibit higher power and robustness in difference detection than existing state-of-the-art competitors.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tang23a.html
https://proceedings.mlr.press/v206/tang23a.htmlA Blessing of Dimensionality in Membership Inference through RegularizationIs overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier’s vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy-utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. However, remarkably, we then show that if coupled with proper regularization, increasing the number of parameters of a model can actually simultaneously increase both its privacy and performance, thereby eliminating the privacy-utility trade-off. Theoretically, we demonstrate this curious phenomenon for logistic regression with ridge regularization in a bi-level feature ensemble setting. Pursuant to our theoretical exploration, we develop a novel leave-one-out analysis tool to precisely characterize the vulnerability of a linear classifier to the optimal membership inference attack. We empirically exhibit this “blessing of dimensionality” for neural networks on a variety of tasks using early stopping as the regularizerTue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tan23b.html
https://proceedings.mlr.press/v206/tan23b.htmlMixed Linear Regression via Approximate Message PassingIn mixed linear regression, each observation comes from one of L regression vectors (signals), but we do not know which one. The goal is to estimate the signals from the unlabeled observations. We propose a novel approximate message passing (AMP) algorithm for estimation and rigorously characterize its performance in the high-dimensional limit. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic mean-squared error. This can be used to tailor the AMP algorithm to take advantage of any known structural information about the signals. Using state evolution, we derive an optimal choice of AMP ‘denoising’ functions that minimizes the estimation error in each iteration. Numerical simulations are provided to validate the theoretical results, and show that AMP significantly outperforms other estimators including spectral methods, expectation maximization, and alternating minimization. Though our numerical results focus on mixed linear regression, the proposed AMP algorithm can be applied to a broader class of models including mixtures of generalized linear models and max-affine regression.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tan23a.html
https://proceedings.mlr.press/v206/tan23a.htmlDeep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven ModelsThe combination of deep neural nets and theory-driven models (deep grey-box models) can be advantageous due to the inherent robustness and interpretability of the theory-driven part. Deep grey-box models are usually learned with a regularized risk minimization to prevent a theory-driven part from being overwritten and ignored by a deep neural net. However, an estimation of the theory-driven part obtained by uncritically optimizing a regularizer can hardly be trustworthy if we are not sure which regularizer is suitable for the given data, which may affect the interpretability. Toward a trustworthy estimation of the theory-driven part, we should analyze the behavior of regularizers to compare different candidates and to justify a specific choice. In this paper, we present a framework that allows us to empirically analyze the behavior of a regularizer with a slight change in the architecture of the neural net and the training objective.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/takeishi23a.html
https://proceedings.mlr.press/v206/takeishi23a.htmlThe Power of Recursion in Graph Neural Networks for Counting SubstructuresTo achieve a graph representation, most Graph Neural Networks (GNNs) follow two steps: first, each graph is decomposed into a number of subgraphs (which we call the recursion step), and then the collection of subgraphs is encoded by several iterative pooling steps. While recently proposed higher-order networks show a remarkable increase in the expressive power through a single recursion on larger neighborhoods followed by iterative pooling, the power of deeper recursion in GNNs without any iterative pooling is still not fully understood. To make it concrete, we consider a pure recursion-based GNN which we call Recursive Neighborhood Pooling GNN (RNP-GNN). The expressive power of an RNP-GNN and its computational cost quantifies the power of (pure) recursion for a graph representation network. We quantify the power by means of counting substructures, which is one main limitation of the Message Passing graph Neural Networks (MPNNs), and show how RNP-GNN can exploit the sparsity of the underlying graph to achieve low-cost powerful representations. We also compare the recent lower bounds on the time complexity and show how recursion-based networks are near optimal.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tahmasebi23a.html
https://proceedings.mlr.press/v206/tahmasebi23a.htmlOn Generalization of Decentralized Learning with Separable DataDecentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds approximately match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the algorithmic front, we design improved gradient-based routines for decentralized learning with separable data and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/taheri23a.html
https://proceedings.mlr.press/v206/taheri23a.htmlRetrospective Uncertainties for Deep Models using Vine CopulasDespite the major progress of deep models as learning machines, uncertainty estimation remains a major challenge. Existing solutions rely on modified loss functions or architectural changes. We propose to compensate for the lack of built-in uncertainty estimates by supplementing any network, retrospectively, with a subsequent vine copula model, in an overall compound we call Vine-Copula Neural Network (VCNN). Through synthetic and real-data experiments, we show that VCNNs could be task (regression/classification) and architecture (recurrent, fully connected) agnostic while providing reliable and better-calibrated uncertainty estimates, comparable to state-of-the-art built-in uncertainty solutions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tagasovska23a.html
https://proceedings.mlr.press/v206/tagasovska23a.htmlPosterior Tracking Algorithm for Classification BanditsThe classification bandit problem aims to determine whether a set of given $K$ arms contains at least $L$ good arms or not. Here, an arm is said to be good if its expected reward is no less than a specified threshold. To solve this problem, we introduce an asymptotically optimal algorithm, named P-tracking, based on posterior sampling. Unlike previous asymptotically optimal algorithms that require solving a linear programming problem with an exponentially large number of constraints, P-tracking solves an equivalent optimization problem that can be computed in time linear in $K$. Additionally, unlike existing algorithms, P-tracking does not require forced exploration steps. Empirical results show that P-tracking outperforms existing algorithms in sample efficiency.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/tabata23a.html
https://proceedings.mlr.press/v206/tabata23a.htmlSmoothly Giving up: Robustness for Simple ModelsThere is a growing need for models that are interpretable and have reduced energy/computational cost (e.g., in health care analytics and federated learning). Examples of algorithms to train such models include logistic regression and boosting. However, one challenge facing these algorithms is that they provably suffer from label noise; this has been attributed to the joint interaction between oft-used convex loss functions and simpler hypothesis classes, resulting in too much emphasis being placed on outliers. In this work, we use the margin-based $\alpha$-loss, which continuously tunes between canonical convex and quasi-convex losses, to robustly train simple models. We show that the $\alpha$ hyperparameter smoothly introduces non-convexity and offers the benefit of “giving up” on noisy training examples. We also provide results on the Long-Servedio dataset for boosting and a COVID-19 survey dataset for logistic regression, highlighting the efficacy of our approach across multiple relevant domains.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sypherd23a.html
https://proceedings.mlr.press/v206/sypherd23a.htmlDiscrete Langevin Samplers via Wasserstein Gradient FlowIt is known that gradient based MCMC samplers for continuous spaces, such as Langevin Monte Carlo (LMC), can be derived as particle versions of a gradient flow that minimizes KL divergence on a Wasserstein manifold. The superior efficiency of such samplers has motivated several recent attempts to generalize LMC to discrete spaces. However, a fully principled extension of Langevin dynamics to discrete has yet to be achieved, due to the lack of well-defined gradients in the sample space. In this work, we show how the Wasserstein gradient flow can be generalized naturally to discrete spaces. Given the proposed formulation, we demonstrate how a discrete analogue of Langevin dynamics can subsequently be developed. With this new understanding, we reveal how recent gradient-based samplers in discrete space can be obtained as special cases by choosing particular discretizations. More importantly, the framework also allows for the derivation of novel algorithms, one of which, discrete Langevin Monte Carlo (DLMC), is obtained by a factorized estimate of the transition matrix. The DLMC method admits a convenient parallel implementation and time-uniform sampling that achieves larger jump distances. We demonstrate the advantages of DLMC for sampling and learning in various binary and categorical distributions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sun23f.html
https://proceedings.mlr.press/v206/sun23f.htmlA Unified Perspective on Regularization and Perturbation in Differentiable Subset SelectionSubset selection, i.e., finding a bunch of items from a collection to achieve specific goals, has wide applications in information retrieval, statistics, and machine learning. To implement an end-to-end learning framework, different relaxed differentiable operators of subset selection are proposed. Most existing work relies on either the regularization method or the perturbation method. In this work, we provide a probabilistic interpretation for regularization relaxation and unify two schemes. Besides, we build some concrete examples to show the generic connection between these two relaxations. Finally, we evaluate the perturbed selector as well as the regularized selector on two tasks: the maximum entropy sampling problem and the feature selection problem. The experimental results show that these two methods can achieve competitive performance against other benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sun23e.html
https://proceedings.mlr.press/v206/sun23e.htmlConvergence of Stein Variational Gradient Descent under a Weaker Smoothness ConditionStein Variational Gradient Descent (SVGD) is an important alternative to the Langevin-type algorithms for sampling from probability distributions of the form $\pi(x) \propto \exp(-V(x))$. In the existing theory of Langevin-type algorithms and SVGD, the potential function $V$ is often assumed to be $L$-smooth. However, this restrictive condition excludes a large class of potential functions such as polynomials of degree greater than $2$. Our paper studies the convergence of the SVGD algorithm for distributions with $(L_0,L_1)$-smooth potentials. This relaxed smoothness assumption was introduced by Zhang et al. [2019a] for the analysis of gradient clipping algorithms. With the help of trajectory-independent auxiliary conditions, we provide a descent lemma establishing that the algorithm decreases the KL divergence at each iteration and prove a complexity bound for SVGD in the population limit in terms of the Stein Fisher information.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sun23d.html
https://proceedings.mlr.press/v206/sun23d.htmlNTS-NOTEARS: Learning Nonparametric DBNs With Prior KnowledgeWe describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation model well-suited for sequential data. DBN-CNN structure learning is formulated as a continuous optimization problem with an acyclicity constraint, following the NOTEARS DAG learning approach (Zheng et al., 2018, 2020). We show how prior knowledge of dependencies (e.g., forbidden and required edges) can be included as additional optimization constraints. Empirical evaluation on simulated and benchmark data shows that NTS-NOTEARS achieves state-of-the-art DAG structure quality compared to both parametric and nonparametric baseline methods, with improvement in the range of 10-20$%$ on the F1-score. We also evaluate NTS-NOTEARS on complex real-world data acquired from professional ice hockey games that contain a mixture of continuous and discrete variables. The code is available online.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sun23c.html
https://proceedings.mlr.press/v206/sun23c.htmlUni6Dv2: Noise Elimination for 6D Pose EstimationUni6D is the first 6D pose estimation approach to employ a unified backbone network to extract features from both RGB and depth images. We discover that the principal reasons of Uni6D performance limitations are Instance-Outside and Instance-Inside noise. Uni6D’s simple pipeline design inherently introduces Instance-Outside noise from background pixels in the receptive field, while ignoring Instance-Inside noise in the input depth data. In this paper, we propose a two-step denoising approach for dealing with the aforementioned noise in Uni6D. To reduce noise from non-instance regions, an instance segmentation network is utilized in the first step to crop and mask the instance. A lightweight depth denoising module is proposed in the second step to calibrate the depth feature before feeding it into the pose regression network. Extensive experiments show that our Uni6Dv2 reliably and robustly eliminates noise, outperforming Uni6D without sacrificing too much inference efficiency. It also reduces the need for annotated real data that requires costly labeling.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sun23b.html
https://proceedings.mlr.press/v206/sun23b.htmlA New Causal Decomposition Paradigm towards Health EquityCausal decomposition has provided a powerful tool to analyze health disparity problems by assessing the proportion of disparity caused by each mediator (the variable that mediates the effect of the exposure on the health outcome). However, most of these methods lack policy implications, as they fail to account for all sources of disparities caused by the mediator. Besides, its identifiability needs to specify a set to be admissible to make the strong ignorability condition hold, which can be problematic as some variables in this set may induce new spurious features. To resolve these issues, under the framework of the structural causal model, we propose a new decomposition, dubbed as adjusted and unadjusted effects, which is able to include all types of disparity by adjusting each mediator’s distribution from the disadvantaged group to the advantaged ones. Besides, by learning the maximal ancestral graph and implementing causal discovery from heterogeneous data, we can identify the admissible set, followed by an efficient algorithm for estimation. The theoretical correctness and efficacy of our method are demonstrated using a synthetic dataset and a common spine disease dataset.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sun23a.html
https://proceedings.mlr.press/v206/sun23a.htmlDIET: Conditional independence testing with marginal dependence measures of residual informationConditional randomization tests (CRTs) assess whether a variable $x$ is predictive of another variable $y$, having observed covariates $z$. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: $F_{x\vert z}(x \vert z)$ and $F_{y\vert z}(y \vert z)$ where $F_{\cdot \vert z}(\cdot \vert z)$ is a conditional cumulative distribution function (CDF) for the distribution $p(\cdot \vert z)$. These variables are termed “information residuals.” We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sudarshan23a.html
https://proceedings.mlr.press/v206/sudarshan23a.htmlPAC-Bayesian Learning of Optimization AlgorithmsWe apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sucker23a.html
https://proceedings.mlr.press/v206/sucker23a.htmlCoordinate Ascent for Off-Policy RL with Global Convergence GuaranteesWe revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective – the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/su23a.html
https://proceedings.mlr.press/v206/su23a.htmlBounding Evidence and Estimating Log-Likelihood in VAEMany crucial problems in deep learning and statistical inference are caused by a variational gap, i.e., a difference between model evidence (log-likelihood) and evidence lower bound (ELBO). In particular, in a classical VAE setting that involves training via an ELBO cost function, it is difficult to provide a robust comparison of the effects of training between models, since we do not know a log-likelihood of data (but only its lower bound). In this paper, to deal with this problem, we introduce a general and effective upper bound, which allows us to efficiently approximate the evidence of data. We provide extensive theoretical and experimental studies of our approach, including its comparison to the other state-of-the-art upper bounds, as well as its application as a tool for the evaluation of models that were trained on various lower bounds.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/struski23a.html
https://proceedings.mlr.press/v206/struski23a.htmlSampling From a Schrödinger BridgeThe Schrödinger bridge is a stochastic process that finds the most likely coupling of two measures with respect to Brownian motion, and is equivalent to the popular entropically regularized optimal transport problem. Motivated by recent applications of the Schrödinger bridge to trajectory reconstruction problems, we study the problem of sampling from a Schrödinger bridge in high dimensions. We assume sample access to the marginals of the Schrödinger bridge process and prove that the natural plug-in sampler achieves a fast statistical rate of estimation for the population bridge in terms of relative entropy. This sampling procedure is given by computing the entropic OT plan between samples from each marginal, and joining a draw from this plan with a Brownian bridge. We apply this result to construct a new and computationally feasible estimator that yields improved rates for entropic optimal transport map estimation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/stromme23a.html
https://proceedings.mlr.press/v206/stromme23a.htmlThe Ordered Matrix Dirichlet for State-Space ModelsMany dynamical systems in the real world are naturally described by latent states with intrinsic ordering, such as “ally”, “neutral”, and “enemy” relationships in international relations. These latent states manifest through countries’ cooperative versus conflictual interactions over time. State-space models (SSMs) explicitly relate the dynamics of observed measurements to transitions in latent states. For discrete data, SSMs commonly do so through a state-to-action emission matrix and a state-to-state transition matrix. This paper introduces the Ordered Matrix Dirichlet (OMD) as a prior distribution over ordered stochastic matrices wherein the discrete distribution in the kth row is stochastically dominated by the (k+1)th, such that probability mass is shifted to the right when moving down rows. We illustrate the OMD prior within two SSMs: a hidden Markov model, and a novel dynamic Poisson Tucker decomposition model tailored to international relations data. We find that models built on the OMD recover interpretable ordered latent structure without forfeiting predictive performance. We suggest future applications to other domains where models with stochastic matrices are popular (e.g., topic modeling), and publish user-friendly code.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/stoehr23a.html
https://proceedings.mlr.press/v206/stoehr23a.htmlData Augmentation for Imbalanced RegressionIn this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring a wider support than the initial one. In a second step, the WR method drives the exogenous distribution to a target one. We discuss the choice of the DA procedure through a numerical study that illustrates the advantages of this approach. Finally, an actuarial application is studied.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/stocksieker23a.html
https://proceedings.mlr.press/v206/stocksieker23a.htmlFaithful Heteroscedastic Regression with Neural NetworksHeteroscedastic regression models a Gaussian variable’s mean and variance as a function of covariates. Parametric methods that employ neural networks for these parameter maps can capture complex relationships in the data. Yet, optimizing network parameters via log likelihood gradients can yield suboptimal mean and uncalibrated variance estimates. Current solutions side-step this optimization problem with surrogate objectives or Bayesian treatments. Instead, we make two simple modifications to optimization. Notably, their combination produces a heteroscedastic model with mean estimates that are provably as accurate as those from its homoscedastic counterpart (i.e. fitting the mean under squared error loss). For a wide variety of network and task complexities, we find that mean estimates from existing heteroscedastic solutions can be significantly less accurate than those from an equivalently expressive mean-only model. Our approach provably retains the accuracy of an equally flexible mean-only model while also offering best-in-class variance calibration. Lastly, we show how to leverage our method to recover the underlying heteroscedastic noise variance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/stirn23a.html
https://proceedings.mlr.press/v206/stirn23a.htmlRegression as Classification: Influence of Task Formulation on Neural Network FeaturesNeural networks can be trained to solve regression problems by using gradient-based methods to minimize the square loss. However, practitioners often prefer to reformulate regression as a classification problem, observing that training on the cross entropy loss results in better performance. By focusing on two-layer ReLU networks, which can be fully characterized by measures over their feature space, we explore how the implicit bias induced by gradient-based optimization could partly explain the above phenomenon. We provide theoretical evidence that the regression formulation yields a measure whose support can differ greatly from that for classification, in the case of one-dimensional data. Our proposed optimal supports correspond directly to the features learned by the input layer of the network. The different nature of these supports sheds light on possible optimization difficulties the square loss could encounter during training, and we present empirical results illustrating this phenomenon.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/stewart23a.html
https://proceedings.mlr.press/v206/stewart23a.htmlBayesian Optimization with Conformal Prediction SetsBayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/stanton23a.html
https://proceedings.mlr.press/v206/stanton23a.htmlA Constant-Factor Approximation Algorithm for Reconciliation $k$-MedianIn the reconciliation $k$-median problem we ask to cluster a set of data points by picking $k$ cluster centers so as to minimize the sum of distances of the data points to their cluster centers plus the sum of pairwise distances between the centers. The problem, which is a variant of classic $k$-median, aims to find a set of cluster centers that are not too far from each other, and it has applications, or example, when selecting a committee to deliberate on a controversial topic. This problem was introduced recently (Ordozgoiti et al., 2019), and it was shown that a local-search-based algorithm is always within a factor $O(k)$ of an optimum solution and performs well in practice. In this paper, we demonstrate a close connection of reconciliation $k$-median to a variant of the $k$-facility location problem, in which each potential cluster center has an individual opening cost and we aim at minimizing the sum of client-center distances and the opening costs. This connection enables us to provide a new algorithm for reconciliation $k$-median that yields a constant-factor approximation (independent of $k$). We also provide a sparsification scheme that reduces the number of potential cluster centers to $O(k)$ in order to substantially speed up approximation algorithms. We empirically compare our new algorithms with the previous local-search approach, showing improved performance and stability. In addition, we show how our sparsification approach helps to reduce computation time without significantly compromising the solution quality.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/spoerhase23a.html
https://proceedings.mlr.press/v206/spoerhase23a.htmlNonparametric Indirect Active LearningTypical models of active learning assume a learner can directly manipulate or query a covariate X to study its relationship with a response Y. However, if X is a feature of a complex system, it may be possible only to indirectly influence X by manipulating a control variable Z, a scenario we refer to as Indirect Active Learning. Under a nonparametric fixed-budget model of Indirect Active Learning, we study minimax convergence rates for estimating a local relationship between X and Y, with different rates depending on the complexities and noise levels of the relationships between Z and X and between X and Y. We also derive minimax rates for passive learning under comparable assumptions, finding in many cases that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/singh23a.html
https://proceedings.mlr.press/v206/singh23a.htmlMulti-armed Bandit Experimental Design: Online Decision-making and Adaptive InferenceMulti-armed bandit has been well-known for its efficiency in online decision-making in terms of minimizing the loss of the participants’ welfare during experiments (i.e., the regret). In clinical trials and many other scenarios, the statistical power of inferring the treatment effects (i.e., the gaps between the mean outcomes of different arms) is also crucial. Nevertheless, minimizing the regret entails harming the statistical power of estimating the treatment effect, since the observations from some arms can be limited. In this paper, we investigate the trade-off between efficiency and statistical power by casting the multi-armed bandit experimental design into a minimax multi-objective optimization problem. We introduce the concept of Pareto optimality to mathematically characterize the situation in which neither the statistical power nor the efficiency can be improved without degrading the other. We derive a useful sufficient and necessary condition for the Pareto optimal solutions. Additionally, we design an effective Pareto optimal multi-armed bandit experiment that can be tailored to different levels of the trade-off between the two objectives.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/simchi-levi23a.html
https://proceedings.mlr.press/v206/simchi-levi23a.htmlCLIP-Lite: Information Efficient Visual Representation Learning with Language SupervisionWe propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0$%$ mAP absolute gain in performance on Pascal VOC classification, and a +22.1$%$ top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-LiteTue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shrivastava23a.html
https://proceedings.mlr.press/v206/shrivastava23a.htmlThe Lauritzen-Chen Likelihood For Graphical ModelsGraphical models such as Markov random fields (MRFs) that are associated with undirected graphs, and Bayesian networks (BNs) that are associated with directed acyclic graphs, have proven to be a very popular approach for reasoning under uncertainty, prediction problems and causal inference. Parametric MRF likelihoods are well-studied for Gaussian and categorical data. However, in more complicated parametric and semi-parametric settings, likelihoods specified via clique potential functions are generally not known to be congenial or non-redundant. Congenial and non-redundant DAG likelihoods are far simpler to specify in both parametric and semi-parametric settings by modeling Markov factors in the DAG factorization. However, DAG likelihoods specified in this way are not guaranteed to coincide in distinct DAGs within the same Markov equivalence class. This complicates likelihoods based model selection procedures for DAGs by “sneaking in” potentially unwarranted assumptions about edge orientations. In this paper we link a density function decomposition due to Chen with the clique factorization of MRFs described by Lauritzen to provide a general likelihood for MRF models. The proposed likelihood is composed of variationally independent, and non-redundant closed form functionals of the observed data distribution, and is sufficiently general to apply to arbitrary parametric and semi-parametric models. We use an extension of our developments to give a general likelihood for DAG models that is guaranteed to coincide for all members of a Markov equivalence class. Our results have direct applications for model selection and semi-parametric inference.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shpitser23a.html
https://proceedings.mlr.press/v206/shpitser23a.htmlLoss-Curvature Matching for Dataset Selection and CondensationTraining neural networks on a large dataset requires substantial computational costs. Dataset reduction selects or synthesizes data instances based on the large dataset, while minimizing the degradation in generalization performance from the full dataset. Existing methods utilize the neural network during the dataset reduction procedure, so the model parameter becomes important factor in preserving the performance after reduction. By depending upon the importance of parameters, this paper introduces a new reduction objective, coined LCMat, which Matches the Loss Curvatures of the original dataset and reduced dataset over the model parameter space, more than the parameter point. This new objective induces a better adaptation of the reduced dataset on the perturbed parameter region than the exact point matching. Particularly, we identify the worst case of the loss curvature gap from the local parameter region, and we derive the implementable upper bound of such worst-case with theoretical analyses. Our experiments on both coreset selection and condensation benchmarks illustrate that LCMat shows better generalization performances than existing baselines.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shin23a.html
https://proceedings.mlr.press/v206/shin23a.htmlDistributed Offline Policy Optimization Over Batch DataFederated learning (FL) has received increasing interests during the past years, However, most of the existing works focus on supervised learning, and federated learning for sequential decision making has not been fully explored. Part of the reason is that learning a policy for sequential decision making typically requires repeated interaction with the environments, which is costly in many FL applications.To overcome this issue, this work proposes a federated offline policy optimization method abbreviated as FedOPO that allows clients to jointly learn the optimal policy without interacting with environments during training. Albeit the nonconcave-convex-strongly concave nature of the resultant max-min-max problem, we establish both the local and global convergence of our FedOPO algorithm. Experiments on the OpenAI gym demonstrate that our algorithm is able to find a near-optimal policy while enjoying various merits brought by FL, including training speedup and improved asymptotic performance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shen23b.html
https://proceedings.mlr.press/v206/shen23b.htmlPAC Learning of Halfspaces with Malicious Noise in Nearly Linear TimeWe study the problem of efficient PAC learning of halfspaces in $\mathbb{R}^d$ in the presence of the malicious noise, where a fraction of the training samples are adversarially corrupted. A series of recent works have developed polynomial-time algorithms that enjoy near-optimal sample complexity and noise tolerance, yet leaving open whether a linear-time algorithm exists and matches these appealing statistical performance guarantees. In this work, we give an affirmative answer by developing an algorithm that runs in time $\tilde{O}(m d )$, where $m = \tilde{O}(\frac{d}{\epsilon})$ is the sample size and $\epsilon \in (0, 1)$ is the target error rate. Notably, the computational complexity of all prior algorithms suffer either a high order dependence on the problem size, or is implicitly proportional to $\frac{1}{\epsilon^2}$ through the sample size. Our key idea is to combine localization and an approximate version of matrix multiplicative weights update method to progressively downweight the contribution of the corrupted samples while refining the learned halfspace.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shen23a.html
https://proceedings.mlr.press/v206/shen23a.htmlOn the Capacity Limits of Privileged ERMWe study the supervised learning paradigm called Learning Using Privileged Information, first suggested by Vapnik and Vashist (2009). In this paradigm, in addition to the examples and labels, additional (privileged) information is provided only for training examples. The goal is to use this information to improve the classification accuracy of the resulting classifier, where this classifier can only use the non-privileged information of new example instances to predict their label. We study the theory of privileged learning with the zero-one loss under the natural Privileged ERM algorithm proposed in Peshyony and Vapnik (2010). We provide a counter example to a claim made in that work regarding the VC dimension of the loss class induced by this problem; We conclude that the claim is incorrect. We then provide a correct VC dimension analysis which gives both lower and upper bounds on the capacity of the Privileged ERM loss class. We further show, via a generalization analysis, that worst-case guarantees for Privileged ERM cannot improve over standard non-privileged ERM, unless the capacity of the privileged information is similar or smaller to that of the non-privileged information. This result points to an important limitation of the Privileged ERM approach. In our closing discussion, we suggest another way in which Privileged ERM might still be helpful, even when the capacity of the privileged information is large.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sharoni23a.html
https://proceedings.mlr.press/v206/sharoni23a.htmlDo Bayesian Neural Networks Need To Be Fully Stochastic?We investigate the benefit of treating all the parameters in a Bayesian neural network stochastically and find compelling theoretical and empirical evidence that this standard construction may be unnecessary. To this end, we prove that expressive predictive distributions require only small amounts of stochasticity. In particular, partially stochastic networks with only n stochastic biases are universal probabilistic predictors for n-dimensional predictive problems. In empirical investigations, we find no systematic benefit of full stochasticity across four different inference modalities and eight datasets; partially stochastic networks can match and sometimes even outperform fully stochastic networks, despite their reduced memory costs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sharma23a.html
https://proceedings.mlr.press/v206/sharma23a.htmlDirect Inference of Effect of Treatment (DIET) for a Cookieless WorldBrands use cookies and device identifiers to link different web visits to the same consumer. However, with increasing demands for privacy, these identifiers are about to be phased out, making identity fragmentation a permanent feature of the online world. Assessing treatment effects via randomized experiments (A/B testing) in such a scenario is challenging because identity fragmentation causes a) users to receive hybrid/mixed treatments, and b) hides the causal link between the historical treatments and the outcome. In this work, we address the problem of estimating treatment effects when a lack of identification leads to incomplete knowledge of historical treatments. This is a challenging problem which has not been addressed in literature yet. We develop a new method called DIET, which can adjust for users being exposed to mixed treatments without the entire history of treatments being available. Our method takes inspiration from the Cox model, and uses a proportional outcome approach under which we prove that one can obtain consistent estimates of treatment effects even under identity fragmentation. Our experiments, on one simulated and two real datasets, show that our method leads to up to 20% reduction in error and 25% reduction in bias over the naive estimate.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shankar23a.html
https://proceedings.mlr.press/v206/shankar23a.htmlPrecision/Recall on Imbalanced Test DataIn this paper we study the problem of estimating accurately the precision and recall for binary classification when the classes are imbalanced and only a limited number of human labels are available. One common strategy is to over-sample the small positive class predicted by the classifier. Rather than random sampling where the values in a confusion matrix are observations coming from a multinomial distribution, we over-sample the minority positive class predicted by the classifier, resulting in two independent binomial distributions. But how much should we over-sample? And what confidence/credible intervals can we deduce based on our over-sampling? We provide formulas for (1) the confidence intervals of the adjusted precision/recall after over-sampling; (2) Bayesian credible intervals of adjusted precision/recall. For precision, the higher the over-sampling rate, the narrower the confidence/credible interval. For recall, there exists an optimal over-sampling ratio, which minimizes the width of the confidence/credible interval. Also, we present experiments on synthetic data and real data to demonstrate the capability of our method to construct accurate intervals. Finally, we demonstrate how we can apply our techniques to Yahoo mail’s quality monitoring system.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shang23a.html
https://proceedings.mlr.press/v206/shang23a.htmlBeyond Performative Prediction: Open-environment Learning with Presence of CorruptionsPerformative prediction is a framework to capture the endogenous distribution changes resulting from the reactions of deployed environments to the learner’s decision. Existing results require that the collected data are sampled from the clean observed distribution. However, this is often not the case in real-world applications, and even worse, data collected in open environments may include corruption due to various undesirable factors. In this paper, we study the entanglement of endogenous distribution change and corruption in open environments, where data are obtained from a corrupted decision-dependent distribution. The central challenge in this problem is the entangling effects between changing distributions and corruptions, which impede the use of effective gradient-based updates. To overcome this difficulty, we propose a novel recursive formula that decouples the two sources of effects, which allows us to further exploit suitable techniques for handling two decoupled effects and obtaining favorable guarantees. Theoretically, we prove that our proposed algorithm converges to the desired solution under corrupted observations, and simultaneously it can retain a competitive rate in the uncorrupted case. Experimental results also support our theoretical findings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shan23a.html
https://proceedings.mlr.press/v206/shan23a.htmlModel-X Sequential Testing for Conditional Independence via Testing by BettingThis paper develops a model-free sequential test for conditional independence. The proposed test allows researchers to analyze an incoming i.i.d. data stream with any arbitrary dependency structure, and safely conclude whether a feature is conditionally associated with the response under study. We allow the processing of data points online, as soon as they arrive, and stop data acquisition once significant results are detected, rigorously controlling the type-I error rate. Our test can work with any sophisticated machine learning algorithm to enhance data efficiency to the extent possible. The developed method is inspired by two statistical frameworks. The first is the model-X conditional randomization test, a test for conditional independence that is valid in offline settings where the sample size is fixed in advance. The second is testing by betting, a “game-theoretic” approach for sequential hypothesis testing. We conduct synthetic experiments to demonstrate the advantage of our test over out-of-the-box sequential tests that account for the multiplicity of tests in the time horizon, and demonstrate the practicality of our proposal by applying it to real-world tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/shaer23a.html
https://proceedings.mlr.press/v206/shaer23a.htmlNODAGS-Flow: Nonlinear Cyclic Causal Structure LearningLearning causal relationships between variables is a well-studied problem in statistics, with many important applications in science. However, modeling real-world systems remain challenging, as most existing algorithms assume that the underlying causal graph is acyclic. While this is a convenient framework for developing theoretical developments about causal reasoning and inference, the underlying modeling assumption is likely to be violated in real systems, because feedback loops are common (e.g., in biological systems). Although a few methods search for cyclic causal models, they usually rely on some form of linearity, which is also limiting, or lack a clear underlying probabilistic model. In this work, we propose a novel framework for learning nonlinear cyclic causal graphical models from interventional data, called NODAGS-Flow. We perform inference via direct likelihood optimization, employing techniques from residual normalizing flows for likelihood estimation. Through synthetic experiments and an application to single-cell high-content perturbation screening data, we show significant performance improvements with our approach compared to state-of-the-art methods with respect to structure recovery and predictive performance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sethuraman23a.html
https://proceedings.mlr.press/v206/sethuraman23a.htmlMixtures of All TreesTree-shaped graphical models are widely used for their tractability. However, they unfortunately lack expressive power as they require committing to a particular sparse dependency structure. We propose a novel class of generative models called mixtures of all trees: that is, a mixture over all possible ($n^{n-2}$) tree-shaped graphical models over n variables. We show that it is possible to parameterize this Mixture of All Trees (MoAT) model compactly (using a polynomial-size representation) in a way that allows for tractable likelihood computation and optimization via stochastic gradient descent. Furthermore, by leveraging the tractability of tree-shaped models, we devise fast-converging conditional sampling algorithms for approximate inference, even though our theoretical analysis suggests that exact computation of marginals in the MoAT model is NP-hard. Empirically, MoAT achieves state-of-the-art performance on density estimation benchmarks when compared against powerful probabilistic models including hidden Chow-Liu Trees.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/selvam23a.html
https://proceedings.mlr.press/v206/selvam23a.htmlSparse Spectral Bayesian Permanental Process with Generalized KernelWe introduce a novel scheme for Bayesian inference on permanental processes which models the Poisson intensity as the square of a Gaussian process. Combining generalized kernels and a Fourier features-based representation of the Gaussian process with a Laplace approximation to the posterior, we achieve a fast and efficient inference that does not require numerical integration over the input space, allows kernel design and scales linearly with the number of events. Our method builds and improves upon the state-of-theart Laplace Bayesian point process benchmark of Walder and Bishop (2017), demonstrated on both synthetic, real-world temporal and large spatial data sets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sellier23a.html
https://proceedings.mlr.press/v206/sellier23a.htmlImproving Adaptive Conformal Prediction Using Self-Supervised LearningConformal prediction is a powerful distribution-free tool for uncertainty quantification, establishing valid prediction intervals with finite-sample guarantees. To produce valid intervals which are also adaptive to the difficulty of each instance, a common approach is to compute normalized nonconformity scores on a separate calibration set. Self-supervised learning has been effectively utilized in many domains to learn general representations for downstream predictors. However, the use of self-supervision beyond model pretraining and representation learning has been largely unexplored. In this work, we investigate how self-supervised pretext tasks can improve the quality of the conformal regressors, specifically by improving the adaptability of conformal intervals. We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/seedat23a.html
https://proceedings.mlr.press/v206/seedat23a.htmlMeta-Uncertainty in Bayesian Model ComparisonBayesian model comparison (BMC) offers a principled probabilistic approach to study and rank competing models. In standard BMC, we construct a discrete probability distribution over the set of possible models, conditional on the observed data of interest. These posterior model probabilities (PMPs) are measures of uncertainty, but—when derived from a finite number of observations—are also uncertain themselves. In this paper, we conceptualize distinct levels of uncertainty which arise in BMC. We explore a fully probabilistic framework for quantifying meta-uncertainty, resulting in an applied method to enhance any BMC workflow. Drawing on both Bayesian and frequentist techniques, we represent the uncertainty over the uncertain PMPs via meta-models which combine simulated and observed data into a predictive distribution for PMPs on new data. We demonstrate the utility of the proposed method in the context of conjugate Bayesian regression, likelihood-based inference with Markov chain Monte Carlo, and simulation-based inference with neural networks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/schmitt23a.html
https://proceedings.mlr.press/v206/schmitt23a.htmlRobust Linear Regression: Gradient-descent, Early-stopping, and BeyondIn this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification Vardi et al. (2022) that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/scetbon23a.html
https://proceedings.mlr.press/v206/scetbon23a.htmlMode-constrained Model-based Reinforcement Learning via Gaussian ProcessesModel-based reinforcement learning (RL) algorithms do not typically consider environments with multiple dynamic modes, where it is beneficial to avoid inoperable or undesirable modes. We present a model-based RL algorithm that constrains training to a single dynamic mode with high probability. This is a difficult problem because the mode constraint is a hidden variable associated with the environment’s dynamics. As such, it is 1) unknown a priori and 2) we do not observe its output from the environment, so cannot learn it with supervised learning. We present a nonparametric dynamic model which learns the mode constraint alongside the dynamic modes. Importantly, it learns latent structure that our planning scheme leverages to 1) enforce the mode constraint with high probability, and 2) escape local optima induced by the mode constraint. We validate our method by showing that it can solve a simulated quadcopter navigation task whilst providing a level of constraint satisfaction both during and after training.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/scannell23a.html
https://proceedings.mlr.press/v206/scannell23a.htmlRisk-aware linear bandits with convex lossIn decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/saux23a.html
https://proceedings.mlr.press/v206/saux23a.htmlImplications of sparsity and high triangle density for graph representation learningRecent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sansford23a.html
https://proceedings.mlr.press/v206/sansford23a.htmlSparsity-Inducing Categorical Prior Improves Robustness of the Information BottleneckThe information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism for learning a joint distribution of the latent variable and the sparsity, and hence it can account for the complete uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/samaddar23a.html
https://proceedings.mlr.press/v206/samaddar23a.htmlImproved Generalization Bound and Learning of Sparsity Patterns for Data-Driven Low-Rank ApproximationLearning sketching matrices for fast and accurate low-rank approximation (LRA) has gained increasing attention. Recently, Bartlett, Indyk, and Wagner (COLT 2022) presented a generalization bound for the learning-based LRA. Specifically, for rank-$k$ approximation using an $m \times n$ learned sketching matrix with $s$ non-zeros in each column, they proved an $\tilde O(nsm)$ bound on the fat shattering dimension ($\tilde O$ hides logarithmic factors). We build on their work and make two contributions. (1) We present a better $\tilde O(nsk)$ bound ($k \le m$). En route to obtaining this result, we give a low-complexity Goldberg–Jerrum algorithm for computing pseudo-inverse matrices, which would be of independent interest. (2) We alleviate an assumption of the previous study that sketching matrices have a fixed sparsity pattern. We prove that learning positions of non-zeros increases the fat shattering dimension only by $O(ns\log n)$. In addition, experiments confirm the practical benefit of learning sparsity patterns.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sakaue23a.html
https://proceedings.mlr.press/v206/sakaue23a.htmlDueling RL: Reinforcement Learning with Trajectory PreferencesWe consider the problem of preference-based reinforcement learning (PbRL), where, unlike traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit (0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the traditional reward-based RL framework crucially depends on how accurately a system designer can express an appropriate reward function, which is often a non-trivial task. The main novelty of the our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we propose an algorithm with a regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further extend the above algorithm to the case of unknown transition dynamics and provide an algorithm with regret $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference-based RL problem with trajectory preferences.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/saha23a.html
https://proceedings.mlr.press/v206/saha23a.htmlImproved Robust Algorithms for Learning with Discriminative Feature FeedbackDiscriminative Feature Feedback is a setting first introduced by Dasgupta et al. (2018), which provides a protocol for interactive learning based on feature explanations that are provided by a human teacher. The features distinguish between the labels of pairs of possibly similar instances. That work has shown that learning in this model can have considerable statistical and computational advantages over learning in standard label-based interactive learning models. In this work, we provide new robust interactive learning algorithms for the Discriminative Feature Feedback model, with mistake bounds that are significantly lower than those of previous robust algorithms for this setting. In the adversarial setting, we reduce the dependence on the number of protocol exceptions from quadratic to linear. In addition, we provide an algorithm for a slightly more restricted model, which obtains an even smaller mistake bound for large models with many exceptions. In the stochastic setting, we provide the first algorithm that converges to the exception rate with a polynomial sample complexity. Our algorithm and analysis for the stochastic setting involve a new construction that we call Feature Influence, which may be of wider applicability.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/sabato23a.html
https://proceedings.mlr.press/v206/sabato23a.htmlNo time to waste: practical statistical contact tracing with few low-bit messagesPandemics have a major impact on society and the economy. In the case of a new virus, such as COVID-19, high-grade tests and vaccines might be slow to develop and scarce in the crucial initial phase. With no time to waste and lock-downs being expensive, contact tracing is thus an essential tool for policymakers. In theory, statistical inference on a virus transmission model can provide an effective method for tracing infections. However, in practice, such algorithms need to run decentralized, rendering existing methods – that require hundreds or even thousands of daily messages per person – infeasible. In this paper, we develop an algorithm that (i) requires only a few (2-5) daily messages, (ii) works with extremely low bandwidths (3-5 bits) and (iii) enables quarantining and targeted testing that drastically reduces the peak and length of the pandemic. We compare the effectiveness of our algorithm using two agent-based simulators of realistic contact patterns and pandemic parameters and show that it performs well even with low bandwidth, imprecise tests, and incomplete population coverage.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/romijnders23a.html
https://proceedings.mlr.press/v206/romijnders23a.htmlCooperative Inverse Decision Theory for Uncertain PreferencesInverse decision theory (IDT) aims to learn a performance metric for classification by eliciting expert classifications on examples. However, elicitation in practical settings may require many classifications of potentially ambiguous examples. To improve the efficiency of elicitation, we propose the cooperative inverse decision theory (CIDT) framework as a formalization of the performance metric elicitation problem. In cooperative inverse decision theory, the expert and a machine play a game where both are rewarded according to the expert’s performance metric, but the machine does not initially know what this function is. We show that optimal policies in this framework produce active learning that leads to an exponential improvement in sample complexity over previous work. One of our key findings is that a broad class of sub-optimal experts can be represented as having uncertain preferences. We use this finding to show such experts naturally fit into our proposed framework extending inverse decision theory to efficiently deal with decision data that is sub-optimal due to noise, conflicting experts, or systematic error.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/robertson23a.html
https://proceedings.mlr.press/v206/robertson23a.htmlAdaptive Tuning for Metropolis Adjusted Langevin TrajectoriesHamiltonian Monte Carlo (HMC) is a widely used sampler for continuous probability distributions. In many cases, the underlying Hamiltonian dynamics exhibit a phenomenon of resonance which decreases the efficiency of the algorithm and makes it very sensitive to hyperparameter values. This issue can be tackled efficiently, either via the use of trajectory length randomization (RHMC) or via partial momentum refreshment. The second approach is connected to the kinetic Langevin diffusion, and has been mostly investigated through the use of Generalized HMC (GHMC). However, GHMC induces momentum flips upon rejections causing the sampler to backtrack and waste computational resources. In this work we focus on a recent algorithm bypassing this issue, named Metropolis Adjusted Langevin Trajectories (MALT). We build upon recent strategies for tuning the hyperparameters of RHMC which target a bound on the Effective Sample Size (ESS) and adapt it to MALT, thereby enabling the first user-friendly deployment of this algorithm. We construct a method to optimize a sharper bound on the ESS and reduce the estimator variance. Easily compatible with parallel implementation, the resultant Adaptive MALT algorithm is competitive in terms of ESS rate and hits useful tradeoffs in memory usage when compared to GHMC, RHMC and NUTS.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/riou-durand23a.html
https://proceedings.mlr.press/v206/riou-durand23a.htmlDifferentially Private Synthetic ControlSynthetic control is a causal inference tool used to estimate the treatment effects of an intervention by creating synthetic counterfactual data. This approach combines measurements from other similar observations (i.e., donor pool) to predict a counterfactual time series of interest (i.e., target unit) by analyzing the relationship between the target and the donor pool before the intervention. As synthetic control tools are increasingly applied to sensitive or proprietary data, formal privacy protections are often required. In this work, we suggest the first algorithms for differentially private synthetic control with explicit error bounds based on the analysis of the sensitivity of the synthetic control query. Our approach builds upon tools from non-private synthetic control and differentially private empirical risk minimization. We empirically evaluate the performance of our algorithms and show favorable results in a variety of parameter regimes.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/rho23a.html
https://proceedings.mlr.press/v206/rho23a.htmlGeneralized PTR: User-Friendly Recipes for Data-Adaptive Algorithms with Differential PrivacyThe “Propose-Test-Release” (PTR) framework [Dwork and Lei, 2009] is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is “nice”. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from “Private Aggregation of Teacher Ensembles (PATE)” [Papernot et al., 2017, 2018] - privately releasing the entire model with a delicate data-dependent analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/redberg23a.html
https://proceedings.mlr.press/v206/redberg23a.htmlDeep Neural Networks with Efficient Guaranteed InvariancesWe address the problem of improving the performance and in particular the sample complexity of deep neural networks by enforcing and guaranteeing invariances to symmetry transformations rather than learning them from data. Group-equivariant convolutions are a popular approach to obtain equivariant representations. The desired corresponding invariance is then imposed using pooling operations. For rotations, it has been shown that using invariant integration instead of pooling further improves the sample complexity. In this contribution, we first expand invariant integration beyond rotations to flips and scale transformations. We then address the problem of incorporating multiple desired invariances into a single network. For this purpose, we propose a multi-stream architecture, where each stream is invariant to a different transformation such that the network can simultaneously benefit from multiple invariances. We demonstrate our approach with successful experiments on Scaled-MNIST, SVHN, CIFAR-10 and STL-10.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/rath23a.html
https://proceedings.mlr.press/v206/rath23a.htmlCoherent Probabilistic Forecasting of Temporal HierarchiesForecasts at different time granularities are required in practice for addressing various business problems starting from short-term operational to medium-term tactical and to long-term strategic planning. These forecasting problems are usually treated independently by learning different ML models which results in forecasts that are not consistent with the temporal aggregation structure, leading to inefficient decision making. Some of the recent work addressed this problem, however, it uses a post-hoc reconciliation strategy, which results in sub-optimal results and cannot produce probabilistic forecasts. In this paper, we present a global model that produces coherent, probabilistic forecasts for different time granularities by learning joint embeddings for the different aggregation levels with graph neural networks and temporal reconciliation. Temporal reconciliation not only enables consistent decisions for business problems across different planning horizons but also improves the quality of forecasts at finer time granularities. A thorough empirical evaluation illustrates the benefits of the proposed method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/rangapuram23a.html
https://proceedings.mlr.press/v206/rangapuram23a.htmlBayesian Hierarchical Models for Counterfactual EstimationCounterfactual explanations utilize feature perturbations to analyze the outcome of an original decision and recommend an actionable recourse. We argue that it is beneficial to provide several alternative explanations rather than a single point solution and propose a probabilistic paradigm to estimate a diverse set of counterfactuals. Specifically, we treat the perturbations as random variables endowed with prior distribution functions. This allows sampling multiple counterfactuals from the posterior density, with the added benefit of incorporating inductive biases, preserving domain specific constraints and quantifying uncertainty in estimates. More importantly, we leverage Bayesian hierarchical modeling to share information across different subgroups of a population, which can both improve robustness and measure fairness. A gradient based sampler with superior convergence characteristics efficiently computes the posterior samples. Experiments across several datasets demonstrate that the counterfactuals estimated using our approach are valid, sparse, diverse and feasible.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/raman23a.html
https://proceedings.mlr.press/v206/raman23a.htmlIncorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approachBayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding Summary Evidence Lower BOund. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/raj23a.html
https://proceedings.mlr.press/v206/raj23a.htmlNoise-Aware Statistical Inference with Differentially Private Synthetic DataWhile generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation (MI), and synthetic data generation using noise-aware (NA) Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation using the values of marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/raisa23a.html
https://proceedings.mlr.press/v206/raisa23a.htmlFast Feature Selection with Fairness ConstraintsWe study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/quinzan23a.html
https://proceedings.mlr.press/v206/quinzan23a.htmlEncoding Domain Knowledge in Multi-view Latent Variable Models: A Bayesian Approach with Structured SparsityMany real-world systems are described not only by data from a single source but via multiple data views. In genomic medicine, for instance, patients can be characterized by data from different molecular layers. Latent variable models with structured sparsity are a commonly used tool for disentangling variation within and across data views. However, their interpretability is cumbersome since it requires a direct inspection and interpretation of each factor from domain experts. Here, we propose MuVI, a novel multi-view latent variable model based on a modified horseshoe prior for modeling structured sparsity. This facilitates the incorporation of limited and noisy domain knowledge, thereby allowing for an analysis of multi-view data in an inherently explainable manner. We demonstrate that our model (i) outperforms state-of-the-art approaches for modeling structured sparsity in terms of the reconstruction error and the precision/recall, (ii) robustly integrates noisy domain expertise in the form of feature sets, (iii) promotes the identifiability of factors and (iv) infers interpretable and biologically meaningful axes of variation in a real-world multi-view dataset of cancer patients.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qoku23a.html
https://proceedings.mlr.press/v206/qoku23a.htmlIterative Teaching by Data HallucinationWe consider the problem of iterative machine teaching, where a teacher sequentially provides examples based on the status of a learner under a discrete input space (i.e., a pool of finite samples), which greatly limits the teacher’s capability. To address this issue, we study iterative teaching under a continuous input space where the input example (i.e., image) can be either generated by solving an optimization problem or drawn directly from a continuous distribution. Specifically, we propose data hallucination teaching (DHT) where the teacher can generate input data intelligently based on labels, the learner’s status and the target concept. We study a number of challenging teaching setups (e.g., linear/neural learners in omniscient and black-box settings). Extensive empirical results verify the effectiveness of DHT.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qiu23a.html
https://proceedings.mlr.press/v206/qiu23a.html{PF}$^2$ES: Parallel Feasible Pareto Frontier Entropy Search for Multi-Objective Bayesian OptimizationWe present Parallel Feasible Pareto Frontier Entropy Search ($\{\mathrm{PF}\}^2$ES) — a novel information-theoretic acquisition function for multi-objective Bayesian optimization supporting unknown constraints and batch queries. Due to the complexity of characterizing the mutual information between candidate evaluations and (feasible) Pareto frontiers, existing approaches must either employ crude approximations that significantly hamper their performance or rely on expensive inference schemes that substantially increase the optimization’s computational overhead. By instead using a variational lower bound, $\{\mathrm{PF}\}^2$ES provides a low-cost and accurate estimate of the mutual information. We benchmark $\{\mathrm{PF}\}^2$ES against other information-theoretic acquisition functions, demonstrating its competitive performance for optimization across synthetic and real-world design problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qing23a.html
https://proceedings.mlr.press/v206/qing23a.htmlT-Phenotype: Discovering Phenotypes of Predictive Temporal Patterns in Disease ProgressionClustering time-series data in healthcare is crucial for clinical phenotyping to understand patients’ disease progression patterns and to design treatment guidelines tailored to homogeneous patient subgroups. While rich temporal dynamics enable the discovery of potential clusters beyond static correlations, two major challenges remain outstanding: i) discovery of predictive patterns from many potential temporal correlations in the multi-variate time-series data and ii) association of individual temporal patterns to the target label distribution that best characterizes the underlying clinical progression. To address such challenges, we develop a novel temporal clustering method, T-Phenotype, to discover phenotypes of predictive temporal patterns from labeled time-series data. We introduce an efficient representation learning approach in frequency domain that can encode variable-length, irregularly-sampled time-series into a unified representation space, which is then applied to identify various temporal patterns that potentially contribute to the target label using a new notion of path-based similarity. Throughout the experiments on synthetic and real-world datasets, we show that T-Phenotype achieves the best phenotype discovery performance over all the evaluated baselines. We further demonstrate the utility of T-Phenotype by uncovering clinically meaningful patient subgroups characterized by unique temporal patterns.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qin23b.html
https://proceedings.mlr.press/v206/qin23b.htmlAn Online and Unified Algorithm for Projection Matrix Vector Multiplication with Application to Empirical Risk MinimizationOnline matrix vector multiplication is a fundamental step and bottleneck in many machine learning algorithms. It is defined as follows: given a matrix at the pre-processing phase, at each iteration one receives a query vector and needs to form the matrix-vector product (approximately) before observing the next vector. In this work, we study a particular instance of such problem called the online projection matrix vector multiplication. Via a reduction, we show it suffices to solve the inverse maintenance problem. Additionally, our framework supports dimensionality reduction to speed up the computation that approximates the matrix-vector product with an optimization-friendly error guarantee. Moreover, our unified approach can handle both data-oblivious sketching and data-dependent sampling. Finally, we demonstrate the effectiveness of our framework by speeding up the empirical risk minimization solver.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qin23a.html
https://proceedings.mlr.press/v206/qin23a.htmlNear-Optimal Differentially Private Reinforcement LearningMotivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves privacy for free asymptotically as $T\rightarrow \infty$. Our techniques — which could be of independent interest — include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qiao23a.html
https://proceedings.mlr.press/v206/qiao23a.htmlCatalyst Acceleration of Error Compensated Methods Leads to Better Communication ComplexityCommunication overhead is well known to be a key bottleneck in large scale distributed learning, and a particularly successful class of methods which help to overcome this bottleneck is based on the idea of communication compression. Some of the most practically effective gradient compressors, such as TopK, are biased, which causes convergence issues unless one employs a well designed error compensation/feedback mechanism. Error compensation is therefore a fundamental technique in the distributed learning literature. In a recent development, Qian et al (NeurIPS 2021) showed that the error-compensation mechanism can be combined with acceleration/momentum, which is another key and highly successful optimization technique. In particular, they developed the error-compensated loop-less Katyusha (ECLK) method, and proved an accelerated linear rate in the strongly convex case. However, the dependence of their rate on the compressor parameter does not match the best dependence obtainable in the non-accelerated error-compensated methods. Our work addresses this problem. We propose several new accelerated error-compensated methods using the catalyst acceleration technique, and obtain results that match the best dependence on the compressor parameter in non-accelerated error-compensated methods up to logarithmic terms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qian23a.html
https://proceedings.mlr.press/v206/qian23a.htmlSurveillance Evasion Through Bayesian Reinforcement LearningWe consider a task of surveillance-evading path-planning in a continuous setting. An Evader strives to escape from a 2D domain while minimizing the risk of detection (and immediate capture). The probability of detection is path-dependent and determined by the spatially inhomogeneous surveillance intensity, which is fixed but a priori unknown and gradually learned in the multi-episodic setting. We introduce a Bayesian reinforcement learning algorithm that relies on a Gaussian Process regression (to model the surveillance intensity function based on the information from prior episodes), numerical methods for Hamilton-Jacobi PDEs (to plan the best continuous trajectories based on the current model), and Confidence Bounds (to balance the exploration vs exploitation). We use numerical experiments and regret metrics to highlight the significant advantages of our approach compared to traditional graph-based algorithms of reinforcement learning.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/qi23a.html
https://proceedings.mlr.press/v206/qi23a.htmlPreferential Subsampling for Stochastic Gradient Langevin DynamicsStochastic gradient MCMC (SGMCMC) offers a scalable alternative to traditional MCMC, by constructing an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. While efficient to compute, the resulting gradient estimator may exhibit a high variance and impact sampler performance. The problem of variance control has been traditionally addressed by constructing a better stochastic gradient estimator, often using control variates. We propose to use a discrete, non-uniform probability distribution to preferentially subsample data points that have a greater impact on the stochastic gradient. In addition, we present a method of adaptively adjusting the subsample size at each iteration of the algorithm, so that we increase the subsample size in areas of the sample space where the gradient is harder to estimate. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/putcha23a.html
https://proceedings.mlr.press/v206/putcha23a.htmlAUC-based Selective ClassificationSelective classification (or classification with a reject option) pairs a classifier with a selection function to determine whether or not a prediction should be accepted. This framework trades off coverage (probability of accepting a prediction) with predictive performance, typically measured by distributive loss functions. In many application scenarios, such as credit scoring, performance is instead measured by ranking metrics, such as the Area Under the ROC Curve (AUC). We propose a model-agnostic approach to associate a selection function to a given probabilistic binary classifier. The approach is specifically targeted at optimizing the AUC. We provide both theoretical justifications and a novel algorithm, called AUCROSS, to achieve such a goal. Experiments show that our method succeeds in trading-off coverage for AUC, improving over existing selective classification methods targeted at optimizing accuracy.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/pugnana23a.html
https://proceedings.mlr.press/v206/pugnana23a.htmlA Contrastive Approach to Online Change Point DetectionWe suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/puchkin23a.html
https://proceedings.mlr.press/v206/puchkin23a.htmlDistance-to-Set Priors and Constrained Bayesian InferenceConstrained learning is prevalent in many statistical tasks. Recent work proposes distance-to-set penalties to derive estimators under general constraints that can be specified as sets, but focuses on obtaining point estimates that do not come with corresponding measures of uncertainty. To remedy this, we approach distance-to-set regularization from a Bayesian lens. We consider a class of smooth distance-to-set priors, showing that they yield well-defined posteriors toward quantifying uncertainty for constrained learning problems. We discuss relationships and advantages over prior work on Bayesian constraint relaxation. Moreover, we prove that our approach is optimal in an information geometric-sense for finite penalty parameters $\rho$, and enjoys favorable statistical properties when $\rho \rightarrow \infty$. The method is designed to perform effectively within gradient based MCMC samplers, as illustrated on a suite of simulated and real data applications.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/presman23a.html
https://proceedings.mlr.press/v206/presman23a.htmlIdeal Abstractions for Decision-Focused LearningWe present a methodology for formulating simplifying abstractions in machine learning systems by identifying and harnessing the utility structure of decisions. Machine learning tasks commonly involve high-dimensional output spaces (e.g., predictions for every pixel in an image or node in a graph), even though a coarser output would often suffice for downstream decision-making (e.g., regions of an image instead of pixels). Developers often hand-engineer abstractions of the output space, but numerous abstractions are possible and it is unclear how the choice of output space for a model impacts its usefulness in downstream decision-making. We propose a method that configures the output space automatically in order to minimize the loss of decision-relevant information. Taking a geometric perspective, we formulate a step of the algorithm as a projection of the probability simplex, termed fold, that minimizes the total loss of decision-related information in the H-entropy sense. Crucially, learning in the abstracted outcome space requires significantly less data, leading to a net improvement in decision quality. We demonstrate the method in two domains: data acquisition for deep neural network training and a closed-loop wildfire management task.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/poli23a.html
https://proceedings.mlr.press/v206/poli23a.htmlClassification of Adolescents’ Risky Behavior in Instant Messaging ConversationsPrevious research on detecting risky online behavior has been rather scattered, typically identifying single risks in online samples. To our knowledge, the presented research is the first that presents a process of building models that can efficiently detect the following four online risky behavior: (1) aggression, harassment, hate; (2) mental health; (3) use of alcohol, and drugs; and (4) sexting. Furthermore, the corpora in this research are unique because of the usage of private instant messaging conversations in the Czech language provided by adolescents. The combination of publicly unavailable and unique data with high-quality annotations of specific psychological phenomena allowed us for precise detection using transformer machine learning models that can handle sequential data and involve the context of utterances. The impact of the context length and text augmentation on model efficiency is discussed in detail. The final model provides promising results with an acceptable F1 score. Therefore, we believe that the model could be used in various applications, e.g., parental applications, chatbots, or services provided by Internet providers. Future research could investigate the usage of the model in other languages.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/plhak23a.html
https://proceedings.mlr.press/v206/plhak23a.htmlFederated Averaging Langevin Dynamics: Toward a unified theory and new algorithmsThis paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated Averaging algorithm to Bayesian inference. We obtain a novel tight non-asymptotic upper bound on the Wasserstein distance to the global posterior for FALD. This bound highlights the effects of statistical heterogeneity, which causes a drift in the local updates that negatively impacts convergence. We propose a new algorithm VR-FALD* that uses control variates to correct the client drift. We establish non-asymptotic bounds showing that VR-FALD* is not affected by statistical heterogeneity. Finally, we illustrate our results on several FL benchmarks for Bayesian inference.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/plassier23a.html
https://proceedings.mlr.press/v206/plassier23a.htmlGlobal-Local Regularization Via Distributional RobustnessDespite superior performance in many situations, deep neural networks are often vulnerable to adversarial examples and distribution shifts, limiting model generalization ability in real-world applications. To alleviate these problems, recent approaches leverage distributional robustness optimization (DRO) to find the most challenging distribution, and then minimize loss function over this most challenging distribution. Regardless of having achieved some improvements, these DRO approaches have some obvious limitations. First, they purely focus on local regularization to strengthen model robustness, missing a global regularization effect that is useful in many real-world applications (e.g., domain adaptation, domain generalization, and adversarial machine learning). Second, the loss functions in the existing DRO approaches operate in only the most challenging distribution, hence decouple with the original distribution, leading to a restrictive modeling capability. In this paper, we propose a novel regularization technique, following the veins of Wasserstein-based DRO framework. Specifically, we define a particular joint distribution and Wasserstein-based uncertainty, allowing us to couple the original and most challenging distributions for enhancing modeling capability and applying both local and global regularizations. Empirical studies on different learning problems demonstrate that our proposed approach significantly outperforms the existing regularization approaches in various domains.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/phan23a.html
https://proceedings.mlr.press/v206/phan23a.htmlDistill n’ Explain: explaining graph neural networks using simple surrogatesExplaining node predictions in graph neural networks (GNNs) often boils down to finding graph substructures that preserve predictions. Finding these structures usually implies back-propagating through the GNN, bonding the complexity (e.g., number of layers) of the GNN to the cost of explaining it. This naturally begs the question: Can we break this bond by explaining a simpler surrogate GNN? To answer the question, we propose Distill n’ Explain (DnX). First, DnX learns a surrogate GNN via knowledge distillation. Then, DnX extracts node or edge-level explanations by solving a simple convex program. We also propose FastDnX, a faster version of DnX that leverages the linear decomposition of our surrogate model. Experiments show that DnX and FastDnX often outperform state-of-the-art GNN explainers while being orders of magnitude faster. Additionally, we support our empirical findings with theoretical results linking the quality of the surrogate model (i.e., distillation error) to the faithfulness of explanations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/pereira23a.html
https://proceedings.mlr.press/v206/pereira23a.htmlOn the Privacy Risks of Algorithmic RecourseAs predictive models are increasingly being employed to make consequential decisions, there is a growing emphasis on developing techniques that can provide algorithmic recourse to affected individuals. While such recourses can be immensely beneficial to affected individuals, potential adversaries could also exploit these recourses to compromise privacy. In this work, we make the first attempt at investigating if and how an adversary can leverage recourses to infer private information about the underlying model’s training data. To this end, we propose a series of novel membership inference attacks which leverage algorithmic recourse. More specifically, we extend the prior literature on membership inference attacks to the recourse setting by leveraging the distances between data instances and their corresponding counterfactuals output by state-of-the-art recourse methods. Extensive experimentation with real world and synthetic datasets demonstrates significant privacy leakage through recourses. Our work establishes unintended privacy leakage as an important risk in the widespread adoption of recourse methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/pawelczyk23a.html
https://proceedings.mlr.press/v206/pawelczyk23a.htmlSymmetric (Optimistic) Natural Policy Gradient for Multi-Agent Learning with Parameter ConvergenceMulti-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the payoffs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/pattathil23a.html
https://proceedings.mlr.press/v206/pattathil23a.htmlFinite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisationWe study the finite-time behaviour of the popular temporal difference (TD) learning algorithm, when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal O (1/t) rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation, and show that this variant fares favourably in problems with ill-conditioned features.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/patil23a.html
https://proceedings.mlr.press/v206/patil23a.htmlScalable marked point processes for exchangeable and non-exchangeable event sequencesWe adopt the interpretability offered by a parametric, Hawkes-process-inspired conditional probability mass function for the marks and apply variational inference techniques to derive a general and scalable inferential framework for marked point processes. The framework can handle both exchangeable and non-exchangeable event sequences with minimal tuning and without any pre-training. This contrasts with many parametric and non-parametric state-of-the-art methods that typically require pre-training and/or careful tuning, and can only handle exchangeable event sequences. The framework’s competitive computational and predictive performance against other state-of-the-art methods are illustrated through real data experiments. Its attractiveness for large-scale applications is demonstrated through a case study involving all events occurring in an English Premier League season.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/panos23a.html
https://proceedings.mlr.press/v206/panos23a.htmlOptimal Algorithms for Latent Bandits with Cluster StructureWe consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into latent clusters such that the mean reward vectors of users within the same cluster are identical. At each round, a user, selected uniformly at random, pulls an arm and observes a corresponding noisy reward. The goal of the users is to maximize their cumulative rewards. This problem is central to practical recommendation systems and has received wide attention of late Gentile et al. (2014), Maillard and Mannor (2014). Now, if each user acts independently, then they would have to explore each arm independently and a regret of $\Omega(\sqrt{\mathrm{MNT}})$ is unavoidable, where M, N are the number of arms and users, respectively. Instead, we propose LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploration of the latent cluster structure to provide the minimax optimal regret of $\widetilde{O}(\sqrt{(M+N)T})$ when the number of clusters is $\tilde O(1)$. This is the first algorithm to guarantee such strong regret bound. LATTICE is based on a careful exploitation of arm information within a cluster while simultaneously clustering users. Furthermore, it is computationally efficient and requires only $O(\log \mathrm{T})$ calls to an offline matrix completion oracle across all T rounds.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/pal23a.html
https://proceedings.mlr.press/v206/pal23a.htmlLearning with Partial Forgetting in Modern Hopfield NetworksIt has been known by neuroscience studies that partial and transient forgetting of memory often plays an important role in the brain to improve performance for certain intellectual activities. In machine learning, associative memory models such as classical and modern Hopfield networks have been proposed to express memories as attractors in the feature space of a closed recurrent network. In this work, we propose learning with partial forgetting (LwPF), where a partial forgetting functionality is designed by element-wise non-bijective projections, for memory neurons in modern Hopfield networks to improve model performance. We incorporate LwPF into the attention mechanism also, whose process has been shown to be identical to the update rule of a certain modern Hopfield network, by modifying the corresponding Lagrangian. We evaluated the effectiveness of LwPF on three diverse tasks such as bit-pattern classification, immune repertoire classification for computational biology, and image classification for computer vision, and confirmed that LwPF consistently improves the performance of existing neural networks including DeepRC and vision transformers.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ota23a.html
https://proceedings.mlr.press/v206/ota23a.htmlTemporal Graph Neural Networks for Irregular DataThis paper proposes a temporal graph neural network model for forecasting of graph-structured irregularly observed time series. Our TGNN4I model is designed to handle both irregular time steps and partial observations of the graph. This is achieved by introducing a time-continuous latent state in each node, following a linear Ordinary Differential Equation (ODE) defined by the output of a Gated Recurrent Unit (GRU). The ODE has an explicit solution as a combination of exponential decay and periodic dynamics. Observations in the graph neighborhood are taken into account by integrating graph neural network layers in both the GRU state update and predictive model. The time-continuous dynamics additionally enable the model to make predictions at arbitrary time steps. We propose a loss function that leverages this and allows for training the model for forecasting over different time horizons. Experiments on simulated data and real-world data from traffic and climate modeling validate the usefulness of both the graph structure and time-continuous dynamics in settings with irregular observations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/oskarsson23a.html
https://proceedings.mlr.press/v206/oskarsson23a.htmlExplicit Regularization in Overparametrized Models via Noise InjectionInjecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/orvieto23a.html
https://proceedings.mlr.press/v206/orvieto23a.htmlAverage case analysis of Lasso under ultra sparse conditionsWe analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors $N$ grows larger keeping the true support size $d$ finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where $N$, $d$ and the number of observations $M$ tend to infinity at the same rate. Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of $N$ and $M$, the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when $M$ diverges as $M = O(\log N)$. The obtained bound for perfect support recovery is a generalization of that given in previous literature, which only considers the case of Gaussian noise and diverging $d$. Extensive numerical experiments strongly support our analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/okajima23a.html
https://proceedings.mlr.press/v206/okajima23a.htmlRobust Linear Regression for General Feature DistributionWe investigate robust linear regression where data may be contaminated by an oblivious adversary, i.e., an adversary that knows the data distribution but is otherwise oblivious to the realization of the data samples. This model has been previously analyzed under strong assumptions. Concretely, (i) all previous works assume that the covariance matrix of the features is positive definite; (ii) most of them assume that the features are centered. Additionally, all previous works make additional restrictive assumptions, e.g., assuming Gaussianity of the features or symmetric distribution of the corruptions. In this work, we investigate robust regression under a more general set of assumptions: (i) the covariance matrix may be either positive definite or positive semi definite, (ii) features may not be centered, (iii) no assumptions beyond boundedness (or sub-Gaussianity) of the features and the measurement noise. Under these assumptions we analyze a sequential algorithm, namely, a natural SGD variant for this problem, and show that it enjoys a fast convergence rate when the covariance matrix is positive definite. In the positive semi definite case we show that there are two regimes: if the features are centered, we can obtain a standard convergence rate; Otherwise, the adversary can cause any learner to fail arbitrarily.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/norman23a.html
https://proceedings.mlr.press/v206/norman23a.htmlSurvivalGAN: Generating Time-to-Event Data for Survival AnalysisSynthetic data is becoming an increasingly promising technology, and successful applications can improve privacy, fairness, and data democratization. While there are many methods for generating synthetic tabular data, the task remains non-trivial and unexplored for specific scenarios. One such scenario is survival data. Here, the key difficulty is censoring: for some instances, we are not aware of the time of event, or if one even occurred. Imbalances in censoring and time horizons cause generative models to experience three new failure modes specific to survival analysis: (1) generating too few at-risk members; (2) generating too many at-risk members; and (3) censoring too early. We formalize these failure modes and provide three new generative metrics to quantify them. Following this, we propose SurvivalGAN, a generative model that handles survival data firstly by addressing the imbalance in the censoring and event horizons, and secondly by using a dedicated mechanism for approximating time-to-event/censoring. We evaluate this method via extensive experiments on medical datasets. SurvivalGAN outperforms multiple baselines at generating survival data, and in particular addresses the failure modes as measured by the new metrics, in addition to improving downstream performance of survival models trained on the synthetic data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/norcliffe23a.html
https://proceedings.mlr.press/v206/norcliffe23a.htmlGeometric Random Walk Graph Neural Networks via Implicit LayersGraph neural networks have recently attracted a lot of attention and have been applied with great success to several important graph problems. The Random Walk Graph Neural Network model was recently proposed as a more intuitive alternative to the well-studied family of message passing neural networks. This model compares each input graph against a set of latent “hidden graphs” using a kernel that counts common random walks up to some length. In this paper, we propose a new architecture, called Geometric Random Walk Graph Neural Network (GRWNN), that generalizes the above model such that it can count common walks of infinite length in two graphs. The proposed model retains the transparency of Random Walk Graph Neural Networks since its first layer also consists of a number of trainable “hidden graphs” which are compared against the input graphs using the geometric random walk kernel. To compute the kernel, we employ a fixed-point iteration approach involving implicitly defined operations. Then, we capitalize on implicit differentiation to derive an efficient training scheme which requires only constant memory, regardless of the number of fixed-point iterations. Experiments on graph classification datasets demonstrate the effectiveness of the proposed approach in comparison with state-of-the-art methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nikolentzos23c.html
https://proceedings.mlr.press/v206/nikolentzos23c.htmlGraph Alignment Kernels using Weisfeiler and Leman HierarchiesGraph kernels have become a standard approach for tackling the graph similarity and learning tasks at the same time. Most graph kernels proposed so far are instances of the R-convolution framework. These kernels decompose graphs into their substructures and sum over all pairs of these substructures. However, considerably less attention has been paid to other types of kernels. In this paper, we propose a new kernel between graphs which reorders the adjacency matrix of each graph based on soft permutation matrices, and then compares those aligned adjacency matrices to each other using a linear kernel. To compute the permutation matrices, the kernel finds corresponding vertices in different graphs. Two vertices match with each other if the Weisfeiler-Leman test of isomorphism assigns the same label to both of them. The proposed kernel is evaluated on several graph classification and graph regression datasets. Our results indicate that the kernel is competitive with traditional and state-of-the-art methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nikolentzos23b.html
https://proceedings.mlr.press/v206/nikolentzos23b.htmlWeisfeiler and Leman go Hyperbolic: Learning Distance Preserving Node RepresentationsIn recent years, graph neural networks (GNNs) have emerged as a promising tool for solving machine learning problems on graphs. Most GNNs are members of the family of message passing neural networks (MPNNs). There is a close connection between these models and the Weisfeiler-Leman (WL) test of isomorphism, an algorithm that can successfully test isomorphism for a broad class of graphs. Recently, much research has focused on measuring the expressive power of GNNs. For instance, it has been shown that standard MPNNs are at most as powerful as WL in terms of distinguishing non-isomorphic graphs. However, these studies have largely ignored the distances between the representations of the nodes/graphs which are of paramount importance for learning tasks. In this paper, we define a distance function between nodes which is based on the hierarchy produced by the WL algorithm, and propose a model that learns representations which preserve those distances between nodes. Since the emerging hierarchy corresponds to a tree, to learn these representations, we capitalize on recent advances in the field of hyperbolic neural networks. We empirically evaluate the proposed model on standard graph and node classification datasets where it achieves competitive performance with state-of-the-art models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nikolentzos23a.html
https://proceedings.mlr.press/v206/nikolentzos23a.htmlOnline Defense Strategies for Reinforcement Learning Against Adaptive Reward PoisoningWe consider the problem of defense against reward-poisoning attacks in reinforcement learning and formulate it as a game in $T$ rounds between a defender and an adaptive attacker in an adversarial environment. To address this problem, we design two novel defense algorithms. First, we propose Exp3-DARP, a defense algorithm that uses Exp3 as a hyperparameter learning subroutine, and show that it achieves order-optimal $\tilde{\Theta}(T^{1/2})$ bounds on our notion of regret with respect to a defense that always picks the optimal parameter in hindsight. We show that the order of $T$ in the bounds cannot be improved when the reward arrival process is adversarial, even if the feedback model of the defense is stronger. However, assuming that the environment is stochastic, we propose OMDUCB-DARP that uses estimates of costs as proxies to update the randomized strategy of the learner and are able to substantially improve the bounds proportional to how smoothly the attacker’s strategy changes. Furthermore, we show that weaker types of defense, that do not take into account the attack structure and the poisoned rewards, suffer linear regret with respect to a defender that always selects the optimal parameter in hindsight when faced with an adaptive attacker that uses a no-regret algorithm to learn the behavior of the defense. Finally, we support our theoretical results with experimental evaluations on three different environments, showcasing the efficiency of our methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nika23a.html
https://proceedings.mlr.press/v206/nika23a.htmlActive Membership Inference Attack under Local Differential Privacy in Federated LearningFederated learning (FL) was originally regarded as a framework for collaborative learning among clients with data privacy protection through a coordinating server. In this paper, we propose a new active membership inference (AMI) attack carried out by a dishonest server in FL. In AMI attacks, the server crafts and embeds malicious parameters into global models to effectively infer whether a target data sample is included in a client’s private training data or not. By exploiting the correlation among data features through a non-linear decision boundary, AMI attacks with a certified guarantee of success can achieve severely high success rates under rigorous local differential privacy (LDP) protection; thereby exposing clients’ training data to significant privacy risk. Theoretical and experimental results on several benchmark datasets show that adding sufficient privacy-preserving noise to prevent our attack would significantly damage FL’s model utility.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nguyen23e.html
https://proceedings.mlr.press/v206/nguyen23e.htmlNonmyopic Multiclass Active Search with Diminishing Returns for Diverse DiscoveryActive search is a setting in adaptive experimental design where we aim to uncover members of rare, valuable class(es) subject to a budget constraint. An important consideration in this problem is diversity among the discovered targets – in many applications, diverse discoveries offer more insight and may be preferable in downstream tasks. However, most existing active search policies either assume that all targets belong to a common positive class or encourage diversity via simple heuristics. We present a novel formulation of active search with multiple target classes, characterized by a utility function chosen from a flexible family whose members encourage diversity among discoveries via a diminishing returns mechanism. We then study this problem under the Bayesian lens and prove a hardness result for approximating the optimal policy for arbitrary positive, increasing, and concave utility functions. Finally, we design an efficient, nonmyopic approximation to the optimal policy for this class of utilities and demonstrate its superior empirical performance in a variety of experimental settings, including drug discovery.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nguyen23d.html
https://proceedings.mlr.press/v206/nguyen23d.htmlAsymptotic Bayes risk of semi-supervised multitask learning on Gaussian mixtureThe article considers semi-supervised multitask learning on a Gaussian mixture model (GMM). Using methods from statistical physics, we compute the asymptotic Bayes risk of each task in the regime of large datasets in high dimension, from which we analyze the role of task similarity in learning and evaluate the performance gain when tasks are learned together rather than separately. In the supervised case, we derive a simple algorithm that attains the Bayes optimal performance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nguyen23c.html
https://proceedings.mlr.press/v206/nguyen23c.htmlFeasible Recourse Plan via Diverse InterpolationExplaining algorithmic decisions and recommending actionable feedback is increasingly important for machine learning applications. Recently, significant efforts have been invested in finding a diverse set of recourses to cover the wide spectrum of users’ preferences. However, existing works often neglect the requirement that the recourses should be close to the data manifold; hence, the constructed recourses might be implausible and unsatisfying to users. To address these issues, we propose a novel approach that explicitly directs the diverse set of actionable recourses towards the data manifold. We first find a diverse set of prototypes in the favorable class that balances the trade-off between diversity and proximity. We demonstrate two specific methods to find these prototypes: either by finding the maximum a posteriori estimate of a determinantal point process or by solving a quadratic binary program. To ensure the actionability constraints, we construct an actionability graph in which the nodes represent the training samples and the edges indicate the feasible action between two instances. We then find a feasible path to each prototype, and this path demonstrates the feasible actions for each recourse in the plan. The experimental results show that our method produces a set of recourses that are close to the data manifold while delivering a better cost-diversity trade-off than existing approaches.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nguyen23b.html
https://proceedings.mlr.press/v206/nguyen23b.htmlOptimal and Private Learning from Human Response DataItem response theory (IRT) is the study of how people make probabilistic decisions, with diverse applications in education testing, recommendation systems, among others. The Rasch model of binary response data, one of the most fundamental models in IRT, remains an active area of research with important practical significance. Recently, Nguyen and Zhang (2022) proposed a new spectral estimation algorithm that is efficient and accurate. In this work, we extend their results in two important ways. Firstly, we obtain a refined entrywise error bound for the spectral algorithm, complementing the ‘average error’ $\ell_2$ bound in their work. Notably, under mild sampling conditions, the spectral algorithm achieves the minimax optimal entrywise error bound (modulo a log factor). Building on the refined analysis, we also show that the spectral algorithm enjoys optimal sample complexity for top-$K$ recovery (e.g., identifying the best $K$ items from approval/disapproval response data), explaining interesting empirical findings in the previous work. Our second contribution addresses an important but understudied topic in IRT: privacy. Despite the human-centric applications of IRT, there has not been any proposed privacy-preserving mechanism in the literature. We develop a private extension of the spectral algorithm, leveraging its unique Markov chain formulation and the discrete Gaussian mechanism (Canonne et al., 2020). Experiments show that our approach is significantly more accurate than the baselines in the low-to-moderate privacy regime.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nguyen23a.html
https://proceedings.mlr.press/v206/nguyen23a.htmlDimensionality Collapse: Optimal Measurement Selection for Low-Error Infinite-Horizon ForecastingThis work introduces a method to select linear functional measurements of a vector-valued time series optimized for forecasting distant time-horizons. By formulating and solving the problem of sequential linear measurement design as an infinite-horizon problem with the time-averaged trace of the Cramér-Rao lower bound (CRLB) for forecasting as the cost, the most informative data can be collected irrespective of the eventual forecasting algorithm. By introducing theoretical results regarding measurements under additive noise from natural exponential families, we construct an equivalent problem from which a local dimensionality reduction can be derived. This alternative formulation is based on the future collapse of dimensionality inherent in the limiting behavior of many differential equations and can be directly observed in the low-rank structure of the CRLB for forecasting. Implementations of both an approximate dynamic programming formulation and the proposed alternative are illustrated using an extended Kalman filter for state estimation, with results on simulated systems with limit cycles and chaotic behavior demonstrating a linear improvement in the CRLB as a function of the number of collapsing dimensions of the system.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/naumer23a.html
https://proceedings.mlr.press/v206/naumer23a.htmlCollision Probability Matching Loss for Disentangling Epistemic Uncertainty from Aleatoric UncertaintyTwo important aspects of machine learning, uncertainty and calibration, have previously been studied separately. The first aspect involves knowing whether inaccuracy is due to the epistemic uncertainty of the model, which is theoretically reducible, or to the aleatoric uncertainty in the data per se, which thus becomes the upper bound of model performance. As for the second aspect, numerous calibration methods have been proposed to correct predictive probabilities to better reflect the true probabilities of being correct. In this paper, we aim to obtain the squared error of predictive distribution from the true distribution as epistemic uncertainty. Our formulation, based on second-order Rényi entropy, integrates the two problems into a unified framework and obtains the epistemic (un)certainty as the difference between the aleatoric and predictive (un)certainties. As an auxiliary loss to ordinary losses, such as cross-entropy loss, the proposed collision probability matching (CPM) loss matches the cross-collision probability between the true and predictive distributions to the collision probability of the predictive distribution, where these probabilities correspond to accuracy and confidence, respectively. Unlike previous Shannon-entropy-based uncertainty methods, the proposed method makes the aleatoric uncertainty directly measurable as test-retest reliability, which is a summary statistic of the true distribution frequently used in scientific research on humans. We provide mathematical proof and strong experimental evidence for our formulation using both a real dataset consisting of real human ratings toward emotional faces and simulation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/narimatsu23a.html
https://proceedings.mlr.press/v206/narimatsu23a.htmlCharacterizing Polarization in Social Networks using the Signed Relational Latent Distance ModelGraph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of ”us-versus-them” that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the Signed Latent Distance Model (SLDM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks. We further extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes, forming the Signed Latent Relational Distance Model (SLIM). On four real social signed networks of polarization, we demonstrate that the models extract low-dimensional characterizations that well predict friendships and animosity while SLIM provides interpretable visualizations defined by extreme positions when restricting the embedding space to polytopes. Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nakis23a.html
https://proceedings.mlr.press/v206/nakis23a.htmlUnderstanding Multimodal Contrastive Learning and Incorporating Unpaired DataLanguage-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL under linear representation settings. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under the presence of wrongly matched pairs. This characterizes the robustness of MMCL to noisy data. Furthermore, when we have access to additional unpaired data, (iii) we propose a new MMCL loss that incorporates additional unpaired datasets. We show that the algorithm can detect the ground-truth pairs and improve performance by fully exploiting unpaired datasets. The performance of the proposed algorithm was verified by numerical experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/nakada23a.html
https://proceedings.mlr.press/v206/nakada23a.htmlConjugate Gradient Method for Generative Adversarial NetworksOne of the training strategies of generative models is to minimize the Jensen–Shannon divergence between the model distribution and the data distribution. Since data distribution is unknown, generative adversarial networks (GANs) formulate this problem as a game between two models, a generator and a discriminator. The training can be formulated in the context of game theory and the local Nash equilibrium (LNE). It does not seem feasible to derive guarantees of stability or optimality for the existing methods. This optimization problem is far more challenging than the single objective setting. Here, we use the conjugate gradient method to reliably and efficiently solve the LNE problem in GANs. We give a proof and convergence analysis under mild assumptions showing that the proposed method converges to a LNE with three different learning rate update rules, including a constant learning rate. Finally, we demonstrate that the proposed method outperforms stochastic gradient descent (SGD) and momentum SGD in terms of best Frechet inception distance (FID) score and outperforms Adam on average.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/naganuma23a.html
https://proceedings.mlr.press/v206/naganuma23a.htmlActive Exploration via Experiment Design in Markov ChainsA key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget into measurements to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a Markov chain, and we can only choose them by selecting a policy controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm – markov-design – that efficiently selects policies whose measurement allocation provably converges to the optimal one. The algorithm is sequential in nature, adapting its choice of policies (experiments) using past measurements. In addition to our theoretical analysis, we demonstrate our framework on applications in ecological surveillance and pharmacology.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mutny23a.html
https://proceedings.mlr.press/v206/mutny23a.htmlResolving the Approximability of Offline and Online Non-monotone DR-Submodular Maximization over General Convex SetsIn recent years, maximization of DR-submodular continuous functions became an important research field, with many real-worlds applications in the domains of machine learning, communication systems, operation research and economics. Most of the works in this field study maximization subject to down-closed convex set constraints due to an inapproximability result by Vondrak (2013). However, Durr et al. (2021) showed that one can bypass this inapproximability by proving approximation ratios that are functions of m, the minimum $\ell-\infty$ norm of any feasible vector. Given this observation, it is possible to get results for maximizing a DR-submodular function subject to general convex set constraints, which has led to multiple works on this problem. The most recent of which is a polynomial time 1/4(1 - m)-approximation offline algorithm due to Du (2022). However, only a sub-exponential time $(1 - m)/(3^{1.5})$-approximation algorithm is known for the corresponding online problem. In this work, we present a polynomial time online algorithm matching the 1/4(1 - m)-approximation of the state-of-the-art offline algorithm. We also present an inapproximability result showing that our online algorithm and Du’s (2022) offline algorithm are both optimal in a strong sense. Finally, we study the empirical performance of our algorithm and the algorithm Du (2022) (which was only theoretically studied previously), and show that they consistently outperform previously suggested algorithms on revenue maximization, location summarization and quadratic programming applications.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mualem23a.html
https://proceedings.mlr.press/v206/mualem23a.htmlWho Should Predict? Exact Algorithms For Learning to Defer to HumansAutomated AI classifiers should be able to defer the prediction to a human decision maker to ensure more accurate predictions. In this work, we jointly train a classifier with a rejector, which decides on each data point whether the classifier or the human should predict. We show that prior approaches can fail to find a human-AI system with low mis-classification error even when there exists a linear classifier and rejector that have zero error (the realizable setting). We prove that obtaining a linear pair with low error is NP-hard even when the problem is realizable. To complement this negative result, we give a mixed-integer-linear-programming (MILP) formulation that can optimally solve the problem in the linear setting. However, the MILP only scales to moderately-sized problems. Therefore, we provide a novel surrogate loss function that is realizable-consistent and performs well empirically. We test our approaches on a comprehensive set of datasets and compare to a wide range of baselines.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mozannar23a.html
https://proceedings.mlr.press/v206/mozannar23a.htmlInducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian OptimisationSparse Gaussian processes are a key component of high-throughput Bayesian optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of determinantal point processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable for a broad range of downstream sequential decision making tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/moss23a.html
https://proceedings.mlr.press/v206/moss23a.htmlOn the Calibration of Probabilistic Classifier SetsMulti-class classification methods that produce sets of probabilistic classifiers, such as ensemble learning methods, are able to model aleatoric and epistemic uncertainty. Aleatoric uncertainty is then typically quantified via the Bayes error, and epistemic uncertainty via the size of the set. In this paper, we extend the notion of calibration, which is commonly used to evaluate the validity of the aleatoric uncertainty representation of a single probabilistic classifier, to assess the validity of an epistemic uncertainty representation obtained by sets of probabilistic classifiers. Broadly speaking, we call a set of probabilistic classifiers calibrated if one can find a calibrated convex combination of these classifiers. To evaluate this notion of calibration, we propose a novel nonparametric calibration test that generalizes an existing test for single probabilistic classifiers to the case of sets of probabilistic classifiers. Making use of this test, we empirically show that ensembles of deep neural networks are often not well calibrated.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mortier23a.html
https://proceedings.mlr.press/v206/mortier23a.htmlConnectivity-contrastive learning: Combining causal discovery and representation learning for multimodal dataCausal discovery methods typically extract causal relations between multiple nodes (variables) based on univariate observations of each node. However, one frequently encounters situations where each node is multivariate, i.e. has multiple observational modalities. Furthermore, the observed modalities may be generated through an unknown mixing process, so that some original latent variables are entangled inside the nodes. In such a multimodal case, the existing frameworks cannot be applied. To analyze such data, we propose a new causal representation learning framework called connectivity-contrastive learning (CCL). CCL disentangles the observational mixing and extracts a set of mutually independent latent components, each having a separate causal structure between the nodes. The actual learning proceeds by a novel self-supervised learning method in which the pretext task is to predict the label of a pair of nodes from the observations of the node pairs. We present theorems which show that CCL can indeed identify both the latent components and the multimodal causal structure under weak technical assumptions, up to some indeterminacy. Finally, we experimentally show its superior causal discovery performance compared to state-of-the-art baselines, in particular demonstrating robustness against latent confounders.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/morioka23a.html
https://proceedings.mlr.press/v206/morioka23a.htmlMinority Oversampling for Imbalanced Data via Class-Preserving Regularized Auto-EncodersClass imbalance is a common phenomenon in multiple application domains such as healthcare, where the sample occurrence of one or few class categories is more prevalent in the dataset than the rest. This work addresses the class-imbalance issue by proposing an over-sampling method for the minority classes in the latent space of a Regularized Auto-Encoder (RAE). Specifically, we construct a latent space by maximizing the conditional data likelihood using an Encoder-Decoder structure, such that oversampling through convex combinations of latent samples preserves the class identity. A jointly-trained linear classifier that separates convexly coupled latent vectors from different classes is used to impose this property on the AE’s latent space. Further, the aforesaid linear classifier is used for final classification without retraining. We theoretically show that our method can achieve a low variance risk estimate compared to naive oversampling methods and is robust to overfitting. We conduct several experiments on benchmark datasets and show that our method outperforms the existing oversampling techniques for handling class imbalance. The code of the proposed method is available at: https://github.com/arnabkmondal/oversamplingrae.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mondal23a.html
https://proceedings.mlr.press/v206/mondal23a.htmlImproving Dual-Encoder Training through Dynamic Indexes for Negative MiningDual encoder models are ubiquitous in modern classification and retrieval. Crucial for training such dual encoders is an accurate estimation of gradients from the partition function of the softmax over the large output space; this requires finding negative targets that contribute most significantly (‘hard negatives). Since dual encoder model parameters change during training, the use of traditional static nearest neighbor indexes can be sub-optimal. These static indexes (1) periodically require expensive re-building of the index, which in turn requires (2) expensive re-encoding of all targets using updated model parameters. This paper addresses both of these challenges. First, we introduce an algorithm that uses a tree structure to approximate the softmax with provable bounds and that dynamically maintains the tree. Second, we approximate the effect of a gradient update on target encodings with an efficient Nystrom low-rank approximation. In our empirical study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining. Furthermore, our method surpasses prior state-of-the-art while using 150x less accelerator memory.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/monath23a.html
https://proceedings.mlr.press/v206/monath23a.htmlPerformative Prediction with Neural NetworksPerformative prediction is a framework for learning models that influence the data they intend to predict. We focus on finding classifiers that are performatively stable, i.e. optimal for the data distribution they induce. Standard convergence results for finding a performatively stable classifier with the method of repeated risk minimization assume that the data distribution is Lipschitz continuous to the model’s parameters. Under this assumption, the loss must be strongly convex and smooth in these parameters; otherwise, the method will diverge for some problems. In this work, we instead assume that the data distribution is Lipschitz continuous with respect to the model’s predictions, a more natural assumption for performative systems. As a result, we are able to significantly relax the assumptions on the loss function. In particular, we do not need to assume convexity with respect to the model’s parameters. As an illustration, we introduce a resampling procedure that models realistic distribution shifts and show that it satisfies our assumptions. We support our theory by showing that one can learn performatively stable classifiers with neural networks making predictions about real data that shift according to our proposed procedure.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mofakhami23a.html
https://proceedings.mlr.press/v206/mofakhami23a.htmlMatching Map Recovery with an Unknown Number of OutliersWe consider the problem of finding the matching map between two sets of $d$-dimensional noisy feature-vectors. The distinctive feature of our setting is that we do not assume that all the vectors of the first set have their corresponding vector in the second set. If $n$ and $m$ are the sizes of these two sets, we assume that the matching map that should be recovered is defined on a subset of unknown cardinality $k^*\le \min(n,m)$. We show that, in the high-dimensional setting, if the signal-to-noise ratio is larger than $5(d\log(4nm/\alpha))^{1/4}$, then the true matching map can be recovered with probability $1-\alpha$. Interestingly, this threshold does not depend on $k^*$ and is the same as the one obtained in prior work in the case of $k = \min(n,m)$. The procedure for which the aforementioned property is proved is obtained by a data-driven selection among candidate mappings $\{\hat\pi_k:k\in[\min(n,m)]\}$. Each $\hat\pi_k$ minimizes the sum of squares of distances between two sets of size $k$. The resulting optimization problem can be formulated as a minimum-cost flow problem, and thus solved efficiently. Finally, we report the results of numerical experiments on both synthetic and real-world data that illustrate our theoretical results and provide further insight into the properties of the algorithms studied in this work.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/minasyan23a.html
https://proceedings.mlr.press/v206/minasyan23a.htmlMulti-Fidelity Bayesian Optimization with Unreliable Information SourcesBayesian optimization (BO) is a powerful framework for optimizing black-box, expensive-to-evaluate functions. Over the past decade, many algorithms have been proposed to integrate cheaper, lower-fidelity approximations of the objective function into the optimization process, with the goal of converging towards the global optimum at a reduced cost. This task is generally referred to as multi-fidelity Bayesian optimization (MFBO). However, MFBO algorithms can lead to higher optimization costs than their vanilla BO counterparts, especially when the low-fidelity sources are poor approximations of the objective function, therefore defeating their purpose. To address this issue, we propose rMFBO (robust MFBO), a methodology to make any GP-based MFBO scheme robust to the addition of unreliable information sources. rMFBO comes with a theoretical guarantee that its performance can be bound to its vanilla BO analog, with high controllable probability. We demonstrate the effectiveness of the proposed methodology on a number of numerical benchmarks, outperforming earlier MFBO methods on unreliable sources. We expect rMFBO to be particularly useful to reliably include human experts with varying knowledge within BO processes.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mikkola23a.html
https://proceedings.mlr.press/v206/mikkola23a.htmlNothing but Regrets — Privacy-Preserving Federated Causal DiscoveryIn critical applications, causal models are the prime choice for their trustworthiness and explainability. If data is inherently distributed and privacy-sensitive, federated learning allows for collaboratively training a joint model. Existing approaches for federated causal discovery share locally discovered causal model in every iteration, therewith not only revealing local structure but also leading to very high communication costs. Instead, we propose an approach for privacy-preserving federated causal discovery by distributed min-max regret optimization. We prove that max-regret is a consistent scoring criterion that can be used within the well-known Greedy Equivalence Search to discover causal networks in a federated setting and is provably privacy-preserving at the same time. Through extensive experiments, we show that our approach reliably discovers causal networks without ever looking at local data and beats the state of the art both in terms of the quality of discovered causal networks as well as communication efficiency.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mian23a.html
https://proceedings.mlr.press/v206/mian23a.htmlA Tale of Sampling and Estimation in Discounted Reinforcement LearningThe most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/metelli23a.html
https://proceedings.mlr.press/v206/metelli23a.htmlOn Model Selection Consistency of Lasso for High-Dimensional Ising ModelsWe theoretically analyze the model selection consistency of least absolute shrinkage and selection operator (Lasso), both with and without post-thresholding, for high-dimensional Ising models. For random regular (RR) graphs of size $p$ with regular node degree $d$ and uniform couplings $\theta_0$, it is rigorously proved that Lasso without post-thresholding is model selection consistent in the whole paramagnetic phase with the same order of sample complexity $n=\Omega{(d^3\log{p})}$ as that of $\ell_1$-regularized logistic regression ($\ell_1$-LogR). This result is consistent with the conjecture in Meng, Obuchi, and Kabashima 2021 using the non-rigorous replica method from statistical physics and thus complements it with a rigorous proof. For general tree-like graphs, it is demonstrated that the same result as RR graphs can be obtained under mild assumptions of the dependency condition and incoherence condition. Moreover, we provide a rigorous proof of the model selection consistency of Lasso with post-thresholding for general tree-like graphs in the paramagnetic phase without further assumptions on the dependency and incoherence conditions. Experimental results agree well with our theoretical analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/meng23a.html
https://proceedings.mlr.press/v206/meng23a.htmlSingular Value Representation: A New Graph Perspective On Neural NetworksWe introduce the Singular Value Representation (SVR), a new method to represent the internal state of neural networks using SVD factorization of the weights. This construction yields a new weighted graph connecting what we call spectral neurons, that correspond to specific activation patterns of classical neurons. We derive a precise statistical framework to discriminate meaningful connections between spectral neurons for fully connected and convolutional layers. To demonstrate the usefulness of our approach for machine learning research, we highlight two discoveries we made using the SVR. First, we highlight the emergence of a dominant connection in VGG networks that spans multiple deep layers. Second, we witness, without relying on any input data, that batch normalization can induce significant connections between near-kernels of deep layers, leading to a remarkable spontaneous sparsification phenomenon. code: a python implementation of the svr can be found at https://github.com/danmlr/svr.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/meller23a.html
https://proceedings.mlr.press/v206/meller23a.htmlStochastic Optimization for Spectral Risk MeasuresSpectral risk objectives – also called L-risks – allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop LSVRG, a stochastic algorithm to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothness of the objective. We show theoretically and experimentally that out-of-the-box approaches such as stochastic subgradient and dual averaging can be hindered by bias, whereas our approach exhibits linear convergence.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mehta23b.html
https://proceedings.mlr.press/v206/mehta23b.htmlThresholded linear banditsWe introduce the thresholded linear bandit problem, a novel sequential decision making problem at the interface of structured stochastic multi-armed bandits and learning halfspaces. The set of arms is $[0, 1]^d$, the expected Bernoulli reward is piecewise constant with a jump at a separating hyperplane, and each arm is associated with a cost that is a positive linear combination of the arm’s components. This problem is motivated by several practical applications. For instance, imagine tuning the continuous features of an offer to a consumer; higher values incur higher cost to the vendor but result in a more attractive offer. At some threshold, the offer is attractive enough for a random consumer to accept at the higher probability level. For the one-dimensional case, we present Leftist, which enjoys $\log^2 T$ problem-dependent regret in favorable cases and has $\log(T) \sqrt{T}$ worst-case regret; we also give a lower bound suggesting this is unimprovable. We then present MD-Leftist, our extension of Leftist to the multi-dimensional case, which obtains similar regret bounds but with $d^{2.5} \log d$ and $d^{1.5} \log d$ dependence on dimension for the two types of bounds respectively. Finally, we experimentally evaluate Leftist.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mehta23a.html
https://proceedings.mlr.press/v206/mehta23a.htmlNonparametric Gaussian Process Covariances via Multidimensional ConvolutionsA key challenge in the practical application of Gaussian processes (GPs) is selecting a proper covariance function. The process convolutions construction of GPs allows some additional flexibility, but still requires choosing a proper smoothing kernel, which is non-trivial. Previous approaches have built covariance functions by using GP priors over the smoothing kernel, and by extension the covariance, as a way to bypass the need to specify it in advance. However, these models have been limited in several ways: they are restricted to single dimensional inputs, e.g. time; they only allow modelling of single outputs and they do not scale to large datasets since inference is not straightforward. In this paper, we introduce a nonparametric process convolution formulation for GPs that alleviates these weaknesses. We achieve this using a functional sampling approach based on Matheron’s rule to perform fast sampling using interdomain inducing variables. We test the performance of our model on benchmarks for single output, multi-output and large-scale GP regression, and find that our approach can provide improvements over standard GP models, particularly for larger datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mcdonald23a.html
https://proceedings.mlr.press/v206/mcdonald23a.htmlDiscovering Many Diverse Solutions with Bayesian OptimizationBayesian optimization (BO) is a popular approach for sample-efficient optimization of black-box objective functions. While BO has been successfully applied to a wide range of scientific applications, traditional approaches to single-objective BO only seek to find a single best solution. This can be a significant limitation in situations where solutions may later turn out to be intractable, for example, a designed molecule may turn out to later violate constraints that can only be evaluated after the optimization process has concluded. To address this issue, we propose rank-ordered Bayesian Optimization with trustregions (ROBOT) which aims to find a portfolio of high-performing solutions that are diverse according to a user-specified diversity measure. We evaluate ROBOT on several real-world applications and show that it can discover large sets of high-performing diverse solutions while requiring few additional function evaluations compared to finding a single best solution.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/maus23a.html
https://proceedings.mlr.press/v206/maus23a.htmlBures-Wasserstein Barycenters and Low-Rank Matrix RecoveryWe revisit the problem of recovering a low-rank positive semidefinite matrix from rank-one projections using tools from optimal transport. More specifically, we show that a variational formulation of this problem is equivalent to computing a Wasserstein barycenter. In turn, this new perspective enables the development of new geometric first-order methods with strong convergence guarantees in Bures-Wasserstein distance. Experiments on simulated data demonstrate the advantages of our new methodology over existing methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/maunu23a.html
https://proceedings.mlr.press/v206/maunu23a.htmlSimulator-Based Inference with WALDO: Confidence Regions by Leveraging Prediction Algorithms and Posterior Estimators for Inverse ProblemsPrediction algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method to construct confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/masserano23a.html
https://proceedings.mlr.press/v206/masserano23a.htmlBut Are You Sure? An Uncertainty-Aware Perspective on Explainable AIAlthough black-box models can accurately predict outcomes such as weather patterns, they often lack transparency, making it challenging to extract meaningful insights (such as which atmospheric conditions signal future rainfall). Model explanations attempt to identify the essential features of a model, but these explanations can be inconsistent: two near-optimal models may admit vastly different explanations. In this paper, we propose a solution to this problem by constructing uncertainty sets for explanations of the optimal model(s) in both frequentist and Bayesian settings. Our uncertainty sets are guaranteed to include the explanation of the optimal model with high probability, even though this model is unknown. We demonstrate the effectiveness of our approach in both synthetic and real-world experiments, illustrating how our uncertainty sets can be used to calibrate trust in model explanations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/marx23a.html
https://proceedings.mlr.press/v206/marx23a.htmlFederated Learning for Data StreamsFederated learning (FL) is an effective solution to train machine learning models on the increasing amount of data generated by IoT devices and smartphones while keeping such data localized. Most previous work on federated learning assumes that clients operate on static datasets collected before training starts. This approach may be inefficient because 1) it ignores new samples clients collect during training, and 2) it may require a potentially long preparatory phase for clients to collect enough data. Moreover, learning on static datasets may be simply impossible in scenarios with small aggregate storage across devices. It is, therefore, necessary to design federated algorithms able to learn from data streams. In this work, we formulate and study the problem of federated learning for data streams. We propose a general FL algorithm to learn from data streams through an opportune weighted empirical risk minimization. Our theoretical analysis provides insights to configure such an algorithm, and we evaluate its performance on a wide range of machine learning tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/marfoq23a.html
https://proceedings.mlr.press/v206/marfoq23a.htmlEquivariant Representation Learning via Class-Pose DecompositionWe introduce a general method for learning representations that are equivariant to symmetries of data. Our central idea is to decompose the latent space into an invariant factor and the symmetry group itself. The components semantically correspond to intrinsic data classes and poses respectively. The learner is trained on a loss encouraging equivariance based on supervision from relative symmetry information. The approach is motivated by theoretical results from group theory and guarantees representations that are lossless, interpretable and disentangled. We provide an empirical investigation via experiments involving datasets with a variety of symmetries. Results show that our representations capture the geometry of data and outperform other equivariant representation learning frameworks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/marchetti23b.html
https://proceedings.mlr.press/v206/marchetti23b.htmlAn Efficient and Continuous Voronoi Density EstimatorWe introduce a non-parametric density estimator deemed Radial Voronoi Density Estimator (RVDE). RVDE is grounded in the geometry of Voronoi tessellations and as such benefits from local geometric adaptiveness and broad convergence properties. Due to its radial definition RVDE is continuous and computable in linear time with respect to the dataset size. This amends for the main shortcomings of previously studied VDEs, which are highly discontinuous and computationally expensive. We provide a theoretical study of the modes of RVDE as well as an empirical investigation of its performance on high-dimensional data. Results show that RVDE outperforms other non-parametric density estimators, including recently introduced VDEs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/marchetti23a.html
https://proceedings.mlr.press/v206/marchetti23a.htmlHigh-Dimensional Private Empirical Risk Minimization by Greedy Coordinate DescentIn this paper, we study differentially private empirical risk minimization (DP-ERM). It has been shown that the worst-case utility of DP-ERM reduces polynomially as the dimension increases. This is a major obstacle to privately learning large machine learning models. In high dimension, it is common for some model’s parameters to carry more information than others. To exploit this, we propose a differentially private greedy coordinate descent (DP-GCD) algorithm. At each iteration, DP-GCD privately performs a coordinate-wise gradient step along the gradients’ (approximately) greatest entry. We show theoretically that DP-GCD can achieve a logarithmic dependence on the dimension for a wide range of problems by naturally exploiting their structural properties (such as quasi-sparse solutions). We illustrate this behavior numerically, both on synthetic and real datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mangold23a.html
https://proceedings.mlr.press/v206/mangold23a.htmlHeavy Sets with Applications to Interpretable Machine Learning DiagnosticsML models take on a new life after deployment and raise a host of new challenges: data drift, model recalibration and monitoring. If performance erodes over time, engineers in charge may ask what changed – did the data distribution change, did the model get worse after retraining? We propose a flexible paradigm for answering a variety of model diagnosis questions by finding heaviest-weight interpretable regions, which we call heavy sets. We associate a local weight describing model mismatch at each datapoint, and find a simple region maximizing the sum (or average) of these weights. Specific choices of weights can find regions where two models differ the most, where a single model makes unusually many errors, or where two datasets have large differences in densities. The premise is that a region with overall elevated errors (weights) may discover statistically significant effects despite individual errors not standing out in the noise. We focus on interpretable regions defined by sparse AND-rules (conjunctive rule using a small subset of available features). We first describe an exact integer programming (IP) formulation applicable to smaller data-sets. As the exact IP is NP-hard, we develop a greedy coordinate-wise dynamic-programming based formulation. For smaller datasets the heuristic often comes close in accuracy to the IP in objective, but it can scale to datasets with millions of examples and thousands of features. We also address statistical significance of the detected regions, taking care of multiple hypothesis testing and spatial dependence challenges that arise in model diagnostics. We evaluate our proposed approach both on synthetic data (with known ground-truth), and on well-known public ML datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/malioutov23a.html
https://proceedings.mlr.press/v206/malioutov23a.htmlInstance-dependent Sample Complexity Bounds for Zero-sum Matrix GamesWe study the sample complexity of identifying an approximate equilibrium for two-player zero-sum $n\times 2$ matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instance-dependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions $i$ and $j$, respectively, they both observe each other’s played actions and a stochastic observation $X_{ij}$ such that $\mathbb{E}[X_{ij}] = A_{ij}$. To our knowledge, our work is the first case of instance-dependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix $A$ as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/maiti23a.html
https://proceedings.mlr.press/v206/maiti23a.htmlOptimal Sketching Bounds for Sparse Linear RegressionWe study oblivious sketching for $k$-sparse linear regression under various loss functions. In particular, we are interested in a distribution over sketching matrices $S\in\mathbb{R}^{m\times n}$ that does not depend on the inputs $A\in\mathbb{R}^{n\times d}$ and $b\in\mathbb{R}^n$, such that, given access to $SA$ and $Sb$, we can recover a $k$-sparse $\tilde x\in\mathbb{R}^d$ with $\|A\tilde x-b\|_f\leq (1+\varepsilon) \min\nolimits_{k{\mathrm{-sparse}\,x\in\mathbb{R}^d}} \|Ax-b\|_f$. Here $\|\cdot\|_f: \mathbb R^n \rightarrow \mathbb R$ is some loss function – such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $m=\Theta(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This extends to $\ell_p$ loss with an additional additive $O(k\log(k/\varepsilon)/\varepsilon^2)$ term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression, where $A$ is the identity matrix. For this problem, under the $\ell_2$ norm, we observe an upper bound of $m=O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2)$, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve $m = o(d)$ showing that $m=O(\mu^2 k\log(\mu n d/\varepsilon)/\varepsilon^2)$ rows suffice, where $\mu$ is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on $\mu$. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize $\|Ax-b\|_2^2+\lambda\|x\|_1$ over $x\in\mathbb{R}^d$. We show that sketching dimension $m =O(\log(d)/(\lambda \varepsilon)^2)$ suffices and that the dependence on $d$ and $\lambda$ is tight.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/mai23a.html
https://proceedings.mlr.press/v206/mai23a.htmlNoisy Low-rank Matrix Optimization: Geometry of Local Minima and Convergence RateThis paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sensing has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results can handle the problem in the case with a general objective function subject to noisy data only when the RIP constant is close to 0. In this paper, we develop a new mathematical framework to solve the above-mentioned problem with a far less restrictive RIP constant. We prove that as long as the RIP constant of the noiseless objective is less than 1/3, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than 1/3. Compared to the existing results in the literature, this paper offers the strongest RIP bound, and provides a complete theoretical analysis on the global and local optimization landscapes of general low-rank optimization problems under random corruptions from any finite-variance family.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ma23a.html
https://proceedings.mlr.press/v206/ma23a.htmlEfficient SAGE Estimation via Causal Structure LearningThe Shapley Additive Global Importance (SAGE) value is a theoretically appealing interpretability method that fairly attributes global importance to a model’s surplus performance contributions over an exponential number of feature sets. This is computationally expensive, particularly because estimating the surplus contributions requires sampling from conditional distributions. Thus, SAGE approximation algorithms only take a fraction of the feature sets into account. We propose d-SAGE, a method that accelerates SAGE approximation. d-SAGE is motivated by the observation that conditional independencies (CIs) between a feature and the model target imply zero surplus contributions, such that their computation can be skipped. To identify CIs, we leverage causal structure learning (CSL) to infer a graph that encodes (conditional) independencies in the data as d-separations. This is computationally more efficient because the expense of the one-time graph inference and the d-separation queries is negligible compared to the expense of surplus contribution evaluations. Empirically we demonstrate that d-SAGE enables the efficient and accurate estimation of SAGE values.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/luther23a.html
https://proceedings.mlr.press/v206/luther23a.htmlImproved Rate of First Order Algorithms for Entropic Optimal TransportThis paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{\mathcal{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{\mathcal{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special design helps us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing. To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{\mathcal{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{\mathcal{O}}({n^2}/{\epsilon})$ for solving OT.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/luo23a.html
https://proceedings.mlr.press/v206/luo23a.htmlModel-Based Uncertainty in Value FunctionsWe consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation, but the over-approximation may result in inefficient exploration. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work. Moreover, our uncertainty quantification technique is easily integrated into common exploration strategies and scales naturally beyond the tabular setting by using standard deep reinforcement learning architectures. Experiments in difficult exploration tasks, both in tabular and continuous control settings, show that our sharper uncertainty estimates improve sample-efficiency.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/luis23a.html
https://proceedings.mlr.press/v206/luis23a.htmlDropout-Resilient Secure Multi-Party Collaborative Learning with Linear Communication ComplexityCollaborative machine learning enables privacy-preserving training of machine learning models without collecting sensitive client data. Despite recent breakthroughs, communication bottleneck is still a major challenge against its scalability to larger networks. To address this challenge, we propose PICO, the first collaborative learning framework with linear communication complexity, significantly improving over the quadratic state-of-the-art, under formal information-theoretic privacy guarantees. Theoretical analysis demonstrates that PICO slashes the communication cost while achieving equal computational complexity, adversary resilience, robustness to client dropouts, and model accuracy to the state-of-the-art. Extensive experiments demonstrate up to 91x reduction in the communication overhead, and up to 7x speed-up in the wall-clock training time compared to the state-of-the-art. As such, PICO addresses a key technical challenge in multi-party collaborative learning, paving the way for future large-scale privacy-preserving learning frameworks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lu23a.html
https://proceedings.mlr.press/v206/lu23a.htmlPrivate Non-Convex Federated Learning Without a Trusted ServerWe study federated learning (FL) with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person’s medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level (ISRL) differential privacy (DP), which requires silo $i$’s communications to satisfy record/item-level DP. We propose novel ISRL-DP algorithms for FL with heterogeneous (non-i.i.d.) silo data and two classes of Lipschitz continuous loss functions: First, we consider losses satisfying the Proximal Polyak-\Lojasiewicz (PL) inequality, which is an extension of the classical PL condition to the constrained setting. In contrast to our result, prior works only considered unconstrained private optimization with Lipschitz PL loss, which rules out most interesting PL losses such as strongly convex problems and linear/logistic regression. Our algorithms nearly attain the optimal strongly convex, homogeneous (i.i.d.) rate for ISRL-DP FL without assuming convexity or i.i.d. data. Second, we give the first private algorithms for non-convex non-smooth loss functions. Our utility bounds even improve on the state-of-the-art bounds for smooth losses. We complement our upper bounds with lower bounds. Additionally, we provide shuffle DP (SDP) algorithms that improve over the state-of-the-art central DP algorithms under more practical trust assumptions. Numerical experiments show that our algorithm has better accuracy than baselines for most privacy levels.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lowy23a.html
https://proceedings.mlr.press/v206/lowy23a.htmlWasserstein Distributionally Robust Linear-Quadratic Estimation under Martingale ConstraintsWe focus on robust estimation of the unobserved state of a discrete-time stochastic system with linear dynamics. A standard analysis of this estimation problem assumes a baseline innovation model; with Gaussian innovations we recover the Kalman filter. However, in many settings, there is insufficient or corrupted data to validate the baseline model. To cope with this problem, we minimize the worst-case mean-squared estimation error of adversarial models chosen within a Wasserstein neighborhood around the baseline. We also constrain the adversarial innovations to form a martingale difference sequence. The martingale constraint relaxes the i.i.d. assumptions which are often imposed on the baseline model. Moreover, we show that the martingale constraints guarantee that the adversarial dynamics remain adapted to the natural time-generated information. Therefore, adding the martingale constraint allows to improve upon over-conservative policies that also protect against unrealistic omniscient adversaries. We establish a strong duality result which we use to develop an efficient subgradient method to compute the distributionally robust estimation policy. If the baseline innovations are Gaussian, we show that the worst-case adversary remains Gaussian. Our numerical experiments indicate that the martingale constraint may also aid in adding a layer of robustness in the choice of the adversarial power.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lotidis23a.html
https://proceedings.mlr.press/v206/lotidis23a.htmlA Sea of Words: An In-Depth Analysis of Anchors for Text DataAnchors (Ribeiro et al., 2018) is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. After formalizing the algorithm for text classification, we present explicit results on different classes of models when the vectorization step is TF-IDF, and words are replaced by a fixed out-of-dictionary token when removed. Our inquiry covers models such as elementary if-then rules and linear classifiers. We then leverage this analysis to gain insights on the behavior of Anchors for any differentiable classifiers. For neural networks, we empirically show that the words corresponding to the highest partial derivatives of the model with respect to the input, reweighted by the inverse document frequencies, are selected by Anchors.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lopardo23a.html
https://proceedings.mlr.press/v206/lopardo23a.htmlBoosted Off-Policy LearningWe propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy’s expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a “weak” learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/london23a.html
https://proceedings.mlr.press/v206/london23a.htmlInducing Neural Collapse in Deep Long-tailed LearningAlthough deep neural networks achieve tremendous success on various classification tasks, the generalization ability drops sheer when training datasets exhibit long-tailed distributions. One of the reasons is that the learned representations (i.e. features) from the imbalanced datasets are less effective than those from balanced datasets. Specifically, the learned representation under class-balanced distribution will present the Neural Collapse (NC) phenomena. NC indicates the features from the same category are close to each other and from different categories are maximally distant, showing an optimal linear separable state of classification. However, the pattern differs on imbalanced datasets and is partially responsible for the reduced performance of the model. In this work, we propose two explicit feature regularization terms to learn high-quality representation for class-imbalanced data. With the proposed regularization, NC phenomena will appear under the class-imbalanced distribution, and the generalization ability can be significantly improved. Our method is easily implemented, highly effective, and can be plugged into most existing methods. The extensive experimental results on widely-used benchmarks show the effectiveness of our methodTue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23i.html
https://proceedings.mlr.press/v206/liu23i.htmlForestPrune: Compact Depth-Pruned Tree EnsemblesTree ensembles are powerful models that achieve excellent predictive performances, but can grow to unwieldy sizes. These ensembles are often post-processed (pruned) to reduce memory footprint and improve interpretability. We present ForestPrune, a novel optimization framework to post-process tree ensembles by pruning depth layers from individual trees. Since the number of nodes in a decision tree increases exponentially with tree depth, pruning deep trees drastically compactifies ensembles. We develop a specialized optimization algorithm to efficiently obtain high-quality solutions to problems under ForestPrune. Our algorithm typically reaches good solutions in seconds for medium-size datasets and ensembles, with 10000s of rows and 100s of trees, resulting in significant speedups over existing approaches. Our experiments demonstrate that ForestPrune produces parsimonious models that outperform models extracted by existing post-processing algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23h.html
https://proceedings.mlr.press/v206/liu23h.htmlConsistent Complementary-Label Learning via Order-Preserving LossesIn contrast to ordinary supervised classification tasks that require massive data with high-quality labels, complementary-label learning (CLL) deals with the weakly-supervised learning scenario where each instance is equipped with a complementary label, which specifies a class the instance does not belong to. However, most of the existing statistically consistent CLL methods suffer from overfitting intrinsically, due to the negative empirical risk issue. In this paper, we aim to propose overfitting-resistant and theoretically grounded methods for CLL. Considering the unique property of the distribution of complementarily labeled samples, we provide a risk estimator via order-preserving losses, which are naturally non-negative and thus can avoid overfitting caused by negative terms in risk estimators. Moreover, we provide classifier-consistency analysis and statistical guarantee for this estimator. Furthermore, we provide a weighed version of the proposed risk estimator to further enhance its generalization ability and prove its statistical consistency. Experiments on benchmark datasets demonstrate the effectiveness of our proposed methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23g.html
https://proceedings.mlr.press/v206/liu23g.htmlINO: Invariant Neural Operators for Learning Complex Physical Systems with Momentum ConservationNeural operators, which emerge as implicit solution operators of hidden governing equations, have recently become popular tools for learning responses of complex real-world physical systems. Nevertheless, the majority of neural operator applications has thus far been data-driven, which neglects the intrinsic preservation of fundamental physical laws in data. In this paper, we introduce a novel integral neural operator architecture, to learn physical models with fundamental conservation laws automatically guaranteed. In particular, by replacing the frame-dependent position information with its invariant counterpart in the kernel space, the proposed neural operator is designed to be translation- and rotation-invariant, and consequently abides by the conservation laws of linear and angular momentums. As applications, we demonstrate the expressivity and efficacy of our model in learning complex material behaviors from both synthetic and experimental datasets, and show that, by automatically satisfying these essential physical laws, our learned neural operator is not only generalizable in handling translated and rotated datasets, but also achieves improved accuracy and efficiency from the baseline neural operator models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23f.html
https://proceedings.mlr.press/v206/liu23f.htmlNonstationary Bandit Learning via Predictive SamplingThompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to nonstationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulation, we demonstrate that predictive sampling outperforms Thompson sampling in all nonstationary environments examined.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23e.html
https://proceedings.mlr.press/v206/liu23e.htmlAsymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary EnvironmentsIn this work, we consider the off-policy policy evaluation problem for contextual bandits and finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical for policy evaluation, but existing estimators that reuse old data introduce large bias such that we can not obtain a valid confidence interval. Inspired from a related field called survey sampling, we introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias. The estimator unifies several existing off-policy policy evaluation methods and improves on them with the use of auxiliary information and a regression approach. We prove that the new estimator is asymptotically unbiased, and provide a consistent variance estimator to a construct a large sample confidence interval. Finally, we empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23d.html
https://proceedings.mlr.press/v206/liu23d.htmlAdaptation to Misspecified Kernel Regularity in Kernelised BanditsIn continuum-armed bandit problems where the underlying function resides in a reproducing kernel Hilbert space (RKHS), namely, the kernelised bandit problems, an important open problem remains of how well learning algorithms can adapt if the regularity of the associated kernel function is unknown. In this work, we study adaptivity to the regularity of translation-invariant kernels, which is characterized by the decay rate of the Fourier transformation of the kernel, in the bandit setting. We derive an adaptivity lower bound, proving that it is impossible to simultaneously achieve optimal cumulative regret in a pair of RKHSs with different regularities. To verify the tightness of this lower bound, we show that an existing bandit model selection algorithm applied with minimax non-adaptive kernelised bandit algorithms matches the lower bound in dependence of T, the total number of steps, except for log factors. By filling in the regret bounds for adaptivity between RKHSs, we connect the statistical difficulty for adaptivity in continuum-armed bandits in three fundamental types of function spaces: RKHS, Sobolev space, and Holder space.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23c.html
https://proceedings.mlr.press/v206/liu23c.htmlSparse Bayesian optimizationBayesian optimization (BO) is a powerful approach to sample-efficient optimization of black-box objective functions. However, the application of BO to areas such as recommendation systems often requires taking the interpretability and simplicity of the configurations into consideration, a setting that has not been previously studied in the BO literature. To make BO applicable in this setting, we present several regularization-based approaches that allow us to discover sparse and more interpretable configurations. We propose a novel differentiable relaxation based on homotopy continuation that makes it possible to target sparsity by working directly with $L_0$ regularization. We identify failure modes for regularized BO and develop a hyperparameter-free method, sparsity exploring Bayesian optimization (SEBO) that seeks to simultaneously maximize a target objective and sparsity. SEBO and methods based on fixed regularization are evaluated on synthetic and real-world problems, and we show that we are able to efficiently optimize for sparsity.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23b.html
https://proceedings.mlr.press/v206/liu23b.htmlEEGNN: Edge Enhanced Graph Neural Network with a Bayesian Nonparametric Graph ModelTraining deep graph neural networks (GNNs) poses a challenging task, as the performance of GNNs may suffer from the number of hidden message-passing layers. The literature has focused on the proposals of over-smoothing and under-reaching to explain the performance deterioration of deep GNNs. In this paper, we propose a new explanation for such deteriorated performance phenomenon, mis-simplification, that is, mistakenly simplifying graphs by preventing self-loops and forcing edges to be unweighted. We show that such simplifying can reduce the potential of message-passing layers to capture the structural information of graphs. In view of this, we propose a new framework, edge enhanced graph neural network (EEGNN). EEGNN uses the structural information extracted from the proposed Dirichlet mixture Poisson graph model (DMPGM), a Bayesian nonparametric model for graphs, to improve the performance of various deep message-passing GNNs. We propose a Markov chain Monte Carlo inference framework for DMPGM. Experiments over different datasets show that our method achieves considerable performance increase compared to baselines.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/liu23a.html
https://proceedings.mlr.press/v206/liu23a.htmlGlobal Convergence of Over-parameterized Deep Equilibrium ModelsA deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. In this paper, the training dynamics of over-parameterized DEQs are investigated, and we propose a novel probabilistic framework to overcome the challenge arising from the weight-sharing and the infinite depth. By supposing a condition on the initial equilibrium point, we prove that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. We further perform a fine-grained non-asymptotic analysis about random DEQs and the corresponding weight-untied models, and show that the required initial condition is satisfied via mild over-parameterization. Moreover, we show that the unique equilibrium point always exists during the training.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ling23a.html
https://proceedings.mlr.press/v206/ling23a.htmlEntropic Risk Optimization in Discounted MDPsRisk-averse Markov Decision Processes (MDPs) have optimal policies that achieve high returns with low variability, but these MDPs are often difficult to solve. Only a few practical risk-averse objectives admit a dynamic programming (DP) formulation, which is the mainstay of most MDP and RL algorithms. We derive a new DP formulation for discounted risk-averse MDPs with Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) objectives. Our DP formulation for ERM, which is possible because of our novel definition of value function with time-dependent risk levels, can approximate optimal policies in a time that is polynomial in the approximation error. We then use the ERM algorithm to optimize the EVaR objective in polynomial time using an optimized discretization scheme. Our numerical results show the viability of our formulations and algorithms in discounted MDPs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lin-hau23a.html
https://proceedings.mlr.press/v206/lin-hau23a.htmlEnergy-Based Models for Functional Data using Path Measure TiltingEnergy-Based Models (EBMs) have proven to be a highly effective approach for modelling densities on finite-dimensional spaces. Their ability to incorporate domain-specific choices and constraints into the structure of the model through composition make EBMs an appealing candidate for applications in physics, biology and computer vision and various other fields. Recently, Energy-Based Processes (EBP) for modelling stochastic processes was proposed for unconditional exchangeable data (e.g., point clouds). In this work, we present a novel subclass of EBPs, called $\mathcal{F}$-EBM for conditional exchangeable data, which is able to learn distributions of functions (such as curves or surfaces) from functional samples evaluated at finitely many points. Two unique challenges arise in the functional context. Firstly, training data is often not evaluated along a fixed set of points. Secondly, steps must be taken to control the behaviour of the model between evaluation points, to mitigate overfitting. The proposed model is an energy based model on function space that is decomposed spectrally, where a Gaussian Process path measure is used to reweight the distribution to capture smoothness properties of the underlying process being modelled. The resulting model has the ability to utilize irregularly sampled training data and can output predictions at any resolution, providing an effective approach to up-scaling functional data. We demonstrate the efficacy of our proposed approach for modelling a range of datasets, including data collected from Standard and Poor’s 500 (S$&$P) and UK National grid.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lim23a.html
https://proceedings.mlr.press/v206/lim23a.htmlProbabilities of Causation: Role of Observational DataProbabilities of causation play a crucial role in modern decision-making. Pearl defined three binary probabilities of causation, the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). These probabilities were then bounded by Tian and Pearl using a combination of experimental and observational data. However, observational data are not always available in practice; in such a case, Tian and Pearl’s Theorem provided valid but less effective bounds using pure experimental data. In this paper, we discuss the conditions that observational data are worth considering to improve the quality of the bounds. More specifically, we defined the expected improvement of the bounds by assuming the observational distributions are uniformly distributed on their feasible interval. We further applied the proposed theorems to the unit selection problem defined by Li and Pearl.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/li23d.html
https://proceedings.mlr.press/v206/li23d.htmlMeta-Learning with Adjoint MethodsModel Agnostic Meta-Learning (MAML) is widely used to find a good initialization for a family of tasks. Despite its success, a critical challenge in MAML is to calculate the gradient w.r.t. the initialization of a long training trajectory for the sampled tasks, because the computation graph can rapidly explode and the computational cost is very expensive. To address this problem, we propose Adjoint MAML (A-MAML). We view gradient descent in the inner optimization as the evolution of an Ordinary Differential Equation (ODE). To efficiently compute the gradient of the validation loss w.r.t. the initialization, we use the adjoint method to construct a companion, backward ODE. To obtain the gradient w.r.t. the initialization, we only need to run the standard ODE solver twice — one is forward in time that evolves a long trajectory of gradient flow for the sampled task; the other is backward and solves the adjoint ODE. We need not create or expand any intermediate computational graphs, adopt aggressive approximations, or impose proximal regularizers in the training loss. Our approach is cheap, accurate, and adaptable to different trajectory lengths. We demonstrate the advantage of our approach in both synthetic and real-world meta-learning tasks. The code is available at https://github.com/shib0li/Adjoint-MAML.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/li23c.html
https://proceedings.mlr.press/v206/li23c.htmlA Statistical Analysis of Polyak-Ruppert Averaged Q-LearningWe study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/li23b.html
https://proceedings.mlr.press/v206/li23b.htmlMultilevel Bayesian QuadratureMultilevel Monte Carlo is a key tool for approximating integrals involving expensive scientific models. The idea is to use approximations of the integrand to construct an estimator with improved accuracy over classical Monte Carlo. We propose to further enhance multilevel Monte Carlo through Bayesian surrogate models of the integrand, focusing on Gaussian process models and the associated Bayesian quadrature estimators. We show, using both theory and numerical experiments, that our approach can lead to significant improvements in accuracy when the integrand is expensive and smooth, and when the dimensionality is small or moderate. We conclude the paper with a case study illustrating the potential impact of our method in landslide-generated tsunami modelling, where the cost of each integrand evaluation is typically too large for operational settings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/li23a.html
https://proceedings.mlr.press/v206/li23a.htmlSMCP3: Sequential Monte Carlo with Probabilistic Program ProposalsThis paper introduces SMCP3, a method for automatically implementing custom sequential Monte Carlo samplers for inference in probabilistic programs. Unlike particle filters and resample-move SMC (Gilks and Berzuini, 2001), SMCP3 algorithms can improve the quality of samples and weights using pairs of Markov proposal kernels that are also specified by probabilistic programs. Unlike Del Moral et al. (2006b), these proposals can themselves be complex probabilistic computations that generate auxiliary variables, apply deterministic transformations, and lack tractable marginal densities. This paper also contributes an efficient implementation in Gen that eliminates the need to manually derive incremental importance weights. SMCP3 thus simultaneously expands the design space that can be explored by SMC practitioners and reduces the implementation effort. SMCP3 is illustrated using applications to 3D object tracking, state-space modeling, and data clustering, showing that SMCP3 methods can simultaneously improve the quality and reduce the cost of marginal likelihood estimation and posterior inference.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lew23a.html
https://proceedings.mlr.press/v206/lew23a.htmlContext-Specific Causal Discovery for Categorical Data Using Staged TreesCausal discovery algorithms aim at untangling complex causal relationships from data. Here, we study causal discovery and inference methods based on staged tree models, which can represent complex and asymmetric causal relationships between categorical variables. We provide a first graphical representation of the equivalence class of a staged tree, by looking only at a specific subset of its underlying independences. We further define a new pre-metric, inspired by the widely used structural intervention distance, to quantify the closeness between two staged trees in terms of their corresponding causal inference statements. A simulation study highlights the efficacy of staged trees in uncovering complexes, asymmetric causal relationships from data, and real-world data applications illustrate their use in practical causal analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/leonelli23a.html
https://proceedings.mlr.press/v206/leonelli23a.htmlVector Quantized Time Series Generation with a Bidirectional Prior ModelTime series generation (TSG) studies have mainly focused on the use of Generative Adversarial Networks (GANs) combined with recurrent neural network (RNN) variants. However, the fundamental limitations and challenges of training GANs still remain. In addition, the RNN-family typically has difficulties with temporal consistency between distant timesteps. Motivated by the successes in the image generation (IMG) domain, we propose TimeVQVAE, the first work, to our knowledge, that uses vector quantization (VQ) techniques to address the TSG problem. Moreover, the priors of the discrete latent spaces are learned with bidirectional transformer models that can better capture global temporal consistency. We also propose VQ modeling in a time-frequency domain, separated into low-frequency (LF) and high-frequency (HF). This allows us to retain important characteristics of the time series and, in turn, generate new synthetic signals that are of better quality, with sharper changes in modularity, than its competing TSG methods. Our experimental evaluation is conducted on all datasets from the UCR archive, using well-established metrics in the IMG literature, such as Frechet inception distance and inception scores. Our implementation on GitHub: https://github.com/ML4ITS/TimeVQVAE.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lee23d.html
https://proceedings.mlr.press/v206/lee23d.htmlDeep Joint Source-Channel Coding with Iterative Source Error CorrectionIn this paper, we propose an iterative source error correction (ISEC) decoding scheme for deep-learning-based joint source-channel coding (Deep JSCC). Given a noisy codeword received through the channel, we use a Deep JSCC encoder and decoder pair to update the codeword iteratively to find a (modified) maximum a-posteriori (MAP) solution. For efficient MAP decoding, we utilize a neural network-based denoiser to approximate the gradient of the log-prior density of the codeword space. Albeit the non-convexity of the optimization problem, our proposed scheme improves various distortion and perceptual quality metrics from the conventional one-shot (non-iterative) Deep JSCC decoding baseline. Furthermore, the proposed scheme produces more reliable source reconstruction results compared to the baseline when the channel noise characteristics do not match the ones used during training.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lee23c.html
https://proceedings.mlr.press/v206/lee23c.htmlExact Gradient Computation for Spiking Neural Networks via Forward PropagationSpiking neural networks (SNN) have recently emerged as alternatives to traditional neural networks, owing to its energy efficiency benefits and capacity to capture biological neuronal mechanisms. However, the classic backpropagation algorithm for training traditional networks has been notoriously difficult to apply to SNN due to the hard-thresholding and discontinuities at spike times. Therefore, a large majority of prior work believes exact gradients for SNN w.r.t. their weights do not exist and has focused on approximation methods to produce surrogate gradients. In this paper, (1) by applying the implicit function theorem to SNN at the discrete spike times, we prove that, albeit being non-differentiable in time, SNNs have well-defined gradients w.r.t. their weights, and (2) we propose a novel training algorithm, called forward propagation (FP), that computes exact gradients for SNN. FP exploits the causality structure between the spikes and allows us to parallelize computation forward in time. It can be used with other algorithms that simulate the forward pass, and it also provides insights on why other related algorithms such as Hebbian learning and also recently-proposed surrogate gradient methods may perform well.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lee23b.html
https://proceedings.mlr.press/v206/lee23b.htmlUSIM Gate: UpSampling Module for Segmenting Precise Boundaries concerning EntropyDeep learning (DL) techniques for precise semantic segmentation have remained a challenge because of the vague boundaries of target objects caused by the low resolution of images. Despite the improved segmentation performance using up/downsampling operations in early DL models, conventional operators cannot fully preserve spatial information and thus generate vague boundaries of target objects. Therefore, for the precise segmentation of target objects in many domains, this paper presents two novel operators: (1) upsampling interpolation method (USIM), an operator that upsamples input feature maps and combines feature maps into one while preserving the spatial information of both inputs, and (2) USIM gate (UG), an advanced USIM operator with boundary-attention mechanisms. We designed our experiments using aerial images where the boundaries critically influence the results. Furthermore, we verified the feasibility that our approach effectively segments target objects using the cityscapes dataset. The experimental results demonstrate that using the USIM and UG with state-of-the-art DL models can improve the segmentation performance with clear boundaries of target objects (Intersection over Union: +6.9$%$; BJ: +10.1$%$). Furthermore, mathematical proofs verify that the USIM and UG contribute to the handling of spatial information.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lee23a.html
https://proceedings.mlr.press/v206/lee23a.htmlASkewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural NetworksIn this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ). We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/leconte23a.html
https://proceedings.mlr.press/v206/leconte23a.htmlScalable Unbalanced Sobolev Transport for Measures on a GraphOptimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)–(iii), Le et al. (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)–(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for fast computation, and it is also negative definite. Additionally, we derive geometric structures for the UST and establish relations between our UST and other transport distances. We further exploit the negative definiteness to design positive definite kernels and evaluate them on various simulations to illustrate their fast computation and comparable performances against other transport baselines for unbalanced measures on a graph.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/le23a.html
https://proceedings.mlr.press/v206/le23a.htmlRefined Convergence and Topology Learning for Decentralized SGD with Heterogeneous DataOne of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/le-bars23a.html
https://proceedings.mlr.press/v206/le-bars23a.htmlCoordinate Descent for SLOPEThe lasso is the most famous sparse regression and feature selection method. One reason for its popularity is the speed at which the underlying optimization problem can be solved. Sorted L-One Penalized Estimation (SLOPE) is a generalization of the lasso with appealing statistical properties. In spite of this, the method has not yet reached widespread interest. A major reason for this is that current software packages that fit SLOPE rely on algorithms that perform poorly in high dimensions. To tackle this issue, we propose a new fast algorithm to solve the SLOPE optimization problem, which combines proximal gradient descent and proximal coordinate descent steps. We provide new results on the directional derivative of the SLOPE penalty and its related SLOPE thresholding operator, as well as provide convergence guarantees for our proposed solver. In extensive benchmarks on simulated and real data, we demonstrate our method’s performance against a long list of competing algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/larsson23a.html
https://proceedings.mlr.press/v206/larsson23a.htmlA Novel Stochastic Gradient Descent Algorithm for Learning Principal SubspacesMany machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja’s rule (Oja, 1982)), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks (Baldi et al., 1989). In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset (LeCun et al. 2010) and the reinforcement learning domain PuddleWorld (Sutton, 1995) demonstrating the usefulness of our approach.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lan23a.html
https://proceedings.mlr.press/v206/lan23a.htmlRobust Variational Autoencoding with Wasserstein Penalty for Novelty DetectionWe propose a new method for novelty detection that can tolerate high corruption of the training points, whereas previous works assumed either no or very low corruption. Our method trains a robust variational autoencoder (VAE), which aims to generate a model for the uncorrupted training points. To gain robustness to high corruption, we incorporate the following four changes to the common VAE: 1. Extracting crucial features of the latent code by a carefully designed dimension reduction component for distributions; 2. Modeling the latent distribution as a mixture of Gaussian low-rank inliers and full-rank outliers, where the testing only uses the inlier model; 3. Applying the Wasserstein-1 metric for regularization, instead of the Kullback-Leibler (KL) divergence; and 4. Using a robust error for reconstruction. We establish both robustness to outliers and suitability to low-rank modeling of the Wasserstein metric as opposed to the KL divergence. We illustrate state-of-the-art results on standard benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lai23a.html
https://proceedings.mlr.press/v206/lai23a.htmlAn Homogeneous Unbalanced Regularized Optimal Transport Model with Applications to Optimal Transport with BoundaryThis work studies how the introduction of the entropic regularization term in unbalanced Optimal Transport (OT) models may alter their homogeneity with respect to the input measures. We observe that in common settings (including balanced OT and unbalanced OT with Kullback-Leibler divergence to the marginals), although the optimal transport cost itself is not homogeneous, optimal transport plans and the so-called Sinkhorn divergences are indeed homogeneous. However, homogeneity does not hold in more general Unbalanced Regularized Optimal Transport (UROT) models, for instance those using the Total Variation as divergence to the marginals. We propose to modify the entropic regularization term to retrieve an UROT model that is homogeneous while preserving most properties of the standard UROT model. We showcase the importance of using our Homogeneous UROT (HUROT) model when it comes to regularize Optimal Transport with Boundary, a transportation model involving a spatially varying divergence to the marginals for which the standard (inhomogeneous) UROT model would yield inappropriate behavior.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/lacombe23a.html
https://proceedings.mlr.press/v206/lacombe23a.htmlEstimating Conditional Average Treatment Effects with Missing Treatment InformationEstimating conditional average treatment effects (CATE) is challenging, especially when treatment information is missing. Although this is a widespread problem in practice, CATE estimation with missing treatments has received little attention. In this paper, we analyze CATE estimation in the setting with missing treatments where unique challenges arise in the form of covariate shifts. We identify two covariate shifts in our setting: (i) a covariate shift between the treated and control population; and (ii) a covariate shift between the observed and missing treatment population. We first theoretically show the effect of these covariate shifts by deriving a generalization bound for estimating CATE in our setting with missing treatments. Then, motivated by our bound, we develop the missing treatment representation network (MTRNet), a novel CATE estimation algorithm that learns a balanced representation of covariates using domain adaptation. By using balanced representations, MTRNet provides more reliable CATE estimates in the covariate domains where the data are not fully observed. In various experiments with semi-synthetic and real-world data, we show that our algorithm improves over the state-of-the-art by a substantial margin.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kuzmanovic23a.html
https://proceedings.mlr.press/v206/kuzmanovic23a.htmlParticle algorithms for maximum likelihood training of latent variable modelsNeal and Hinton (1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional F, and the EM algorithm as coordinate descent applied to F. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with F and show that their limits coincide with F’s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kuntz23a.html
https://proceedings.mlr.press/v206/kuntz23a.htmlMeta-learning for Robust Anomaly DetectionWe propose a meta-learning method to improve the anomaly detection performance on unseen target tasks that have only unlabeled data. Existing meta-learning methods for anomaly detection have shown remarkable performance but require labeled data in target tasks. Although they can treat unlabeled data as normal assuming anomalies in the unlabeled data are negligible, this assumption is often violated in practice. As a result, the methods have low performance. Our method meta-learns with related tasks that have labeled and unlabeled data such that the expected test anomaly detection performance is directly improved when the anomaly detector is adapted to given unlabeled data. Our method is based on autoencoders (AEs), which are widely used neural network-based anomaly detectors. We model anomalous attributes for each unlabeled instance in the reconstruction loss of the AE, which are used to prevent the anomalies from being reconstructed; they can remove the effect of the anomalies. We formulate adaptation to the unlabeled data as a learning problem of the last layer of the AE and the anomalous attributes. This formulation enables the optimum solution to be obtained with a closed-form alternate update formula, which is preferable to efficiently maximize the expected test anomaly detection performance. The effectiveness of our method is experimentally shown with four real-world datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kumagai23a.html
https://proceedings.mlr.press/v206/kumagai23a.htmlStochastic Tree Ensembles for Estimating Heterogeneous EffectsDetermining subgroups that respond especially well (or poorly) to specific interventions (medical or policy) requires new supervised learning methods tailored specifically for causal inference. Bayesian Causal Forest (BCF) is a recent method that has been documented to perform well on data generating processes with strong confounding of the sort that is plausible in many applications. This paper develops a novel algorithm for fitting the BCF model, which is more efficient than the previous Gibbs sampler. The new algorithm can be used to initialize independent chains of the existing Gibbs sampler leading to better posterior exploration and coverage of the associated interval estimates in simulation studies. The new algorithm is compared to related approaches via simulation studies as well as an empirical analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/krantsevich23a.html
https://proceedings.mlr.press/v206/krantsevich23a.htmlDifferentiable Change-point Detection With Temporal Point ProcessesIn this paper, we consider the problem of global change-point detection in event sequence data, where both the event distributions and change-points are assumed to be unknown. For this problem, we propose a Log-likelihood Ratio based Global Change-point Detector, which observes the entire sequence and detects a prespecified number of change-points. Based on the Transformer Hawkes Process (THP), a well-known neural TPP framework, we develop DCPD, a differentiable change-point detector, along with maintaining distinct intensity and mark predictor for each partition. Further, we propose a sliding-window-based extension of DCPD to improve its scalability in terms of the number of events or change-points with minor sacrifices in performance. Experiments on synthetic datasets explore the effects of run-time, relative complexity, and other aspects of distributions on various properties of our changepoint detectors, namely robustness, detection accuracy, scalability, etc. under controlled environments. Finally, we perform experiments on six real-world temporal event sequences collected from diverse domains like health, geographical regions, etc., and show that our methods either outperform or perform comparably with the baselines.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/koley23a.html
https://proceedings.mlr.press/v206/koley23a.htmlMARS: Masked Automatic Ranks Selection in Tensor DecompositionsTensor decomposition methods have proven effective in various applications, including compression and acceleration of neural networks. At the same time, the problem of determining optimal decomposition ranks, which present the crucial parameter controlling the compressionaccuracy trade-off, is still acute. In this paper, we introduce MARS - a new efficient method for the automatic selection of ranks in general tensor decompositions. During training, the procedure learns binary masks over decomposition cores that “select” the optimal tensor structure. The learning is performed via relaxed maximum a posteriori (MAP) estimation in a specific Bayesian model and can be naturally embedded into the standard neural network training routine. Diverse experiments demonstrate that MARS achieves better results compared to previous works in various tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kodryan23a.html
https://proceedings.mlr.press/v206/kodryan23a.htmlPositional Encoder Graph Neural Networks for Geographic DataGraph neural networks (GNNs) provide a powerful and scalable solution for modeling continuous spatial data. However, they often rely on Euclidean distances to construct the input graphs. This assumption can be improbable in many real-world settings, where the spatial structure is more complex and explicitly non-Euclidean (e.g., road networks). Here, we propose PE-GNN, a new framework that incorporates spatial context and correlation explicitly into the models. Building on recent advances in geospatial auxiliary task learning and semantic spatial embeddings, our proposed method (1) learns a context-aware vector encoding of the geographic coordinates and (2) predicts spatial autocorrelation in the data in parallel with the main task. On spatial interpolation and regression tasks, we show the effectiveness of our approach, improving performance over different state-of-the-art GNN approaches. We observe that our approach not only vastly improves over the GNN baselines, but can match Gaussian processes, the most commonly utilized method for spatial interpolation problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/klemmer23a.html
https://proceedings.mlr.press/v206/klemmer23a.htmlANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling BanditsWe study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem. The only two existing attempts in this line of work fall short across multiple dimensions, including pessimistic measures of non-stationary complexity and non-adaptive parameter tuning that requires knowledge of the number of preference changes. We develop an elimination-based rescheduling algorithm to overcome these shortcomings and show a near-optimal $\tilde O(\sqrt{S^{CW} T})$ dynamic regret bound, where $\S^{CW}$ is the number of times the Condorcet winner changes in $T$ rounds. This yields the first near-optimal dynamic regret bound for unknown $S^{CW}$. We further study other related notions of non-stationarity for which we also prove near-optimal dynamic regret guarantees under additional assumptions on the preference model.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kleine-buening23a.html
https://proceedings.mlr.press/v206/kleine-buening23a.htmlEfficient fair PCA for fair representation learningWe revisit the problem of fair principal component analysis (PCA), where the goal is to learn the best low-rank linear approximation of the data that obfuscates demographic information. We propose a conceptually simple approach that allows for an analytic solution similar to standard PCA and can be kernelized. Our methods have the same complexity as standard PCA, or kernel PCA, and run much faster than existing methods for fair PCA based on semidefinite programming or manifold optimization, while achieving similar results.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kleindessner23a.html
https://proceedings.mlr.press/v206/kleindessner23a.htmlThe Lie-Group Bayesian Learning RuleThe Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the manifold. We address these difficulties by proposing an extension based on Lie-groups where posteriors are parametrized through transformations of an arbitrary base distribution and updated via the group’s exponential map. This simplifies all three difficulties for many cases, providing flexible parametrizations through group’s action, simple gradient computation through reparameterization, and updates that always stay on the manifold. We use the new learning rule to derive a new algorithm for deep learning with desirable biologically-plausible attributes to learn sparse features. Our work opens a new frontier for the design of new algorithms by exploiting lie-group structures.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kiral23a.html
https://proceedings.mlr.press/v206/kiral23a.htmlSwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal RetrievalWe tackle the cross-modal retrieval problem, where learning is only supervised by the relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kim23e.html
https://proceedings.mlr.press/v206/kim23e.htmlSqueeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual BanditsWe propose a linear contextual bandit algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kim23d.html
https://proceedings.mlr.press/v206/kim23d.htmlCovariate-informed Representation Learning to Prevent Posterior Collapse of iVAEThe recently proposed identifiable variational autoencoder (iVAE) framework provides a promising approach for learning latent independent components (ICs). iVAEs use auxiliary covariates to build an identifiable generation structure from covariates to ICs to observations, and the posterior network approximates ICs given observations and covariates. Though the identifiability is appealing, we show that iVAEs could have local minimum solution where observations and the approximated ICs are independent given covariates.– a phenomenon we referred to as the posterior collapse problem of iVAEs. To overcome this problem, we develop a new approach, covariate-informed iVAE (CI-iVAE) by considering a mixture of encoder and posterior distributions in the objective function. In doing so, the objective function prevents the posterior collapse, resulting latent representations that contain more information of the observations. Furthermore, CI-iVAE extends the original iVAE objective function to a larger class and finds the optimal one among them, thus having tighter evidence lower bounds than the original iVAE. Experiments on simulation datasets, EMNIST, Fashion-MNIST, and a large-scale brain imaging dataset demonstrate the effectiveness of our new method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kim23c.html
https://proceedings.mlr.press/v206/kim23c.htmlContextual Linear Bandits under Noisy Features: Towards Bayesian OraclesWe study contextual linear bandit problems under feature uncertainty; they are noisy with missing entries. To address the challenges of the noise, we analyze Bayesian oracles given observed noisy features. Our Bayesian analysis finds that the optimal hypothesis can be far from the underlying realizability function, depending on the noise characteristics, which are highly non-intuitive and do not occur for classical noiseless setups. This implies that classical approaches cannot guarantee a non-trivial regret bound. Therefore, we propose an algorithm that aims at the Bayesian oracle from observed information under this model, achieving $\tilde{O}(d\sqrt{T})$ regret bound when there is a large number of arms. We demonstrate the proposed algorithm using synthetic and real-world datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kim23b.html
https://proceedings.mlr.press/v206/kim23b.htmlCharacterizing Internal Evasion Attacks in Federated LearningFederated learning allows for clients in a distributed system to jointly train a machine learning model. However, clients’ models are vulnerable to attacks during the training and testing phases. In this paper, we address the issue of adversarial clients performing “internal evasion attacks”: crafting evasion attacks at test time to deceive other clients. For example, adversaries may aim to deceive spam filters and recommendation systems trained with federated learning for monetary gain. The adversarial clients have extensive information about the victim model in a federated learning setting, as weight information is shared amongst clients. We are the first to characterize the transferability of such internal evasion attacks for different learning methods and analyze the trade-off between model accuracy and robustness depending on the degree of similarities in client data. We show that adversarial training defenses in the federated learning setting only display limited improvements against internal attacks. However, combining adversarial training with personalized federated learning frameworks increases relative internal attack robustness by 60$%$ compared to federated adversarial training and performs well under limited system resources.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kim23a.html
https://proceedings.mlr.press/v206/kim23a.htmlConvolutional Persistence as a Remedy to Neural Model AnalysisWhile deep neural networks are proven to be effective learning systems, their analysis is complex due to the high-dimensionality of their weight space. Persistent topological properties can be used as an additional descriptor, providing insights on how the network weights evolve during training. In this paper, we focus on convolutional neural networks, and define the topology of the space, populated by convolutional filters (i.e., kernels). We perform an extensive analysis of topological properties of the convolutional filters. Specifically, we define a metric based on persistent homology, namely, Convolutional Topology Representation, to determine an important factor in neural networks training: the generalizability of the model to the test set. We further analyse how various training methods affect the topology of convolutional layers.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/khramtsova23a.html
https://proceedings.mlr.press/v206/khramtsova23a.htmlAdversarial robustness of VAEs through the lens of local geometryIn an unsupervised attack on variational autoencoders (VAEs), an adversary finds a small perturbation in an input sample that significantly changes its latent space encoding, thereby compromising the reconstruction for a fixed decoder. A known reason for such vulnerability is the distortions in the latent space resulting from a mismatch between approximated latent posterior and a prior distribution. Consequently, a slight change in an input sample can move its encoding to a low/zero density region in the latent space resulting in an unconstrained generation. This paper demonstrates that an optimal way for an adversary to attack VAEs is to exploit a directional bias of a stochastic pullback metric tensor induced by the encoder and decoder networks. The pullback metric tensor of an encoder measures the change in infinitesimal latent volume from an input to a latent space. Thus, it can be viewed as a lens to analyse the effect of input perturbations leading to latent space distortions. We propose robustness evaluation scores using the eigenspectrum of a pullback metric tensor. Moreover, we empirically show that the scores correlate with the robustness parameter $\beta$ of the $\beta-$VAE. Since increasing $\beta$ also degrades reconstruction quality, we demonstrate a simple alternative using mixup training to fill the empty regions in the latent space, thus improving robustness with improved reconstruction.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/khan23b.html
https://proceedings.mlr.press/v206/khan23b.htmlBarlow Graph Auto-Encoder for Unsupervised Network EmbeddingNetwork embedding has emerged as a promising research field for network analysis. Recently, an approach, named Barlow Twins, has been proposed for self-supervised learning in computer vision by applying the redundancy-reduction principle to the embedding vectors corresponding to two distorted versions of the image samples. Motivated by this, we propose Barlow Graph Auto-Encoder, a simple yet effective architecture for learning network embedding. It aims to maximize the similarity between the embedding vectors of immediate and larger neighborhoods of a node while minimizing the redundancy between the components of these projections. In addition, we also present the variational counterpart named Barlow Variational Graph Auto-Encoder. We demonstrate the effectiveness of our approach in learning multiple graph-related tasks, i.e., link prediction, clustering, and downstream node classification, by providing extensive comparisons with several well-known techniques on eight benchmark datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/khan23a.html
https://proceedings.mlr.press/v206/khan23a.htmlDiffusion Generative Models in Infinite DimensionsDiffusion generative models have recently been applied to domains where the available data can be seen as a discretization of an underlying function, such as audio signals or time series. However, these models operate directly on the discretized data, and there are no semantics in the modeling process that relate the observed data to the underlying functional forms. We generalize diffusion models to operate directly in function space by developing the foundational theory for such models in terms of Gaussian measures on Hilbert spaces. A significant benefit of our function space point of view is that it allows us to explicitly specify the space of functions we are working in, leading us to develop methods for diffusion generative modeling in Sobolev spaces. Our approach allows us to perform both unconditional and conditional generation of function-valued data. We demonstrate our methods on several synthetic and real-world benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kerrigan23a.html
https://proceedings.mlr.press/v206/kerrigan23a.htmlRank-Based Causal Discovery for Post-Nonlinear ModelsLearning causal relationships from empirical observations is a central task in scientific research. A common method is to employ structural causal models that postulate noisy functional relations among a set of interacting variables. To ensure unique identifiability of causal directions, researchers consider restricted subclasses of structural causal models. Post-nonlinear (PNL) causal models constitute one of the most flexible options for such restricted subclasses, containing in particular the popular additive noise models as a further subclass. However, learning PNL models is not well studied beyond the bivariate case. The existing methods learn non-linear functional relations by minimizing residual dependencies and subsequently test independence from residuals to determine causal orientations. However, these methods can be prone to overfitting and, thus, difficult to tune appropriately in practice. As an alternative, we propose a new approach for PNL causal discovery that uses rank-based methods to estimate the functional parameters. This new approach exploits natural invariances of PNL models and disentangles the estimation of the non-linear functions from the independence tests used to find causal orientations. We prove consistency of our method and validate our results in numerical experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/keropyan23a.html
https://proceedings.mlr.press/v206/keropyan23a.htmlUnified Perspective on Probability Divergence via the Density-Ratio Likelihood: Bridging KL-Divergence and Integral Probability MetricsThis paper provides a unified perspective for the Kullback-Leibler (KL)-divergence and the integral probability metrics (IPMs) from the perspective of maximum likelihood density-ratio estimation (DRE). Both the KL-divergence and the IPMs are widely used in various fields in applications such as generative modeling. However, a unified understanding of these concepts has still been unexplored. In this paper, we show that the KL-divergence and the IPMs can be represented as maximal likelihoods differing only by sampling schemes, and use this result to derive a unified form of the IPMs and a relaxed estimation method. To develop the estimation problem, we construct an unconstrained maximum likelihood estimator to perform DRE with a stratified sampling scheme. We further propose a novel class of probability divergences, called the Density Ratio Metrics (DRMs), that interpolates the KL-divergence and the IPMs. In addition to these findings, we also introduce some applications of the DRMs, such as DRE and generative adversarial networks. In experiments, we validate the effectiveness of our proposed methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kato23a.html
https://proceedings.mlr.press/v206/kato23a.htmlNeural Discovery of Permutation SubgroupsWe consider the problem of discovering subgroup $H$ of permutation group $S_n$. Unlike the traditional $H$-invariant networks wherein $H$ is assumed to be known, we present a method to discover the underlying subgroup, given that it satisfies certain conditions. Our results show that one could discover any subgroup of type $S_k (k \leq n)$ by learning an $S_n$-invariant function and a linear transformation. We also prove similar results for cyclic and dihedral subgroups. Finally, we provide a general theorem that can be extended to discover other subgroups of $S_n$. We also demonstrate the applicability of our results through numerical experiments on image-digit sum and symmetric polynomial regression tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/karjol23a.html
https://proceedings.mlr.press/v206/karjol23a.htmlRobust and Agnostic Learning of Conditional Distributional Treatment EffectsThe conditional average treatment effect (CATE) is the best measure of individual causal effects given baseline covariates. However, the CATE only captures the (conditional) average, and can overlook risks and tail events, which are important to treatment choice. In aggregate analyses, this is usually addressed by measuring the distributional treatment effect (DTE), such as differences in quantiles or tail expectations between treatment groups. Hypothetically, one can similarly fit conditional quantile regressions in each treatment group and take their difference, but this would not be robust to misspecification or provide agnostic best-in-class predictions. We provide a new robust and model-agnostic methodology for learning the conditional DTE (CDTE) for a class of problems that includes conditional quantile treatment effects, conditional super-quantile treatment effects, and conditional treatment effects on coherent risk measures given by f-divergences. Our method is based on constructing a special pseudo-outcome and regressing it on covariates using any regression learner. Our method is model-agnostic in that it can provide the best projection of CDTE onto the regression model class. Our method is robust in that even if we learn these nuisances nonparametrically at very slow rates, we can still learn CDTEs at rates that depend on the class complexity and even conduct inferences on linear projections of CDTEs. We investigate the behavior of our proposal in simulations, as well as in a case study of 401(k) eligibility effects on wealth.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/kallus23a.html
https://proceedings.mlr.press/v206/kallus23a.htmlBayesian Convolutional Deep Sets with Task-Dependent Stationary PriorConvolutional deep sets is a neural network architecture that can model stationary stochastic processes. This architecture uses the kernel smoother and deep convolutional neural network to construct translation equivariant functional representations. However, the non-parametric nature of the kernel smoother can produce ambiguous representations when the number of data points is not given sufficiently. To address this issue, we introduce bayesian convolutional deep sets, which constructs random translation equivariant functional representations with a stationary prior. Furthermore, we present how to impose the task-dependent prior for each dataset because a wrongly imposed prior can result in an even worse representation than that of the kernel smoother. Empirically, we demonstrate that the proposed architecture alleviates the targeted issue in various experiments with time-series and image datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jung23a.html
https://proceedings.mlr.press/v206/jung23a.htmlAverage Adjusted Association: Efficient Estimation with High Dimensional ConfoundersThe log odds ratio is a well-established metric for evaluating the association between binary outcome and exposure variables. Despite its widespread use, there has been limited discussion on how to summarize the log odds ratio as a function of confounders through averaging. To address this issue, we propose the Average Adjusted Association (AAA), which is a summary measure of association in a heterogeneous population, adjusted for observed confounders. To facilitate the use of it, we also develop efficient double/debiased machine learning (DML) estimators of the AAA. Our DML estimators use two equivalent forms of the efficient influence function, and are applicable in various sampling scenarios, including random sampling, outcome-based sampling, and exposure-based sampling. Through real data and simulations, we demonstrate the practicality and effectiveness of our proposed estimators in measuring the AAA.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jun23a.html
https://proceedings.mlr.press/v206/jun23a.htmlGeneralization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph DiffusionGraph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network’s feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works’ settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves the test performance on several graph-level classification tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ju23a.html
https://proceedings.mlr.press/v206/ju23a.htmlFederated Learning under Distributed Concept DriftFederated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are ill-suited to staggered drifts, necessitating multiple-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jothimurugesan23a.html
https://proceedings.mlr.press/v206/jothimurugesan23a.htmlScalable Bayesian Optimization Using Vecchia Approximations of Gaussian ProcessesBayesian optimization is a technique for optimizing black-box target functions. At the core of Bayesian optimization is a surrogate model that predicts the output of the target function at previously unseen inputs to facilitate the selection of promising input values. Gaussian processes (GPs) are commonly used as surrogate models but are known to scale poorly with the number of observations. Inducing point GP approximations can mitigate scaling issues, but may provide overly smooth estimates of the target function. In this work we adapt the Vecchia approximation, a popular GP approximation from spatial statistics, to enable scalable high-dimensional Bayesian optimization. We develop several improvements and extensions to Vecchia, including training warped GPs using mini-batch gradient descent, approximate neighbor search, and variance recalibration. We demonstrate the superior performance of Vecchia in BO using both Thompson sampling and qUCB. On several test functions and on two reinforcement-learning problems, our methods compared favorably to the state of the art, often outperforming inducing point methods and even exact GPs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jimenez23a.html
https://proceedings.mlr.press/v206/jimenez23a.htmlA Conditional Gradient-based Method for Simple Bilevel Optimization with Convex Lower-level ProblemIn this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${O}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${O}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We also prove stronger convergence guarantees under the Holderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jiang23a.html
https://proceedings.mlr.press/v206/jiang23a.htmlDon’t be fooled: label leakage in explanation methods and the importance of their quantitative evaluationFeature attribution methods identify which features of an input most influence a model’s output. Most widely-used feature attribution methods (such as SHAP, LIME, and Grad-CAM) are “class-dependent” methods in that they generate a feature attribution vector as a function of class. In this work, we demonstrate that class-dependent methods can “leak” information about the selected class, making that class appear more likely than it is. Thus, an end user runs the risk of drawing false conclusions when interpreting an explanation generated by a class-dependent method. In contrast, we introduce “distribution-aware” methods, which favor explanations that keep the label’s distribution close to its distribution given all features of the input. We introduce SHAP-KL and FastSHAP-KL, two baseline distribution-aware methods that compute Shapley values. Finally, we perform a comprehensive evaluation of seven class-dependent and three distribution-aware methods on three clinical datasets of different high-dimensional data types: images, biosignals, and text.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jethani23a.html
https://proceedings.mlr.press/v206/jethani23a.htmlFactorial SDE for Multi-Output Gaussian Process RegressionMulti-output Gaussian process (GP) regression has been widely used as a flexible nonparametric Bayesian model for predicting multiple correlated outputs given inputs. However, the cubic complexity in the sample size and the output dimensions for inverting the kernel matrix has limited their use in the large-data regime. In this paper, we introduce the factorial stochastic differential equation as a representation of multi-output GP regression, which is a factored state-space representation as in factorial hidden Markov models. We propose a structured mean-field variational inference approach that achieves a time complexity linear in the number of samples, along with its sparse variational inference counterpart with complexity linear in the number of inducing points. On simulated and real-world data, we show that our approach significantly improves upon the scalability of previous methods, while achieving competitive prediction accuracy.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jeong23a.html
https://proceedings.mlr.press/v206/jeong23a.htmlNearly Optimal Latent State Decoding in Block MDPsWe consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP. We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of possible contexts.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jedra23a.html
https://proceedings.mlr.press/v206/jedra23a.htmlUltra-marginal Feature Importance: Learning from Data with Causal GuaranteesScientists frequently prioritize learning from data rather than training the best possible model; however, research in machine learning often prioritizes the latter. Marginal contribution feature importance (MCI) was developed to break this trend by providing a useful framework for quantifying the relationships in data. In this work, we aim to improve upon the theoretical properties, performance, and runtime of MCI by introducing ultra-marginal feature importance (UMFI), which uses dependence removal techniques from the AI fairness literature as its foundation. We first propose axioms for feature importance methods that seek to explain the causal and associative relationships in data, and we prove that UMFI satisfies these axioms under basic assumptions. We then show on real and simulated data that UMFI performs better than MCI, especially in the presence of correlated interactions and unrelated features, while partially learning the structure of the causal graph and reducing the exponential runtime of MCI to super-linear.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/janssen23a.html
https://proceedings.mlr.press/v206/janssen23a.htmlBayesian Variable Selection in a Million DimensionsBayesian variable selection is a powerful tool for data analysis, as it offers a principled method for variable selection that accounts for prior information and uncertainty. However, wider adoption of Bayesian variable selection has been hampered by computational challenges, especially in difficult regimes with a large number of covariates P or non-conjugate likelihoods. To scale to the large P regime we introduce an efficient Markov Chain Monte Carlo scheme whose cost per iteration is sublinear in P (though linear in the number of data points). In addition we show how this scheme can be extended to generalized linear models for count data, which are prevalent in biology, ecology, economics, and beyond. In particular we design efficient algorithms for variable selection in binomial and negative binomial regression, which includes logistic regression as a special case. In experiments we demonstrate the effectiveness of our methods, including on cancer and maize genomic data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jankowiak23a.html
https://proceedings.mlr.press/v206/jankowiak23a.htmlOnline Learning for Traffic Routing under Unknown PreferencesIn transportation networks, road tolling schemes are a method to cope with the efficiency losses due to selfish user routing, wherein users choose routes to minimize individual travel costs. However, the efficacy of tolling schemes often relies on access to complete information on users’ trip attributes, such as their origin-destination (O-D) travel information and their values of time, which may not be available in practice. Motivated by this practical consideration, we propose an online learning approach to set tolls in a traffic network to drive heterogeneous users with different values of time toward a system-efficient traffic pattern. In particular, we develop a simple yet effective algorithm that adjusts tolls at each time period solely based on the observed aggregate flows on the roads of the network without relying on any additional trip attributes of users, thereby preserving user privacy. In the setting where the O-D pairs and values of time of users are drawn i.i.d. at each period, we show that our approach obtains an expected regret and road capacity violation of $O(\sqrt{T})$, where $T$ is the number of periods over which tolls are updated. Our regret guarantee is relative to an offline oracle with complete information on users’ trip attributes. We further establish a $\Omega(\sqrt{T})$ lower bound on the regret of any algorithm, which establishes that our algorithm is optimal up to constants. Finally, we demonstrate the superior performance of our approach relative to several benchmarks on a real-world traffic network, which highlights its practical applicability.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/jalota23a.html
https://proceedings.mlr.press/v206/jalota23a.htmlHedging against Complexity: Distributionally Robust Optimization with Parametric ApproximationEmpirical risk minimization (ERM) and distributionally robust optimization (DRO) are popular approaches for solving stochastic optimization problems that appear in operations management and machine learning. Existing generalization error bounds for these methods depend on either the complexity of the cost function or dimension of the uncertain parameters; consequently, the performance of these methods is poor for high-dimensional problems with objective functions under high complexity. We propose a simple approach in which the distribution of uncertain parameters is approximated using a parametric family of distributions. This mitigates both sources of complexity; however, it introduces a model misspecification error. We show that this new source of error can be controlled by suitable DRO formulations. Our proposed parametric DRO approach has significantly improved generalization bounds over existing ERM / DRO methods and parametric ERM for a wide variety of settings. Our method is particularly effective under distribution shifts. We also illustrate the superior performance of our approach on both synthetic and real-data portfolio optimization and regression tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/iyengar23a.html
https://proceedings.mlr.press/v206/iyengar23a.htmlRepresentation Learning in Deep RL via Discrete Information BottleneckSeveral self-supervised representation learning methods have been proposed for reinforcement learning (RL) with rich observations. For real world applications of RL, recovering underlying latent states is crucial, particularly when sensory inputs can contain irrelevant and exogenous information. In this work, we study how information bottlenecjs can be used to construct latent states efficiently in the presence of task irrelevant information. We propose architectures that utilize variational and discrete information bottleneck, coined as RepDIB, to learn structured factorized representations. Exploiting the expressiveness bought by factorized representations, we introduce a simple, yet effective, bottleneck that can be integrated with any existing self supervised objective for RL. We demonstrate this across several online and offline RL benchmarks, along with a real robot arm task, where we find that compressed representations with RepDIB can lead to strong performance improvements, as the learnt bottlenecks can help predict only the relevant state, while ignoring irrelevant information.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/islam23a.html
https://proceedings.mlr.press/v206/islam23a.htmlKernel Conditional Moment Constraints for Confounding Robust InferenceWe study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value. It can be shown that our estimator contains the recently proposed sharp estimator by Dorn and Guo (2022) as a special case, and our method enables a novel extension of the classical marginal sensitivity model using f-divergence. To construct our estimator, we leverage the kernel method to obtain a tractable approximation to the conditional moment constraints, which traditional non-sharp estimators failed to take into account. In the theoretical analysis, we provide a condition for the choice of the kernel which guarantees no specification error that biases the lower bound estimation. Furthermore, we provide consistency guarantees of policy evaluation and learning. In the experiments with synthetic and real-world data, we demonstrate the effectiveness of the proposed method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ishikawa23a.html
https://proceedings.mlr.press/v206/ishikawa23a.htmlA stopping criterion for Bayesian optimization by the gap of expected minimum simple regretsBayesian optimization (BO) improves the efficiency of black-box optimization; however, the associated computational cost and power consumption remain dominant in the application of machine learning methods. This paper proposes a method of determining the stopping time in BO. The proposed criterion is based on the difference between the expectation of the minimum of a variant of the simple regrets before and after evaluating the objective function with a new parameter setting. Unlike existing stopping criteria, the proposed criterion is guaranteed to converge to the theoretically optimal stopping criterion for any choices of arbitrary acquisition functions and threshold values. Moreover, the threshold for the stopping criterion can be determined automatically and adaptively. We experimentally demonstrate that the proposed stopping criterion finds reasonable timing to stop a BO with a small number of evaluations of the objective function.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ishibashi23a.html
https://proceedings.mlr.press/v206/ishibashi23a.htmlLearning Constrained Structured Spaces with Application to Multi-Graph MatchingMulti-graph matching is a prominent structured prediction task, in which the predicted label is constrained to the space of cycle-consistent matchings. While direct loss minimization is an effective method for learning predictors over structured label spaces, it cannot be applied efficiently to the problem at hand, since executing a specialized solver across sets of matching predictions is computationally prohibitive. Moreover, there’s no supervision on the ground-truth matchings over cycle-consistent prediction sets. Our key insight is to strictly enforce the matching constraints in pairwise matching predictions and softly enforce the cycle-consistency constraints by casting them as weighted loss terms, such that the severity of inconsistency with global predictions is tuned by a penalty parameter. Inspired by the classic penalty method, we prove that our method theoretically recovers the optimal multi-graph matching constrained solution. Our method’s advantages are brought to light in experimental results on the popular keypoint matching task on the Pascal VOC and the Willow ObjectClass datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/indelman23a.html
https://proceedings.mlr.press/v206/indelman23a.htmlStochastic Mirror Descent for Large-Scale Sparse RecoveryWe discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a confidence ball of a given norm around the approximate solution). This convergence is linear during the first “preliminary” phase of the routine and is sublinear during the second “asymptotic” phase. We consider an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting, we show that the proposed algorithm attains the optimal convergence of the estimation error under weak assumptions on the regressor distribution. We also present a numerical study illustrating the performance of the algorithm on high-dimensional simulation data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ilandarideva23a.html
https://proceedings.mlr.press/v206/ilandarideva23a.htmlFast Block Coordinate Descent for Non-Convex Group RegularizationsNon-convex sparse regularizations with group structures are useful tools for selecting important feature groups. For optimization with these regularizations, block coordinate descent (BCD) is a standard solver that iteratively updates each parameter group. However, it suffers from high computation costs for a large number of parameter groups. The state-of-the-art method prunes unnecessary updates in BCD by utilizing bounds on the norms of the parameter groups. Unfortunately, since it computes the bound for each iteration, the computation cost still tends to be high when the updates are not sufficiently pruned. This paper proposes a fast BCD for non-convex group regularizations. Specifically, it selects a small subset of the parameter groups from all the parameter groups on the basis of the bounds and performs BCD on the subset. The subset grows step by step in accordance with the bounds during optimization. Since it computes the bounds only when selecting and growing the subsets, the total cost for computing the bounds is smaller than in the previous method. In addition, we theoretically guarantee the convergence of our method. Experiments show that our method is up to four times faster than the state-of-the-art method and 68 times faster than the original BCD without any loss of accuracy.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ida23a.html
https://proceedings.mlr.press/v206/ida23a.htmlFalsification of Internal and External Validity in Observational Studies via Conditional Moment RestrictionsRandomized Controlled Trials (RCT)s are relied upon to assess new treatments, but suffer from limited power to guide personalized treatment decisions. On the other hand, observational (i.e., non-experimental) studies have large and diverse populations, but are prone to various biases (e.g. residual confounding). To safely leverage the strengths of observational studies, we focus on the problem of falsification, whereby RCTs are used to validate causal effect estimates learned from observational data. In particular, we show that, given data from both an RCT and an observational study, assumptions on internal and external validity have an observable, testable implication in the form of a set of Conditional Moment Restrictions (CMRs). Further, we show that expressing these CMRs with respect to the causal effect, or “causal contrast”, as opposed to individual counterfactual means, provides a more reliable falsification test. In addition to giving guarantees on the asymptotic properties of our test, we demonstrate superior power and type I error of our approach on semi-synthetic and real world datasets. Our approach is interpretable, allowing a practitioner to visualize which subgroups in the population lead to falsification of an observational study.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hussain23a.html
https://proceedings.mlr.press/v206/hussain23a.htmlTight Regret and Complexity Bounds for Thompson Sampling via Langevin Monte CarloIn this paper, we consider high dimensional contextual bandit problems. Within this setting, Thompson Sampling and its variants have been proposed and have been successfully applied to multiple machine learning problems. Existing theory on Thompson Sampling shows that it has suboptimal dimension dependency in contrast to upper confidence bound (UCB) algorithms. To circumvent this issue and obtain optimal regret bounds, (Zhang, 2021) recently proposed to modify Thompson Sampling by enforcing more exploration and hence is able to attain optimal regret bounds. Nonetheless, this analysis does not permit tractable implementation in high dimensions. The main challenge therein is the simulation of the posterior samples at each step given the available observations. To overcome this, we propose and analyze the use of Markov Chain Monte Carlo methods. As a corollary, we show that for contextual linear bandits, using Langevin Monte Carlo (LMC) or Metropolis Adjusted Langevin Algorithm (MALA), our algorithm attains optimal regret bounds of $\tilde{O}(d\sqrt{T})$. Furthermore, we show that this is obtained with $\tilde{O}(dT^4)$, $\tilde{O}(dT^2)$ data evaluations respectively for LMC and MALA. Finally, we validate our findings through numerical simulations and show that we outperform vanilla Thompson sampling in high dimensions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/huix23a.html
https://proceedings.mlr.press/v206/huix23a.htmlFix-A-Step: Semi-supervised Learning From Uncurated Unlabeled DataSemi-supervised learning (SSL) promises improved accuracy compared to training classifiers on small labeled datasets by also training on many unlabeled images. In real applications like medical imaging, unlabeled data will be collected for expediency and thus uncurated: possibly different from the labeled set in classes or features. Unfortunately, modern deep SSL often makes accuracy worse when given uncurated unlabeled data. Recent complex remedies try to detect out-of-distribution unlabeled images and then discard or downweight them. Instead, we introduce Fix-A-Step, a simpler procedure that views all uncurated unlabeled images as potentially helpful. Our first insight is that even uncurated images can yield useful augmentations of labeled data. Second, we modify gradient descent updates to prevent optimizing a multi-task SSL loss from hurting labeled-set accuracy. Fix-A-Step can “repair” many common deep SSL methods, improving accuracy on CIFAR benchmarks across all tested methods and levels of artificial class mismatch. On a new medical SSL benchmark called Heart2Heart, Fix-A-Step can learn from 353,500 truly uncurated ultrasound images to deliver gains that generalize across hospitals.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/huang23c.html
https://proceedings.mlr.press/v206/huang23c.htmlTowards Balanced Representation Learning for Credit Policy EvaluationCredit policy evaluation presents profitable opportunities for E-commerce platforms through improved decision-making. The core of policy evaluation is estimating the causal effects of the policy on the target outcome. However, selection bias presents a key challenge in estimating causal effects from real-world data. Some recent causal inference methods attempt to mitigate selection bias by leveraging covariate balancing in the representation space to obtain the domain-invariant features. However, it is noticeable that balanced representation learning can be accompanied by a failure of domain discrimination, resulting in the loss of domain-related information. This is referred to as the over-balancing issue. In this paper, we introduce a novel objective for representation balancing methods to do policy evaluation. In particular, we construct a doubly robust loss based on the predictions of treatment and outcomes, serving as a prerequisite for covariate balancing to deal with the over-balancing issue. In addition, we investigate how to improve treatment effect estimations by exploiting the unconfoundedness assumption. The extensive experimental results on benchmark datasets and a newly introduced credit dataset show a general outperformance of our method compared with existing methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/huang23b.html
https://proceedings.mlr.press/v206/huang23b.htmlAdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax OptimizationIn the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems by using the unified adaptive matrices, which include almost all existing coordinate-wise and global adaptive learning rates. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Specifically, we propose a fast Adaptive Gradient Descent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower gradient complexity of $\tilde{O}(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing results of the adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. Moreover, we propose an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower gradient complexity of $\tilde{O}(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing results of the adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method can reach the best known gradient complexity of $\tilde{O}(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. The experiments on policy evaluation and fair classifier learning tasks are conducted to verify the efficiency of our new algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/huang23a.html
https://proceedings.mlr.press/v206/huang23a.htmlAdaptive Dimension Reduction and Variational Inference for Transductive Few-Shot ClassificationTransductive Few-Shot learning has gained increased attention nowadays considering the cost of data annotations along with the increased accuracy provided by unlabelled samples in the domain of few shot. Especially in Few-Shot Classification (FSC), recent works explore the feature distributions aiming at maximizing likelihoods or posteriors with respect to the unknown parameters. Following this vein, and considering the parallel between FSC and clustering, we seek for better taking into account the uncertainty in estimation due to lack of data, as well as better statistical properties of the clusters associated with each class. Therefore in this paper we propose a new clustering method based on Variational Bayesian inference, further improved by Adaptive Dimension Reduction based on Probabilistic Linear Discriminant Analysis. Our proposed method significantly improves accuracy in the realistic unbalanced transductive setting on various Few-Shot benchmarks when applied to features used in previous studies, with a gain of up to $6%$ in accuracy. In addition, when applied to balanced setting, we obtain very competitive results without making use of the class-balance artefact which is disputable for practical use cases.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hu23c.html
https://proceedings.mlr.press/v206/hu23c.htmlA Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement LearningWe study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hu23b.html
https://proceedings.mlr.press/v206/hu23b.htmlPrivacy-preserving Sparse Generalized Eigenvalue ProblemIn this paper we study the (sparse) Generalized Eigenvalue Problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher’s discriminant analysis (FDA) and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a $\rho$-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has an $\ell_2$-norm estimation error of $\tilde{O}(\frac{d}{n}+\frac{d}{n^2\rho})$ (under some mild assumptions), where $d$ is the dimension and $n$ is the sample size. Next, we discuss how to find such a initial parameter privately. In the high dimensional sparse case where $d\gg n$, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of $\tilde{O}(\frac{s\log d}{n}+\frac{s\log d}{n^2\rho})$ for various statistical models, where $s$ is the sparsity of the underlying parameter.Moreover, we show that these errors in the stochastic setting are optimal up to a factor of $\mathrm{Poly}(\log n)$ by providing the lower bounds of PCA and SIR under statistical setting and in the CDP model. Finally, to give a separation between $\epsilon$-DP and $\rho$-CDP for GEP, we also provide the lower bound $\Omega(\frac{d}{n}+\frac{d^2}{n^2\epsilon^2})$ and $\Omega(\frac{s\log d}{n}+\frac{s^2\log^2 d}{n^2\epsilon^2})$ of private minimax risk for PCA, under the statistical setting and $\epsilon$-DP model, in low and high dimensional sparse case respectively.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hu23a.html
https://proceedings.mlr.press/v206/hu23a.htmlDelayed Feedback in Generalised Linear Bandits RevisitedThe stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/howson23b.html
https://proceedings.mlr.press/v206/howson23b.htmlOptimism and Delays in Episodic Reinforcement LearningThere are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/howson23a.html
https://proceedings.mlr.press/v206/howson23a.htmlAn Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior KnowledgeWe propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on non- stationary kernel bandits. Moreover, when applied to the non-stationary linear bandits by us- ing a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hong23b.html
https://proceedings.mlr.press/v206/hong23b.htmlVariational Inference for Neyman-Scott ProcessesNeyman-Scott processes (NSPs) have been applied across a range of fields to model points or temporal events with a hierarchy of clusters. Markov chain Monte Carlo (MCMC) is typically used for posterior sampling in the model. However, MCMC’s mixing time can cause the resulting inference to be slow, and thereby slow down model learning and prediction. We develop the first variational inference (VI) algorithm for NSPs, and give two examples of suitable variational posterior point process distributions. Our method minimizes the inclusive Kullback-Leibler (KL) divergence for VI to obtain the variational parameters. We generate samples from the approximate posterior point processes much faster than MCMC, as we can directly estimate the approximate posterior point processes without any MCMC steps or gradient descent. We include synthetic and real-world data experiments that demonstrate our VI algorithm achieves better prediction performance than MCMC when computational time is limited.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hong23a.html
https://proceedings.mlr.press/v206/hong23a.htmlNeural Laplace Control for Continuous-time Delayed SystemsMany real-world offline reinforcement learning (RL) problems involve continuous-time environments with delays. Such environments are characterized by two distinctive features: firstly, the state x(t) is observed at irregular time intervals, and secondly, the current action a(t) only affects the future state x(t + g) with an unknown delay g > 0. A prime example of such an environment is satellite control where the communication link between earth and a satellite causes irregular observations and delays. Existing offline RL algorithms have achieved success in environments with irregularly observed states in time or known delays. However, environments involving both irregular observations in time and unknown delays remains an open and challenging problem. To this end, we propose Neural Laplace Control, a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner–and is able to learn from an offline dataset sampled with irregular time intervals from an environment that has a inherent unknown constant delay. We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/holt23a.html
https://proceedings.mlr.press/v206/holt23a.htmlFlexible risk design using bi-directional dispersionMany novel notions of “risk” (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution of gradient-based learners.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/holland23a.html
https://proceedings.mlr.press/v206/holland23a.htmlProbNeRF: Uncertainty-Aware Inference of 3D Shapes from 2D ImagesThe problem of inferring object shape from a single 2D image is underconstrained. Prior knowledge about what objects are plausible can help, but even given such prior knowledge there may still be uncertainty about the shapes of occluded parts of objects. Recently, conditional neural radiance field (NeRF) models have been developed that can learn to infer good point estimates of 3D models from single 2D images. The problem of inferring uncertainty estimates for these models has received less attention. In this work, we propose probabilistic NeRF (ProbNeRF), a model and inference strategy for learning probabilistic generative models of 3D objects’ shapes and appearances, and for doing posterior inference to recover those properties from 2D images. ProbNeRF is trained as a variational autoencoder, but at test time we use Hamiltonian Monte Carlo (HMC) for inference. Given one or a few 2D images of an object (which may be partially occluded), ProbNeRF is able not only to accurately model the parts it sees, but also to propose realistic and diverse hypotheses about the parts it does not see. We show that key to the success of ProbNeRF are (i) a deterministic rendering scheme, (ii) an annealed-HMC strategy, (iii) a hypernetwork-based decoder architecture, and (iv) doing inference over a full set of NeRF weights, rather than just a low-dimensional code. Videos and code are available at https://probnerf.github.io.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hoffman23a.html
https://proceedings.mlr.press/v206/hoffman23a.htmlUnifying local and global model explanations by functional decomposition of low dimensional structuresWe consider a global representation of a regression or classification function by decomposing it into the sum of main and interaction components of arbitrary order. We propose a new identification constraint that allows for the extraction of interventional SHAP values and partial dependence plots, thereby unifying local and global explanations. With our proposed identification, a feature’s partial dependence plot corresponds to the main effect term plus the intercept. The interventional SHAP value of feature $k$ is a weighted sum of the main component and all interaction components that include $k$, with the weights given by the reciprocal of the component’s dimension. This brings a new perspective to local explanations such as SHAP values which were previously motivated by game theory only. We show that the decomposition can be used to reduce direct and indirect bias by removing all components that include a protected feature. Lastly, we motivate a new measure of feature importance. In principle, our proposed functional decomposition can be applied to any machine learning model, but exact calculation is only feasible for low-dimensional structures or ensembles of those. We provide an algorithm and efficient implementation for gradient-boosted trees (xgboost) and random planted forest. Conducted experiments suggest that our method provides meaningful explanations and reveals interactions of higher orders. The proposed methods are implemented in an R package, available at https://github.com/PlantedML/glex.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hiabu23a.html
https://proceedings.mlr.press/v206/hiabu23a.htmlFast Distributed k-Means with a Small Number of RoundsWe propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means$||$ algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means$||$, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means$||$.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hess23a.html
https://proceedings.mlr.press/v206/hess23a.htmlA principled framework for the design and analysis of token algorithmsWe consider a decentralized optimization problem, in which n nodes collaborate to optimize a global objective function using local communications only. While many decentralized algorithms focus on gossip communications (pairwise averaging), we consider a different scheme, in which a “token” that contains the current estimate of the model performs a random walk over the network, and updates its model using the local model of the node it is at. Indeed, token algorithms generally benefit from improved communication efficiency and privacy guarantees. We frame the token algorithm as a randomized gossip algorithm on a conceptual graph, which allows us to prove a series of convergence results for variance-reduced and accelerated token algorithms for the complete graph. We also extend these results to the case of multiple tokens by extending the conceptual graph, and to general graphs by tweaking the communication procedure. The reduction from token to well-studied gossip algorithms leads to tight rates for many token algorithms, and we illustrate their performance empirically.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hendrikx23a.html
https://proceedings.mlr.press/v206/hendrikx23a.htmlAgnostic PAC Learning of $k$-juntas Using $L_2$-Polynomial RegressionMany conventional learning algorithms rely on loss functions other than the natural 0-1 loss for computational efficiency and theoretical tractability. Among them are approaches based on absolute loss (L1 regression) and square loss (L2 regression). The first is proved to be an agnostic PAC learner for various important concept classes such as juntas, and half-spaces. On the other hand, the second is preferable because of its computational efficiency which is linear in the sample size. However, PAC learnability is still unknown as guarantees have been proved only under distributional restrictions. The question of whether L2 regression is an agnostic PAC learner for 0-1 loss has been open since 1993 and yet has to be answered. This paper resolves this problem for the junta class on the Boolean cube — proving agnostic PAC learning of k-juntas using L2 polynomial regression. Moreover, we present a new PAC learning algorithm based on the Boolean Fourier expansion with lower computational complexity. Fourier-based algorithms, such as Linial et al. (1993), have been used under distributional restrictions, such as uniform distribution. We show that with an appropriate change one can apply those algorithms in agnostic settings without any distributional assumption. We prove our results by connecting the PAC learning with 0-1 loss to the minimum mean square estimation (MMSE) problem. We derive an elegant upper bound on the 0-1 loss in terms of the MMSE error based on that, we show that the sign of the MMSE is a PAC learner for any concept class containing it.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/heidari23b.html
https://proceedings.mlr.press/v206/heidari23b.htmlLearning k-qubit Quantum Operators via Pauli DecompositionMotivated by the limited qubit capacity of current quantum systems, we study the quantum sample complexity of k-qubit quantum operators, i.e., operations applicable on only k out of d qubits. The problem is studied according to the quantum probably approximately correct (QPAC) model abiding by quantum mechanical laws such as no-cloning, state collapse, and measurement incompatibility. With the delicacy of quantum samples and the richness of quantum operations, one expects a significantly larger quantum sample complexity. This paper proves the contrary. We show that the quantum sample complexity of k-qubit quantum operations is comparable to the classical sample complexity of their counterparts (juntas), at least when $\frac{k}{d}\ll 1$. This is surprising, especially since sample duplication is prohibited, and measurement incompatibility would lead to an exponentially larger sample complexity with standard methods. Our approach is based on the Pauli decomposition of quantum operators and a technique called Quantum Shadow Sampling (QSS) to reduce the sample complexity exponentially. The results are proved by developing (i) a connection between the learning loss and the Pauli decomposition; (ii) a scalable QSS circuit for estimating the Pauli coefficients; and (iii) a quantum algorithm for learning $k$-qubit operators with sample complexity $O(\frac{k4^k}{\epsilon^2}\log d)$.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/heidari23a.html
https://proceedings.mlr.press/v206/heidari23a.htmlTabLLM: Few-shot Classification of Tabular Data with Large Language ModelsWe study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method’s ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hegselmann23a.html
https://proceedings.mlr.press/v206/hegselmann23a.htmlSoundSynp: Sound Source Detection from Raw Waveforms with Multi-Scale Synperiodic FilterbanksWe propose synperiodic filter banks, a novel multi-scale learnable filter bank construction strategy that all filters are synchronized by their rotating periodicity. By synchronizing in a certain periodicity, we naturally get filters whose temporal length are reduced if they carry higher frequency response, and vice versa. Such filters internally maintain a better time-frequency resolution trade-off. By further alternating the periodicity, we can easily obtain a group of synperiodic filter bank (we call synperiodic filter banks), where filters of same frequency response in different groups differ in temporal length. Convolving these filter banks with sound raw waveform achieves multi-scale perception in time domain. Moreover, applying the same filter banks to recursively process the 2x-downsampled waveform enables multi-scale perception in the frequency domain. Benefiting from the multi-scale perception in both time and frequency domains, our proposed synperiodic filter banks learn multi-scale time-frequency representation in a data-driven way. Experiments on both sound source direction of arrival (DoA) and physical location detection task show the superiority of synperiodic filter banks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/he23c.html
https://proceedings.mlr.press/v206/he23c.htmlHow Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm?We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. Distribution-free upper and lower bounds on the gen-error can also be obtained. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information shared between the labeled and pseudo-labeled data samples. This serves as a guideline to choose an appropriate pseudo-labeling method from a given family of methods. To deepen our understanding, we further explore two examples—mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data $\lambda$ affects the gen-error under both scenarios. As $\lambda$ increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified exactly with our analysis, and is dependent on the cross-covariance between the labeled and pseudo-labeled data samples. For logistic regression, the gen-error and the variance component of the excess risk also decrease as $\lambda$ increases.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/he23b.html
https://proceedings.mlr.press/v206/he23b.htmlLearning Physics-Informed Neural Networks without Stacked Back-propagationPhysics-Informed Neural Network (PINN) has become a commonly used machine learning approach to solve partial differential equations (PDE). But, facing high-dimensional secondorder PDE problems, PINN will suffer from severe scalability issues since its loss includes second-order derivatives, the computational cost of which will grow along with the dimension during stacked back-propagation. In this work, we develop a novel approach that can significantly accelerate the training of Physics-Informed Neural Networks. In particular, we parameterize the PDE solution by the Gaussian smoothed model and show that, derived from Stein’s Identity, the second-order derivatives can be efficiently calculated without back-propagation. We further discuss the model capacity and provide variance reduction methods to address key limitations in the derivative estimation. Experimental results show that our proposed method can achieve competitive error compared to standard PINN training but is significantly faster.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/he23a.html
https://proceedings.mlr.press/v206/he23a.htmlMind the (optimality) Gap: A Gap-Aware Learning Rate Scheduler for Adversarial NetsAdversarial nets have proved to be powerful in various domains including generative modeling (GANs), transfer learning, and fairness. However, successfully training adversarial nets using first-order methods remains a major challenge. Typically, careful choices of the learning rates are needed to maintain the delicate balance between the competing networks. In this paper, we design a novel learning rate scheduler that dynamically adapts the learning rate of the adversary to maintain the right balance. The scheduler is driven by the fact that the loss of an ideal adversarial net is a constant known a priori. The scheduler is thus designed to keep the loss of the optimized adversarial net close to that of an ideal network. We run large-scale experiments to study the effectiveness of the scheduler on two popular applications: GANs for image generation and adversarial nets for domain adaptation. Our experiments indicate that adversarial nets trained with the scheduler are less likely to diverge and require significantly less tuning. For example, on CelebA, a GAN with the scheduler requires only one-tenth of the tuning budget needed without a scheduler. Moreover, the scheduler leads to statistically significant improvements in model quality, reaching up to 27$%$ in Frechet Inception Distance for image generation and 3$%$ in test accuracy for domain adaptation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hazimeh23a.html
https://proceedings.mlr.press/v206/hazimeh23a.htmlNyström Method for Accurate and Scalable Implicit DifferentiationThe essential difficulty of gradient-based bilevel optimization using implicit differentiation is to estimate the inverse Hessian vector product with respect to neural network parameters. This paper proposes to tackle this problem by the Nyström method and the Woodbury matrix identity, exploiting the low-rankness of the Hessian. Compared to existing methods using iterative approximation, such as conjugate gradient and the Neumann series approximation, the proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations. As a result, the proposed method works stably in various tasks and is faster than iterative approximations. Throughout experiments including large-scale hyperparameter optimization and meta learning, we demonstrate that the Nyström method consistently achieves comparable or even superior performance to other approaches. The source code is available from https://github.com/moskomule/hypergrad.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hataya23a.html
https://proceedings.mlr.press/v206/hataya23a.htmlLearning in RKHM: a C*-Algebraic Twist for Kernel MachinesSupervised learning in reproducing kernel Hilbert space (RKHS) and vector-valued RKHS (vvRKHS) has been investigated for more than 30 years. In this paper, we provide a new twist to this rich literature by generalizing supervised learning in RKHS and vvRKHS to reproducing kernel Hilbert C*-module (RKHM), and show how to construct effective positive-definite kernels by considering the perspective of C*-algebra. Unlike the cases of RKHS and vvRKHS, we can use C*-algebras to enlarge representation spaces. This enables us to construct RKHMs whose representation power goes beyond RKHSs, vvRKHSs, and existing methods such as convolutional neural networks. Our framework is suitable, for example, for effectively analyzing image data by allowing the interaction of Fourier components.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hashimoto23a.html
https://proceedings.mlr.press/v206/hashimoto23a.htmlStructure of Nonlinear Node Embeddings in Stochastic Block ModelsNonlinear node embedding techniques such as DeepWalk and Node2Vec are used extensively in practice to uncover structure in graphs. Despite theoretical guarantees in special regimes (such as the case of high embedding dimension), the structure of the optimal low dimensional embeddings has not been formally understood even for graphs obtained from simple generative models. We consider the stochastic block model and show that under appropriate separation conditions, the optimal embeddings can be analytically characterized. Akin to known results on eigenvector based (spectral) embeddings, we prove theoretically that solution vectors are well-clustered, up to a sublinear error.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/harker23a.html
https://proceedings.mlr.press/v206/harker23a.htmlOptimal Contextual Bandits with Knapsacks under Realizability via Regression OraclesWe study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/han23b.html
https://proceedings.mlr.press/v206/han23b.htmlRiemannian Accelerated Gradient Methods via ExtrapolationIn this paper, we propose a convergence acceleration scheme for general Riemannian optimization problems by extrapolating iterates on manifolds. We show that when the iterates are generated from the Riemannian gradient descent method, the scheme achieves the optimal convergence rate asymptotically and is computationally more favorable than the recently proposed Riemannian Nesterov accelerated gradient methods. A salient feature of our analysis is the convergence guarantees with respect to the use of general retraction and vector transport. Empirically, we verify the practical benefits of the proposed acceleration strategy, including robustness to the choice of different averaging schemes on manifolds.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/han23a.html
https://proceedings.mlr.press/v206/han23a.htmlBaCaDI: Bayesian Causal Discovery with Unknown InterventionsInferring causal structures from experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, the targets of the interventions are often uncertain or unknown and the number of observations limited. As a result, standard causal discovery methods can no longer be reliably used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering and reasoning about the causal structure that underlies data generated under various unknown experimental or interventional conditions. BaCaDI is fully differentiable, which allows us to infer the complex joint posterior over the intervention targets and the causal structure via efficient gradient-based variational inference. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/hagele23a.html
https://proceedings.mlr.press/v206/hagele23a.htmlAn Unpooling Layer for Graph GenerationWe propose a novel and trainable graph unpooling layer for effective graph generation. The unpooling layer receives an input graph with features and outputs an enlarged graph with desired structure and features. We prove that the output graph of the unpooling layer remains connected and for any connected graph there exists a series of unpooling layers that can produce it from a 3-node graph. We apply the unpooling layer within the generator of a generative adversarial network as well as the decoder of a variational autoencoder. We give extensive experimental evidence demonstrating the competitive performance of our proposed method on synthetic and real data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/guo23a.html
https://proceedings.mlr.press/v206/guo23a.htmlCan 5th Generation Local Training Methods Support Client Sampling? Yes!The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022) identified four distinct generations of LT methods based on the quality of the provided theoretical communication complexity guarantees. Despite a lot of progress in this area, none of the existing works were able to show that it is theoretically better to employ multiple local gradient-type steps (i.e., to engage in LT) than to rely on a single local gradient-type step only in the important heterogeneous data regime. In a recent breakthrough embodied in their ProxSkip method and its theoretical analysis, Mishchenko et al. (2022) showed that LT indeed leads to provable communication acceleration for arbitrarily heterogeneous data, thus jump-starting the 5th generation of LT methods. However, while these latest generation LT methods are compatible with DS, none of them support CS. We resolve this open problem in the affirmative. In order to do so, we had to base our algorithmic development on new algorithmic and theoretical foundations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/grudzien23a.html
https://proceedings.mlr.press/v206/grudzien23a.htmlUncertainty Estimates of Predictions via a General Bias-Variance DecompositionReliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift and restricted to classification. Alternatively, proper scores can be used for most predictive tasks but a bias-variance decomposition for model uncertainty does not exist in the current literature. In this work we introduce a general bias-variance decomposition for proper scores, giving rise to the Bregman Information as the variance term. We discover how exponential families and the classification log-likelihood are special cases and provide novel formulations. Surprisingly, we can express the classification case purely in the logit space. We showcase the practical relevance of this decomposition on several downstream tasks, including model ensembles and confidence regions. Further, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gruber23a.html
https://proceedings.mlr.press/v206/gruber23a.htmlCoarse-Grained Smoothness for Reinforcement Learning in Metric SpacesPrincipled decision-making in continuous state–action spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Q-functions, leading to improved learning. We provide a theoretical analysis of our new smoothness definition, and discuss its implications and impact on control and exploration in continuous domains.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gottesman23a.html
https://proceedings.mlr.press/v206/gottesman23a.htmlIncentive-aware Contextual Pricing with Non-parametric Market NoiseWe consider a dynamic pricing problem for repeated contextual second-price auctions with multiple strategic buyers who aim to maximize their long-term time discounted utility. The seller has limited information on buyers’ overall demand curves which depends on a non-parametric market-noise distribution, and buyers may potentially submit corrupted bids (relative to true valuations) to manipulate the seller’s pricing policy for more favorable reserve prices in the future. We focus on designing the seller’s learning policy to set contextual reserve prices where the seller’s goal is to minimize regret compared to the revenue of a benchmark clairvoyant policy that has full information of buyers’ demand. We propose a policy with a phased-structure that incorporates randomized “isolation” periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. We then prove that our policy enjoys a T-period regret of $O(\sqrt{T})$ facing strategic buyers. Finally, we conduct numerical simulations to compare our proposed algorithm to standard pricing policies. Our numerical results show that our algorithm outperforms these policies under various buyer bidding behavior.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/golrezaei23b.html
https://proceedings.mlr.press/v206/golrezaei23b.htmlPricing against a Budget and ROI Constrained BuyerInternet advertisers (buyers) repeatedly procure ad impressions from ad platforms (sellers) with the aim to maximize total conversion (i.e. ad value) while respecting both budget and return-on-investment (ROI) constraints for efficient utilization of limited monetary resources. Facing such a constrained buyer who aims to learn her optimal strategy to acquire impressions, we study from a seller’s perspective how to learn and price ad impressions through repeated posted price mechanisms to maximize revenue. For this two-sided learning setup, we propose a learning algorithm for the seller that utilizes an episodic binary-search procedure to identify a revenue-optimal selling price. We show that such a simple learning algorithm enjoys low seller regret when within each episode, the budget and ROI constrained buyer approximately best responds to the posted price. We present simple yet natural buyer’s bidding algorithms under which the buyer approximately best responds while satisfying budget and ROI constraints, leading to a low regret for our proposed seller pricing algorithm. The design of our seller algorithm is motivated by the fact that the seller’s revenue function admits a bell-shaped structure when the buyer best responds to prices under budget and ROI constraints, enabling our seller algorithm to identify revenue-optimal selling prices efficiently.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/golrezaei23a.html
https://proceedings.mlr.press/v206/golrezaei23a.htmlAnalysis of Catastrophic Forgetting for Random Orthogonal Transformation Tasks in the Overparameterized RegimeOverparameterization is known to permit strong generalization performance in neural networks. In this work, we provide an initial theoretical analysis of its effect on catastrophic forgetting in a continual learning setup. We show experimentally that in Permuted MNIST image classification tasks, the generalization performance of multilayer perceptrons trained by vanilla stochastic gradient descent can be improved by overparameterization, and the extent of the performance increase achieved by overparameterization is comparable to that of state-of-the-art continual learning algorithms. We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem, where each task is related by a random orthogonal transformation. We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small if the model is sufficiently overparameterized.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/goldfarb23a.html
https://proceedings.mlr.press/v206/goldfarb23a.htmlBreaking a Classical Barrier for Classifying Arbitrary Test Examples in the Quantum ModelA new model for adversarial robustness was introduced by Goldwasser et al. in [GKKM20]. In this model the authors present a selective and transductive learning algorithm which guarantees a low test error and low rejection rate wrt to the original distribution. Moreover, a lower bound in terms of the VC-dimension, the standard risk and the number of samples is derived. We show that this lower bound can be broken in the quantum world. We consider a new model, influenced by the quantum PAC-learning model introduced by [BJ95], and similar in spirit to the one in [GKKM20]. In this model we give an interactive protocol between the learner and the adversary (at test-time) that guarantees robustness. This protocol, when applied, breaks the lower bound from [GKKM20]. From the technical perspective, our protocol is inspired by recent advances in delegation of quantum computation, e.g. [Mah18]. But in order to be applicable to our task, we extend the delegation protocol to enable a new feature, e.g. by extending delegation of decision problems, i.e. BQP, to sampling problems with adversarially chosen inputs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gluch23a.html
https://proceedings.mlr.press/v206/gluch23a.htmlAlgorithm for Constrained Markov Decision Process with Linear ConvergenceThe problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy-regularized policy optimizer and Vaidya’s dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gladin23a.html
https://proceedings.mlr.press/v206/gladin23a.htmlDensity Ratio Estimation and Neyman Pearson Classification with Missing DataDensity Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/givens23a.html
https://proceedings.mlr.press/v206/givens23a.htmlSpectral Augmentations for Graph Contrastive LearningContrastive learning has emerged as a premier method for learning representations with or without supervision. Recent studies have shown its utility in graph representation learning for pre-training. Despite successes, the understanding of how to design effective graph augmentations that can capture structural properties common to many different types of downstream graphs remains incomplete. We propose a set of well-motivated graph transformation operations derived via graph spectral analysis to provide a bank of candidates when constructing augmentations for a graph contrastive objective, enabling contrastive learning to capture useful structural representation from pre-training graph datasets. We first present a spectral graph cropping augmentation that involves filtering nodes by applying thresholds to the eigenvalues of the leading Laplacian eigenvectors. Our second novel augmentation reorders the graph frequency components in a structural Laplacian-derived position graph embedding. Further, we introduce a method that leads to improved views of local subgraphs by performing alignment via global random walk embeddings. Our experimental results indicate consistent improvements in out-of-domain graph data transfer compared to state-of-the-art graph contrastive learning methods, shedding light on how to design a graph learner that is able to learn structural properties common to diverse graph types.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ghose23a.html
https://proceedings.mlr.press/v206/ghose23a.htmlLangevin Diffusion Variational InferenceMany methods that build powerful variational distributions based on unadjusted Langevin transitions exist. Most of these were developed using a wide range of different approaches and techniques. Unfortunately, the lack of a unified analysis and derivation makes developing new methods and reasoning about existing ones a challenging task. We address this giving a single analysis that unifies and generalizes these existing techniques. The main idea is to augment the target and variational by numerically simulating the underdamped Langevin diffusion process and its time reversal. The benefits of this approach are twofold: it provides a unified formulation for many existing methods, and it simplifies the development of new ones. In fact, using our formulation we propose a new method that combines the strengths of previously existing algorithms; it uses underdamped Langevin transitions and powerful augmentations parameterized by a score network. Our empirical evaluation shows that our proposed method consistently outperforms relevant baselines in a wide range of tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/geffner23a.html
https://proceedings.mlr.press/v206/geffner23a.htmlKnowledge Sheaves: A Sheaf-Theoretic Framework for Knowledge Graph EmbeddingKnowledge graph embedding involves learning representations of entities—the vertices of the graph—and relations—the edges of the graph—such that the resulting representations encode the known factual information represented by the knowledge graph and can be used in the inference of new relations. We show that knowledge graph embedding is naturally expressed in the topological and categorical language of cellular sheaves: a knowledge graph embedding can be described as an approximate global section of an appropriate knowledge sheaf over the graph, with consistency constraints induced by the knowledge graph’s schema. This approach provides a generalized framework for reasoning about knowledge graph embedding models and allows for the expression of a wide range of prior constraints on embeddings. Further, the resulting embeddings can be easily adapted for reasoning over composite relations without special training. We implement these ideas to highlight the benefits of the extensions inspired by this new perspective.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gebhart23a.html
https://proceedings.mlr.press/v206/gebhart23a.htmlFair learning with Wasserstein barycenters for non-decomposable performance measuresThis work provides several fundamental characterizations of the optimal classification function under the demographic parity constraint. In the awareness framework, akin to the classical unconstrained classification case, we show that maximizing accuracy under this fairness constraint is equivalent to solving a fair regression problem followed by thresholding at level $1/2$. We extend this result to linear-fractional classification measures (e.g., $F$-score, AM measure, balanced accuracy, etc.), highlighting the fundamental role played by regression in this framework. Our results leverage recently developed connection between the demographic parity constraint and the multi-marginal optimal transport formulation. Informally, our result shows that the transition between the unconstrained problem and the fair one is achieved by replacing the conditional expectation of the label by the solution of the fair regression problem. Finally, leveraging our analysis, we demonstrate an equivalence between the awareness and the unawareness setups for two sensitive groups.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gaucher23a.html
https://proceedings.mlr.press/v206/gaucher23a.htmlFaster Projection-Free Augmented Lagrangian Methods via Weak Proximal OracleThis paper considers a convex composite optimization problem with affine constraints, which includes problems that take the form of minimizing a smooth convex objective function over the intersection of (simple) convex sets, or regularized with multiple (simple) functions. Motivated by high-dimensional applications in which exact projection/proximal computations are not tractable, we propose a projection-free augmented Lagrangian-based method, in which primal updates are carried out using a weak proximal oracle (WPO). In an earlier work, WPO was shown to be more powerful than the standard linear minimization oracle (LMO) that underlies conditional gradient-based methods (aka Frank-Wolfe methods). Moreover, WPO is computationally tractable for many high-dimensional problems of interest, including those motivated by recovery of low-rank matrices and tensors, and optimization over polytopes which admit efficient LMOs. The main result of this paper shows that under a certain curvature assumption (which is weaker than strong convexity), our WPO-based algorithm achieves an ergodic rate of convergence of $O(1/T)$ for both the objective residual and feasibility gap. This result, to the best of our knowledge, improves upon the $O(1/\sqrt{T})$ rate for existing LMO-based projection-free methods for this class of problems. Empirical experiments on a low-rank and sparse covariance matrix estimation task and the Max Cut semidefinite relaxation demonstrate that of our method can outperform state-of-the-art LMO-based Lagrangian-based methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/garber23a.html
https://proceedings.mlr.press/v206/garber23a.htmlOn the Convergence of Distributed Stochastic Bilevel Optimization Algorithms over a NetworkBilevel optimization has been applied to a wide variety of machine learning models and numerous stochastic bilevel optimization algorithms have been developed in recent years. However, most existing algorithms restrict their focus on the single-machine setting so that they are incapable of handling the distributed data. To address this issue, under the setting where all participants compose a network and perform peer-to-peer communication in this network, we developed two novel decentralized stochastic bilevel optimization algorithms based on the gradient tracking communication mechanism and two different gradient estimators. Additionally, we established their convergence rates for nonconvex-strongly-convex problems with novel theoretical analysis strategies. To our knowledge, this is the first work achieving these theoretical results. Finally, we applied our algorithms to practical machine learning models, and the experimental results confirmed the efficacy of our algorithms.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gao23a.html
https://proceedings.mlr.press/v206/gao23a.htmlActive Learning for Single Neuron Models with Lipschitz Non-LinearitiesWe consider the problem of active learning for single neuron models, also sometimes called “ridge functions”, in the agnostic setting (under adversarial label noise). Such models have been shown to be broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for partial differential equations. Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation guarantees can be obtained using a well-known active learning strategy for fitting linear functions in the agnostic setting. Namely, we can collect samples via statistical leverage score sampling, which has been shown to be nearoptimal in other active learning scenarios. We support our theoretical results with empirical simulations showing that our proposed active learning strategy based on leverage score sampling outperforms (ordinary) uniform sampling when fitting single neuron models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gajjar23a.html
https://proceedings.mlr.press/v206/gajjar23a.htmlOne Arrow, Two Kills: A Unified Framework for Achieving Optimal Regret Guarantees in Sleeping BanditsWe address the problem of Internal Regret in adversarial Sleeping Bandits and the relationship between different notions of sleeping regrets in multi-armed bandits. We propose a new concept called Internal Regret for sleeping multi-armed bandits (MAB) and present an algorithm that achieves sublinear Internal Regret, even when losses and availabilities are both adversarial. We demonstrate that a low internal regret leads to both low external regret and low policy regret for i.i.d. losses. Our contribution is unifying existing notions of regret in sleeping bandits and exploring their implications for each other. In addition, we extend our results to Dueling Bandits (DB), a preference feedback version of multi-armed bandits, and design a low-regret algorithm for sleeping dueling bandits with stochastic preferences and adversarial availabilities. We validate the effectiveness of our algorithms through empirical evaluations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/gaillard23a.html
https://proceedings.mlr.press/v206/gaillard23a.htmlRandomized Greedy Learning for Non-monotone Stochastic Submodular Maximization Under Full-bandit FeedbackWe investigate the problem of unconstrained combinatorial multi-armed bandits with full-bandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$-regret upper bound of $\tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/fourati23a.html
https://proceedings.mlr.press/v206/fourati23a.htmlInfluence Diagnostics under Self-concordanceInfluence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/fisher23a.html
https://proceedings.mlr.press/v206/fisher23a.htmlA Bregman Divergence View on the Difference-of-Convex AlgorithmThe difference of convex (DC) algorithm is a conceptually simple method for the minimization of (non)convex functions that are expressed as the difference of two convex functions. An attractive feature of the algorithm is that it maintains a global overestimator on the function and does not require a choice of step size at each iteration. By adopting a Bregman divergence point of view, we simplify and strengthen many existing non-asymptotic convergence guarantees for the DC algorithm. We further present several sufficient conditions that ensure a linear convergence rate, namely a new DC Polyak-Lojasiewicz condition, as well as a relative strong convexity assumption. Importantly, our conditions do not require smoothness of the objective function. We illustrate our results on a family of minimization problems involving the quantum relative entropy, with applications in quantum information theory.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/faust23a.html
https://proceedings.mlr.press/v206/faust23a.htmlLearning Sparse Graphon Mean Field GamesAlthough the field of multi-agent reinforcement learning (MARL) has made considerable progress in the last years, solving systems with a large number of agents remains a hard challenge. Graphon mean field games (GMFGs) enable the scalable analysis of MARL problems that are otherwise intractable. By the mathematical structure of graphons, this approach is limited to dense graphs which are insufficient to describe many real-world networks such as power law graphs. Our paper introduces a novel formulation of GMFGs, called LPGMFGs, which leverages the graph theoretical concept of $L^p$ graphons and provides a machine learning tool to efficiently and accurately approximate solutions for sparse network problems. This especially includes power law networks which are empirically observed in various application areas and cannot be captured by standard graphons. We derive theoretical existence and convergence guarantees and give empirical examples that demonstrate the accuracy of our learning approach for systems with many agents. Furthermore, we extend the Online Mirror Descent (OMD) learning algorithm to our setup to accelerate learning speed, empirically show its capabilities, and conduct a theoretical analysis using the novel concept of smoothed step graphons. In general, we provide a scalable, mathematically well-founded machine learning approach to a large class of otherwise intractable problems of great relevance in numerous research fields.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/fabian23a.html
https://proceedings.mlr.press/v206/fabian23a.htmlThe Role of Codeword-to-Class Assignments in Error-Correcting Codes: An Empirical StudyError-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebooks are commonly either predefined or problem dependent. Given predefined codebooks, codeword-to-class assignments are traditionally overlooked, and codewords are implicitly assigned to classes arbitrarily. Our paper shows that these assignments play a major role in the performance of ECC. Specifically, we examine similarity-preserving assignments, where similar codewords are assigned to similar classes. Addressing a controversy in existing literature, our extensive experiments confirm that similarity-preserving assignments induce easier subproblems and are superior to other assignment policies in terms of their generalization performance. We find that similarity-preserving assignments make predefined codebooks become problem-dependent, without altering other favorable codebook properties. Finally, we show that our findings can improve predefined codebooks dedicated to extreme classification.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/evron23a.html
https://proceedings.mlr.press/v206/evron23a.htmlImproved Representation Learning Through Tensorized AutoencodersThe central question in representation learning is what constitutes a good or meaningful representation. In this work we argue that if we consider data with inherent cluster structures, where clusters can be characterized through different means and covariances, those data structures should be represented in the embedding as well. While Autoencoders (AE) are widely used in practice for unsupervised representation learning, they do not fulfil the above condition on the embedding as they obtain a single representation of the data. To overcome this we propose a meta-algorithm that can be used to extend an arbitrary AE architecture to a tensorized version (TAE) that allows for learning cluster-specific embeddings while simultaneously learning the cluster assignment. For the linear setting we prove that TAE can recover the principle components of the different clusters in contrast to principle component of the entire data recovered by a standard AE. We validate this on planted models and for general, non-linear and convolutional AEs we empirically illustrate that tensorizing the AE is beneficial in clustering and de-noising tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/esser23a.html
https://proceedings.mlr.press/v206/esser23a.htmlInteractive Learning with Pricing for Optimal and Stable Allocations in MarketsLarge-scale online recommendation systems must facilitate the allocation of a limited number of items among competing users while learning their preferences from user feedback. As a principled way of incorporating market constraints and user incentives in the design, we consider our objectives to be two-fold: maximal social welfare with minimal instability. To maximize social welfare, our proposed framework enhances the quality of recommendations by exploring allocations that optimistically maximize the rewards. To minimize instability, a measure of users’ incentives to deviate from recommended allocations, the algorithm prices the items based on a scheme derived from the Walrasian equilibria. Though it is known that these equilibria yield stable prices for markets with known user preferences, our approach accounts for the inherent uncertainty in the preferences and further ensures that the users accept most of their recommendations under offered prices. To the best of our knowledge, our approach is the first to integrate techniques from combinatorial bandits, optimal resource allocation, and collaborative filtering to obtain an algorithm that achieves sub-linear social welfare regret as well as sub-linear instability. Empirical studies on synthetic and real-world data also demonstrate the efficacy of our strategy compared to approaches that do not fully incorporate all these aspects.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/erginbas23a.html
https://proceedings.mlr.press/v206/erginbas23a.htmlBalanced Off-Policy Evaluation for Personalized PricingWe consider a personalized pricing problem in which we have data consisting of feature information, historical pricing decisions, and binary realized demand. The goal is to perform off-policy evaluation for a new personalized pricing policy that maps features to prices. Methods based on inverse propensity weighting (including doubly robust methods) for off-policy evaluation may perform poorly when the logging policy has little exploration or is deterministic, which is common in pricing applications. Building on the balanced policy evaluation framework of Kallus (2018), we propose a new approach tailored to pricing applications. The key idea is to compute an estimate that minimizes the worst-case mean squared error or maximizes a worst-case lower bound on policy performance, where in both cases the worst-case is taken with respect to a set of possible revenue functions. We establish theoretical convergence guarantees and empirically demonstrate the advantage of our approach using a real-world pricing dataset.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/elmachtoub23a.html
https://proceedings.mlr.press/v206/elmachtoub23a.htmlA Statistical Learning Take on the Concordance Index for Survival AnalysisThe introduction of machine learning (ML) techniques to the field of survival analysis has increased the flexibility of modeling approaches, and ML based models have become state-of-the-art. These models optimize their own cost functions, and their performance is often evaluated using the concordance index (C-index). From a statistical learning perspective, it is therefore an important problem to analyze the relationship between the optimizers of the C-index and those of the ML cost functions. We address this issue by providing C-index Fisher-consistency results and excess risk bounds for several of the commonly used cost functions in survival analysis. We identify conditions under which they are consistent, under the form of three nested families of survival models. We also study the general case where no model assumption is made and present a new, off-the-shelf method that is shown to be consistent with the C-index, although computationally expensive at inference. Finally, we perform limited numerical experiments with simulated data to illustrate our theoretical findings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/elgui23a.html
https://proceedings.mlr.press/v206/elgui23a.htmlOn the Strategyproofness of the Geometric MedianThe geometric median, an instrumental component of the secure machine learning toolbox, is known to be effective when robustly aggregating models (or gradients), gathered from potentially malicious (or strategic) users. What is less known is the extent to which the geometric median incentivizes dishonest behaviors. This paper addresses this fundamental question by quantifying its strategyproofness. While we observe that the geometric median is not even approximately strategyproof, we prove that it is asymptotically $\alpha$-strategyproof: when the number of users is large enough, a user that misbehaves can gain at most a multiplicative factor $\alpha$, which we compute as a function of the distribution followed by the users. We then generalize our results to the case where users actually care more about specific dimensions, determining how this impacts $\alpha$. We also show how the skewed geometric medians can be used to improve strategyproofness.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/el-mhamdi23a.html
https://proceedings.mlr.press/v206/el-mhamdi23a.htmlLearning from Multiple Sources for Data-to-Text and Text-to-DataData-to-text (D2T) and text-to-data (T2D) are dual tasks that convert structured data, such as graphs or tables into fluent text, and vice versa. These tasks are usually handled separately and use corpora extracted from a single source. Current systems leverage pre-trained language models fine-tuned on D2T or T2D tasks. This approach has two main limitations: first, a separate system has to be tuned for each task and source; second, learning is limited by the scarcity of available corpora. This paper considers a more general scenario where data are available from multiple heterogeneous sources. Each source, with its specific data format and semantic domain, provides a non-parallel corpus of text and structured data. We introduce a variational auto-encoder model with disentangled style and content variables that allows us to represent the diversity that stems from multiple sources of text and data. Our model is designed to handle the tasks of D2T and T2D jointly. We evaluate our model on several datasets, and show that by learning from multiple sources, our model closes the performance gap with its supervised single-source counterpart and outperforms it in some cases.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/duong23a.html
https://proceedings.mlr.press/v206/duong23a.htmlEfficient and Light-Weight Federated Learning via Asynchronous Distributed DropoutAsynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup, where slower clients can severely impede the learning process. Herein, we propose AsyncDrop, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings. Overall, AsyncDrop achieves better performance compared to state of the art asynchronous methodologies, while resulting in less communication and training time overheads. The key idea revolves around creating “submodels” out of the global model, and distributing their training to workers, based on device heterogeneity. We rigorously justify that such an approach can be theoretically characterized. We implement our approach and compare it against other asynchronous baselines, both by design and by adapting existing synchronous FL algorithms to asynchronous scenarios. Empirically, AsyncDrop reduces the communication cost and training time, while matching or improving the final test accuracy in diverse non-i.i.d. FL scenarios.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dun23a.html
https://proceedings.mlr.press/v206/dun23a.htmlOnline Algorithms with Costly PredictionsIn recent years there has been a significant research effort on incorporating predictions into online algorithms. However, work in this area often makes the underlying assumption that predictions come for free (e.g., without any computational or monetary costs). In this paper, we consider a cost associated with making predictions. We show that interesting algorithmic subtleties arise for even the most basic online problems, such as ski rental and its generalization, the Bahncard problem. In particular, we show that with costly predictions, care needs to be taken in (i) asking for the prediction at the right time, (ii) deciding if it is worth asking for the prediction, and (iii) how many predictions we ask for, in settings where it is natural to consider making multiple predictions. Specifically, (i) in the basic ski-rental setting, we compute the optimal delay before asking the predictor, (ii) in the same setting, given apriori information about the true number of ski-days through its mean and variance, we provide a simple algorithm that is near-optimal, under some natural parameter settings, in deciding if it is worth asking for the predictor and (iii) in the setting of the Bahncard problem, we provide a $(1+\varepsilon)$-approximation algorithm and quantify lower bounds on the number of queries required to do so. In addition, we show that solving the problem optimally would require almost complete information of the instance.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/drygala23a.html
https://proceedings.mlr.press/v206/drygala23a.htmlFrequentist Uncertainty Quantification in Semi-Structured Neural NetworksSemi-structured regression (SSR) models jointly learn the effect of structured (tabular) and unstructured (non-tabular) data through additive predictors and deep neural networks (DNNs), respectively. Inference in SSR models aims at deriving confidence intervals for the structured predictor, although current approaches ignore the variance of the DNN estimation of the unstructured effects. This results in an underestimation of the variance of the structured coefficients and, thus, an increase of Type-I error rates. To address this shortcoming, we present here a theoretical framework for structured inference in SSR models that incorporates the variance of the DNN estimate into confidence intervals for the structured predictor. By treating this estimate as a random offset with known variance, our formulation is agnostic to the specific deep uncertainty quantification method employed. Through numerical experiments and a practical application on a medical dataset, we show that our approach results in increased coverage of the true structured coefficients and thus a reduction in Type-I error rate compared to ignoring the variance of the neural network, naive ensembling of SSR models, and a variational inference baseline.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dorigatti23a.html
https://proceedings.mlr.press/v206/dorigatti23a.htmlApproximate Regions of Attraction in Learning with Decision-Dependent DistributionsAs data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by situations where the initial conditions may have significant impact on the long-term behavior of the system. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization to performative risk minimizers.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dong23b.html
https://proceedings.mlr.press/v206/dong23b.htmlA New Modeling Framework for Continuous, Sequential DomainsTemporal models such as Dynamic Bayesian Networks (DBNs) and Hidden Markov Models (HMMs) have been widely used to model time-dependent sequential data. Typically, these approaches limit focus to discrete domains, employ first-order Markov and stationary assumptions, and limit representational power so that efficient (approximate) inference procedures can be applied. We propose a novel temporal model for continuous domains, where the transition distribution is conditionally tractable: it is modelled as a tractable continuous density over the variables at the current time slice only, while the parameters are controlled using a Recurrent Neural Network (RNN) that takes all previous observations as input. We show that, in this model, various inference tasks can be efficiently implemented using forward filtering with simple gradient ascent. Our experimental results on two different tasks over several real-world sequential datasets demonstrate the superior performance of our model against existing competitors.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dong23a.html
https://proceedings.mlr.press/v206/dong23a.htmlCompress Then Test: Powerful Kernel Testing in Near-linear TimeKernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on n sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each n point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test—recovering the same optimal detection boundary—while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20–200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/domingo-enrich23a.html
https://proceedings.mlr.press/v206/domingo-enrich23a.htmlOrigins of Low-Dimensional Adversarial PerturbationsMachine learning models are known to be susceptible to adversarial perturbations. Even more concerning is the fact that these adversarial perturbations can be found by black-box search using surprisingly few queries, which essentially restricts the perturbation to a subspace of dimension $k$—much smaller than the dimension $d$ of the image space. This intriguing phenomenon raises the question: Is the vulnerability to black-box attacks inherent or can we hope to prevent them? In this paper, we initiate a rigorous study of the phenomenon of low-dimensional adversarial perturbations (LDAPs). Our result characterizes precisely the sufficient conditions for the existence of LDAPs, and we show that these conditions hold for neural networks under practical settings, including the so-called lazy regime wherein the parameters of the trained network remain close to their values at initialization. Our theoretical results are confirmed by experiments on both synthetic and real data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dohmatob23a.html
https://proceedings.mlr.press/v206/dohmatob23a.htmlGraph Spectral Embedding using the Geodesic Betweenness CentralityWe introduce the Graph Sylvester Embedding (GSE), an unsupervised graph representation of local similarity, connectivity, and global structure. GSE uses the solution of the Sylvester equation to capture both network structure and neighborhood proximity in a single representation. Unlike embeddings based on the eigenvectors of the Laplacian, GSE incorporates two or more basis functions, for instance using the Laplacian and the affinity matrix. Such basis functions are constructed not from the original graph, but from one whose weights measure the centrality of an edge (the fraction of the number of shortest paths that pass through that edge) in the original graph. This allows more flexibility and control to represent complex network structure and shows significant improvements over the state of the art when used for data analysis tasks such as predicting failed edges in material science and network alignment in the human-SARS CoV-2 protein-protein interactome.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/deutsch23a.html
https://proceedings.mlr.press/v206/deutsch23a.htmlBayesian Optimization over High-Dimensional Combinatorial Spaces via Dictionary-based EmbeddingsWe consider the problem of optimizing expensive black-box functions over high-dimensional combinatorial spaces which arises in many science, engineering, and ML applications. We use Bayesian Optimization (BO) and propose a novel surrogate modeling approach for efficiently handling a large number of binary and categorical parameters. The key idea is to select a number of discrete structures from the input space (the dictionary) and use them to define an ordinal embedding for high-dimensional combinatorial structures. This allows us to use existing Gaussian process models for continuous spaces. We develop a principled approach based on binary wavelets to construct dictionaries for binary spaces, and propose a randomized construction method that generalizes to categorical spaces. We provide theoretical justification to support the effectiveness of the dictionary-based embeddings. Our experiments on diverse real-world benchmarks demonstrate the effectiveness of our proposed surrogate modeling approach over state-of-the-art BO methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/deshwal23a.html
https://proceedings.mlr.press/v206/deshwal23a.htmlReinforcement Learning with Stepwise Fairness ConstraintsAI methods are used in societally important settings, ranging from credit to employment to housing, and it is crucial to provide fairness in regard to automated decision making. Moreover, many settings are dynamic, with populations responding to sequential decision policies. We introduce the study of reinforcement learning (RL) with stepwise fairness constraints, which require group fairness at each time step. In the case of tabular episodic RL, we provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violations. Our framework provides tools to study the impact of fairness constraints in sequential settings and brings up new challenges in RL.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/deng23a.html
https://proceedings.mlr.press/v206/deng23a.htmlMMD-B-Fair: Learning Fair Representations with Statistical TestingWe introduce a method, MMD-B-Fair, to learn fair representations of data via kernel two-sample testing. We find neural features of our data where a maximum mean discrepancy (MMD) test cannot distinguish between different values of sensitive attributes, while preserving information about the target. Minimizing the power of an MMD test is more difficult than maximizing it (as done in previous work), because the test threshold’s complex behavior cannot be simply ignored. Our method exploits the simple asymptotics of block testing schemes to efficiently find fair representations without requiring the complex adversarial optimization or generative modelling schemes widely used by existing work on fair representation learning. We evaluate our approach on various datasets, showing its ability to hide information about sensitive attributes, and its effectiveness in downstream transfer tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/deka23a.html
https://proceedings.mlr.press/v206/deka23a.htmlTransport Reversible Jump ProposalsReversible jump Markov chain Monte Carlo (RJMCMC) proposals that achieve reasonable acceptance rates and mixing are notoriously difficult to design in most applications. Inspired by recent advances in deep neural network-based normalizing flows and density estimation, we demonstrate an approach to enhance the efficiency of RJMCMC sampling by performing transdimensional jumps involving reference distributions. In contrast to other RJMCMC proposals, the proposed method is the first to apply a non-linear transport-based approach to construct efficient proposals between models with complicated dependency structures. It is shown that, in the setting where exact transports are used, our RJMCMC proposals have the desirable property that the acceptance probability depends only on the model probabilities. Numerical experiments demonstrate the efficacy of the approach.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/davies23a.html
https://proceedings.mlr.press/v206/davies23a.htmlMultiple-policy High-confidence Policy EvaluationIn reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e., evaluating each policy separately or estimating a model of the MDP, do not take into account the similarities between target policies and scale with the number of policies to evaluate or the size of the MDP, respectively. We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dann23a.html
https://proceedings.mlr.press/v206/dann23a.htmlThe ELBO of Variational Autoencoders Converges to a Sum of EntropiesThe central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and understand learning behavior. More generally, we believe that our contributions can be useful for future theoretical and practical studies on VAE learning as they provide novel information on those points in parameters space that optimization of VAEs converges to.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/damm23a.html
https://proceedings.mlr.press/v206/damm23a.htmlLearning to Optimize with Stochastic Dominance ConstraintsIn real-world decision-making, uncertainty is important yet difficult to handle. Stochastic dominance provides a theoretically sound approach to comparing uncertain quantities, but optimization with stochastic dominance constraints is often computationally expensive, which limits practical applicability. In this paper, we develop a simple yet efficient approach for the problem, Light Stochastic Dominance Solver (light-SD), by leveraging properties of the Lagrangian. We recast the inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses the intractability and leads to tractable updates or even closed-form solutions for gradient calculations. We prove convergence of the algorithm and test it empirically. The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dai23b.html
https://proceedings.mlr.press/v206/dai23b.htmlA Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured OptimizationThis paper introduces a new proximal stochastic gradient method with variance reduction and stabilization for minimizing the sum of a convex stochastic function and a group sparsity-inducing regularization function. Since the method may be viewed as a stabilized version of the recently proposed algorithm PStorm, we call our algorithm S-PStorm. Our analysis shows that S-PStorm has strong convergence results. In particular, we prove an upper bound on the number of iterations required by S-PStorm before its iterates correctly identify (with high probability) an optimal support (i.e., the zero and nonzero structure of an optimal solution). Most algorithms in the literature with such a support identification property use variance reduction techniques that require either periodically evaluating an exact gradient or storing a history of stochastic gradients. Unlike these methods, S-PStorm achieves variance reduction without requiring either of these, which is advantageous. Moreover, our support-identification result for S-PStorm shows that, with high probability, an optimal support will be identified correctly in all iterations with index above a threshold. We believe that this type of result is new to the literature since the few existing other results prove that the optimal support is identified with high probability at each iteration with a sufficiently large index (meaning that the optimal support might be identified in some iterations, but not in others). Numerical experiments on regularized logistic loss problems show that S-PStorm outperforms existing methods in various metrics that measure how efficiently and robustly iterates of an algorithm identify an optimal support.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dai23a.html
https://proceedings.mlr.press/v206/dai23a.htmlFast Variational Estimation of Mutual Information for Implicit and Explicit Likelihood ModelsComputing mutual information (MI) of random variables lacks a closed-form in nontrivial models. Variational MI approximations are widely used as flexible estimators for this purpose, but computing them typically requires solving a costly nonconvex optimization. We prove that a widely used class of variational MI estimators can be solved via moment matching operations in place of the numerical optimization methods that are typically required. We show that the same moment matching solution yields variational estimates for so-called “implicit” models that lack a closed form likelihood function. Furthermore, we demonstrate that this moment matching solution has multiple orders of magnitude computational speed up compared to the standard optimization based solutions. We show that theoretical results are supported by numerical evaluation in fully parameterized Gaussian mixture models and a generalized linear model with implicit likelihood due to nuisance variables. We also demonstrate on the implicit simulation-based likelihood SIR epidemiology model, where we avoid costly likelihood free inference and observe many orders of magnitude speedup.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/dahlke23a.html
https://proceedings.mlr.press/v206/dahlke23a.htmlUnderstanding the Impact of Competing Events on Heterogeneous Treatment Effect Estimation from Time-to-Event DataWe study the problem of inferring heterogeneous treatment effects (HTEs) from time-to-event data in the presence of competing events. Albeit its great practical relevance, this problem has received little attention compared to its counterparts studying HTE estimation without time-to-event data or competing events. We take an outcome modeling approach to estimating HTEs, and consider how and when existing prediction models for time-to-event data can be used as plug-in estimators for potential outcomes. We then investigate whether competing events present new challenges for HTE estimation – in addition to the standard confounding problem –, and find that, because there are multiple definitions of causal effects in this setting – namely total, direct and separable effects –, competing events can act as an additional source of covariate shift depending on the desired treatment effect interpretation and associated estimand. We theoretically analyze and empirically illustrate when and how these challenges play a role when using generic machine learning prediction models for the estimation of HTEs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/curth23a.html
https://proceedings.mlr.press/v206/curth23a.htmlActually Sparse Variational Gaussian ProcessesGaussian processes (GPs) are typically criticised for their unfavourable scaling in both computational and memory requirements. For large datasets, sparse GPs reduce these demands by conditioning on a small set of inducing variables designed to summarise the data. In practice however, for large datasets requiring many inducing variables, such as low-lengthscale spatial data, even sparse GPs can become computationally expensive, limited by the number of inducing variables one can use. In this work, we propose a new class of inter-domain variational GP, constructed by projecting a GP onto a set of compactly supported B-spline basis functions. The key benefit of our approach is that the compact support of the B-spline basis functions admits the use of sparse linear algebra to significantly speed up matrix operations and drastically reduce the memory footprint. This allows us to very efficiently model fast-varying spatial phenomena with tens of thousands of inducing variables, where previous approaches failed.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cunningham23a.html
https://proceedings.mlr.press/v206/cunningham23a.htmlScalable Bicriteria Algorithms for Non-Monotone Submodular CoverIn this paper, we consider the optimization problem Submodular Cover (SC), which is to find a minimum cost subset of a ground set $U$ such that the value of a submodular function $f$ is above a threshold $\tau$. In contrast to most existing work on SC, it is not assumed that $f$ is monotone. Two bicriteria approximation algorithms are presented for SC that, for input parameter $0 < \epsilon < 1$, give $O( 1 / \epsilon^2 )$ ratio to the optimal cost and ensures the function $f$ is at least $\tau(1 - \epsilon)/2$. A lower bound shows that under the value query model shows that no polynomial-time algorithm can ensure that $f$ is larger than $\tau/2$. Further, the algorithms presented are scalable to large data sets, processing the ground set in a stream. Similar algorithms developed for SC also work for the related optimization problem of Submodular Maximization (KCSM). Finally, the algorithms are demonstrated to be effective in experiments involving graph cut and data summarization functions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/crawford23a.html
https://proceedings.mlr.press/v206/crawford23a.htmlRevisiting Fair-PAC Learning and the Axioms of Cardinal WelfareCardinal objectives serve as intuitive targets in fair machine learning by summarizing utility (welfare) or disutility (malfare) $u$ over $g$ groups. Under standard axioms, all welfare and malfare functions are $w$-weighted $p$-power-means, i.e. $M_p(u;w) = \sqrt[p]{\sum_{i=1}^g w_i u_i^p}$, with $p \leq 1$ for welfare, or $p \geq 1$ for malfare. We show the same under weaker axioms, and also identify stronger axioms that naturally restrict $p$. It is known that power-mean malfare functions are Lipschitz continuous, and thus statistically easy to estimate or learn. We show that all power means are locally Holder continuous, i.e., $|M(u; w)-M(u’ ; w)| \leq \lambda \parallel u - u’\parallel^\alpha$ for some $\lambda$, $\alpha$,$\parallel \cdot \parallel$. In particular, $\lambda$ and $1/\alpha$ are bounded except as $p \rightarrow 0$ or $\min_i w_i \rightarrow 0$, and via this analysis we bound the sample complexity of optimizing welfare. This yields a novel concept of fair-PAC learning, wherein welfare functions are only polynomially harder to optimize than malfare functions, except when $p \approx 0$ or $\min_i w_i$ $\approx$ 0, which is exponentially harder.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cousins23a.html
https://proceedings.mlr.press/v206/cousins23a.htmlNeural Simulated AnnealingSimulated annealing (SA) is a stochastic global optimisation metaheuristic applicable to a wide range of discrete and continuous variable problems. Despite its simplicity, SA hinges on carefully handpicked components, viz. proposal distribution and annealing schedule, that often have to be fine tuned to individual problem instances. In this work, we seek to make SA more effective and easier to use by framing its proposal distribution as a reinforcement learning policy that can be optimised for higher solution quality given a computational budget. The result is Neural SA, a competitive and general machine learning method for combinatorial optimisation that is efficient, and easy to design and train. We show Neural SA with such a learnt proposal distribution, parametrised by small equivariant neural networks, outperforms SA baselines on several problems: Rosenbrock’s function and the Knapsack, Bin Packing and Travelling Salesperson problems. We also show Neural SA scales well to large problems (generalising to much larger instances than those seen during training) while getting comparable performance to popular off-the-shelf solvers and machine learning methods in terms of solution quality and wall-clock time.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/correia23a.html
https://proceedings.mlr.press/v206/correia23a.htmlEfficiently Forgetting What You Have Learned in Graph Representation Learning via ProjectionAs privacy protection receives much attention, unlearning the effect of a specific node from a pre-trained graph learning model has become equally important. However, due to the node dependency in the graph-structured data, representation unlearning in Graph Neural Networks (GNNs) is challenging and less well explored. In this paper, we fill in this gap by first studying the unlearning problem in linear-GNNs, and then introducing its extension to non-linear structures. Given a set of nodes to unlearn, we propose Projector that unlearns by projecting the weight parameters of the pre-trained model onto a subspace that is irrelevant to features of the nodes to be forgotten. Projector could overcome the challenges caused by node dependency and enjoys perfect data removal, i.e., the unlearned model parameters do not contain any information about the unlearned node features which is guaranteed by algorithmic construction. Empirical results on real-world datasets illustrate the effectiveness and efficiency of Projector.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cong23a.html
https://proceedings.mlr.press/v206/cong23a.htmlMinimum-Entropy Coupling Approximation Guarantees Beyond the Majorization BarrierGiven a set of discrete probability distributions, the minimum entropy coupling is the minimum entropy joint distribution that has the input distributions as its marginals. This has immediate relevance to tasks such as entropic causal inference for causal graph discovery and bounding mutual information between variables that we observe separately. Since finding the minimum entropy coupling is NP-Hard, various works have studied approximation algorithms. The work of [Compton, 2022] shows that the greedy coupling algorithm of [Kocaoglu et al., 2017a] is always within $\log_2(e)$ $\approx$ 1.44 bits of the optimal coupling. Moreover, they show that it is impossible to obtain a better approximation guarantee using the majorization lower-bound that all prior works have used: thus establishing a majorization barrier. In this work, we break the majorization barrier by designing a stronger lower-bound that we call the profile method. Using this profile method, we are able to show that the greedy algorithm is always within $\log_2(e)/e$ $\approx$ 0.53 bits of optimal for coupling two distributions (previous best-known bound is within 1 bit), and within $(1 + \log_2(e))/2$ $\approx$ 1.22 bits for coupling any number of distributions (previous best-known bound is within 1.44 bits). We also examine a generalization of the minimum entropy coupling problem: Concave Minimum-Cost Couplings. We are able to obtain similar guarantees for this generalization in terms of the concave cost function. Additionally, we make progress on the open problem of [Kovačević et al., 2015] regarding NP membership of the minimum entropy coupling problem by showing that any hardness of minimum entropy coupling beyond NP comes from the difficulty of computing arithmetic in the complexity class NP. Finally, we present exponential-time algorithms for computing the exactly optimal solution. We experimentally observe that our new profile method lower bound is not only helpful for analyzing the greedy approximation algorithm, but also for improving the speed of our new backtracking-based exact algorithm.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/compton23a.html
https://proceedings.mlr.press/v206/compton23a.htmlOn double-descent in uncertainty quantification in overparametrized modelsUncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent behavior in the calibration curve of optimally regularised estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/clarte23a.html
https://proceedings.mlr.press/v206/clarte23a.htmlOne Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement LearningAlthough parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cisneros-velarde23a.html
https://proceedings.mlr.press/v206/cisneros-velarde23a.htmlVariational Boosted Soft TreesGradient boosting machines (GBMs) based on decision trees consistently demonstrate state-of-the-art results on regression and classification tasks with tabular data, often outperforming deep neural networks. However, these models do not provide well-calibrated predictive uncertainties, which prevents their use for decision making in high-risk applications. The Bayesian treatment is known to improve predictive uncertainty calibration, but previously proposed Bayesian GBM methods are either computationally expensive, or resort to crude approximations. Variational inference is often used to implement Bayesian neural networks, but is difficult to apply to GBMs, because the decision trees used as weak learners are non-differentiable. In this paper, we propose to implement Bayesian GBMs using variational inference with soft decision trees, a fully differentiable alternative to standard decision trees introduced by Irsoy et al. Our experiments demonstrate that variational soft trees and variational soft GBMs provide useful uncertainty estimates, while retaining good predictive performance. The proposed models show higher test likelihoods when compared to the state-of-the-art Bayesian GBMs in 7/10 tabular regression datasets and improved out-of-distribution detection in 5/10 datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cinquin23a.html
https://proceedings.mlr.press/v206/cinquin23a.htmlProvable Hierarchy-Based Meta-Reinforcement LearningHierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing works either assume access to expert-constructed hierarchies, or use hierarchy-learning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task. We consider a tabular setting where natural hierarchical structure is embedded in the transition dynamics. Analogous to supervised meta-learning theory, we provide diversity conditions which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Furthermore, we provide regret bounds on a learner using the recovered hierarchy to solve a meta-test task. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chua23a.html
https://proceedings.mlr.press/v206/chua23a.htmlOptimal robustness-consistency tradeoffs for learning-augmented metrical task systemsWe examine the problem of designing learning-augmented algorithms for metrical task systems (MTS) that exploit machine-learned advice while maintaining rigorous, worst-case guarantees on performance. We propose an algorithm, DART, that achieves this dual objective, providing cost within a multiplicative factor $(1+\epsilon)$ of the machine-learned advice (i.e., consistency) while ensuring cost within a multiplicative factor $2^{O(1/\epsilon)}$ of a baseline robust algorithm (i.e., robustness) for any $\epsilon > 0$. We show that this exponential tradeoff between consistency and robustness is unavoidable in general, but that in important subclasses of MTS, such as when the metric space has bounded diameter and in the $k$-server problem, our algorithm achieves improved, polynomial tradeoffs between consistency and robustness.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/christianson23a.html
https://proceedings.mlr.press/v206/christianson23a.htmlSubset verification and search algorithms for causal DAGsLearning causal relationships between variables is a fundamental task in causal inference and directed acyclic graphs (DAGs) are a popular choice to represent the causal relationships. As one can recover a causal graph only up to its Markov equivalence class from observations, interventions are often used for the recovery task. Interventions are costly in general and it is important to design algorithms that minimize the number of interventions performed. In this work, we study the problem of identifying the smallest set of interventions required to learn the causal relationships between a subset of edges (target edges). Under the assumptions of faithfulness, causal sufficiency, and ideal interventions, we study this problem in two settings: when the underlying ground truth causal graph is known (subset verification) and when it is unknown (subset search). For the subset verification problem, we provide an efficient algorithm to compute a minimum sized interventional set; we further extend these results to bounded size non-atomic interventions and node-dependent interventional costs. For the subset search problem, in the worst case, we show that no algorithm (even with adaptivity or randomization) can achieve an approximation ratio that is asymptotically better than the vertex cover of the target edges when compared with the subset verification number. This result is surprising as there exists a logarithmic approximation algorithm for the search problem when we wish to recover the whole causal graph. To obtain our results, we prove several interesting structural properties of interventional causal graphs that we believe have applications beyond the subset verification/search problems studied here.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/choo23a.html
https://proceedings.mlr.press/v206/choo23a.htmlApproximating a RUM from Distributions on $k$-SlatesIn this work we consider the problem of fitting Random Utility Models (RUMs) to user choices. Given the winner distributions of the subsets of size $k$ of a universe, we obtain a polynomial-time algorithm that finds the RUM that best approximates the given distribution on average. Our algorithm is based on a linear program that we solve using the ellipsoid method. Given that its separation oracle problem is NP-hard, we devise an approximate separation oracle that can be viewed as a generalization of the weighted Feedback Arc Set problem to hypergraphs. Our theoretical result can also be made practical: we obtain a heuristic that scales to real-world datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chierichetti23a.html
https://proceedings.mlr.press/v206/chierichetti23a.htmlFederated Asymptotics: a model to compare federated learning algorithmsWe develop an asymptotic framework to compare the test performance of (personalized) federated learning algorithms whose purpose is to move beyond algorithmic convergence arguments. To that end, we study a high-dimensional linear regression model to elucidate the statistical properties (per client test error) of loss minimizers. Our techniques and model allow precise predictions about the benefits of personalization and information sharing in federated scenarios, including that Federated Averaging with simple client fine-tuning achieves identical asymptotic risk to more intricate meta-learning approaches and outperforms naive Federated Averaging. We evaluate and corroborate these theoretical predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cheng23b.html
https://proceedings.mlr.press/v206/cheng23b.htmlSelect and Optimize: Learning to solve large-scale TSP instancesLearning-based algorithms to solve TSP are getting popular in recent years, but most existing works cannot solve very large-scale TSP instances within a limited time. To solve this problem, this paper introduces a creative and distinctive method to select and locally optimize sub-parts of a solution. Concretely, we design a novel framework to generalize a small-scale selector-and-optimizer network to large-scale TSP instances by iteratively selecting while optimizing one sub-problem. At each iteration, the running time of sub-problem sampling and selection is significantly reduced due to the full use of parallel computing. Our neural model is well-designed to exploit the characteristics of the sub-problems. Furthermore, we introduce a trick called destroy-and-repair to avoid the local minimum of the iterative algorithm from a global perspective. Extensive experiments show that our method accelerates state-of-the-art learning-based algorithms more than 2x while achieving better solution quality on large-scale TSP instances ranging in size from 200 to 20,000.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cheng23a.html
https://proceedings.mlr.press/v206/cheng23a.htmlHeteRSGD: Tackling Heterogeneous Sampling Costs via Optimal Reweighted Stochastic Gradient DescentOne implicit assumption in current stochastic gradient descent (SGD) algorithms is the identical cost for sampling each component function of the finite-sum objective. However, there are applications where the costs differ substantially, for which SGD schemes with uniform sampling invoke a high sampling load. We investigate the use of importance sampling (IS) as a cost saver in this setting, in contrast to its traditional use for variance reduction. The key ingredient is a novel efficiency metric for IS that advocates low sampling costs while penalizing high gradient variances. We then propose HeteRSGD, an SGD scheme that performs gradient sampling according to optimal probability weights stipulated by the metric, and establish theories on its optimal asymptotic and finite-time convergence rates among all possible IS-based SGD schemes. We show that the relative efficiency gain of HeteRSGD can be arbitrarily large regardless of the problem dimension and number of components. Our theoretical results are validated numerically for both convex and nonconvex problems.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23i.html
https://proceedings.mlr.press/v206/chen23i.htmlAlgorithm-Dependent Bounds for Representation Learning of Multi-Source Domain AdaptationWe use information-theoretic tools to derive a novel analysis of Multi-source Domain Adaptation (MDA) from the representation learning perspective. Concretely, we study joint distribution alignment for supervised MDA with few target labels and unsupervised MDA with pseudo labels, where the latter is relatively hard and less commonly studied. We further provide algorithm-dependent generalization bounds for these two settings, where the generalization is characterized by the mutual information between the parameters and the data. Then we propose a novel deep MDA algorithm, implicitly addressing the target shift through joint alignment. Finally, the mutual information bounds are extended to this algorithm providing a non-vacuous gradient-norm estimation. The proposed algorithm has comparable performance to the state-of-the-art on target-shifted MDA benchmark with improved memory efficiency.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23h.html
https://proceedings.mlr.press/v206/chen23h.htmlReducing Discretization Error in the Frank-Wolfe MethodThe Frank-Wolfe algorithm is a popular method in structurally constrained machine learning applications, due to its fast per-iteration complexity. However, one major limitation of the method is a slow rate of convergence that is difficult to accelerate due to erratic, zig-zagging step directions, even asymptotically close to the solution. We view this as an artifact of discretization; that is to say, the Frank-Wolfe flow, which is its trajectory at asymptotically small step sizes, does not zig-zag, and reducing discretization error will go hand-in-hand in producing a more stabilized method, with better convergence properties. We propose two improvements: a multistep Frank-Wolfe method that directly applies optimized higher-order discretization schemes; and an LMO-averaging scheme with reduced discretization error, and whose local convergence rate over general convex sets accelerates from a rate of $O(1/k)$ to up to $O(1/k^{3/2})$.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23g.html
https://proceedings.mlr.press/v206/chen23g.htmlProtecting Global Properties of Datasets with Distribution Privacy MechanismsWe consider the problem of ensuring confidentiality of dataset properties aggregated over many records of a dataset. Such properties can encode sensitive information, such as trade secrets or demographic data, while involving a notion of data protection different to the privacy of individual records typically discussed in the literature. In this work, we demonstrate how a distribution privacy framework can be applied to formalize such data confidentiality. We extend the Wasserstein Mechanism from Pufferfish privacy and the Gaussian Mechanism from attribute privacy to this framework, then analyze their underlying data assumptions and how they can be relaxed. We then empirically evaluate the privacy-utility tradeoffs of these mechanisms and apply them against a practical property inference attack which targets global properties of datasets. The results show that our mechanisms can indeed reduce the effectiveness of the attack while providing utility substantially greater than a crude group differential privacy baseline. Our work thus provides groundwork for theoretical mechanisms for protecting global properties of datasets along with their evaluation in practice.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23f.html
https://proceedings.mlr.press/v206/chen23f.htmlThe communication cost of security and privacy in federated frequency estimationWe consider the federated frequency estimation problem, where each user holds a private item $X_i$ from a size-$d$ domain and a server aims to estimate the empirical frequency (i.e., histogram) of $n$ items with $n \ll d$. Without any security and privacy considerations, each user can communicate its item to the server by using $\log d$ bits. A naive application of secure aggregation protocols would, however, require $d\log n$ bits per user. Can we reduce the communication needed for secure aggregation, and does security come with a fundamental cost in communication? In this paper, we develop an information-theoretic model for secure aggregation that allows us to characterize the fundamental cost of security and privacy in terms of communication. We show that with security (and without privacy) $\Omega\left( n \log d \right)$ bits per user are necessary and sufficient to allow the server to compute the frequency distribution. This is significantly smaller than the $d\log n$ bits per user needed by the naive scheme but significantly higher than the $\log d$ bits per user needed without security. To achieve differential privacy, we construct a linear scheme based on a noisy sketch that locally perturbs the data and does not require a trusted server (a.k.a. distributed differential privacy). We analyze this scheme under $\ell_2$ and $\ell_\infty$ loss. By using our information-theoretic framework, we show that the scheme achieves the optimal accuracy-privacy trade-off with optimal communication cost, while matching the performance in the centralized case where data is stored in the central server.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23e.html
https://proceedings.mlr.press/v206/chen23e.htmlA Multi-Task Gaussian Process Model for Inferring Time-Varying Treatment Effects in Panel DataWe introduce a Bayesian multi-task Gaussian process model for estimating treatment effects from panel data, where an intervention outside the observer’s control influences a subset of the observed units. Our model encodes structured temporal dynamics both within and across the treatment and control groups and incorporates a flexible prior for the evolution of treatment effects over time. These innovations aid in inferring posteriors for dynamic treatment effects that encode our uncertainty about the likely trajectories of units in the absence of treatment. We also discuss the asymptotic properties of the joint posterior over counterfactual outcomes and treatment effects, which exhibits intuitive behavior in the large-sample limit. In experiments on both synthetic and real data, our approach performs no worse than existing methods and significantly better when standard assumptions are violated.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23d.html
https://proceedings.mlr.press/v206/chen23d.htmlOn-Demand Communication for Asynchronous Multi-Agent BanditsThis paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate asynchronously – agent pull times and rates are unknown, irregular, and heterogeneous – and face the same instance of a K-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose ODC, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. ODC is a generic protocol that can be integrated into most cooperative bandit algorithms without degrading their performance. We then incorporate ODC into the natural extensions of UCB and AAE algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23c.html
https://proceedings.mlr.press/v206/chen23c.htmlByzantine-Robust Online and Offline Distributed Reinforcement LearningWe consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is COW, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23b.html
https://proceedings.mlr.press/v206/chen23b.htmlStatistical Analysis of Karcher Means for Random Restricted PSD MatricesNon-asymptotic statistical analysis is often missing for modern geometry-aware machine learning algorithms due to the possibly intricate non-linear manifold structure. This paper studies an intrinsic mean model on the manifold of restricted positive semi-definite matrices and provides a non-asymptotic statistical analysis of the Karcher mean. We also consider a general extrinsic signal-plus-noise model, under which a deterministic error bound of the Karcher mean is provided. As an application, we show that the distributed principal component analysis algorithm, LRC-dPCA, achieves the same performance as the full sample PCA algorithm. Numerical experiments lend strong support to our theories.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chen23a.html
https://proceedings.mlr.press/v206/chen23a.htmlPrecision Recall Cover: A Method For Assessing Generative ModelsGenerative modelling has seen enormous practical advances over the past few years. Evaluating the quality of a generative system however is often still based on subjective human inspection. To overcome this, very recently the research community has turned to exploring formal evaluation metrics and methods. In this work, we propose a novel evaluation paradigm based on a two way nearest neighbor neighborhood test. We define a novel measure of mutual coverage for two continuous probability distributions. From this, we derive an empirical analogue and show analytically that it exhibits favorable theoretical properties while it is also straightforward to compute. We show that, while algorithmically simple, our derived method is also statistically sound. In contrast to previously employed distance measures, our measure naturally stems from a notion of local discrepancy, which can be accessed separately. This provides more detailed information to practitioners on the diagnosis of where their generative models will perform well, or conversely where their models fail. We complement our analysis with a systematic experimental evaluation and comparison to other recently proposed measures. Using a wide array of experiments we demonstrate our algorithms strengths over other existing methods and confirm our results from the theoretical analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cheema23a.html
https://proceedings.mlr.press/v206/cheema23a.html“Plus/minus the learning rate”: Easy and Scalable Statistical Inference with SGDIn this paper, we develop a statistical inference procedure using stochastic gradient descent (SGD)-based confidence intervals. These intervals are of the simplest possible form: $\theta_{N,j} \pm 2\sqrt{}(\gamma/N)$ , where $\theta_N$ is the SGD estimate of model parameters $\theta$ over N data points, and $\gamma$ is the learning rate. This construction relies only on a proper selection of the learning rate to ensure the standard SGD conditions for O(1/n) convergence. The procedure performs well in our empirical evaluations, achieving near-nominal coverage intervals scaling up to 20$\times$ as many parameters as other SGD-based inference methods. We also demonstrate our method’s practical significance on modeling adverse events in emergency general surgery patients using a novel dataset from the Hospital of the University of Pennsylvania.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chee23a.html
https://proceedings.mlr.press/v206/chee23a.htmlAdversarial De-confounding in Individualised Treatment Effects EstimationObservational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chauhan23a.html
https://proceedings.mlr.press/v206/chauhan23a.htmlTwo-Sample Tests for Inhomogeneous Random Graphs in $L_r$ Norm: Optimality and AsymptoticsIn this paper we study the two-sample problem for inhomogeneous Erdős-Rényi (IER), random graph models, in the $L_r$ norm, in the high-dimensional regime where the number of samples is smaller or comparable to the size of the graphs. Given two symmetric matrices $P, Q \in [0, 1]^{n \times n}$ (with zeros on the diagonals), the two-sample problem for IER graphs (with respect to the $L_r$ norm $||\cdot||_r$) is to test the hypothesis $H_0: P=Q$ versus $H_1: ||P-Q||_r \geq \varepsilon$, given a sample of $m$ graphs from the respective distributions. In this paper, we obtain the optimal sample complexity for testing in the $L_r$-norm, for all integers $r \geq 1$. We also derive the asymptotic distribution of the optimal tests under $H_0$ and develop a method for consistently estimating their variances. This allows us to efficiently implement the optimal tests with precise asymptotic level and establish their asymptotic consistency. We validate our theoretical results by numerical experiments for various natural IER models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chatterjee23a.html
https://proceedings.mlr.press/v206/chatterjee23a.htmlAutoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment BatteryThe Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means – linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the “VBE” step in a stochastic variational Bayesian expectation maximization (VBEM) procedure that we use for approximate Bayesian inference on the entire model. We use the method on a sample of WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional posthoc method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chang23a.html
https://proceedings.mlr.press/v206/chang23a.htmlClustering High-dimensional Data with Ordered Weighted $\ell_1$ RegularizationClustering complex high-dimensional data is particularly challenging as the signal-to-noise ratio in such data is significantly lower than their classical counterparts. This is mainly because most of the features describing a data point have little to no information about the natural grouping of the data. Filtering such features is, thus, critical in harnessing meaningful information from such large-scale data. Many recent methods have attempted to find feature importance in a centroid-based clustering setting. Though empirically successful in classical low-dimensional settings, most perform poorly, especially on microarray and single-cell RNA-seq data. This paper extends the merits of weighted center-based clustering through the Ordered Weighted $\ell_1$ (OWL) norm for better feature selection. Appealing to the elegant properties of block coordinate-descent and Frank-Wolf algorithms, we are not only able to maintain computational efficiency but also able to outperform the state-of-the-art in high-dimensional settings. The proposal also comes with finite sample theoretical guarantees, including a rate of $\mathcal{O}\left(\sqrt{k \log p/n}\right)$, under model-sparsity, bridging the gap between theory and practice of weighted clustering.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/chakraborty23a.html
https://proceedings.mlr.press/v206/chakraborty23a.htmlMulti-task Representation Learning with Stochastic Linear BanditsWe study the problem of transfer-learning in the setting of stochastic linear contextual bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning the tasks jointly. Following recent results to design Lasso stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix, nor {does} it requires the covariance of the arms distribution to be invertible. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $T\sqrt{rN}+\sqrt{rNTd}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy over an independent task learning baseline, which has a worse regret of order $T\sqrt{dN}$. We also argue that our policy {is minimax optimal} and, when $T\geq d$, has a multi-task regret which is comparable to the regret of an oracle policy which knows the true underlying representation.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cella23a.html
https://proceedings.mlr.press/v206/cella23a.htmlFlexible and Efficient Contextual Bandits with Heterogeneous Treatment Effect OraclesContextual bandit algorithms often estimate reward models to inform decision-making. However, true rewards can contain action-independent redundancies that are not relevant for decision-making. We show it is more data-efficient to estimate any function that explains the reward differences between actions, that is, the treatment effects. Motivated by this observation, building on recent work on oracle-based bandit algorithms, we provide the first reduction of contextual bandits to general-purpose heterogeneous treatment effect estimation, and we design a simple and computationally efficient algorithm based on this reduction. Our theoretical and experimental results demonstrate that heterogeneous treatment effect estimation in contextual bandits offers practical advantages over reward estimation including more efficient model estimation and greater flexibility to model misspecification.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/carranza23a.html
https://proceedings.mlr.press/v206/carranza23a.htmlActive Cost-aware Labeling of Streaming DataWe study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data’s inputs belong to one of K discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is B, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\tilde O(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after T rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm. After T rounds in d dimensions, we show that the loss is bounded by $\tilde O(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS with a squared exponential kernel and by $\tilde O(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$ in an RKHS with a Matérn kernel. Our empirical evaluation demonstrates that our method outperforms other baselines in several synthetic experiments and two real experiments in medicine and astronomy.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cai23a.html
https://proceedings.mlr.press/v206/cai23a.htmlTransport Elliptical Slice SamplingWe propose a new framework for efficiently sampling from complex probability distributions using a combination of normalizing flows and elliptical slice sampling (Murray et al., 2010). The central idea is to learn a diffeomorphism, through normalizing flows, that maps the non-Gaussian structure of the target distribution to an approximately Gaussian distribution. We then use the elliptical slice sampler, an efficient and tuning-free Markov chain Monte Carlo (MCMC) algorithm, to sample from the transformed distribution. The samples are then pulled back using the inverse normalizing flow, yielding samples that approximate the stationary target distribution of interest. Our transport elliptical slice sampler (TESS) is optimized for modern computer architectures, where its adaptation mechanism utilizes parallel cores to rapidly run multiple Markov chains for a few iterations. Numerical demonstrations show that TESS produces Monte Carlo samples from the target distribution with lower autocorrelation compared to non-transformed samplers, and demonstrates significant improvements in efficiency when compared to gradient-based proposals designed for parallel computer architectures, given a flexible enough diffeomorphism.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cabezas23a.html
https://proceedings.mlr.press/v206/cabezas23a.htmlA Case of Exponential Convergence Rates for SVMOptimizing the misclassification risk is in general NP-hard. Tractable solvers can be obtained by considering a surrogate regression problem. While convergence to the regression function is typically sublinear, the corresponding classification error can decay much faster. Fast and super fast rates (up to exponential) have been established for general smooth losses on problems where a hard margin is present between classes. This leaves out models based on non-smooth losses such as support vector machines, and problems where there is no hard margin, begging several questions. Are such models incapable of fast convergence? Are they therefore structurally inferior? Is the hard margin condition really necessary to obtain exponential convergence? Developing a new strategy, we provide an answer to these questions. In particular, we show not only that support vector machines can indeed converge exponentially fast, but also that they can do so even without hard margin.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/cabannnes23a.html
https://proceedings.mlr.press/v206/cabannnes23a.htmlThe Schrödinger Bridge between Gaussian Measures has a Closed FormThe static optimal transport $(\mathrm{OT})$ problem between Gaussians seeks to recover an optimal map, or more generally a coupling, to morph a Gaussian into another. It has been well studied and applied to a wide variety of tasks. Here we focus on the dynamic formulation of OT, also known as the Schrödinger bridge (SB) problem, which has recently seen a surge of interest in machine learning due to its connections with diffusion-based generative models. In contrast to the static setting, much less is known about the dynamic setting, even for Gaussian distributions. In this paper, we provide closed-form expressions for SBs between Gaussian measures. In contrast to the static Gaussian OT problem, which can be simply reduced to studying convex programs, our framework for solving SBs requires significantly more involved tools such as Riemannian geometry and generator theory. Notably, we establish that the solutions of SBs between Gaussian measures are themselves Gaussian processes with explicit mean and covariance kernels, and thus are readily amenable for many downstream applications such as generative modeling or interpolation. To demonstrate the utility, we devise a new method for modeling the evolution of single-cell genomics data and report significantly improved numerical stability compared to existing SB-based approaches.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bunne23a.html
https://proceedings.mlr.press/v206/bunne23a.htmlBlitzMask: Real-Time Instance Segmentation Approach for Mobile DevicesWe propose a fast and low complexity anchor-free instance segmentation approach BlitzMask. For the first time, the approach achieves competitive results for real-time inference on mobile devices. The model architecture modifies CenterNet by adding a new lite head to the CenterNet architecture. The model contains only layers optimized for inference on mobile devices, e.g. batch normalization, standard convolution, depthwise convolution, and can be easily embedded into a mobile device. The instance segmentation task requires finding an arbitrary (not a priori fixed) number of instance masks. The proposed method predicts the number of instance masks separately for each image using a predicted heatmap. Then, it decomposes each instance mask over a predicted spanning set, which is an output of the lite head. The approach uses training from scratch with a new optimization process and a new loss function. A model with EfficientNet-Lite B4 backbone and 320x320 input resolution achieves 28.9 mask AP at 29.2 fps on Samsung S21 GPU and 28.0 mask AP at 39.4 fps on Samsung S21 DSP. This sets a new speed benchmark for inference for instance segmentation on mobile devices.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bulygin23a.html
https://proceedings.mlr.press/v206/bulygin23a.htmlMinimax-Bayes Reinforcement LearningWhile the Bayesian decision-theoretic framework offers an elegant solution to the problem of decision making under uncertainty, one question is how to appropriately select the prior distribution. One idea is to employ a worst-case prior. However, this is not as easy to specify in sequential decision making as in simple statistical estimation problems. This paper studies (sometimes approximate) minimax-Bayes solutions for various reinforcement learning problems to gain insights into the properties of the corresponding priors and policies. We find that while the worst-case prior depends on the setting, the corresponding minimax policies are more robust than those that assume a standard (i.e. uniform) prior.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/buening23a.html
https://proceedings.mlr.press/v206/buening23a.htmlMembership Inference Attacks against Synthetic Data through Overfitting DetectionData is the foundation of most science. Unfortunately, sharing data can be obstructed by the risk of violating data privacy, impeding research in fields like healthcare. Synthetic data is a potential solution. It aims to generate data that has the same distribution as the original data, but that does not disclose information about individuals. Membership Inference Attacks (MIAs) are a common privacy attack, in which the attacker attempts to determine whether a particular real sample was used for training of the model. Previous works that propose MIAs against generative models either display low performance—giving the false impression that data is highly private—or need to assume access to internal generative model parameters—a relatively low-risk scenario, as the data publisher often only releases synthetic data, not the model. In this work we argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model. Experimentally we show that DOMIAS is significantly more successful at MIA than previous work, especially at attacking uncommon samples. The latter is disconcerting since these samples may correspond to underrepresented groups. We also demonstrate how DOMIAS’ MIA performance score provides an interpretable metric for privacy, giving data publishers a new tool for achieving the desired privacy-utility trade-off in their synthetic data.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/breugel23a.html
https://proceedings.mlr.press/v206/breugel23a.htmlCausal Entropy OptimizationWe study the problem of globally optimizing the causal effect on a target variable of an unknown causal graph in which interventions can be performed. This problem arises in many areas of science including biology, operations research and healthcare. We propose Causal Entropy Optimization (CEO), a framework that generalizes Causal Bayesian Optimization (CBO) to account for all sources of uncertainty, including the one arising from the causal graph structure. CEO incorporates the causal structure uncertainty both in the surrogate models for the causal effects and in the mechanism used to select interventions via an information-theoretic acquisition function. The resulting algorithm automatically trades-off structure learning and causal effect optimization, while naturally accounting for observation noise. For various synthetic and real-world structural causal models, CEO achieves faster convergence to the global optimum compared with CBO while also learning the graph. Furthermore, our joint approach to structure learning and causal optimization improves upon sequential, structure-learning-first approaches.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/branchini23a.html
https://proceedings.mlr.press/v206/branchini23a.htmlProbabilistic Querying of Continuous-Time Event SequencesContinuous-time event sequences, i.e., sequences consisting of continuous time stamps and associated event types (“marks”), are an important type of sequential data with many applications, e.g., in clinical medicine or user behavior modeling. Since these data are typically modeled in an autoregressive manner (e.g., using neural Hawkes processes or their classical counterparts), it is natural to ask questions about future scenarios such as “what kind of event will occur next” or “will an event of type $A$ occur before one of type $B$.” Addressing such queries with direct methods such as naive simulation can be highly inefficient from a computational perspective. This paper introduces a new typology of query types and a framework for addressing them using importance sampling. Example queries include predicting the $n^\mathrm{th}$ event type in a sequence and the hitting time distribution of one or more event types. We also leverage these findings further to be applicable for estimating general “$A$ before $B$” type of queries. We prove theoretically that our estimation method is effectively always better than naive simulation and demonstrate empirically based on three real-world datasets that our approach can produce orders of magnitude improvements in sampling efficiency compared to naive methods.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/boyd23a.html
https://proceedings.mlr.press/v206/boyd23a.htmlExploration in Reward Machines with Low RegretWe study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bourel23a.html
https://proceedings.mlr.press/v206/bourel23a.htmlRandom Features Model with General Convex Regularization: A Fine Grained Analysis with Precise Asymptotic Learning CurvesWe compute precise asymptotic expressions for the learning curves of least squares random feature (RF) models with either a separable strongly convex regularization or the $\ell_1$ regularization. We propose a novel multi-level application of the convex Gaussian min max theorem (CGMT) to overcome the traditional difficulty of finding computable expressions for random features models with correlated data. Our result takes the form of a computable 4-dimensional scalar optimization. In contrast to previous results, our approach does not require solving an often intractable proximal operator, which scales with the number of model parameters. Furthermore, we extend the universality results for the training and generalization errors for RF models to $\ell_1$ regularization. In particular, we demonstrate that under mild conditions, random feature models with elastic net or $\ell_1$ regularization are asymptotically equivalent to a surrogate Gaussian model with the same first and second moments. We numerically demonstrate the predictive capacity of our results, and show experimentally that the predicted test error is accurate even in the non-asymptotic regime.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bosch23a.html
https://proceedings.mlr.press/v206/bosch23a.htmlIsotropic Gaussian Processes on Finite Spaces of GraphsWe propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Matérn. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors’ kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/borovitskiy23a.html
https://proceedings.mlr.press/v206/borovitskiy23a.htmlFrom Shapley Values to Generalized Additive Models and backIn explainable machine learning, local post-hoc explanation algorithms and inherently interpretable models are often seen as competing approaches. This work offers a partial reconciliation between the two by establishing a correspondence between Shapley Values and Generalized Additive Models (GAMs). We introduce $n$-Shapley Values, a parametric family of local post-hoc explanation algorithms that explain individual predictions with interaction terms up to order $n$. By varying the parameter $n$, we obtain a sequence of explanations that covers the entire range from Shapley Values up to a uniquely determined decomposition of the function we want to explain. The relationship between $n$-Shapley Values and this decomposition offers a functionally-grounded characterization of Shapley Values, which highlights their limitations. We then show that $n$-Shapley Values, as well as the Shapley Taylor- and Faith-Shap interaction indices, recover GAMs with interaction terms up to order $n$. This implies that the original Shapely Values recover GAMs without variable interactions. Taken together, our results provide a precise characterization of Shapley Values as they are being used in explainable machine learning. They also offer a principled interpretation of partial dependence plots of Shapley Values in terms of the underlying functional decomposition. A package for the estimation of different interaction indices is available at https://github.com/tml-tuebingen/nshap.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bordt23a.html
https://proceedings.mlr.press/v206/bordt23a.htmlIdentification of Blackwell Optimal Policies for Deterministic MDPsThis paper investigates a new learning problem, the identification of Blackwell optimal policies on deterministic MDPs (DMDPs): A learner has to return a Blackwell optimal policy with fixed confidence using a minimal number of queries. First, we characterize the maximal set of DMDPs for which the identification is possible. Then, we focus on the analysis of algorithms based on product-form confidence regions. We minimize the number of queries by efficiently visiting the state-action pairs with respect to the shape of confidence sets. Furthermore, these confidence sets are themselves optimized to achieve better performances. The performances of our methods compare to the lower bounds up to a factor $n^2$ in the worst case – where $n$ is the number of states, and constant in certain classes of DMDPs.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/boone23a.html
https://proceedings.mlr.press/v206/boone23a.htmlHierarchical-Hyperplane Kernels for Actively Learning Gaussian Process Models of Nonstationary SystemsLearning precise surrogate models of complex computer simulations and physical machines often require long-lasting or expensive experiments. Furthermore, the modeled physical dependencies exhibit nonlinear and nonstationary behavior. Machine learning methods that are used to produce the surrogate model should therefore address these problems by providing a scheme to keep the number of queries small, e.g. by using active learning and be able to capture the nonlinear and nonstationary properties of the system. One way of modeling the nonstationarity is to induce input-partitioning, a principle that has proven to be advantageous in active learning for Gaussian processes. However, these methods either assume a known partitioning, need to introduce complex sampling schemes or rely on very simple geometries. In this work, we present a simple, yet powerful kernel family that incorporates a partitioning that: i) is learnable via gradient-based methods, ii) uses a geometry that is more flexible than previous ones, while still being applicable in the low data regime. Thus, it provides a good prior for active learning procedures. We empirically demonstrate excellent performance on various active learning tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bitzer23a.html
https://proceedings.mlr.press/v206/bitzer23a.htmlRecurrent Neural Networks and Universal Approximation of Bayesian FiltersWe consider the Bayesian optimal filtering problem: i.e. estimating some conditional statistics of a latent time-series signal from an observation sequence. Classical approaches often rely on the use of assumed or estimated transition and observation models. Instead, we formulate a generic recurrent neural network framework and seek to learn directly a recursive mapping from observational inputs to the desired estimator statistics. The main focus of this article is the approximation capabilities of this framework. We provide approximation error bounds for filtering in general non-compact domains. We also consider strong time-uniform approximation error bounds that guarantee good long-time performance. We discuss and illustrate a number of practical concerns and implications of these results.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bishop23a.html
https://proceedings.mlr.press/v206/bishop23a.htmlTighter PAC-Bayes Generalisation Bounds by Leveraging Example DifficultyWe introduce a modified version of the excess risk, which can be used to obtain empirically tighter, faster-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for [$-$1, 1]-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around 0. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/biggs23a.html
https://proceedings.mlr.press/v206/biggs23a.htmlPrediction-Oriented Bayesian Active LearningInformation-theoretic approaches to active learning have traditionally focused on maximising the information gathered about the model parameters, most commonly by optimising the BALD score. We highlight that this can be suboptimal from the perspective of predictive performance. For example, BALD lacks a notion of an input distribution and so is prone to prioritise data of limited relevance. To address this we propose the expected predictive information gain (EPIG), an acquisition function that measures information gain in the space of predictions rather than parameters. We find that using EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models, and thus provides an appealing drop-in replacement.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bickfordsmith23a.html
https://proceedings.mlr.press/v206/bickfordsmith23a.htmlPiecewise Stationary Bandits under Risk CriteriaPiecewise stationary stochastic multi-armed bandits have been extensively explored in the risk-neutral and sub-Gaussian setting. In this work, we consider a multi-armed bandit framework in which the reward distributions are heavy-tailed and non-stationary, and evaluate the performance of algorithms using general risk criteria. Specifically, we make the following contributions: (i) We first propose a non-parametric change detection algorithm that can detect general distributional changes in heavy-tailed distributions. (ii)We then propose a truncation-based UCB-type bandit algorithm integrating the above regime change detection algorithm to minimize the regret of the non-stationary learning problem. (iii) Finally, we establish the regret bounds for the proposed bandit algorithm by characterizing the statistical properties of the general change detection algorithm, along with a novel regret analysis.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bhatt23b.html
https://proceedings.mlr.press/v206/bhatt23b.htmlOn Universal Portfolios with Continuous Side InformationA new portfolio selection strategy that adapts to a continuous side-information sequence is presented, with a universal wealth guarantee against a class of state-constant rebalanced portfolios with respect to a state function that maps each side-information symbol to a finite set of states. In particular, given that a state function belongs to a collection of functions of finite Natarajan dimension, the proposed strategy is shown to achieve, asymptotically to first order in the exponent, the same wealth as the best state-constant rebalanced portfolio with respect to the best state function, chosen in hindsight from observed market. This result can be viewed as an extension of the seminal work of Cover and Ordentlich (1996) that assumes a single-state function.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bhatt23a.html
https://proceedings.mlr.press/v206/bhatt23a.htmlReward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling LawsSpecifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate, which depends primarily on the eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the generality of these results, our bounds are stronger than previous bounds developed for more specialized problems. We specifically show that the well-studied problem of Gaussian process (GP) bandit optimization is a special case of our framework, and that our bounds either improve or are competitive with known regret guarantees for the Matérn kernel.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bhatia23a.html
https://proceedings.mlr.press/v206/bhatia23a.htmlCompeting against Adaptive Strategies in Online Learning via HintsFor many of the classic online learning settings, it is known that having a “hint” about the loss function before making a prediction yields significantly better regret guarantees. In this work we study the question, do hints allow us to go beyond the standard notion of regret (which competes against the best fixed strategy) and compete against adaptive or dynamic strategies? After all, if hints were perfect, we can clearly compete against a fully dynamic strategy. For some common online learning settings, we provide upper and lower bounds for the switching regret, i.e., the difference between the loss incurred by the algorithm and the optimal strategy in hindsight that switches state at most $L$ times, where $L$ is some parameter. We show positive results for online linear optimization and the classic experts problem. Interestingly, such results turn out to be impossible for the classic bandit setting.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bhaskara23a.html
https://proceedings.mlr.press/v206/bhaskara23a.htmlStochastic Gradient Descent-Ascent: Unified Theory and New Efficient MethodsStochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA). Although variants of the new methods are known for solving minimization problems, they were never considered or analyzed for solving min-max problems and VIPs. We also demonstrate the most important properties of the new methods through extensive numerical experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/beznosikov23a.html
https://proceedings.mlr.press/v206/beznosikov23a.htmlOn the Limitations of the Elo, Real-World Games are Transitive, not AdditiveThe Elo score has been extensively used to rank players by their skill or strength in competitive games such as chess, go, or StarCraft II. The Elo score implicitly assumes games have a strong additive—hence transitive—component. In this paper, we investigate the challenge of identifying transitive components in games. As a starting point, we show that the Elo score provably fails to extract the transitive component of some elementary transitive games. Based on this observation, we propose an alternative ranking system which properly extracts the transitive components in these games. Finally, we conduct an in-depth empirical validation on real-world game payoff matrices: it shows significant prediction performance improvements compared to the Elo score.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bertrand23a.html
https://proceedings.mlr.press/v206/bertrand23a.htmlTo Impute or not to Impute? Missing Data in Treatment Effect EstimationMissing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the input (e.g. an individual) and the label (e.g. an outcome). The treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we introduce mixed confounded missingness (MCM), a new missingness mechanism where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment introduces bias in covariates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data. We highlight that our experiments encompass both average treatment effects and conditional average treatment effects.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/berrevoets23a.html
https://proceedings.mlr.press/v206/berrevoets23a.htmlProvable Safe Reinforcement Learning with Binary FeedbackSafety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both theoretical and empirical frameworks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bennett23a.html
https://proceedings.mlr.press/v206/bennett23a.htmlBayesian Optimization Over Iterative Learners with Structured Responses: A Budget-aware Planning ApproachThe rising growth of deep neural networks (DNNs) and datasets in size motivates the need for efficient solutions for simultaneous model selection and training. Many methods for hyperparameter optimization (HPO) of iterative learners, including DNNs, attempt to solve this problem by querying and learning a response surface while searching for the optimum of that surface. However, many of these methods make myopic queries, do not consider prior knowledge about the response structure, and/or perform a biased cost-aware search, all of which exacerbate identifying the best-performing model when a total cost budget is specified. This paper proposes a novel approach referred to as Budget-Aware Planning for Iterative Learners (BAPI) to solve HPO problems under a constrained cost budget. BAPI is an efficient non-myopic Bayesian optimization solution that accounts for the budget and leverages the prior knowledge about the objective function and cost function to select better configurations and to take more informed decisions during the evaluation (training). Experiments on diverse HPO benchmarks for iterative learners show that BAPI performs better than state-of-the-art baselines in most cases.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/belakaria23a.html
https://proceedings.mlr.press/v206/belakaria23a.htmlOn the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced DataVarious logit-adjusted parameterizations of the cross-entropy (CE) loss have been proposed as alternatives to weighted CE for training large models on label-imbalanced data far beyond the zero train error regime. The driving force behind those designs has been the theory of implicit bias, which for linear(ized) models, explains why they successfully induce bias on the optimization path towards solutions that favor minorities. Aiming to extend this theory to non-linear models, we investigate the implicit geometry of classifiers and embeddings that are learned by different CE parameterizations. Our main result characterizes the global minimizers of a non-convex cost-sensitive SVM classifier for the unconstrained features model, which serves as an abstraction of deep-nets. We derive closed-form formulas for the angles and norms of classifiers and embeddings as a function of the number of classes, the imbalance and the minority ratios, and the loss hyperparameters. Using these, we show that logit-adjusted parameterizations can be appropriately tuned to learn symmetric geometries irrespective of the imbalance ratio. We complement our analysis with experiments and an empirical study of convergence accuracy in deep-nets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/behnia23a.html
https://proceedings.mlr.press/v206/behnia23a.htmlHigh Probability Bounds for Stochastic Continuous Submodular MaximizationWe consider maximization of stochastic monotone continuous submodular functions (CSF) with a diminishing return property. Existing algorithms only guarantee the performance in expectation, and do not bound the probability of getting a bad solution. This implies that for a particular run of the algorithms, the solution may be much worse than the provided guarantee in expectation. In this paper, we first empirically verify that this is indeed the case. Then, we provide the first high-probability analysis of the existing methods for stochastic CSF maximization, namely PGA, boosted PGA, SCG, and SCG++. Finally, we provide an improved high-probability bound for SCG, under slightly stronger assumptions, with a better convergence rate than that of the expected solution. Through extensive experiments on non-concave quadratic programming (NQP) and optimal budget allocation, we confirm the validity of our bounds and show that even in the worst-case, PGA converges to $OPT/2$, and boosted PGA, SCG, SCG++ converge to $(1 - 1/e)OPT$, but at a slower rate than that of the expected solution.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/becker23a.html
https://proceedings.mlr.press/v206/becker23a.htmlPrincipled Approaches for Private Adaptation from a Public SourceA key problem in a variety of applications is that of domain adaptation from a public source domain, for which a relatively large amount of labeled data with no privacy constraints is at one’s disposal, to a private target domain, for which a private sample is available with very few or no labeled data. In regression problems, where there are no privacy constraints on the source or target data, a discrepancy minimization approach was shown to outperform a number of other adaptation algorithm baselines. Building on that approach, we initiate a principled study of differentially private adaptation from a source domain with public labeled data to a target domain with unlabeled private data. We design differentially private discrepancy-based adaptation algorithms for this problem. The design and analysis of our private algorithms critically hinge upon several key properties we prove for a smooth approximation of the weighted discrepancy, such as its smoothness with respect to the $\ell_1$-norm and the sensitivity of its gradient. We formally show that our adaptation algorithms benefit from strong generalization and privacy guarantees.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bassily23a.html
https://proceedings.mlr.press/v206/bassily23a.htmlA Faster Sampler for Discrete Determinantal Point ProcessesDiscrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. They are however held back in some cases by the high cost of sampling. In the worst-case scenario, the sampling cost scales as $O(n^3)$ where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels. In such cases, the cost of standard sampling algorithms scales as $O(np^2 + nm^2)$ where m is the (average) number of samples of the DPP (usually m $\ll$ n) and p the rank of the kernel used to define the DPP (m $\leq$ p $\leq$ n). The first term, $O(np^2)$, comes from a SVD-like step. We focus here on the second term of this cost, $O(nm^2)$, and show that it can be brought down to $O(nm + m^3 log m)$ without loss on the sampling’s exactness. In practice, we observe very substantial speedups compared to the classical algorithm as soon as n $>$ 1, 000. The algorithm described here is a close variant of the standard algorithm for sampling continuous DPPs, and uses rejection sampling. In the specific case of projection DPPs, we also show that any additional sample can be drawn in time $O(m^3 log m)$. Finally, an interesting by-product of the analysis is that a realisation from a DPP is typically contained in a subset of size $O(m \log m)$ formed using leverage score i.i.d. sampling.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/barthelme23a.html
https://proceedings.mlr.press/v206/barthelme23a.htmlAdaptive Cholesky Gaussian ProcessesWe present a method to approximate Gaussian process regression models to large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bartels23a.html
https://proceedings.mlr.press/v206/bartels23a.htmlExploration in Linear Bandits with Rich Action Sets and its Implications for InferenceWe present a non-asymptotic lower bound on the spectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound on the least eigenvalue, rather than a logarithmic lower bound as shown by Lattimore et al. (2017) for discrete (i.e., well-separated) action spaces. Furthermore, while the latter holds only in the asymptotic regime ($n \to \infty$), our result for these “locally rich” action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios – model selection and clustering in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary—the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/banerjee23b.html
https://proceedings.mlr.press/v206/banerjee23b.htmlTesting of Horn SamplersSampling over combinatorial spaces is a fundamental problem in artificial intelligence with a wide variety of applications. Since state-of-the-art techniques heavily rely on heuristics whose rigorous analysis remains beyond the reach of current theoretical tools, the past few years have witnessed interest in the design of techniques to test the quality of samplers. The current state-of-the-art techniques, $\mathsf{Barbarik}$ and $\mathsf{Barbarik2}$, focuses on the cases where combinatorial spaces are encoded as Conjunctive Normal Form (CNF) formulas. While CNF is a general-purpose form, often techniques rely on exploiting specific representations to achieve speedup. Of particular interest are Horn clauses, which form the basis of the logic programming tools in AI. In this context, a natural question is whether it is possible to design a tester that can determine the correctness of a given Horn sampler. The primary contribution of this paper is an affirmative answer to the above question. We design the first tester, $\mathsf{Flash}$, which tests the correctness of a given Horn sampler: given a specific distribution $\mathcal{I}$ and parameters $\eta$, $\varepsilon$, and $\delta$, the tester $\mathsf{Flash}$ correctly (with probability at least $ 1-\delta$) distinguishes whether the underlying distribution of the Horn-sampler is “$\varepsilon$-close” to $\mathcal{I}$ or “$\eta$-far” from $\mathcal{I}$ by sampling only $\widetilde{\mathcal{O}}(\mathsf{tilt}^3/(\eta - \varepsilon)^4)$ samples from the Horn-sampler, where the $\mathsf{tilt}$ is the ratio of the maximum and the minimum (non-zero) probability masses of $\mathcal{I}$. We also provide a prototype implementation of $\mathsf{Flash}$ and test three state-of-the-art samplers on a set of benchmarks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/banerjee23a.html
https://proceedings.mlr.press/v206/banerjee23a.htmlNash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness GamesAdversarial training is a standard technique for training adversarially robust models. In this paper, we study adversarial training as an alternating best-response strategy in a 2-player zero-sum game. We prove that even in a simple scenario of a linear classifier and a statistical model that abstracts robust vs. non-robust features, the alternating best response strategy of such game may not converge. On the other hand, a unique pure Nash equilibrium of the game exists and is provably robust. We support our theoretical results with experiments, showing the non-convergence of adversarial training and the robustness of Nash equilibrium.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/balcan23a.html
https://proceedings.mlr.press/v206/balcan23a.htmlLarge deviations rates for stochastic gradient descent with strongly convex functionsRecent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, allowing for the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviations bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments nor interplay between the noise geometry and the shape of the cost function. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bajovic23a.html
https://proceedings.mlr.press/v206/bajovic23a.htmlEstimating Total Correlation with Mutual Information EstimatorsTotal correlation (TC) is a fundamental concept in information theory that measures statistical dependency among multiple random variables. Recently, TC has shown noticeable effectiveness as a regularizer in many learning tasks, where the correlation among multiple latent embeddings requires to be jointly minimized or maximized. However, calculating precise TC values is challenging, especially when the closed-form distributions of embedding variables are unknown. In this paper, we introduce a unified framework to estimate total correlation values with sample-based mutual information (MI) estimators. More specifically, we discover a relation between TC and MI and propose two types of calculation paths (tree-like and line-like) to decompose TC into MI terms. With each MI term being bounded, the TC values can be successfully estimated. Further, we provide theoretical analyses concerning the statistical consistency of the proposed TC estimators. Experiments are presented on both synthetic and real-world scenarios, where our estimators demonstrate effectiveness in all TC estimation, minimization, and maximization tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bai23a.html
https://proceedings.mlr.press/v206/bai23a.htmlImproving Adversarial Robustness via Joint Classification and Multiple Explicit Detection ClassesThis work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the “abstain” class. In this work, we show that such a provable framework can benefit by extension to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. We show that naïvely adding multiple abstain classes can lead to “model degeneracy”, then we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of abstain classes.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/baharlouei23a.html
https://proceedings.mlr.press/v206/baharlouei23a.htmlA Mini-Block Fisher Method for Deep Neural NetworksDeep Neural Networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a “mini-block Fisher (MBF)” preconditioned stochastic gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF’s per-iteration computational cost is only slightly higher than it is for first-order methods. The performance of MBF is compared to that of several baseline methods, on Autoencoder, Convolutional Neural Network (CNN), and Graph Convolutional Network (GCN) problems, to validate its effectiveness both in terms of time efficiency and generalization power. Finally, it is proved that an idealized version of MBF converges linearly.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bahamou23a.html
https://proceedings.mlr.press/v206/bahamou23a.htmlTS-UCB: Improving on Thompson Sampling With Little to No Additional ComputationThompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. We propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance improvements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehensive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoretical perspective, we establish optimal regret guarantees for TS-UCB for both the K-armed and linear bandit models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/baek23a.html
https://proceedings.mlr.press/v206/baek23a.htmlDeep Value Function Networks for Large-Scale Multistage Stochastic ProgramsA neural networks-based stagewise decomposition algorithm called Deep Value Function Networks (DVFN) is proposed for large-scale multistage stochastic programming (MSP) problems. Traditional approaches such as nested Benders decomposition and its stochastic variant, stochastic dual dynamic programming (SDDP) approximates value functions as piecewise linear convex functions by gradually accumulating subgradient cuts from dual solutions of stagewise subproblems. Although they have been proven effective for linear problems, nonlinear problems may suffer from the increasing number of subgradient cuts as they proceed. A recently developed algorithm called Value Function Gradient Learning (VFGL) replaced the piecewise linear approximation with parametric function approximation, but its performance heavily depends upon the choice of parametric forms like most of traditional parametric machine learning algorithms did. On the other hand, DVFN approximates value functions using neural networks, which are known to have huge capacity in terms of their functional representations. The art of choosing appropriate parametric form becomes a simple labor of hyperparameter search for neural networks. However, neural networks are non-convex in general, and it would make the learning process unstable. We resolve this issue by using input convex neural networks that guarantee convexity with respect to inputs. We compare DVFN with SDDP and VFGL for solving large-scale linear and nonlinear MSP problems: production optimization and energy planning. Numerical examples clearly indicate that DVFN provide accurate and computationally efficient solutions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bae23a.html
https://proceedings.mlr.press/v206/bae23a.htmlGaussian Processes on Distributions based on Regularized Optimal TransportWe present a novel kernel over the space of probability measures based on the dual formulation of optimal regularized transport. We propose an Hilbertian embedding of the space of probabilities using their Sinkhorn potentials, which are solutions of the dual entropic relaxed optimal transport between the probabilities and a reference measure $\mathcal{U}$. We prove that this construction enables to obtain a valid kernel, by using the Hilbert norms. We prove that the kernel enjoys theoretical properties such as universality and some invariances, while still being computationally feasible. Moreover we provide theoretical guarantees on the behaviour of a Gaussian process based on this kernel. The empirical performances are compared with other traditional choices of kernels for processes indexed on distributions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bachoc23a.html
https://proceedings.mlr.press/v206/bachoc23a.htmlRandomized geometric tools for anomaly detection in stock marketsWe propose novel randomized geometric tools to detect low-volatility anomalies in stock markets; a principal problem in financial economics. Our modeling of the (detection) problem results in sampling and estimating the (relative) volume of geodesically non-convex and non-connected spherical patches that arise by intersecting a non-standard simplex with a sphere. To sample, we introduce two novel Markov Chain Monte Carlo (MCMC) algorithms that exploit the geometry of the problem and employ state-of-the-art continuous geometric random walks (such as Billiard walk and Hit-and-Run) adapted on spherical patches. To our knowledge, this is the first geometric formulation and MCMC-based analysis of the volatility puzzle in stock markets. We have implemented our algorithms in C++ (along with an R interface) and we illustrate the power of our approach by performing extensive experiments on real data. Our analyses provide accurate detection and new insights into the distribution of portfolios’ performance characteristics. Moreover, we use our tools to show that classical methods for low-volatility anomaly detection in finance form bad proxies that could lead to misleading or inaccurate results.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/bachelard23a.html
https://proceedings.mlr.press/v206/bachelard23a.htmlSecond Order Path Variationals in Non-Stationary Online LearningWe consider the problem of universal dynamic regret minimization under exp-concave and smooth losses. We show that appropriately designed Strongly Adaptive algorithms achieve a dynamic regret of $\tilde O(d^2 n^{1/5} [\mathcal{TV}_1(w_{1:n})]^{2/5} \vee d^2)$, where $n$ is the time horizon and $\mathcal{TV}_1(w_{1:n})$ a path variational based on second order differences of the comparator sequence. Such a path variational naturally encodes comparator sequences that are piece-wise linear – a powerful family that tracks a variety of non-stationarity patterns in practice (Kim et al., 2009). The aforementioned dynamic regret is shown to be optimal modulo dimension dependencies and poly-logarithmic factors of $n$. To the best of our knowledge, this path variational has not been studied in the non-stochastic online learning literature before. Our proof techniques rely on analysing the KKT conditions of the offline oracle and requires several non-trivial generalizations of the ideas in Baby and Wang (2021) where the latter work only implies an $\tilde{O}(n^{1/3})$ regret for the current problem.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/baby23a.html
https://proceedings.mlr.press/v206/baby23a.htmlOvercoming Prior Misspecification in Online Learning to RankThe recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/azizi23a.html
https://proceedings.mlr.press/v206/azizi23a.htmlTheoretically Grounded Loss Functions and Algorithms for Adversarial RobustnessAdversarial robustness is a critical property of classifiers in applications as they are increasingly deployed in complex real-world systems. Yet, achieving accurate adversarial robustness in machine learning remains a persistent challenge and the choice of the surrogate loss function used for training a key factor. We present a family of new loss functions for adversarial robustness, smooth adversarial losses, which we show can be derived in a general way from broad families of loss functions used in multi-class classification. We prove strong H-consistency theoretical guarantees for these loss functions, including multi-class H-consistency bounds for sum losses in the adversarial setting. We design new regularized algorithms based on the minimization of these principled smooth adversarial losses (PSAL). We further show through a series of extensive experiments with the CIFAR-10, CIFAR-100 and SVHN datasets that our PSAL algorithm consistently outperforms the current state-of-the-art technique, TRADES, for both robust accuracy against l-infinity-norm bounded perturbations and, even more significantly, for clean accuracy. Finally, we prove that, unlike PSAL, the TRADES loss in general does not admit an H-consistency property.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/awasthi23c.html
https://proceedings.mlr.press/v206/awasthi23c.htmlTheory and Algorithm for Batch Distribution Drift ProblemsWe study a problem of batch distribution drift motivated by several applications, which consists of determining an accurate predictor for a target time segment, for which a moderate amount of labeled samples are at one’s disposal, while leveraging past segments for which substantially more labeled samples are available. We give new algorithms for this problem guided by a new theoretical analysis and generalization bounds derived for this scenario. We further extend our results to the case where few or no labeled data is available for the period of interest. Finally, we report the results of extensive experiments demonstrating the benefits of our drifting algorithm, including comparisons with natural baselines. A by-product of our study is a principled solution to the problem of multiple-source adaptation with labeled source data and a moderate amount of target labeled data, which we briefly discuss and compare with.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/awasthi23b.html
https://proceedings.mlr.press/v206/awasthi23b.htmlFast Computation of Branching Process Transition Probabilities via ADMMBranching processes are a class of continuous-time Markov chains (CTMCs) prevalent for modeling stochastic population dynamics in ecology, biology, epidemiology, and many other fields. The transient or finite-time behavior of these systems is fully characterized by their transition probabilities. However, computing them requires marginalizing over all paths between endpoint-conditioned values, which often poses a computational bottleneck. Leveraging recent results that connect generating function methods to a compressed sensing framework, we recast this task from the lens of sparse optimization. We propose a new solution method using variable splitting; in particular, we derive closed form updates in a highly efficient ADMM algorithm. Notably, no matrix products—let alone inversions—are required at any step. This reduces computational cost by orders of magnitude over existing methods, and the algorithm is easily parallelizable and fairly insensitive to tuning parameters. A comparison to prior work is carried out in two applications to models of blood cell production and transposon evolution, showing that the proposed method is orders of magnitudes more scalable than existing work.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/awasthi23a.html
https://proceedings.mlr.press/v206/awasthi23a.htmlComputing Abductive Explanations for Boosted TreesBoosted trees is a dominant ML model, exhibiting high accuracy. However, boosted trees are hardly intelligible, and this is a problem whenever they are used in safety-critical applications. Indeed, in such a context, provably sound explanations for the predictions made are expected. Recent work have shown how subset-minimal abductive explanations can be derived for boosted trees, using automated reasoning techniques. However, the generation of such well-founded explanations is intractable in the general case. To improve the scalability of their generation, we introduce the notion of tree-specific explanation for a boosted tree. We show that tree-specific explanations are provably sound abductive explanations that can be computed in polynomial time. We also explain how to derive a subset-minimal abductive explanation from a tree-specific explanation. Experiments on various datasets show the computational benefits of leveraging tree-specific explanations for deriving subset-minimal abductive explanations.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/audemard23a.html
https://proceedings.mlr.press/v206/audemard23a.htmlqEUBO: A Decision-Theoretic Acquisition Function for Preferential Bayesian OptimizationPreferential Bayesian optimization (PBO) is a framework for optimizing a decision maker’s latent utility function using preference feedback. This work introduces the expected utility of the best option (qEUBO) as a novel acquisition function for PBO. When the decision maker’s responses are noise-free, we show that qEUBO is one-step Bayes optimal and thus equivalent to the popular knowledge gradient acquisition function. We also show that qEUBO enjoys an additive constant approximation guarantee to the one-step Bayes-optimal policy when the decision maker’s responses are corrupted by noise. We provide an extensive evaluation of qEUBO and demonstrate that it outperforms the state-of-the-art acquisition functions for PBO across many settings. Finally, we show that, under sufficient regularity conditions, qEUBO’s Bayesian simple regret converges to zero at a rate $o(1/n)$ as the number of queries, $n$, goes to infinity. In contrast, we show that simple regret under qEI, a popular acquisition function for standard BO often used for PBO, can fail to converge to zero. Enjoying superior performance, simple computation, and a grounded decision-theoretic justification, qEUBO is a promising acquisition function for PBO.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/astudillo23a.html
https://proceedings.mlr.press/v206/astudillo23a.htmlRoot Cause Identification for Collective Anomalies in Time Series given an Acyclic Summary Causal Graph with LoopsThis paper presents an approach for identifying the root causes of collective anomalies given observational time series and an acyclic summary causal graph which depicts an abstraction of causal relations present in a dynamic system at its normal regime. The paper first shows how the problem of root cause identification can be divided into many independent subproblems by grouping related anomalies using d-separation. Further, it shows how, under this setting, some root causes can be found directly from the graph and from the time of appearance of anomalies. Finally, it shows, how the rest of the root causes can be found by comparing direct causal effects in the normal and in the anomalous regime. To this end, temporal adaptations of the back-door and the single-door criterions are introduced. Extensive experiments conducted on both simulated and real-world datasets demonstrate the effectiveness of the proposed method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/assaad23a.html
https://proceedings.mlr.press/v206/assaad23a.htmlIs interpolation benign for random forest regression?Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples. Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity. While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for Random Forests (RF). In this paper, we study the trade-off between interpolation and consistency for several types of RF algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved simultaneously for several non-adaptive RF. Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we study interpolating Median RF which are proved to be consistent in the interpolating regime. This is the first result conciliating interpolation and consistency for RF, highlighting that the averaging effect introduced by feature randomization is a key mechanism, sufficient to ensure the consistency in the interpolation regime and beyond. Numerical experiments show that Breiman’s RF are consistent while exactly interpolating, when no bootstrap step is involved. We theoretically control the size of the interpolation area, which converges fast enough to zero, giving a necessary condition for exact interpolation and consistency to occur in conjunction.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/arnould23a.html
https://proceedings.mlr.press/v206/arnould23a.htmlVector Optimization with Stochastic Bandit FeedbackWe introduce vector optimization problems with stochastic bandit feedback, in which preferences among designs are encoded by a polyhedral ordering cone $C$. Our setup generalizes the best arm identification problem to vector-valued rewards by extending the concept of Pareto set beyond multi-objective optimization. We characterize the sample complexity of ($\epsilon,\delta$)-PAC Pareto set identification by defining a new cone-dependent notion of complexity, called the ordering complexity. In particular, we provide gap-dependent and worst-case lower bounds on the sample complexity and show that, in the worst-case, the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na{ı̈}ve elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, the returned ($\epsilon,\delta$)-PAC Pareto set, and the success of identification.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ararat23a.html
https://proceedings.mlr.press/v206/ararat23a.htmlMixed-Effect Thompson SamplingA contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. To explore efficiently using this structure, we propose Mixed-Effect Thompson Sampling (meTS) and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform well empirically and show the generality of the proposed framework.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/aouali23a.html
https://proceedings.mlr.press/v206/aouali23a.htmlCombining Graphical and Algebraic Approaches for Parameter Identification in Latent Variable Structural Equation ModelsMeasurement error is ubiquitous in many variables “latent-to-observed” (L2O) transformation from the MIIV approach and develop an equivalent graphical L2O transformation that allows applying existing graphical criteria to latent parameters in SEMs. We combine L2O transformation with graphical instrumental variable criteria to obtain an efficient algorithm for non-iterative parameter identification in SEMs with latent variables. We prove that this graphical L2O transformation with the instrumental set criterion is equivalent to the state-of-the-art MIIV approach for SEMs, and show that it can lead to novel identification strategies when combined with other graphical criteria.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ankan23a.html
https://proceedings.mlr.press/v206/ankan23a.htmlFitting low-rank models on egocentrically sampled partial networksThe statistical modeling of random networks has been widely used to uncover interaction mechanisms in complex systems and to predict unobserved links in real-world networks. In many applications, network connections are collected via egocentric sampling: a subset of nodes is sampled first, after which all links involving this subset are recorded; all other information is missing. Compared with the assumption of “uniformly missing at random”, egocentrically sampled partial networks require specially designed modeling strategies. Current statistical methods are either computationally infeasible or based on intuitive designs without theoretical justification. Here, we propose an approach to fit general low-rank models for egocentrically sampled networks, which include several popular network models. This method is based on graph spectral properties and is computationally efficient for large-scale networks. It results in consistent recovery of missing subnetworks due to egocentric sampling for sparse networks. To our knowledge, this method offers the first theoretical guarantee for egocentric partial network estimation in the scope of low-rank models. We evaluate the technique on several synthetic and real-world networks and show that it delivers competitive performance in link prediction tasks.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/angus-chan23a.html
https://proceedings.mlr.press/v206/angus-chan23a.htmlClustering above Exponential Families with Tempered Exponential MeasuresThe link with exponential families has allowed k-means clustering to be generalized to a wide variety of data-generating distributions in exponential families and clustering distortions among Bregman divergences. Getting the framework to go beyond exponential families is important to lift roadblocks like the lack of robustness of some population minimizers, which is carved into their axiomatization. Current generalizations of exponential families like the q-exponential families or even the deformed exponential families fail at achieving the goal. In this paper, we provide a new attempt at getting a complete framework, grounded in a new generalization of exponential families that we introduce, called tempered exponential measures (TEMs). TEMs keep the maximum entropy axiomatization framework of q-exponential families, but instead of normalizing the measure, normalize a dual called a co-distribution. Numerous interesting properties arise for clustering, such as improved and controllable robustness for population minimizers, that keep a simple analytic form.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/amid23a.html
https://proceedings.mlr.press/v206/amid23a.htmlFixing by Mixing: A Recipe for Optimal Byzantine ML under HeterogeneityByzantine machine learning (ML) aims to ensure the resilience of distributed learning algorithms to misbehaving (or Byzantine) machines. Although this problem received significant attention, prior works often assume the data held by the machines to be homogeneous, which is seldom true in practical settings. Data heterogeneity makes Byzantine ML considerably more challenging, since a Byzantine machine can hardly be distinguished from a non-Byzantine outlier. A few solutions have been proposed to tackle this issue, but these provide suboptimal probabilistic guarantees and fare poorly in practice. This paper closes the theoretical gap, achieving optimality and inducing good empirical results. In fact, we show how to automatically adapt existing solutions for (homogeneous) Byzantine ML to the heterogeneous setting through a powerful mechanism, we call nearest neighbor mixing (NNM), which boosts any standard robust distributed gradient descent variant to yield optimal Byzantine resilience under heterogeneity. We obtain similar guarantees (in expectation) by plugging NNM in the distributed stochastic heavy ball method, a practical substitute to distributed gradient descent. We obtain empirical results that significantly outperform state-of-the-art Byzantine ML solutions.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/allouah23a.html
https://proceedings.mlr.press/v206/allouah23a.htmlUniversal Agent Mixtures and the Geometry of IntelligenceInspired by recent progress in multi-agent Reinforcement Learning (RL), in this work we examine the collective intelligent behaviour of theoretical universal agents by introducing a weighted mixture operation. Given a weighted set of agents, their weighted mixture is a new agent whose expected total reward in any environment is the corresponding weighted average of the original agents’ expected total rewards in that environment. Thus, if RL agent intelligence is quantified in terms of performance across environments, the weighted mixture’s intelligence is the weighted average of the original agents’ intelligence. This operation enables various interesting new theorems that shed light on the geometry of RL agent intelligence, namely: results about symmetries, convex agent-sets, and local extrema. We also show that any RL agent intelligence measure based on average performance across environments, subject to certain weak technical conditions, is identical (up to a constant factor) to performance within a single environment dependent on said intelligence measure.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/alexander23a.html
https://proceedings.mlr.press/v206/alexander23a.htmlLearning Robust Graph Neural Networks with Limited SupervisionGraph Neural Networks (GNNs) require a relatively large number of labeled nodes and a reliable/uncorrupted graph connectivity structure to obtain good performance on the semi-supervised node classification task. The performance of GNNs can degrade significantly as the number of labeled nodes decreases or the graph connectivity structure is corrupted by adversarial attacks or noise in data measurement/collection. Therefore, it is important to develop GNN models that are able to achieve good performance when there is limited supervision knowledge–a few labeled nodes and a noisy graph structure. In this paper, we propose a novel Dual GNN learning framework to address this challenging task. The proposed framework has two GNN based node prediction modules. The primary module uses the input graph structure to induce typical node embeddings and predictions with a regular GNN baseline, while the auxiliary module constructs a new graph structure through fine-grained spectral clustering and learns new node embeddings and predictions. By integrating the two modules in a dual GNN learning framework, we perform joint learning in an end-to-end fashion. This general framework can be applied on many GNN baseline models. The experimental results show that the proposed dual GNN framework can greatly outperform the GNN baseline methods and yield superior performance over many state-of-the-art methods when the labeled nodes are scarce and the graph connectivity structure is noisy.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/alchihabi23a.html
https://proceedings.mlr.press/v206/alchihabi23a.htmlAdapting to Latent Subgroup Shifts via Concepts and ProxiesWe address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/alabdulmohsin23a.html
https://proceedings.mlr.press/v206/alabdulmohsin23a.htmlConformalized Unconditional Quantile RegressionWe develop a predictive inference procedure that combines conformal prediction (CP) with unconditional quantile regression (QR)-a commonly used tool in econometrics that involves regressing the recentered influence function (RIF) of the quantile functional over input covariates. Unlike the more widely known conditional QR, unconditional QR explicitly captures the impact of changes in covariate distribution on the quantiles of the marginal distribution of outcomes. Leveraging this property, our procedure issues adaptive predictive intervals with localized frequentist coverage guarantees. It operates by fitting a machine learning model for the RIFs using training data, and then applying the CP procedure for any test covariate with respect to a “hypothetical” covariate distribution localized around the new instance. Experiments show that our procedure is adaptive to heteroscedasticity, provides transparent coverage guarantees that are relevant to the test instance at hand, and performs competitively with existing methods in terms of efficiency.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/alaa23a.html
https://proceedings.mlr.press/v206/alaa23a.htmlProbing Graph RepresentationsToday we have a good theoretical understanding of the representational power of Graph Neural Networks (GNNs). For example, their limitations have been characterized in relation to a hierarchy of Weisfeiler-Lehman (WL) isomorphism tests. However, we do not know what is encoded in the learned representations. This is our main question. We answer it using a probing framework to quantify the amount of meaningful information captured in graph representations. Our findings on molecular datasets show the potential of probing for understanding the inductive biases of graph-based models. We compare different families of models, and show that Graph Transformers capture more chemically relevant information compared to models based on message passing. We also study the effect of different design choices such as skip connections and virtual nodes. We advocate for probing as a useful diagnostic tool for evaluating and developing graph-based models.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/akhondzadeh23a.html
https://proceedings.mlr.press/v206/akhondzadeh23a.htmlGenerative Oversampling for Imbalanced Data via Majority-Guided VAELearning with imbalanced data is a challenging problem in deep learning. Over-sampling is a widely used technique to re-balance the sampling distribution of training data. However, most existing over-sampling methods only use intra-class information of minority classes to augment the data but ignore the inter-class relationships with the majority ones, which is prone to overfitting, especially when the imbalance ratio is large. To address this issue, we propose a novel over-sampling model, called Majority-Guided VAE(MGVAE), which generates new minority samples under the guidance of a majority-based prior. In this way, the newly generated minority samples can inherit the diversity and richness of the majority ones, thus mitigating overfitting in downstream tasks. Furthermore, to prevent model collapse under limited data, we first pre-train MGVAE on sufficient majority samples and then fine-tune based on minority samples with Elastic Weight Consolidation(EWC) regularization. Experimental results on benchmark image datasets and real-world tabular data show that MGVAE achieves competitive improvements over other over-sampling methods in downstream classification tasks, demonstrating the effectiveness of our method.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ai23a.html
https://proceedings.mlr.press/v206/ai23a.htmlSemantic Strengthening of Neuro-Symbolic LearningNumerous neuro-symbolic approaches have recently been proposed typically with the goal of adding symbolic knowledge to the output layer of a neural network. Ideally, such losses maximize the probability that the neural network’s predictions satisfy the underlying domain. Unfortunately, this type of probabilistic inference is often computationally infeasible. Neuro-symbolic approaches therefore commonly resort to fuzzy approximations of this probabilistic objective, sacrificing sound probabilistic semantics, or to sampling which is very seldom feasible. We approach the problem by first assuming the constraint decomposes conditioned on the features learned by the network. We iteratively strengthen our approximation, restoring the dependence between the constraints most responsible for degrading the quality of the approximation. This corresponds to computing the mutual information between pairs of constraints conditioned on the network’s learned features, and may be construed as a measure of how well aligned the gradients of two distributions are. We show how to compute this efficiently for tractable circuits. We test our approach on three tasks: predicting a minimum-cost path in Warcraft, predicting a minimum-cost perfect matching, and solving Sudoku puzzles, observing that it improves upon the baselines while sidestepping intractability.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ahmed23a.html
https://proceedings.mlr.press/v206/ahmed23a.htmlImproved Approximation for Fair Correlation ClusteringCorrelation clustering is a ubiquitous paradigm in unsupervised machine learning where addressing unfairness is a major challenge. Motivated by this, we study fair correlation clustering where the data points may belong to different protected groups and the goal is to ensure fair representation of all groups across clusters. Our paper significantly generalizes and improves on the quality guarantees of previous work of Ahmadian et al. as follows. * We allow the user to specify an arbitrary upper bound on the representation of each group in a cluster. * Our algorithm allows individuals to have multiple protected features and ensure fairness simultaneously across them all. * We prove guarantees for clustering quality and fairness in this general setting. Furthermore, this improves on the results for the special cases studied in previous work. Our experiments on real-world data demonstrate that our clustering quality compared to the optimal solution is much better than what our theoretical result suggests.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/ahmadian23a.html
https://proceedings.mlr.press/v206/ahmadian23a.htmlOn the bias of K-fold cross validation with stable learnersThis paper investigates the efficiency of the K-fold cross-validation (CV) procedure and a debiased version thereof as a means of estimating the generalization risk of a learning algorithm. We work under the general assumption of uniform algorithmic stability. We show that the K-fold risk estimate may not be consistent under such general stability assumptions, by constructing non vanishing lower bounds on the error in realistic contexts such as regularized empirical risk minimisation and stochastic gradient descent. We thus advocate the use of a debiased version of the K-fold and prove an error bound with exponential tail decay regarding this version. Our result is applicable to the large class of uniformly stable algorithms, contrarily to earlier works focusing on specific tasks such as density estimation. We illustrate the relevance of the debiased K-fold CV on a simple model selection problem and demonstrate empirically the usefulness of the promoted approach on real world classification and regression datasets.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/aghbalou23a.html
https://proceedings.mlr.press/v206/aghbalou23a.htmlSample Complexity of Distinguishing Cause from EffectWe study the sample complexity of causal structure learning on a two-variable system with observational and experimental data. Specifically, for two variables $X$ and $Y$, we consider the classical scenario where either $X$ causes $Y$, $Y$ causes $X$, or there is an unmeasured confounder between $X$ and $Y$. Let $m_1$ be the number of observational samples of $(X,Y)$, and let $m_2$ be the number of interventional samples where either $X$ or $Y$ has been subject to an external intervention. We show that if $X$ and $Y$ are over a finite domain of size $k$ and are significantly correlated, the minimum $m_2$ needed is sublinear in $k$. Moreover, as $m_1$ grows, the minimum $m_2$ needed to identify the causal structure decreases. In fact, we can give a tight characterization of the tradeoff between $m_1$ and $m_2$ when $m_1 = O(k)$ or is sufficiently large. We build upon techniques for closeness testing when $m_1$ is small (e.g., sublinear in $k$), and for non-parametric density estimation when $m_2$ is large. Our hardness results are based on carefully constructing causal models whose marginal and interventional distributions form hard instances of canonical results on property testing.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/acharya23b.html
https://proceedings.mlr.press/v206/acharya23b.htmlDiscrete Distribution Estimation under User-level Local Differential PrivacyWe study discrete distribution estimation under user-level local differential privacy (LDP). In user-level $\varepsilon$-LDP, each user has a $m\ge1$ samples and the privacy of all $m$ samples must be preserved simultaneously. We resolve the following dilemma: While on the one hand having more samples per user should provide more information about the underlying distribution, on the other hand, guaranteeing privacy of all $m$ samples should make estimation task more difficult. We obtain tight bounds for this problem under almost all parameter regimes. Perhaps surprisingly, we show that in suitable parameter regimes, having $m$ samples per user is equivalent to having $m$ times more users, each with only one sample. Our results demonstrate interesting phase transitions for $m$ and the privacy parameter $\varepsilon$ in the estimation risk. Finally, connecting with recent results on shuffled DP, we show that combined with random shuffling, our algorithm leads to optimal error guarantees (up to logarithmic factors) under the central model of user-level DP in certain parameter regimes. We provide several simulations to verify our theoretical findings.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/acharya23a.html
https://proceedings.mlr.press/v206/acharya23a.htmlLast-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum GamesThis paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings. In the former, players observe their exact gradient vectors of the utility functions. In the latter, they only observe the noisy gradient vectors. Even the celebrated Multiplicative Weights Update (MWU) and Optimistic MWU (OMWU) algorithms may not converge to a Nash equilibrium with noisy feedback. On the contrary, M2WU exhibits the last-iterate convergence to a stationary point near a Nash equilibrium in both feedback settings. We then prove that it converges to an exact Nash equilibrium by iteratively adapting the mutation term. We empirically confirm that M2WU outperforms MWU and OMWU in exploitability and convergence rates.Tue, 11 Apr 2023 00:00:00 +0000
https://proceedings.mlr.press/v206/abe23a.html
https://proceedings.mlr.press/v206/abe23a.html