0$, we prove that every set $P$ of $n$ integers has a weighted subset $S\subseteq P$ (sometimes called core-set) of cardinality $|S|\in O(\log(N)^{O(1)})$ that approximates $cost(P,c)$ (for every $c\in [N]$) up to a multiplicative factor of $1\pm\varepsilon$. Using known coreset techniques, this implies streaming algorithms using only $O((\log(N)\log(n))^{O(1)})$ memory. Our results hold for a large family of functions. Experimental results and open source code are provided. }
}
@InProceedings{pmlr-v151-nikitin22a,
title = { Non-separable Spatio-temporal Graph Kernels via SPDEs },
author = {Nikitin, Alexander V. and John, St and Solin, Arno and Kaski, Samuel},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10640--10660},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/nikitin22a/nikitin22a.pdf},
url = {https://proceedings.mlr.press/v151/nikitin22a.html},
abstract = { Gaussian processes (GPs) provide a principled and direct approach for inference and learning on graphs. However, the lack of justified graph kernels for spatio-temporal modelling has held back their use in graph problems. We leverage an explicit link between stochastic partial differential equations (SPDEs) and GPs on graphs, introduce a framework for deriving graph kernels via SPDEs, and derive non-separable spatio-temporal graph kernels that capture interaction across space and time. We formulate the graph kernels for the stochastic heat equation and wave equation. We show that by providing novel tools for spatio-temporal GP modelling on graphs, we outperform pre-existing graph kernels in real-world applications that feature diffusion, oscillation, and other complicated interactions. }
}
@InProceedings{pmlr-v151-moskovitz22a,
title = { Towards an Understanding of Default Policies in Multitask Policy Optimization },
author = {Moskovitz, Ted and Arbel, Michael and Parker-Holder, Jack and Pacchiano, Aldo},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10661--10686},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/moskovitz22a/moskovitz22a.pdf},
url = {https://proceedings.mlr.press/v151/moskovitz22a.html},
abstract = { Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees. }
}
@InProceedings{pmlr-v151-kviman22a,
title = { Multiple Importance Sampling ELBO and Deep Ensembles of Variational Approximations },
author = {Kviman, Oskar and Melin, Harald and Koptagel, Hazal and Elvira, Victor and Lagergren, Jens},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10687--10702},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/kviman22a/kviman22a.pdf},
url = {https://proceedings.mlr.press/v151/kviman22a.html},
abstract = { In variational inference (VI), the marginal log-likelihood is estimated using the standard evidence lower bound (ELBO), or improved versions as the importance weighted ELBO (IWELBO). We propose the multiple importance sampling ELBO (MISELBO), a versatile yet simple framework. MISELBO is applicable in both amortized and classical VI, and it uses ensembles, e.g., deep ensembles, of independently inferred variational approximations. As far as we are aware, the concept of deep ensembles in amortized VI has not previously been established. We prove that MISELBO provides a tighter bound than the average of standard ELBOs, and demonstrate empirically that it gives tighter bounds than the average of IWELBOs. MISELBO is evaluated in density-estimation experiments that include MNIST and several real-data phylogenetic tree inference problems. First, on the MNIST dataset, MISELBO boosts the density-estimation performances of a state-of-the-art model, nouveau VAE. Second, in the phylogenetic tree inference setting, our framework enhances a state-of-the-art VI algorithm that uses normalizing flows. On top of the technical benefits of MISELBO, it allows to unveil connections between VI and recent advances in the importance sampling literature, paving the way for further methodological advances. We provide our code at https://github.com/Lagergren-Lab/MISELBO. }
}
@InProceedings{pmlr-v151-nix22a,
title = { Can Functional Transfer Methods Capture Simple Inductive Biases? },
author = {Nix, Arne and Shrinivasan, Suhas and Walker, Edgar Y. and Sinz, Fabian},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10703--10717},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/nix22a/nix22a.pdf},
url = {https://proceedings.mlr.press/v151/nix22a.html},
abstract = { Transferring knowledge embedded in trained neural networks is a core problem in areas like model compression and continual learning. Among knowledge transfer approaches, functional transfer methods such as knowledge distillation and representational distance learning are particularly promising, since they allow for transferring knowledge across different architectures and tasks. Considering various characteristics of networks that are desirable to transfer, equivariance is a notable property that enables a network to capture valuable relationships in the data. We assess existing functional transfer methods on their ability to transfer equivariance and empirically show that they fail to even transfer shift equivariance, one of the simplest equivariances. Further theoretical analysis demonstrates that representational similarity methods, in fact, cannot guarantee the transfer of the intended equivariance. Motivated by these findings, we develop a novel transfer method that learns an equivariance model from a given teacher network and encourages the student network to acquire the same equivariance, via regularization. Experiments show that our method successfully transfers equivariance even in cases where highly restrictive methods, such as directly matching student and teacher representations, fail. }
}
@InProceedings{pmlr-v151-emmenegger22a,
title = { On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum Optimization },
author = {Emmenegger, Nicolas and Kyng, Rasmus and Zehmakan, Ahad N.},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10718--10752},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/emmenegger22a/emmenegger22a.pdf},
url = {https://proceedings.mlr.press/v151/emmenegger22a.html},
abstract = { We prove lower bounds for higher-order methods in smooth non-convex finite-sum optimization. Our contribution is threefold: We first show that a deterministic algorithm cannot profit from the finite-sum structure of the objective and that simulating a pth-order regularized method on the whole function by constructing exact gradient information is optimal up to constant factors. We further show lower bounds for randomized algorithms and compare them with the best-known upper bounds. To address some gaps between the bounds, we propose a new second-order smoothness assumption that can be seen as an analogue of the first-order mean-squared smoothness assumption. We prove that it is sufficient to ensure state-of-the-art convergence guarantees while allowing for a sharper lower bound. }
}
@InProceedings{pmlr-v151-bergamin22a,
title = { Model-agnostic out-of-distribution detection using combined statistical tests },
author = {Bergamin, Federico and Mattei, Pierre-Alexandre and Drachmann Havtorn, Jakob and S\'en\'etaire, Hugo and Schmutz, Hugo and Maal{\o}e, Lars and Hauberg, Soren and Frellsen, Jes},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10753--10776},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/bergamin22a/bergamin22a.pdf},
url = {https://proceedings.mlr.press/v151/bergamin22a.html},
abstract = { We present simple methods for out-of-distribution detection using a trained generative model. These techniques, based on classical statistical tests, are model-agnostic in the sense that they can be applied to any differentiable generative model. The idea is to combine a classical parametric test (Rao’s score test) with the recently introduced typicality test. These two test statistics are both theoretically well-founded and exploit different sources of information based on the likelihood for the typicality test and its gradient for the score test. We show that combining them using Fisher’s method overall leads to a more accurate out-of-distribution test. We also discuss the benefits of casting out-of-distribution detection as a statistical testing problem, noting in particular that false positive rate control can be valuable for practical out-of-distribution detection. Despite their simplicity and generality, these methods can be competitive with model-specific out-of-distribution detection algorithms without any assumptions on the out-distribution. }
}
@InProceedings{pmlr-v151-cheng22a,
title = { Generalized Group Testing },
author = {Cheng, Xiwei and Jaggi, Sidharth and Zhou, Qiaoqiao},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10777--10835},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/cheng22a/cheng22a.pdf},
url = {https://proceedings.mlr.press/v151/cheng22a.html},
abstract = { In the problem of classical group testing one aims to identify a small subset (of size $d$) diseased individuals/defective items in a large population (of size $n$) via a minimal number of suitably-designed group tests on subsets of items, where the test outcome is positive iff the given test contains at least one defective item. Motivated by physical considerations, we consider a generalized setting that includes as special cases multiple other group-testing-like models in the literature. In our setting, which subsumes as special cases a variety of noiseless and noisy group-testing models in the literature, the test outcome is positive with probability $f(x)$, where $x$ is the number of defectives tested in a pool, and $f(\cdot)$ is an arbitrary {\it monotonically increasing} (stochastic) test function. Our main contributions are as follows. 1. We present a non-adaptive scheme that with probability $1-\varepsilon$ identifies all defective items. Our scheme requires at most ${\cal O}( H(f) d\log(n/\varepsilon))$ tests, where $H(f)$ is a suitably defined “sensitivity parameter" of $f(\cdot)$, and is never larger than ${\cal O}(d^{1+o(1)})$, but may be substantially smaller for many $f(\cdot)$. 2. We argue that any non-adaptive group testing scheme needs at least $\Omega (h(f) d\log(n/d))$ tests to ensure high reliability recovery. Here $h(f)$ is a suitably defined “concentration parameter" of $f(\cdot)$, and $h(f) \in \Omega{(1)}$. 3. We prove that our sample-complexity bounds for generalized group testing are information-theoretically near-optimal for a variety of sparse-recovery group-testing models in the literature. That is, for {\it any} “noisy" test function $f(\cdot)$ (i.e. $0< f(0) < f(d) <1$), and for a variety of “(one-sided) noiseless" test functions $f(\cdot)$ (i.e., either $f(0)=0$, or $f(d)=1$, or both) studied in the literature we show that $H(f)/h(f) \in \Theta(1)$. As a by-product we tightly characterize the heretofore open information-theoretic sample-complexity for the well-studied model of threshold group-testing. For general (near)-noiseless test functions $f(\cdot)$ we show that $H(f)/h(f) \in {\cal O}(d^{1+o(1)})$. We also demonstrate a “natural" test-function $f(\cdot)$ whose sample complexity scales “extremally" as $\Theta ( d^2\log(n))$, rather than $\Theta ( d\log(n))$ as in the case of classical group-testing. Some of our techniques may be of independent interest – in particular our achievability requires a delicate saddle-point approximation, and our impossibility proof relies on a novel bound relating the mutual information of pair of random variables with the mean and variance of a specific function, and we derive novel structural results about monotone functions. }
}
@InProceedings{pmlr-v151-liu22g,
title = { Adaptive A/B Test on Networks with Cluster Structures },
author = {Liu, Yang and Zhou, Yifan and Li, Ping and Hu, Feifang},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10836--10851},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/liu22g/liu22g.pdf},
url = {https://proceedings.mlr.press/v151/liu22g.html},
abstract = { Units in online A/B tests are often involved in social networks. Thus, their outcomes may depend on the treatment of their neighbors. Many of such networks exhibit certain cluster structures allowing the use of these features in the design to reduce the bias from network interference. When the average treatment effect (ATE) is considered from the individual perspective, conditions for the valid estimation restrict the use of these features in the design. We show that such restrictions can be alleviated if the ATE from the cluster perspective is considered. Using an illustrative example, we further show that the weights employed by the Horvitz-Thompson estimator may not appropriately accommodate the network structure, and purely relying on graph-cluster randomization may generate very unbalanced cluster-treated structures across the treatment arms. The measures of such structures for one cluster may depend on the treatment of other clusters and pose a great challenge for the design of A/B tests. To address these issues, we propose a rerandomized-adaptive randomization to balance the clusters and a cluster-adjusted estimator to alleviate the problem of the weights. Numerical studies are conducted to demonstrate the usage of the proposed procedure. }
}
@InProceedings{pmlr-v151-chierichetti22a,
title = { Spectral Robustness for Correlation Clustering Reconstruction in Semi-Adversarial Models },
author = {Chierichetti, Flavio and Panconesi, Alessandro and Re, Giuseppe and Trevisan, Luca},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10852--10880},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/chierichetti22a/chierichetti22a.pdf},
url = {https://proceedings.mlr.press/v151/chierichetti22a.html},
abstract = { Correlation Clustering is an important clustering problem with many applications. We study the reconstruction version of this problem, in which one seeks to reconstruct a latent clustering that has been corrupted by random noise and adversarial modifications. Concerning the latter, there is a standard "post-adversarial" model in the literature, in which adversarial modifications come after the noise. Here, we introduce and analyse a "pre-adversarial" model, in which adversarial modifications come before the noise. Given an input coming from such a semi-adversarial generative model, the goal is to approximately reconstruct with high probability the latent clustering. We focus on the case where the hidden clusters have nearly equal size and show the following. In the pre-adversarial setting, spectral algorithms are optimal, in the sense that they reconstruct all the way to the information-theoretic threshold beyond which no reconstruction is possible. This is in contrast to the post-adversarial setting, in which their ability to restore the hidden clusters stops before the threshold, but the gap is optimally filled by SDP-based algorithms. These results highlight a heretofore unknown robustness of spectral algorithms, showing them less brittle than previously thought. }
}
@InProceedings{pmlr-v151-sekhon22a,
title = { Beyond Data Samples: Aligning Differential Networks Estimation with Scientific Knowledge },
author = {Sekhon, Arshdeep and Wang, Zhe and Qi, Yanjun},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10881--10923},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/sekhon22a/sekhon22a.pdf},
url = {https://proceedings.mlr.press/v151/sekhon22a.html},
abstract = { Learning the differential statistical dependency network between two contexts is essential for many real-life applications, mostly in the high dimensional low sample regime. In this paper, we propose a novel differential network estimator that allows integrating various sources of knowledge beyond data samples. The proposed estimator is scalable to a large number of variables and achieves a sharp asymptotic convergence rate. Empirical experiments on extensive simulated data and four real-world applications (one on neuroimaging and three from functional genomics) show that our approach achieves improved differential network estimation and provides better supports to downstream tasks like classification. Our results highlight significant benefits of integrating group, spatial and anatomic knowledge during differential genetic network identification and brain connectome change discovery. }
}
@InProceedings{pmlr-v151-li22g,
title = { Two-way Sparse Network Inference for Count Data },
author = {Li, Sijia and L\'opez-Garc{\'\i}a, Mart{\'\i}n and Lawrence, Neil D. and Cutillo, Luisa},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10924--10938},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/li22g/li22g.pdf},
url = {https://proceedings.mlr.press/v151/li22g.html},
abstract = { Classically, statistical datasets have a larger number of data points than features ($n > p$). The standard model of classical statistics caters for the case where data points are considered conditionally independent given the parameters. However, for $n \approx p$ or $p > n$ such models are poorly determined. Kalaitzis et al. (2013) introduced the Bigraphical Lasso, an estimator for sparse precision matrices based on the Cartesian product of graphs. Unfortunately, the original Bigraphical Lasso algorithm is not applicable in case of large $p$ and $n$ due to memory requirements. We exploit eigenvalue decomposition of the Cartesian product graph to present a more efficient version of the algorithm which reduces memory requirements from $O(n^2p^2)$ to $O(n^2 +p^2)$. Many datasets in different application fields, such as biology, medicine and social science, come with count data, for which Gaussian based models are not applicable. Our multiway network inference approach can be used for discrete data. Our methodology accounts for the dependencies across both instances and features, reduces the computational complexity for high dimensional data and enables to deal with both discrete and continuous data. Numerical studies on both synthetic and real datasets are presented to showcase the performance of our method. }
}
@InProceedings{pmlr-v151-zhou22e,
title = { Rapid Convergence of Informed Importance Tempering },
author = {Zhou, Quan and Smith, Aaron},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10939--10965},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/zhou22e/zhou22e.pdf},
url = {https://proceedings.mlr.press/v151/zhou22e.html},
abstract = { Informed Markov chain Monte Carlo (MCMC) methods have been proposed as scalable solutions to Bayesian posterior computation on high-dimensional discrete state spaces, but theoretical results about their convergence behavior in general settings are lacking. In this article, we propose a class of MCMC schemes called informed importance tempering (IIT), which combine importance sampling and informed local proposals, and derive generally applicable spectral gap bounds for IIT estimators. Our theory shows that IIT samplers have remarkable scalability when the target posterior distribution concentrates on a small set. Further, both our theory and numerical experiments demonstrate that the informed proposal should be chosen with caution: the performance may be very sensitive to the shape of the target distribution. We find that the “square-root proposal weighting” scheme tends to perform well in most settings. }
}
@InProceedings{pmlr-v151-cohen-karlik22a,
title = { On the Implicit Bias of Gradient Descent for Temporal Extrapolation },
author = {Cohen-Karlik, Edo and Ben David, Avichai and Cohen, Nadav and Globerson, Amir},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10966--10981},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/cohen-karlik22a/cohen-karlik22a.pdf},
url = {https://proceedings.mlr.press/v151/cohen-karlik22a.html},
abstract = { When using recurrent neural networks (RNNs) it is common practice to apply trained models to sequences longer than those seen in training. This “extrapolating” usage deviates from the traditional statistical learning setup where guarantees are provided under the assumption that train and test distributions are identical. Here we set out to understand when RNNs can extrapolate, focusing on a simple case where the data generating distribution is memoryless. We first show that even with infinite training data, there exist RNN models that interpolate perfectly (i.e., they fit the training data) yet extrapolate poorly to longer sequences. We then show that if gradient descent is used for training, learning will converge to perfect extrapolation under certain assumptions on initialization. Our results complement recent studies on the implicit bias of gradient descent, showing that it plays a key role in extrapolation when learning temporal prediction models. }
}
@InProceedings{pmlr-v151-ghosh22a,
title = { Differentiable Bayesian inference of SDE parameters using a pathwise series expansion of Brownian motion },
author = {Ghosh, Sanmitra and Birrell, Paul J. and De Angelis, Daniela},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10982--10998},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/ghosh22a/ghosh22a.pdf},
url = {https://proceedings.mlr.press/v151/ghosh22a.html},
abstract = { By invoking a pathwise series expansion of Brownian motion, we propose to approximate a stochastic differential equation (SDE) with an ordinary differential equation (ODE). This allows us to reformulate Bayesian inference for a SDE as the parameter estimation task for an ODE. Unlike a nonlinear SDE, the likelihood for an ODE model is tractable and its gradient can be obtained using adjoint sensitivity analysis. This reformulation allows us to use an efficient sampler, such as NUTS, that rely on the gradient of the log posterior. Applying the reparameterisation trick, variational inference can also be used for the same estimation task. We illustrate the proposed method on a variety of SDE models. We obtain similar parameter estimates when compared to data augmentation techniques. }
}
@InProceedings{pmlr-v151-pavutnitskiy22a,
title = { Quadric Hypersurface Intersection for Manifold Learning in Feature Space },
author = {Pavutnitskiy, Fedor and Ivanov, Sergei O. and Abramov, Evgeniy and Borovitskiy, Viacheslav and Klochkov, Artem and Vyalov, Viktor and Zaikovskii, Anatolii and Petiushko, Aleksandr},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {10999--11013},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/pavutnitskiy22a/pavutnitskiy22a.pdf},
url = {https://proceedings.mlr.press/v151/pavutnitskiy22a.html},
abstract = { The knowledge that data lies close to a particular submanifold of the ambient Euclidean space may be useful in a number of ways. For instance, one may want to automatically mark any point far away from the submanifold as an outlier or to use the geometry to come up with a better distance metric. Manifold learning problems are often posed in a very high dimension, e.g. for spaces of images or spaces of words. Today, with deep representation learning on the rise in areas such as computer vision and natural language processing, many problems of this kind may be transformed into problems of moderately high dimension, typically of the order of hundreds. Motivated by this, we propose a manifold learning technique suitable for moderately high dimension and large datasets. The manifold is learned from the training data in the form of an intersection of quadric hypersurfaces—simple but expressive objects. At test time, this manifold can be used to introduce a computationally efficient outlier score for arbitrary new data points and to improve a given similarity metric by incorporating the learned geometric structure into it. }
}
@InProceedings{pmlr-v151-wu22f,
title = { Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons },
author = {Wu, Yue and Jin, Tao and Lou, Hao and Xu, Pan and Farnoud, Farzad and Gu, Quanquan},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11014--11036},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/wu22f/wu22f.pdf},
url = {https://proceedings.mlr.press/v151/wu22f.html},
abstract = { In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus, a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from multiple users and improves the users’ average accuracy by maintaining an active set of users. We prove that our algorithm can return the true ranking of items with high probability. We also provide a sample complexity bound for the proposed algorithm, which outperforms the non-active strategies in the literature and close to oracle under mild conditions. Experiments are provided to show the empirical advantage of the proposed methods over the state-of-the-art baselines. }
}
@InProceedings{pmlr-v151-bruns-smith22a,
title = { Outcome Assumptions and Duality Theory for Balancing Weights },
author = {Bruns-Smith, David A. and Feller, Avi},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11037--11055},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/bruns-smith22a/bruns-smith22a.pdf},
url = {https://proceedings.mlr.press/v151/bruns-smith22a.html},
abstract = { We study balancing weight estimators, which reweight outcomes from a source population to estimate missing outcomes in a target population. These estimators minimize the worst-case error by making an assumption about the outcome model. In this paper, we show that this outcome assumption has two immediate implications. First, we can replace the minimax optimization problem for balancing weights with a simple convex loss over the assumed outcome function class. Second, we can replace the commonly-made overlap assumption with a more appropriate quantitative measure, the minimum worst-case bias. Finally, we show conditions under which the weights remain robust when our assumptions on the outcomes are wrong. }
}
@InProceedings{pmlr-v151-ariafar22a,
title = { Predicting the utility of search spaces for black-box optimization: a simple, budget-aware approach },
author = {Ariafar, Setareh and Gilmer, Justin and Nado, Zachary and Snoek, Jasper and Jenatton, Rodolphe and Dahl, George},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11056--11071},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/ariafar22a/ariafar22a.pdf},
url = {https://proceedings.mlr.press/v151/ariafar22a.html},
abstract = { Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limited budget, one must strike a balance between excluding potentially promising regions and keeping the search space small enough to be tractable. The goal of this work is to motivate—through example applications in tuning deep neural networks—the problem of predicting the quality of search spaces conditioned on budgets, as well as to provide a simple scoring method based on a utility function applied to a probabilistic response surface model, similar to Bayesian optimization. We show that the method we present can compute meaningful budget-conditional scores in a variety of situations. We also provide experimental evidence that accurate scores can be useful in constructing and pruning search spaces. Ultimately, we believe scoring search spaces should become standard practice in the experimental workflow for deep learning. }
}
@InProceedings{pmlr-v151-yu22c,
title = { Learning from Multiple Noisy Partial Labelers },
author = {Yu, Peilin and Ding, Tiffany and Bach, Stephen H.},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11072--11095},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/yu22c/yu22c.pdf},
url = {https://proceedings.mlr.press/v151/yu22c.html},
abstract = { Programmatic weak supervision creates models without hand-labeled training data by combining the outputs of heuristic labelers. Existing frameworks make the restrictive assumption that labelers output a single class label. Enabling users to create partial labelers that output subsets of possible class labels would greatly expand the expressivity of programmatic weak supervision. We introduce this capability by defining a probabilistic generative model that can estimate the underlying accuracies of multiple noisy partial labelers without ground truth labels. We show how to scale up learning, for example learning on 100k examples in one minute, a 300$\times$ speed up compared to a naive implementation. We also prove that this class of models is generically identifiable up to label swapping under mild conditions. We evaluate our framework on three text classification and six object classification tasks. On text tasks, adding partial labels increases average accuracy by 8.6 percentage points. On image tasks, we show that partial labels allow us to approach some zero-shot object classification problems with programmatic weak supervision by using class attributes as partial labelers. On these tasks, our framework has accuracy comparable to recent embedding-based zero-shot learning methods, while using only pre-trained attribute detectors. }
}
@InProceedings{pmlr-v151-fasoulakis22a,
title = { Forward Looking Best-Response Multiplicative Weights Update Methods for Bilinear Zero-sum Games },
author = {Fasoulakis, Michail and Markakis, Evangelos and Pantazis, Yannis and Varsos, Constantinos},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11096--11117},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/fasoulakis22a/fasoulakis22a.pdf},
url = {https://proceedings.mlr.press/v151/fasoulakis22a.html},
abstract = { Our work focuses on extra gradient learning algorithms for finding Nash equilibria in bilinear zero-sum games. The proposed method, which can be formally considered as a variant of Optimistic Mirror Descent (Mertikopoulos et al., 2019), uses a large learning rate for the intermediate gradient step which essentially leads to computing (approximate) best response strategies against the profile of the previous iteration. Although counter-intuitive at first sight due to the irrationally large, for an iterative algorithm, intermediate learning step, we prove that the method guarantees last-iterate convergence to an equilibrium. Particularly, we show that the algorithm reaches first an $\eta^{1/\rho}$-approximate Nash equilibrium, with $\rho > 1$, by decreasing the Kullback-Leibler divergence of each iterate by at least $\Omega(\eta^{1+\frac{1}{\rho}})$, for sufficiently small learning rate $\eta$, until the method becomes a contracting map, and converges to the exact equilibrium. Furthermore, we perform experimental comparisons with the optimistic variant of the multiplicative weights update method, by Daskalakis and Panageas (2019) and show that our algorithm has significant practical potential since it offers substantial gains in terms of accelerated convergence. }
}
@InProceedings{pmlr-v151-pedrood22a,
title = { Hypergraph Simultaneous Generators },
author = {Pedrood, Bahman and Domeniconi, Carlotta and Laskey, Kathryn},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11118--11130},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/pedrood22a/pedrood22a.pdf},
url = {https://proceedings.mlr.press/v151/pedrood22a.html},
abstract = { Generative models for affiliation networks condition the edges on the membership of their nodes to communities. The problem of community detection under these models is addressed by inferring the membership parameters from the network structure. Current models make several unrealistic assumptions to make the inference feasible, and are mostly designed to work on regular graphs that cannot handle multi-way connections between nodes. While the models designed for hypergraphs attempt to capture the latter, they add further strict assumptions on the structure and size of hyperedges and are usually computationally intractable for real data. This paper proposes an efficient probabilistic generative model for detecting overlapping communities that process hyperedges without any changes or restrictions on their size. Our model represents the entire state space of the hyperedges, which is exponential in the number of nodes. We develop a mathematical computation reduction scheme that reduces the inference time to linear in the volume of the hypergraph without sacrificing precision. Our experimental results validate the effectiveness and scalability of our model and demonstrate the superiority of our approach over state-of-the-art community detection methods. }
}
@InProceedings{pmlr-v151-dyer22a,
title = { Amortised Likelihood-free Inference for Expensive Time-series Simulators with Signatured Ratio Estimation },
author = {Dyer, Joel and Cannon, Patrick W. and Schmon, Sebastian M.},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11131--11144},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/dyer22a/dyer22a.pdf},
url = {https://proceedings.mlr.press/v151/dyer22a.html},
abstract = { Simulation models of complex dynamics in the natural and social sciences commonly lack a tractable likelihood function, rendering traditional likelihood-based statistical inference impossible. Recent advances in machine learning have introduced novel algorithms for estimating otherwise intractable likelihood functions using a likelihood ratio trick based on binary classifiers. Consequently, efficient likelihood approximations can be obtained whenever good probabilistic classifiers can be constructed. We propose a kernel classifier for sequential data using *path signatures* based on the recently introduced signature kernel. We demonstrate that the representative power of signatures yields a highly performant classifier, even in the crucially important case where sample numbers are low. In such scenarios, our approach can outperform sophisticated neural networks for common posterior inference tasks. }
}
@InProceedings{pmlr-v151-acharya22a,
title = { Robust Training in High Dimensions via Block Coordinate Geometric Median Descent },
author = {Acharya, Anish and Hashemi, Abolfazl and Jain, Prateek and Sanghavi, Sujay and Dhillon, Inderjit S. and Topcu, Ufuk},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11145--11168},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/acharya22a/acharya22a.pdf},
url = {https://proceedings.mlr.press/v151/acharya22a.html},
abstract = { Geometric median (GM) is a classical method in statistics for achieving robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 1/2. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) in high-dimensional optimization problems. In this paper, we show that by applying GM to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 1/2 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with GM while resulting in significant speedup in training. We further validate the run-time and robustness of our approach empirically on several popular deep learning tasks. Code available at: https://github.com/anishacharya/BGMD }
}
@InProceedings{pmlr-v151-kallus22a,
title = { Stateful Offline Contextual Policy Evaluation and Learning },
author = {Kallus, Nathan and Zhou, Angela},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11169--11194},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/kallus22a/kallus22a.pdf},
url = {https://proceedings.mlr.press/v151/kallus22a.html},
abstract = { We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions. This model can be thought of as an offline generalization of contextual bandits with resource constraints. We formalize the relevant causal structure of problems such as dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The key insight is that an individual-level response is often not causally affected by the state variable and can therefore easily be generalized across timesteps and states. When this is true, we study implications for (doubly robust) off-policy evaluation and learning by instead leveraging single time-step evaluation, estimating the expectation over a single arrival via data from a population, for fitted-value iteration in a marginal MDP. We study sample complexity and analyze error amplification that leads to the persistence, rather than attenuation, of confounding error over time. In simulations of dynamic and capacitated pricing, we show improved out-of-sample policy performance in this class of relevant problems. }
}
@InProceedings{pmlr-v151-chen22i,
title = { Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation },
author = {Chen, Zaiwei and Theja Maguluri, Siva},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11195--11214},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/chen22i/chen22i.pdf},
url = {https://proceedings.mlr.press/v151/chen22i.html},
abstract = { In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity. }
}
@InProceedings{pmlr-v151-hanna22a,
title = { Solving Multi-Arm Bandit Using a Few Bits of Communication },
author = {Hanna, Osama A. and Yang, Lin and Fragouli, Christina},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11215--11236},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/hanna22a/hanna22a.pdf},
url = {https://proceedings.mlr.press/v151/hanna22a.html},
abstract = { The multi-armed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Existing works usually fail to address this issue and can become infeasible in certain applications. In this paper we address the communication problem by optimizing the communication of rewards collected by distributed agents. By providing nearly matching upper and lower bounds, we tightly characterize the number of bits needed per reward for the learner to accurately learn without suffering additional regret. In particular, we establish a generic reward quantization algorithm, QuBan, that can be applied on top of any (no-regret) MAB algorithm to form a new communication-efficient counterpart, that requires only a few (as low as 3) bits to be sent per iteration while preserving the same regret bound. Our lower bound is established via constructing hard instances from a subgaussian distribution. Our theory is further corroborated by numerically experiments. }
}
@InProceedings{pmlr-v151-mokhtarian22a,
title = { Causal Effect Identification with Context-specific Independence Relations of Control Variables },
author = {Mokhtarian, Ehsan and Jamshidi, Fateme and Etesami, Jalal and Kiyavash, Negar},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11237--11246},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/mokhtarian22a/mokhtarian22a.pdf},
url = {https://proceedings.mlr.press/v151/mokhtarian22a.html},
abstract = { We study the problem of causal effect identification from observational distribution given the causal graph and some context-specific independence (CSI) relations. It was recently shown that this problem is NP-hard, and while a sound algorithm to learn the causal effects is proposed in Tikka et al. (2019), no complete algorithm for the task exists. In this work, we propose a sound and complete algorithm for the setting when the CSI relations are limited to observed nodes with no parents in the causal graph. One limitation of the state of the art in terms of its applicability is that the CSI relations among all variables, even unobserved ones, must be given (as opposed to learned). Instead, We introduce a set of graphical constraints under which the CSI relations can be learned from mere observational distribution. This expands the set of identifiable causal effects beyond the state of the art. }
}
@InProceedings{pmlr-v151-liu22h,
title = { Entropy Regularized Optimal Transport Independence Criterion },
author = {Liu, Lang and Pal, Soumik and Harchaoui, Zaid},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11247--11279},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/liu22h/liu22h.pdf},
url = {https://proceedings.mlr.press/v151/liu22h.html},
abstract = { We introduce an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic and study its statistical behavior under both the null hypothesis and the alternative hypothesis. The theoretical results involve tools from U-process theory and optimal transport theory. We also offer a random feature type approximation for large-scale problems, as well as a differentiable program implementation for deep learning applications. We present experimental results on existing benchmarks for independence testing, illustrating the interest of the proposed criterion to capture both linear and nonlinear dependencies in synthetic data and real data. }
}
@InProceedings{pmlr-v151-deng22c,
title = { Polynomial Time Reinforcement Learning in Factored State MDPs with Linear Value Functions },
author = {Deng, Zihao and Devic, Siddartha and Juba, Brendan},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11280--11304},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/deng22c/deng22c.pdf},
url = {https://proceedings.mlr.press/v151/deng22c.html},
abstract = { Many reinforcement learning (RL) environments in practice feature enormous state spaces that may be described compactly by a "factored" structure, that may be modeled by Factored Markov Decision Processes (FMDPs). We present the first polynomial time algorithm for RL in Factored State MDPs (generalizing FMDPs) that neither relies on an oracle planner nor requires a linear transition model; it only requires a linear value function with a suitable local basis with respect to the factorization, permitting efficient variable elimination. With this assumption, we can solve this family of Factored State MDPs in polynomial time by constructing an efficient separation oracle for convex optimization. Importantly, and in contrast to prior work on FMDPs, we do not assume that the transitions on various factors are conditionally independent. }
}
@InProceedings{pmlr-v151-padakandla22a,
title = { PAC Learning of Quantum Measurement Classes : Sample Complexity Bounds and Universal Consistency },
author = {Padakandla, Arun and Magner, Abram},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11305--11319},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/padakandla22a/padakandla22a.pdf},
url = {https://proceedings.mlr.press/v151/padakandla22a.html},
abstract = { We formulate a quantum analogue of the fundamental classical PAC learning problem. As on a quantum computer, we model data to be encoded by modifying specific attributes - spin axis of an electron, plane of polarization of a photon - of sub-atomic particles. Any interaction, including reading off, extracting or learning from such data is via quantum measurements, thus leading us to a problem of PAC learning Quantum Measurement Classes. We propose and analyze the sample complexity of a new ERM algorithm that respects quantum non-commutativity. Our study entails that we define the VC dimension of Positive Operator Valued Measure(ments) (POVMs) concept classes. Our sample complexity bounds involve optimizing over partitions of jointly measurable classes. Finally, we identify universally consistent sequences of POVM classes. Technical components of this work include computations involving tensor products, trace and uniform convergence bounds. }
}
@InProceedings{pmlr-v151-shams-azam22a,
title = { Can we Generalize and Distribute Private Representation Learning? },
author = {Shams Azam, Sheikh and Kim, Taejin and Hosseinalipour, Seyyedali and Joe-Wong, Carlee and Bagchi, Saurabh and Brinton, Christopher},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11320--11340},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/shams-azam22a/shams-azam22a.pdf},
url = {https://proceedings.mlr.press/v151/shams-azam22a.html},
abstract = { We study the problem of learning representations that are private yet informative i.e., provide information about intended "ally" targets while hiding sensitive "adversary" attributes. We propose Exclusion-Inclusion Generative Adversarial Network (EIGAN), a generalized private representation learning (PRL) architecture that accounts for multiple ally and adversary attributes unlike existing PRL solutions. While centrally-aggregated dataset is a prerequisite for most PRL techniques, data in real-world is often siloed across multiple distributed nodes unwilling to share the raw data because of privacy concerns. We address this practical constraint by developing D-EIGAN, the first distributed PRL method that learns representations at each node without transmitting the source data. We theoretically analyze the behavior of adversaries under the optimal EIGAN and D-EIGAN encoders and the impact of dependencies among ally and adversary tasks on the optimization objective. Our experiments on various datasets demonstrate the advantages of EIGAN in terms of performance, robustness, and scalability. In particular, EIGAN outperforms the previous state-of-the-art by a significant accuracy margin ($47%$ improvement), and D-EIGAN’s performance is consistently on par with EIGAN under different network settings. }
}
@InProceedings{pmlr-v151-sebenius22a,
title = { Feature Collapsing for Gaussian Process Variable Ranking },
author = {Sebenius, Isaac and Paananen, Topi and Vehtari, Aki},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11341--11355},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/sebenius22a/sebenius22a.pdf},
url = {https://proceedings.mlr.press/v151/sebenius22a.html},
abstract = { At present, there is no consensus on the most effective way to establish feature relevance for Gaussian process models. The most common heuristic, Automatic Relevance Determination, has several downsides; many alternate methods incur unacceptable computational costs. Existing methods based on sensitivity analysis of the posterior predictive distribution are promising, but are heavily biased and show room for improvement. This paper proposes Feature Collapsing as a novel method for performing GP feature relevance determination in an effective, unbiased, and computationally-inexpensive manner compared to existing algorithms. }
}
@InProceedings{pmlr-v151-zhang22g,
title = { Private Sequential Hypothesis Testing for Statisticians: Privacy, Error Rates, and Sample Size },
author = {Zhang, Wanrong and Mei, Yajun and Cummings, Rachel},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11356--11373},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/zhang22g/zhang22g.pdf},
url = {https://proceedings.mlr.press/v151/zhang22g.html},
abstract = { The sequential hypothesis testing problem is a class of statistical analyses where the sample size is not fixed in advance. Instead, the decision-process takes in new observations sequentially to make real-time decisions for testing an alternative hypothesis against a null hypothesis until some stopping criterion is satisfied. In many common applications of sequential hypothesis testing, the data can be highly sensitive and may require privacy protection; for example, sequential hypothesis testing is used in clinical trials, where doctors sequentially collect data from patients and must determine when to stop recruiting patients and whether the treatment is effective. The field of differential privacy has been developed to offer data analysis tools with strong privacy guarantees, and has been commonly applied to machine learning and statistical tasks. In this work, we study the sequential hypothesis testing problem under a slight variant of differential privacy, known as Renyi differential privacy. We present a new private algorithm based on Wald’s Sequential Probability Ratio Test (SPRT) that also gives strong theoretical privacy guarantees. We provide theoretical analysis on statistical performance measured by Type I and Type II error as well as the expected sample size. We also empirically validate our theoretical results on several synthetic databases, showing that our algorithms also perform well in practice. Unlike previous work in private hypothesis testing that focused only on the classical fixed sample setting, our results in the sequential setting allow a conclusion to be reached much earlier, and thus saving the cost of collecting additional samples. }
}
@InProceedings{pmlr-v151-gasanov22a,
title = { FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning },
author = {Gasanov, Elnur and Khaled, Ahmed and Horv\'ath, Samuel and Richtarik, Peter},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11374--11421},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/gasanov22a/gasanov22a.pdf},
url = {https://proceedings.mlr.press/v151/gasanov22a.html},
abstract = { Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation. }
}
@InProceedings{pmlr-v151-xu22e,
title = { Data Appraisal Without Data Sharing },
author = {Xu, Xinlei and Hannun, Awni and Van Der Maaten, Laurens},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11422--11437},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/xu22e/xu22e.pdf},
url = {https://proceedings.mlr.press/v151/xu22e.html},
abstract = { One of the most effective approaches to improving the performance of a machine learning model is to procure additional training data. A model owner seeking relevant training data from a data owner needs to appraise the data before acquiring it. However, without a formal agreement, the data owner does not want to share data. The resulting Catch-22 prevents efficient data markets from forming. This paper proposes adding a data appraisal stage that requires no data sharing between data owners and model owners. Specifically, we use multi-party computation to implement an appraisal function computed on private data. The appraised value serves as a guide to facilitate data selection and transaction. We propose an efficient data appraisal method based on forward influence functions that approximates data value through its first-order loss reduction on the current model. The method requires no additional hyper-parameters or re-training. We show that in private, forward influence functions provide an appealing trade-off between high quality appraisal and required computation, in spite of label noise, class imbalance, and missing data. Our work seeks to inspire an open market that incentivizes efficient, equitable exchange of domain-specific training data. }
}
@InProceedings{pmlr-v151-le22c,
title = { On Global-view Based Defense via Adversarial Attack and Defense Risk Guaranteed Bounds },
author = {Le, Trung and Tuan Bui, Anh and Minh Tri Tue, Le and Zhao, He and Montague, Paul and Tran, Quan and Phung, Dinh},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11438--11460},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/le22c/le22c.pdf},
url = {https://proceedings.mlr.press/v151/le22c.html},
abstract = { It is well-known that deep neural networks (DNNs) are susceptible to adversarial attacks, which presents the most severe fragility of the deep learning system. Despite achieving impressive performance, most of the current state-of-the-art classifiers remain highly vulnerable to carefully crafted imperceptible, adversarial perturbations. Recent research attempts to understand neural network attack and defense have become increasingly urgent and important. While rapid progress has been made on this front, there is still an important theoretical gap in achieving guaranteed bounds on attack/defense models, leaving uncertainty in the quality and certified guarantees of these models. To this end, we systematically address this problem in this paper. More specifically, we formulate attack and defense in a generic setting where there exists a family of adversaries (i.e., attackers) for attacking a family of classifiers (i.e., defenders). We develop a novel class of f-divergences suitable for measuring divergence among multiple distributions. This equips us to study the interactions between attackers and defenders in a countervailing game where we formulate a joint risk on attack and defense schemes. This is followed by our key results on guaranteed upper and lower bounds on this risk that can provide a better understanding of the behaviors of those parties from the attack and defense perspectives, thereby having important implications to both attack and defense sides. Finally, benefited from our theory, we propose an empirical approach that bases on a global view to defend against adversarial attacks. The experimental results conducted on benchmark datasets show that the global view for attack/defense if exploited appropriately can help to improve adversarial robustness. }
}
@InProceedings{pmlr-v151-montasser22a,
title = { Transductive Robust Learning Guarantees },
author = {Montasser, Omar and Hanneke, Steve and Srebro, Nathan},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11461--11471},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/montasser22a/montasser22a.pdf},
url = {https://proceedings.mlr.press/v151/montasser22a.html},
abstract = { We study the problem of adversarially robust learning in the transductive setting. For classes H of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in the VC dimension and is adaptive to the complexity of the perturbation set. This result provides an exponential improvement in dependence on VC dimension over the best known upper bound on the robust error in the inductive setting, at the expense of competing with a more restrictive notion of optimal robust error. }
}
@InProceedings{pmlr-v151-zuo22a,
title = { Online Competitive Influence Maximization },
author = {Zuo, Jinhang and Liu, Xutong and Joe-Wong, Carlee and Lui, John C. S. and Chen, Wei},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11472--11502},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/zuo22a/zuo22a.pdf},
url = {https://proceedings.mlr.press/v151/zuo22a.html},
abstract = { Online influence maximization has attracted much attention as a way to maximize influence spread through a social network while learning the values of unknown network parameters. Most previous works focus on single-item diffusion. In this paper, we introduce a new Online Competitive Influence Maximization (OCIM) problem, where two competing items (e.g., products, news stories) propagate in the same network and influence probabilities on edges are unknown. We adopt a combinatorial multi-armed bandit (CMAB) framework for OCIM, but unlike the non-competitive setting, the important monotonicity property (influence spread increases when influence probabilities on edges increase) no longer holds due to the competitive nature of propagation, which brings a significant new challenge to the problem. We provide a nontrivial proof showing that the Triggering Probability Modulated (TPM) condition for CMAB still holds in OCIM, which is instrumental for our proposed algorithms OCIM-TS and OCIM-OFU to achieve sublinear Bayesian and frequentist regret, respectively. We also design an OCIM-ETC algorithm that requires less feedback and easier offline computation, at the expense of a worse frequentist regret bound. Experimental evaluations demonstrate the effectiveness of our algorithms. }
}
@InProceedings{pmlr-v151-korba22a,
title = { Adaptive Importance Sampling meets Mirror Descent : a Bias-variance Tradeoff },
author = {Korba, Anna and Portier, Fran\c{c}ois},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11503--11527},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/korba22a/korba22a.pdf},
url = {https://proceedings.mlr.press/v151/korba22a.html},
abstract = { Adaptive importance sampling is a widely spread Monte Carlo technique that uses a re-weighting strategy to iteratively estimate the so-called target distribution. A major drawback of adaptive importance sampling is the large variance of the weights which is known to badly impact the accuracy of the estimates. This paper investigates a regularization strategy whose basic principle is to raise the importance weights at a certain power. This regularization parameter, that might evolve between zero and one during the algorithm, is shown (i) to balance between the bias and the variance and (ii) to be connected to the mirror descent framework. Using a kernel density estimate to build the sampling policy, the uniform convergence is established under mild conditions. Finally, several practical ways to choose the regularization parameter are discussed and the benefits of the proposed approach are illustrated empirically. }
}
@InProceedings{pmlr-v151-boffi22a,
title = { The role of optimization geometry in single neuron learning },
author = {Boffi, Nicholas and Tu, Stephen and Slotine, Jean-Jacques},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11528--11549},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/boffi22a/boffi22a.pdf},
url = {https://proceedings.mlr.press/v151/boffi22a.html},
abstract = { Recent numerical experiments have demonstrated that the choice of optimization geometry used during training can impact generalization performance when learning expressive nonlinear model classes such as deep neural networks. These observations have important implications for modern deep learning, but remain poorly understood due to the difficulty of the associated nonconvex optimization. Towards an understanding of this phenomenon, we analyze a family of pseudogradient methods for learning generalized linear models under the square loss – a simplified problem containing both nonlinearity in the model parameters and nonconvexity of the optimization which admits a single neuron as a special case. We prove non-asymptotic bounds on the generalization error that sharply characterize how the interplay between the optimization geometry and the feature space geometry sets the out-of-sample performance of the learned model. Experimentally, selecting the optimization geometry as suggested by our theory leads to improved performance in generalized linear model estimation problems such as nonlinear and nonconvex variants of sparse vector recovery and low-rank matrix sensing. }
}
@InProceedings{pmlr-v151-deng22d,
title = { Learning Tensor Representations for Meta-Learning },
author = {Deng, Samuel and Guo, Yilin and Hsu, Daniel and Mandal, Debmalya},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11550--11580},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/deng22d/deng22d.pdf},
url = {https://proceedings.mlr.press/v151/deng22d.html},
abstract = { We introduce a tensor-based model of shared representation for meta-learning from a diverse set of tasks. Prior works on learning linear representations for meta-learning assume that there is a common shared representation across different tasks, and do not consider the additional task-specific observable side information. In this work, we model the meta-parameter through an order-$3$ tensor, which can adapt to the observed task features of the task. We propose two methods to estimate the underlying tensor. The first method solves a tensor regression problem and works under natural assumptions on the data generating process. The second method uses the method of moments under additional distributional assumptions and has an improved sample complexity in terms of the number of tasks. We also focus on the meta-test phase, and consider estimating task-specific parameters on a new task. Substituting the estimated tensor from the first step allows us estimating the task-specific parameters with very few samples of the new task, thereby showing the benefits of learning tensor representations for meta-learning. Finally, through simulation and several real-world datasets, we evaluate our methods and show that it improves over previous linear models of shared representations for meta-learning. }
}
@InProceedings{pmlr-v151-farhadi22a,
title = { Differentially Private Densest Subgraph },
author = {Farhadi, Alireza and Hajiaghayi, MohammadTaghi and Shi, Elaine},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11581--11597},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/farhadi22a/farhadi22a.pdf},
url = {https://proceedings.mlr.press/v151/farhadi22a.html},
abstract = { Given a graph, the densest subgraph problem asks for a set of vertices such that the average degree among these vertices is maximized. Densest subgraph has numerous applications in learning, e.g., community detection in social networks, link spam detection, correlation mining, bioinformatics, and so on. Although there are efficient algorithms that output either exact or approximate solutions to the densest subgraph problem, existing algorithms may violate the privacy of the individuals in the network, e.g., leaking the existence/non-existence of edges. In this paper, we study the densest subgraph problem in the framework of the differential privacy, and we derive the upper and lower bounds for this problem. We show that there exists a linear-time $\epsilon$-differentially private algorithm that finds a 2-approximation of the densest subgraph with an extra poly-logarithmic additive error. Our algorithm not only reports the approximate density of the densest subgraph, but also reports the vertices that form the dense subgraph. Our upper bound almost matches the famous 2-approximation by Charikar both in performance and in approximation ratio, but we additionally achieve differential privacy. In comparison with Charikar’s algorithm, our algorithm has an extra poly logarithmic additive error. We partly justify the additive error with a new lower bound, showing that for any differentially private algorithm that provides a constant-factor approximation, a sub-logarithmic additive error is inherent. We also practically study our differentially private algorithm on real-world graphs, and we show that in practice the algorithm finds a solution which is very close to the optimal. }
}
@InProceedings{pmlr-v151-zhang22h,
title = { A Manifold View of Adversarial Risk },
author = {Zhang, Wenjia and Zhang, Yikai and Hu, Xiaoling and Goswami, Mayank and Chen, Chao and Metaxas, Dimitris N.},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11598--11614},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/zhang22h/zhang22h.pdf},
url = {https://proceedings.mlr.press/v151/zhang22h.html},
abstract = { The adversarial risk of a machine learning model has been widely studied. Most previous works assume that the data lies in the whole ambient space. We propose to take a new angle and take the manifold assumption into consideration. Assuming data lies in a manifold, we investigate two new types of adversarial risk, the normal adversarial risk due to perturbation along normal direction, and the in-manifold adversarial risk due to perturbation within the manifold. We prove that the classic adversarial risk can be bounded from both sides using the normal and in-manifold adversarial risks. We also show with a surprisingly pessimistic case that the standard adversarial risk can be nonzero even when both normal and in-manifold risks are zero. We finalize the paper with empirical studies supporting our theoretical results. Our results suggest the possibility of improving the robustness of a classifier by only focusing on the normal adversarial risk. }
}
@InProceedings{pmlr-v151-corinzia22a,
title = { Statistical and computational thresholds for the planted k-densest sub-hypergraph problem },
author = {Corinzia, Luca and Penna, Paolo and Szpankowski, Wojciech and Buhmann, Joachim},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11615--11640},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/corinzia22a/corinzia22a.pdf},
url = {https://proceedings.mlr.press/v151/corinzia22a.html},
abstract = { In this work, we consider the problem of recovery a planted k-densest sub-hypergraph on d-uniform hypergraphs. This fundamental problem appears in different contexts, e.g., community detection, average-case complexity, and neuroscience applications as a structural variant of tensor-PCA problem. We provide tight information-theoretic upper and lower bounds for the exact recovery threshold by the maximum-likelihood estimator, as well as algorithmic bounds based on approximate message passing algorithms. The problem exhibits a typical statistical-to-computational gap observed in analogous sparse settings that widen with increasing sparsity of the problem. The bounds show that the signal structure impacts the location of the statistical and computational phase transition that the known existing bounds for the tensor-PCA model do not capture. This effect is due to the generic planted signal prior that this latter model addresses. }
}
@InProceedings{pmlr-v151-babay22a,
title = { Controlling Epidemic Spread using Probabilistic Diffusion Models on Networks },
author = {Babay, Amy E. and Dinitz, Michael and Srinivasan, Aravind and Tsepenekas, Leonidas and Vullikanti, Anil},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11641--11654},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/babay22a/babay22a.pdf},
url = {https://proceedings.mlr.press/v151/babay22a.html},
abstract = { The spread of an epidemic is often modeled by an SIR random process on a social network graph. The MinInfEdge problem for optimal social distancing involves minimizing the expected number of infections, when we are allowed to break at most B edges; similarly the MinInfNode problem involves removing at most B vertices. These are fundamental problems in epidemiology and network science. While a number of heuristics have been considered, the complexity of this problem remains generally open. In this paper, we present two bicriteria approximation algorithms for the MinInfEdge problem, which give the first non-trivial approximations for this problem. The first is based on the cut sparsification result technique of Karger, which works for any graph, when the transmission probabilities are not too small. The second is a Sample Average Approximation (SAA) based algorithm, which we analyze for the Chung-Lu random graph model. We also extend some of our results for the MinInfNode problem. }
}
@InProceedings{pmlr-v151-azari22a,
title = { Equivariant Deep Dynamical Model for Motion Prediction },
author = {Azari, Bahar and Erdogmus, Deniz},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11655--11668},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/azari22a/azari22a.pdf},
url = {https://proceedings.mlr.press/v151/azari22a.html},
abstract = { Learning representations through deep generative modeling is a powerful approach for dynamical modeling to discover the most simplified and compressed underlying description of the data, to then use it for other tasks such as prediction. Most learning tasks have intrinsic symmetries, i.e., the input transformations leave the output unchanged, or the output undergoes a similar transformation. The learning process is, however, usually uninformed of these symmetries. Therefore, the learned representations for individually transformed inputs may not be meaningfully related. In this paper, we propose an SO(3) equivariant deep dynamical model (EqDDM) for motion prediction that learns a structured representation of the input space in the sense that the embedding varies with symmetry transformations. EqDDM is equipped with equivariant networks to parameterize the state-space emission and transition models. We demonstrate the superior predictive performance of the proposed model on various motion data. }
}
@InProceedings{pmlr-v151-ryan22a,
title = { The Fast Kernel Transform },
author = {Ryan, John P. and Ament, Sebastian E. and Gomes, Carla P. and Damle, Anil},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11669--11690},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/ryan22a/ryan22a.pdf},
url = {https://proceedings.mlr.press/v151/ryan22a.html},
abstract = { Kernel methods are a highly effective and widely used collection of modern machine learning algorithms. A fundamental limitation of virtually all such methods are computations involving the kernel matrix that naively scale quadratically (e.g., matrix-vector multiplication) or cubically (solving linear systems) with the size of the dataset N. We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications (MVMs) for datasets in moderate dimensions with quasilinear complexity. Typically, analytically grounded fast multiplication methods require specialized development for specific kernels. In contrast, our scheme is based on auto-differentiation and automated symbolic computations that leverage the analytical structure of the underlying kernel. This allows the FKT to be easily applied to a broad class of kernels, including Gaussian, Matern, and Rational Quadratic covariance functions and Green’s functions, including those of the Laplace and Helmholtz equations. Furthermore, the FKT maintains a high, quantifiable, and controllable level of accuracy—properties that many acceleration methods lack. We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks with comparisons to adjacent methods, and by applying it to scale the stochastic neighborhood embedding (t-SNE) and Gaussian processes to large real-world datasets. }
}
@InProceedings{pmlr-v151-nietert22a,
title = { Outlier-Robust Optimal Transport: Duality, Structure, and Statistical Analysis },
author = {Nietert, Sloan and Goldfeld, Ziv and Cummings, Rachel},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11691--11719},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/nietert22a/nietert22a.pdf},
url = {https://proceedings.mlr.press/v151/nietert22a.html},
abstract = { The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to be minimax optimal for robust estimation under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets. }
}
@InProceedings{pmlr-v151-ortega22a,
title = { Diversity and Generalization in Neural Network Ensembles },
author = {Ortega, Luis A. and Caba\~nas, Rafael and Masegosa, Andres},
booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
pages = {11720--11743},
year = {2022},
editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
volume = {151},
series = {Proceedings of Machine Learning Research},
month = {28--30 Mar},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v151/ortega22a/ortega22a.pdf},
url = {https://proceedings.mlr.press/v151/ortega22a.html},
abstract = { Ensembles are widely used in machine learning and, usually, provide state-of-the-art performance in many prediction tasks. From the very beginning, the diversity of an ensemble has been identified as a key factor for the superior performance of these models. But the exact role that diversity plays in ensemble models is poorly understood, specially in the context of neural networks. In this work, we combine and expand previously published results in a theoretically sound framework that describes the relationship between diversity and ensemble performance for a wide range of ensemble methods. More precisely, we provide sound answers to the following questions: how to measure diversity, how diversity relates to the generalization error of an ensemble, and how diversity is promoted by neural network ensemble algorithms. This analysis covers three widely used loss functions, namely, the squared loss, the cross-entropy loss, and the 0-1 loss; and two widely used model combination strategies, namely, model averaging and weighted majority vote. We empirically validate this theoretical analysis with neural network ensembles. }
}