Proceedings of Machine Learning ResearchProceedings of Thirty Fifth Conference on Learning Theory
Held in London, UK on 02-05 July 2022
Published as Volume 178 by the Proceedings of Machine Learning Research on 28 June 2022.
Volume Edited by:
Po-Ling Loh
Maxim Raginsky
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/v178/
Wed, 08 Feb 2023 10:38:13 +0000Wed, 08 Feb 2023 10:38:13 +0000Jekyll v3.9.3Conference on Learning Theory 2022: PrefaceThu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/preface-loh22a.html
https://proceedings.mlr.press/v178/preface-loh22a.htmlOpen Problem: Regret Bounds for Noise-Free Kernel-Based BanditsKernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-vakili22a.html
https://proceedings.mlr.press/v178/open-problem-vakili22a.htmlOpen Problem: Do you pay for Privacy in Online learning?Online learning, in the mistake bound model, is one of the most fundamental concepts in learning theory and differential privacy is, perhaps, the most widely used statistical concept of privacy in the machine learning community. Thus, defining problems which are online differentially privately learnable is of great interest in learning theory. In this paper, we pose the question on if the two problems are equivalent from a learning perspective, i.e., is privacy for free in the online learning framework?Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-sanyal22a.html
https://proceedings.mlr.press/v178/open-problem-sanyal22a.htmlOpen Problem: Optimal Best Arm Identification with Fixed-BudgetBest arm identification or pure exploration problems have received much attention in the COLT community since Bubeck et al. (2009) and Audibert et al. (2010). For any bandit instance with a unique best arm, its asymptotic complexity in the so-called fixed-confidence setting has been completely characterized in Garivier and Kaufmann (2016) and Chernoff (1959), while little is known about the asymptotic complexity in its “dual” setting called fixed-budget setting. This note discusses the open problems and conjectures about the instance-dependent asymptotic complexity in the fixed-budget setting.Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-qin22a.html
https://proceedings.mlr.press/v178/open-problem-qin22a.htmlOpen Problem: Finite-Time Instance Dependent Optimality for Stochastic Online Learning with Feedback GraphsBoth asymptotic and non-asymptotic instance dependent regret bounds are known for the stochastic multi-armed bandit problem. Such regret bounds are known to be tight up to lower order terms in the setting of Gaussian rewards (Garivier et al., 2019). We revisit the related problem of stochastic online learning with feedback graphs, where asymptotically optimal instance dependent algorithms are known. Surprisingly, the notion of optimal finite-time regret is not a uniquely defined property in this context and in general, it is decoupled from the asymptotic rate. We pose two open problems. First we ask for a characterization of the finite time instance-dependent optimal regret. Next, we ask for a characterization of the set of graphs for which the finite time regret is bounded by the asymptotically optimal rate, for reasonable values of the time horizon.Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-marinov22a.html
https://proceedings.mlr.press/v178/open-problem-marinov22a.htmlOpen Problem: Running time complexity of accelerated $\ell_1$-regularized PageRankThe results in Google Search, Twitter and other popular search engines traditionally utilize the Personalized PageRank (PPR) vector to rank the results in their search engines. Additionally, there is a plethora of applications beyond the web~\citep{G15} which are modelled using PPR. In recent work by~\cite{ACL06,GM14_ICML,fountoulakis2019variational}, it was shown that small probabilities in PPR vector, e.g., web pages beyond the first page in Google Search, can be thresholded out automatically by utilizing $\ell_1$-regularization or equivalently by early termination. Both versions result in approximate computation of PPR. The current fastest method for computing the $\ell_1$-regularized PPR uses proximal gradient method and requires $\tilde{\mathcal{O}}((\alpha \rho)^{-1})$ total running time, where $\alpha$ is the teleportation parameter and $\rho$ is a parameter which controls the level of sparsity in the $\ell_1$-regularized PPR. It is important to note that the running time complexity does not depend on the size of the underlying graph (e.g. the length of the PPR vector). Such property has become a prerequisite to probe modern large scale networks. A seemingly natural way to build an even faster algorithm for computing the $\ell_1$-regularized PPR is to accelerate the proximal gradient method and consequently reduce the running time complexity to $\tilde{\mathcal{O}}((\sqrt{\alpha} \rho)^{-1})$. This will lead to a speed-up by a factor of $1/\sqrt{\alpha}$ and improve the running time of various network analytic methods which build upon PPR. However, the original analysis of the proximal gradient method in~\cite{fountoulakis2019variational} does not apply to the accelerated version. While we have empirical evidence that indicates accelerated proximal gradient requires less total running time, it is not even clear if acceleration could lead to a worse running time complexity in the worst case.Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-fountoulakis22a.html
https://proceedings.mlr.press/v178/open-problem-fountoulakis22a.htmlOpen Problem: Properly learning decision trees in polynomial time?The authors recently gave an almost-polynomial time membership query algorithm for properly learning decision trees under the uniform distribution~\citep{BLQT21}. The previous fastest algorithm for this problem ran in quasipolynomial time, a consequence of \cite{EH89}s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-blanc22a.html
https://proceedings.mlr.press/v178/open-problem-blanc22a.htmlOpen Problem: Better Differentially Private Learning Algorithms with Margin GuaranteesThe design of efficient differentially private (DP) learning algorithms with dimension-independent learning guarantees has been one of the central challenges in the field of privacy-preserving machine learning. Existing algorithms either suffer from weak generalization guarantees, restrictive model assumptions, or quite large computation cost. In non-private learning, dimension-independent generalization guarantees based on the notion of confidence margin were shown to be the most informative and useful learning guarantees. This motivates a systematic study of DP learning algorithms with confidence-margin generalization guarantees. A recent work has started exploring this direction in the context of linear and kernel-based classification as well as certain classes of neural networks (NNs). Despite showing several positive results, a number of fundamental questions are still open. We identify two major open problems related to DP margin-based learning algorithms. The first problem relates to the design of algorithms with more favorable computational cost. The second one pertains to the question of achieving margin guarantees for NNs under DP with no explicit dependence on the network size.Thu, 01 Sep 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/open-problem-bassily22a.html
https://proceedings.mlr.press/v178/open-problem-bassily22a.htmlReturn of the bias: Almost minimax optimal high probability bounds for adversarial linear banditsWe introduce a modification of follow the regularised leader and combine it with the log determinant potential and suitable loss estimators to prove that the minimax regret for adaptive adversarial linear bandits is at most $O(d \sqrt{T \log(T)})$ where $d$ is the dimension and $T$ is the number of rounds. By using exponential weights, we improve this bound to $O(\sqrt{dT\log(kT)})$ when the action set has size $k$. These results confirms an old conjecture. We also show that follow the regularized leader with the entropic barrier and suitable loss estimators has regret against an adaptive adversary of at most $O(d^2 \sqrt{T} \log(T))$ and can be implement in polynomial time, which improves on the best known bound for an efficient algorithm of $O(d^{7/2} \sqrt{T} \poly(\log(T)))$ by Lee et al 2020.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/zimmert22b.html
https://proceedings.mlr.press/v178/zimmert22b.htmlPushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum StatesWe revisit the classical online portfolio selection problem. It is widely assumed that a trade-off between computational complexity and regret is unavoidable, with Cover’s Universal Portfolios algorithm, SOFT-BAYES and ADA-BARRONS currently constituting its state-of-the-art Pareto frontier. In this paper, we present the first efficient algorithm, BISONS, that obtains polylogarithmic regret with memory and per-step running time requirements that are polynomial in the dimension, displacing ADA-BARRONS from the Pareto frontier. Additionally, we resolve a COLT 2020 open problem by showing that a certain Follow-The-Regularized-Leader algorithm with log-barrier regularization suffers an exponentially larger dependence on the dimension than previously conjectured. Thus, we rule out this algorithm as a candidate for the Pareto frontier. We also extend our algorithm and analysis to a more general problem than online portfolio selection, viz. online learning of quantum states with log loss. This algorithm, called SCHRODINGER’S-BISONS, ibs the first efficient algorithm with polylogarithmic regret for this more general problem.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/zimmert22a.html
https://proceedings.mlr.press/v178/zimmert22a.htmlSingle Trajectory Nonparametric Learning of Nonlinear DynamicsGiven a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE). More precisely, we give nonasymptotic expected $l^2$-distance bounds between the LSE and the true regression function, where expectation is evaluated on a fresh, counterfactual, trajectory. We leverage recently developed information-theoretic methods to establish the optimality of the LSE for nonparametric hypotheses classes in terms of supremum norm metric entropy and a subgaussian parameter. Next, we relate this subgaussian parameter to the stability of the underlying process using notions from dynamical systems theory. When combined, these developments lead to rate-optimal error bounds that scale as $T^{-1/(2+q)}$ for suitably stable processes and hypothesis classes with metric entropy growth of order $\delta^{-q}$. Here, $T$ is the length of the observed trajectory, $\delta \in \mathbb{R}_+$ is the packing granularity and $q\in (0,2)$ is a complexity term. Finally, we specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/ziemann22a.html
https://proceedings.mlr.press/v178/ziemann22a.htmlHigh-Dimensional Projection Pursuit: Outer Bounds and Applications to Interpolation in Neural NetworksGiven a cloud of $n$ data points in $\R^d$, consider all projections onto $m$-dimensional subspaces of $\R^d$ and, for each such projection, the empirical distribution of the projected points. What does this collection of probability distributions look like when $n,d$ grow large? We consider this question under the null model in which the points are i.i.d. standard Gaussian vectors, focusing on the asymptotic regime in which $n,d\to\infty$, with $n/d\to\alpha\in (0,\infty)$, while $m$ is fixed. Denoting by $\cuF_{m, \alpha}$ the set of probability distributions in $\R^m$ that arise as low-dimensional projections in this limit, we establish new outer bounds on $\cuF_{m, \alpha}$. In particular, we characterize the radius of $\cuF_{m,\alpha}$ in terms of Wasserstein distance and prove sharp bounds in terms of Kullback-Leibler divergence and Rényi information dimension. The previous question has application to unsupervised learning methods, such as projection pursuit and independent component analysis. We introduce a version of the same problem that is relevant for supervised learning, and prove a sharp Wasserstein radius bound. As an application, we establish an upper bound on the interpolation threshold of two-layers neural networks with $m$ hidden neurons.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/zhou22a.html
https://proceedings.mlr.press/v178/zhou22a.htmlHorizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary PoliciesThis paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency \citep{zhang2020reinforcement} or has an exponential dependency on $S$ \citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/zhang22a.html
https://proceedings.mlr.press/v178/zhang22a.htmlOffline Reinforcement Learning with Realizability and Single-policy ConcentrabilitySample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/zhan22a.html
https://proceedings.mlr.press/v178/zhan22a.htmlLattice-Based Methods Surpass Sum-of-Squares in ClusteringClustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. ’20; Mao, Wein ’21; Davis, Diaz, Wang ’21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra–Lenstra–Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/zadik22a.html
https://proceedings.mlr.press/v178/zadik22a.htmlMean-field nonparametric estimation of interacting particle systemsThis paper concerns the nonparametric estimation problem of the distribution-state dependent drift vector field in an interacting $N$-particle system. Observing single-trajectory data for each particle, we derive the mean-field rate of convergence for the maximum likelihood estimator (MLE), which depends on both Gaussian complexity and Rademacher complexity of the function class. In particular, when the function class contains $\alpha$-smooth H{ö}lder functions, our rate of convergence is minimax optimal on the order of $N^{-\frac{\alpha}{d+2\alpha}}$. Combining with a Fourier analytical deconvolution estimator, we derive the consistency of MLE for the external force and interaction kernel in the McKean-Vlasov equation.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/yao22a.html
https://proceedings.mlr.press/v178/yao22a.htmlEigenspace Restructuring: A Principle of Space and Frequency in Neural NetworksUnderstanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the highly complex nature of the problem, progress has been relatively slow. In this note, through the lens of infinite-width networks, a.k.a. neural kernels, we present one such principle resulting from hierarchical localities. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept {\it frequency}, which measures the order of interactions. We show that the topologies from deep convolutional networks (CNNs) restructure the associated eigenspaces into finer subspaces. In addition to frequency, the new structure also depends on the concept {\it space}, which measures the spatial distance among nonlinear interaction terms. The resulting fine-grained eigenstructure dramatically improves the network’s learnability, empowering them to simultaneously model a much richer class of interactions. including Long-Range-Low-Frequency interactions, Short-Range-High-Frequency interactions, and various interpolations and extrapolations in-between. Additionally, model scaling can improve the resolutions of interpolations and extrapolations and, therefore, the network’s learnability. Finally, we prove a sharp characterization of the generalization error for infinite-width CNNs (aka C-NTK and CNN-GP) of any depth in the high-dimensional setting. Two corollaries follow: (1) infinite-width deep CNNs can overcome the curse of dimensionality without losing their expressivity, and (2) scaling improves performance in both the finite and infinite data regimes.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/xiao22a.html
https://proceedings.mlr.press/v178/xiao22a.htmlMulti-Agent Learning for Iterative Dominance Elimination: Formal Barriers and New AlgorithmsDominated actions are natural (and perhaps the simplest possible) multi-agent generalizations of sub-optimal actions as in standard single-agent decision making. Thus similar to standard bandit learning, a fundamental learning question in multi-agent systems is whether agents can efficiently eliminate all iteratively dominated actions in an unknown game if they can only observe noisy bandit feedback about the payoff of their played actions. Surprisingly, despite a seemingly simple task, we show a quite negative result; that is, standard no regret algorithms — including the entire family of Dual Averaging algorithms — provably take exponentially many rounds to eliminate all iteratively dominated actions. Moreover, algorithms with the stronger no swap regret also suffer similar exponential inefficiency. To overcome these barriers, we develop a new algorithm that adjusts Exp3 with Diminishing Historical rewards (termed Exp3-DH); Exp3-DH gradually “forgets” history at carefully tailored rates. We prove that when all agents run Exp3-DH (a.k.a., self-play in multi-agent learning), all iteratively dominated actions can be eliminated within polynomially many rounds. Our experimental results further demonstrate the efficiency of Exp3-DH, and that state-of-the-art bandit algorithms, even those explicitly developed for learning in games, fail to eliminate all iteratively dominated actions efficiently.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/wu22a.html
https://proceedings.mlr.press/v178/wu22a.htmlNon-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of SquaresWe consider potentially non-convex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on infinite-dimensional sums-of-squares and Fourier analysis, and is instantiated on the minimization of periodic functions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/woodworth22a.html
https://proceedings.mlr.press/v178/woodworth22a.htmlMultilevel Optimization for Inverse ProblemsInverse problems occur in a variety of parameter identification tasks in engineering. Such problems are challenging in practice, as they require repeated evaluation of computationally expensive forward models. We introduce a unifying framework of multilevel optimization that can be applied to a wide range of optimization-based solvers. Our framework provably reduces the computational cost associated with evaluating the expensive forward maps stemming from various physical models. To demonstrate the versatility of our analysis, we discuss its implications for various methodologies including multilevel (accelerated, stochastic) gradient descent, a multilevel ensemble Kalman inversion and a multilevel Langevin sampler. We also provide numerical experiments to verify our theoretical findings.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/weissmann22a.html
https://proceedings.mlr.press/v178/weissmann22a.htmlRandom Graph Matching in Geometric Models: the Case of Complete GraphsThis paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models. Specifically, given a random permutation $\pi^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{\pi^*(i)}, Y_i\}$ in $\reals^d$ with noise parameter $\sigma$, the edge weights are given by $A_{ij}=\kappa(X_i,X_j)$ and $B_{ij}=\kappa(Y_i,Y_j)$ for some link function $\kappa$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $\kappa(x,y)=⟨x, y ⟩$ and Euclidean distance model with $\kappa(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $\pi^*$ when $\sigma=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $\sigma=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of Dai et al. (2019) and Kunisky and Niles-Weed (2022) in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of Umeyama (1988) emerges as a further approximation to the maximum likelihood in the geometric model.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/wang22a.html
https://proceedings.mlr.press/v178/wang22a.htmlBeyond No Regret: Instance-Dependent PAC Reinforcement LearningThe theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible—there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity—yielding a complexity which scales with the suboptimality gaps and the “reachability” of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/wagenmaker22a.html
https://proceedings.mlr.press/v178/wagenmaker22a.htmlMirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise VarianceWe study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+\kappa)$-th moment, for some $\kappa \in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/vural22a.html
https://proceedings.mlr.press/v178/vural22a.htmlLabel noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisationUnderstanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the role of the label noise in the training dynamics of a quadratically parametrised model through its continuous time version. We explicitly characterise the solution chosen by the stochastic flow and prove that it implicitly solves a Lasso program. To fully complete our analysis, we provide nonasymptotic convergence guarantees for the dynamics as well as conditions for support recovery. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and help explain the greater performances of stochastic dynamics as observed in practice.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/vivien22a.html
https://proceedings.mlr.press/v178/vivien22a.html(Nearly) Optimal Private Linear Regression for Sub-Gaussian Data via Adaptive ClippingWe study the problem of differentially private linear regression where each of the data point is sampled from a fixed sub-Gaussian style distribution. We propose and analyze a one-pass mini-batch stochastic gradient descent method (DP-AMBSSGD) where points in each iteration are sampled without replacement. Noise is added for DP but the noise standard deviation is estimated online. Compared to existing $(\epsilon, \delta)$-DP techniques which have sub-optimal error bounds, DP-AMBSSGD is able to provide nearly optimal error bounds in terms of key parameters like dimensionality $d$, number of points $N$, and the standard deviation \sigma of the noise in observations. For example, when the $d$-dimensional covariates are sampled i.i.d. from the normal distribution, then the excess error of DP-AMBSSGD due to privacy is $\sigma^2 d/N(1+d/(\epsilon^2 N))$, i.e., the error is meaningful when number of samples N\geq d \log d which is the standard operative regime for linear regression. In contrast, error bounds for existing efficient methods in this setting are: $d^3/(\epsilon^2 N^2)$, even for $\sigma=0$. That is, for constant $\epsilon$, the existing techniques require $N=d^{1.5}$ to provide a non-trivial result.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/varshney22a.html
https://proceedings.mlr.press/v178/varshney22a.htmlAccelerated SGD for Non-Strongly-Convex Least SquaresWe consider stochastic approximation for the least squares regression problem in the non-strongly convex setting. We present the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as $O(d/t)$ while accelerating the forgetting of the initial conditions to $O(d/t^2)$. Our new algorithm is based on a simple modification of the accelerated gradient descent. We provide convergence results for both the averaged and the last iterate of the algorithm. In order to describe the tightness of these new bounds, we present a matching lower bound in the noiseless setting and thus show the optimality of our algorithm.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/varre22a.html
https://proceedings.mlr.press/v178/varre22a.htmlWidth is Less Important than Depth in ReLU Neural NetworksWe solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network’s architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/vardi22a.html
https://proceedings.mlr.press/v178/vardi22a.htmlLearning to Control Linear Systems can be HardIn this paper, we study the statistical difficulty of learning to control linear systems. We focus on two standard benchmarks, the sample complexity of stabilization, and the regret of the online learning of the Linear Quadratic Regulator (LQR). Prior results state that the statistical difficulty for both benchmarks scales polynomially with the system state dimension up to system-theoretic quantities. However, this does not reveal the whole picture. By utilizing minimax lower bounds for both benchmarks, we prove that there exist non-trivial classes of systems for which learning complexity scales dramatically, i.e. exponentially, with the system dimension. This situation arises in the case of underactuated systems, i.e. systems with fewer inputs than states. Such systems are structurally difficult to control and their system theoretic quantities can scale exponentially with the system dimension dominating learning complexity. Under some additional structural assumptions (bounding systems away from uncontrollability), we provide qualitatively matching upper bounds. We prove that learning complexity can be at most exponential with the controllability index of the system, that is the degree of underactuation.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/tsiamis22a.html
https://proceedings.mlr.press/v178/tsiamis22a.htmlRisk bounds for aggregated shallow neural networks using Gaussian priorsAnalysing statistical properties of neural networks is a central topic in statistics and machine learning. However, most results in the literature focus on the properties of the neural network minimizing the training error. The goal of this paper is to consider aggregated neural networks using a Gaussian prior. The departure point of our approach is an arbitrary aggregate satisfying the PAC-Bayesian inequality. The main contribution is a precise nonasymptotic assessment of the estimation error appearing in the PAC-Bayes bound. Our analysis is sharp enough to lead to minimax rates of estimation over Sobolev smoothness classes.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/tinsi22a.html
https://proceedings.mlr.press/v178/tinsi22a.htmlTwo-Sided Weak Submodularity for Matroid Constrained Optimization and RegressionWe study the following problem: Given a variable of interest, we would like to find a best linear predictor for it by choosing a subset of k relevant variables obeying a matroid constraint. This problem is a natural generalization of subset selection problems where it is necessary to spread observations amongst multiple different classes. We derive new, strengthened guarantees for this problem by improving the analysis of the residual random greedy algorithm and by developing a novel distorted local-search algorithm. To quantify our approximation guarantees, we refine the definition of weak submodularity by Das and Kempe (2011) and introduce the notion of an upper submodularity ratio, which we connect to the minimum k-sparse eigenvalue of the covariance matrix. More generally, we look at the problem of maximizing a set function f with lower and upper submodularity ratio $\gamma$ and $\beta$ under a matroid constraint. For this problem, our algorithms have asymptotic approximation guarantee 1/2 and (1 - 1/e) as the function is closer to being submodular. As a second application, we show that the Bayesian A-optimal design objective falls into our framework, leading to new guarantees for this problem as well.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/thiery22a.html
https://proceedings.mlr.press/v178/thiery22a.htmlStochastic linear optimization never overfits with quadratically-bounded losses on general dataThis work provides test error bounds for iterative fixed point methods on linear predictors — specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) — with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as conditions numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on approximately mixing Markov chains (all prior stochastic TD bounds are in expectation).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/telgarsky22a.html
https://proceedings.mlr.press/v178/telgarsky22a.htmlMinimax Regret on Patterns Using Kullback-Leibler Divergence CoveringThis paper considers the problem of finding a tighter upper bound on the minimax regret of patterns, a class used to study large-alphabet distributions which avoids infinite asymptotic regret and redundancy. Our method for finding upper bounds for minimax regret uses cover numbers with Kullback-Leibler (KL) divergence as the distance. Compared to existing results by Acharya et al. (2013), we are able to improve the power of the exponent on the logarithmic term, giving a minimax regret bound which matches the best known minimax redundancy bound on patterns.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/tang22a.html
https://proceedings.mlr.press/v178/tang22a.htmlTracking Most Significant Arm Switches in BanditsIn \emph{bandit with distribution shifts}, one aims to automatically adapt to unknown changes in reward distribution, and \emph{restart} exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually \emph{severe}. To resolve this question, we propose a new notion of \emph{significant shift}, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of \emph{total variation} $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/suk22a.html
https://proceedings.mlr.press/v178/suk22a.htmlOn characterizations of learnability with computable learnersWe study computable PAC (CPAC) learning as introduced by Agarwal et al. (2020). First, we consider the main open question of finding characterizations of proper and improper CPAC learning. We give a characterization of a closely related notion of *strong* CPAC learning, and we provide a negative answer to the COLT open problem posed by Agarwal et al. (2021) whether all decidably representable PAC learnable classes are improperly CPAC learnable. Second, we consider undecidability of (computable) PAC learnability. We give a simple general argument to exhibit such undecidability, and we initiate a study of the arithmetical complexity of learnability. We briefly discuss the relation to the undecidability result of Ben-David et al. (2019), that motivated the work of Agarwal et al.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/sterkenburg22a.html
https://proceedings.mlr.press/v178/sterkenburg22a.htmlSelf-Consistency of the Fokker Planck EquationThe Fokker-Planck equation (FPE) is the partial differential equation that governs the density evolution of the Ito process and is of great importance to the literature of statistical physics and machine learning. The FPE can be regarded as a continuity equation where the change of the density is completely determined by a time varying velocity field. Importantly, this velocity field also depends on the current density function. As a result, the ground-truth velocity field can be shown to be the solution of a fixed-point equation, a property that we call self-consistency. In this paper, we exploit this concept to design a potential function of the hypothesis velocity fields, and prove that, if such a function diminishes to zero during the training procedure, the trajectory of the densities generated by the hypothesis velocity fields converges to the solution of the FPE in the Wasserstein-2 sense. The proposed potential function is amenable to neural-network based parameterization as the stochastic gradient with respect to the parameter can be efficiently computed. Once a parameterized model, such as Neural Ordinary Differential Equation is trained, we can generate the entire trajectory to the FPE.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/shen22a.html
https://proceedings.mlr.press/v178/shen22a.htmlThe Implicit Bias of Benign OverfittingThe phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond simple linear regression setups. In this paper, we show that for regression, benign overfitting is “biased” towards certain types of problems, in the sense that its existence on one learning problem precludes its existence on other learning problems. On the negative side, we use this to argue that one should not expect benign overfitting to occur in general, for several natural extensions of the plain linear regression problems studied so far. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we consider a model where an arbitrary input distribution of some fixed dimension k is concatenated with a high-dimensional distribution, and prove that the max-margin predictor (to which gradient-based methods are known to converge in direction) is asymptotically biased towards minimizing the expected \emph{squared hinge loss} w.r.t. the k-dimensional distribution. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/shamir22a.html
https://proceedings.mlr.press/v178/shamir22a.htmlRate-Distortion Theoretic Generalization Bounds for Stochastic Learning AlgorithmsUnderstanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the ’compression error rate’ can be linked to the generalization error both in expectation and with high probability. We show that in the ’lossless compression’ setting, we recover and improve existing mutual information-based bounds, whereas a ’lossy compression’ scheme allows us to link generalization to the rate-distortion dimension - a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/sefidgaran22a.html
https://proceedings.mlr.press/v178/sefidgaran22a.htmlStability vs Implicit Bias of Gradient Methods on Separable Data and BeyondAn influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such methods to generalize well has been attributed to their implicit bias towards large margin predictors, both asymptotically as well as in finite time. We give an additional unified explanation for this generalization and relate it to two simple properties of the optimization objective, that we refer to as realizability and self-boundedness. We introduce a general setting of unconstrained stochastic convex optimization with these properties, and analyze generalization of gradient methods through the lens of algorithmic stability. In this broader setting, we obtain sharp stability bounds for gradient descent and stochastic gradient descent which apply even for a very large number of gradient steps, and use them to derive general generalization bounds for these algorithms. Finally, as direct applications of the general bounds, we return to the setting of linear classification with separable data and establish several novel test loss and test accuracy bounds for gradient descent and stochastic gradient descent for a variety of loss functions with different tail decay rates. In some of these cases, our bounds significantly improve upon the existing generalization error bounds in the literature.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/schliserman22a.html
https://proceedings.mlr.press/v178/schliserman22a.htmlOptimization-Based Separations for Neural NetworksDepth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/safran22a.html
https://proceedings.mlr.press/v178/safran22a.htmlOn the Role of Channel Capacity in Learning Gaussian Mixture ModelsThis paper studies the sample complexity of learning the $k$ unknown centers of a balanced Gaussian mixture model (GMM) in $\mathbb{R}^d$ with spherical covariance matrix $\sigma^2\bm{I}$. In particular, we are interested in the following question: what is the maximal noise level $\sigma^2$, for which the sample complexity is essentially the same as when estimating the centers from labeled measurements? To that end, we restrict attention to a Bayesian formulation of the problem, where the centers are uniformly distributed on the sphere $\sqrt{d}\mathcal{S}^{d-1}$. Our main results characterize the \emph{exact noise threshold} $\sigma^2$ below which the GMM learning problem, in the large system limit $d,k\to\infty$, is as easy as learning from labeled observations, and above which it is substantially harder. The threshold occurs at $\frac{\log k}{d} = \frac12\log\left( 1+\frac{1}{\sigma^2} \right)$, which is the capacity of the additive white Gaussian noise (AWGN) channel. Thinking of the set of $k$ centers as a code, this noise threshold can be interpreted as the largest noise level for which the error probability of the code over the AWGN channel is small. Previous works on the GMM learning problem have identified the \emph{minimum distance} between the centers as a key parameter in determining the statistical difficulty of learning the corresponding GMM. While our results are only proved for GMMs whose centers are uniformly distributed over the sphere, they hint that perhaps it is the decoding error probability associated with the center constellation as a channel code that determines the statistical difficulty of learning the corresponding GMM, rather than just the minimum distance.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/romanov22a.html
https://proceedings.mlr.press/v178/romanov22a.htmlImproved Parallel Algorithm for Minimum Cost Submodular Cover ProblemIn the minimum cost submodular cover problem (MinSMC), we are given a monotone nondecreasing submodular function $f\colon 2^V \rightarrow \mathbb{Z}^+$, a linear cost function $c: V\rightarrow \mathbb R^{+}$, and an integer $k\leq f(V)$, the goal is to find a subset $A\subseteq V$ with the minimum cost such that $f(A)\geq k$. The MinSMC can be found at the heart of many machine learning and data mining applications. In this paper, we design a parallel algorithm for the MinSMC that takes at most $O(\frac{\log (km)\log k(\log m+\log\log (mk))}{\varepsilon^4})$ adaptive rounds, and it achieves an approximation ratio of $\frac{H(\min\{\Delta,k\})}{1-5\varepsilon}$ with probability at least $1-3\varepsilon$, where $\Delta=\max_{v\in V}f(v)$, $H(\cdot)$ is the Harmonic number, $m=|V|$, and $\varepsilon$ is a constant in $(0,\frac{1}{5})$.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/ran22a.html
https://proceedings.mlr.press/v178/ran22a.htmlClustering with Queries under Semi-Random NoiseThe seminal paper by Mazumdar and Saha (2017a) introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fully-random oracle. In this work, we develop robust learning methods that tolerate general semi-random noise obtaining qualitatively the same guarantees as the best possible methods in the fully-random model. More specifically, given a set of n points with an unknown underlying partition, we are allowed to query pairs of points u,v to check if they are in the same cluster, but with probability p, the answer may be adversarially chosen. We show that information theoretically O(nk log n /(1-2p)^2) queries suffice to learn any cluster of sufficiently large size. Our main result is a computationally efficient algorithm that can identify large clusters with O(nk log n/ (1-2p)^2) + poly(log n, k, 1/(1-2p)) queries, matching the guarantees of the best known algorithms in the fully-random model. As a corollary of our approach, we develop the first parameter-free algorithm for the fully-random model, answering an open question in Mazumdar and Saha (2017a).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/pia22a.html
https://proceedings.mlr.press/v178/pia22a.htmlThe Structured Abstain Problem and the Lovász HingeThe Lovász hinge is a convex surrogate recently proposed for structured binary classification, in which k binary predictions are made simultaneously and the error is judged by a submodular set function. Despite its wide usage in image segmentation and related problems, its consistency has remained open. We resolve this open question, showing that the Lovász hinge is inconsistent for its desired target unless the set function is modular. Leveraging a recent embedding framework, we instead derive the target loss for which the Lovász hinge is consistent. This target, which we call the structured abstain problem, allows one to abstain on any subset of the k predictions. We derive two link functions, each of which are consistent for all submodular set functions simultaneously.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/nueve22a.html
https://proceedings.mlr.press/v178/nueve22a.htmlOptimal SQ Lower Bounds for Learning Halfspaces with Massart NoiseWe give tight statistical query (SQ) lower bounds for learnining halfspaces in the presence of Massart noise. In particular, suppose that all labels are corrupted with probability at most $\eta$. We show that for arbitrary $\eta \in [0,1/2]$ every SQ algorithm achieving misclassification error better than $\eta$ requires queries of superpolynomial accuracy or at least a superpolynomial number of queries. Further, this continues to hold even if the information-theoretically optimal error $\OPT$ is as small as $\exp\Paren{-\log^c(d)}$, where $d$ is the dimension and $0 < c < 1$ is an arbitrary absolute constant, and an overwhelming fraction of examples are noiseless. Our lower bound matches known polynomial time algorithms, which are also implementable in the SQ framework. Previously, such lower bounds only ruled out algorithms achieving error $\OPT + \e$ or error better than $\Omega(\eta)$ or, if $\eta$ is close to $1/2$, error $\eta - o_\eta(1)$, where the term $o_\eta(1)$ is constant in $d$ but going to 0 for $\eta$ approaching $1/2$. As a consequence, we also show that achieving misclassification error better than $1/2$ in the $(A,\alpha)$-Tsybakov model is SQ-hard for $A$ constant and $\alpha$ bounded away from 1.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/nasser22a.html
https://proceedings.mlr.press/v178/nasser22a.htmlPrivate High-Dimensional Hypothesis TestingWe provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for $d$-dimensional Gaussian distributions with known covariance $\Sigma$, we can test whether the distribution comes from $\mathcal{N}(\mu^*, \Sigma)$ for some fixed $\mu^*$ or from some $\mathcal{N}(\mu, \Sigma)$ with total variation distance at least $\alpha$ from $\mathcal{N}(\mu^*, \Sigma)$ with $(\varepsilon, 0)$-differential privacy, using only \[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/3}}{\alpha^{4/3} \cdot \varepsilon^{2/3}} + \frac{1}{\alpha \cdot \varepsilon}\right)\]{samples} if the algorithm is allowed to be computationally inefficient, and only \[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/4}}{\alpha \cdot \varepsilon}\right)\]{samples} for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over $\{-1, 1\}^d$, and tolerant testing. Our results improve over the previous best work of Canonne et al. (2020) for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of $O\left(\frac{\sqrt{d}}{\alpha^2}\right)$ in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of $d$-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size $d$ (Acharya et al., 2018), which refutes a conjectured lower bound of Canonne et al. (2020).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/narayanan22a.html
https://proceedings.mlr.press/v178/narayanan22a.htmlDimension-free convergence rates for gradient Langevin dynamics in RKHSGradient Langevin dynamics (GLD) and stochastic GLD (SGLD) have attracted considerable attention lately, as a way to provide convergence guarantees in a non-convex setting. However, the known rates grow exponentially with the dimension of the space under the dissipative condition. In this work, we provide a convergence analysis of GLD and SGLD when the optimization space is an infinite-dimensional Hilbert space. More precisely, we derive non-asymptotic, dimension-free convergence rates for GLD/SGLD when performing regularized non-convex optimization in a reproducing kernel Hilbert space. Amongst others, the convergence analysis relies on the properties of a stochastic differential equation, its discrete time Galerkin approximation and the geometric ergodicity of the associated Markov chains.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/muzellec22a.html
https://proceedings.mlr.press/v178/muzellec22a.htmlOptimal and instance-dependent guarantees for Markovian linear stochastic approximationWe study stochastic approximation procedures for approximately solving a $d$-dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise—covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$—and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/mou22a.html
https://proceedings.mlr.press/v178/mou22a.htmlUniversality of empirical risk minimizationConsider supervised learning from i.i.d. samples {(y_i, x_i )}_{i≤n} where x_i ∈ R_p are feature vectors and y_i ∈ R are labels. We study empirical risk minimization over a class of functions that are parameterized by k = O(1) vectors θ_1 , . . . , θ_k ∈ R_p, and prove universality results both for the training and test error. Namely, under the proportional asymptotics n, p → ∞ , with n/p = Θ(1), we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed —to leading order— under a simpler model in which the feature vectors x_i are replaced by Gaussian vectors g_i with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors x_i with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors x_i that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/montanari22a.html
https://proceedings.mlr.press/v178/montanari22a.htmlWasserstein GANs with Gradient Penalty Compute Congested TransportWasserstein GANs with Gradient Penalty (WGAN-GP) are a very popular method for training generative models to produce high quality synthetic data. While WGAN-GP were initially developed to calculate the Wasserstein 1 distance between generated and real data, recent works (e.g. [23]) have provided empirical evidence that this does not occur, and have argued that WGAN-GP perform well not in spite of this issue, but because of it. In this paper we show for the first time that WGAN-GP compute the minimum of a different optimal transport problem, the so-called congested transport [7]. Congested transport determines the cost of moving one distribution to another under a transport model that penalizes congestion. For WGAN-GP, we find that the congestion penalty has a spatially varying component determined by the sampling strategy used in [12] which acts like a local speed limit, making congestion cost less in some regions than others. This aspect of the congested transport problem is new, in that the congestion penalty turns out to be unbounded and depends on the distributions to be transported, and so we provide the necessary mathematical proofs for this setting. One facet of our discovery is a formula connecting the gradient of solutions to the optimization problem in WGAN-GP to the time averaged momentum of the optimal mass flow. This is in contrast to the gradient of Kantorovich potentials for the Wasserstein 1 distance, which is just the normalized direction of flow. Based on this and other considerations, we speculate on how our results explain the observed performance of WGAN-GP. Beyond applications to GANs, our theorems also point to the possibility of approximately solving large scale congested transport problems using neural network techniques.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/milne22a.html
https://proceedings.mlr.press/v178/milne22a.htmlDamped Online Newton Step for Portfolio SelectionWe revisit the classic online portfolio selection problem, where at each round a learner selects a distribution over a set of portfolios to allocate its wealth. It is known that for this problem a logarithmic regret with respect to Cover’s loss is achievable using the Universal Portfolio Selection algorithm, for example. However, all existing algorithms that achieve a logarithmic regret for this problem have per-round time and space complexities that scale polynomially with the total number of rounds, making them impractical. In this paper, we build on the recent work by Luo et al. 2018 and present the first practical online portfolio selection algorithm with a logarithmic regret and whose per-round time and space complexities depend only logarithmically on the horizon. Behind our approach are two key technical novelties. We first show that the Damped Online Newton steps can approximate mirror descent iterates well, even when dealing with time-varying regularizers. Second, we present a new meta-algorithm that achieves an adaptive, logarithmic regret (i.e. a logarithmic regret on any sub-interval) for mixable losses.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/mhammedi22b.html
https://proceedings.mlr.press/v178/mhammedi22b.htmlEfficient Projection-Free Online Convex Optimization with Membership OracleIn constrained convex optimization, existing interior point methods do not scale well with the dimension of the ambient space. Alternative approaches such as Projected Gradient Descent only provide a computational benefit for simple convex sets where Euclidean projections can be performed efficiently, such as Euclidean balls. For other more complex sets, the cost of the projections can be too high. To circumvent these issues, alternative methods based on the famous Frank-Wolfe algorithm have been studied and widely used. Such methods use a Linear Optimization Oracle at each iteration instead of Euclidean projections; the former can often be performed efficiently. Such methods have also been extended to the online and stochastic optimization settings. However, the Frank-Wolfe algorithm and its variants do not achieve the optimal performance, in terms of regret or rate, for general convex sets. What is more, the Linear Optimization Oracle they use can still be computationally expensive in some cases. In this paper, we move away from Frank-Wolfe style algorithms and present a new reduction that turns any algorithm $\mathsf{A}$ over a Euclidean ball (where projections are cheap) to an algorithm over a general convex constraint set $\mathcal{C}$ contained within the ball, without sacrificing the performance of the original algorithm $\mathsf{A}$ by much. Our reduction requires $O(T \ln T)$ calls to a Membership Oracle on $\mathcal{C}$ after $T$ rounds, and no linear optimization on $\K$ is needed. Using this reduction, we recover optimal regret bounds [resp. rates], in terms of the number of iterations, in online [resp. stochastic] convex optimization. Our guarantees are also useful in the offline convex optimization setting when the dimension of the ambient space is large.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/mhammedi22a.html
https://proceedings.mlr.press/v178/mhammedi22a.htmlScale-free Unconstrained Online Learning for Curved LossesA sequence of works in unconstrained online convex optimisation have investigated the possibility of adapting simultaneously to the norm U of the comparator and the maximum norm G of the gradients. In full generality, matching upper and lower bounds are known which show that this comes at the unavoidable cost of an additive GU^3, which is not needed when either G or U is known in advance. Surprisingly, recent results by Kempka et al. (2019) show that no such price for adaptivity is needed in the specific case of 1-Lipschitz losses like the hinge loss. We follow up on this observation by showing that there is in fact never a price to pay for adaptivity if we specialise to any of the other common supervised online learning losses: our results cover log loss, (linear and non-parametric) logistic regression, square loss prediction, and (linear and non-parametric) least-squares regression. We also fill in several gaps in the literature by providing matching lower bounds with an explicit dependence on U. In all cases we obtain scale-free algorithms, which are suitably invariant under rescaling of the data. Our general goal is to establish achievable rates without concern for computational efficiency, but for linear logistic regression we also provide an adaptive method that is as efficient as the recent non-adaptive algorithm by Agarwal et al. (2021).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/mayo22a.html
https://proceedings.mlr.press/v178/mayo22a.htmlEfficient Convex Optimization Requires Superlinear MemoryWe show that any memory-constrained, first-order algorithm which minimizes $d$-dimensional, $1$-Lipschitz convex functions over the unit ball to $1/\mathrm{poly}(d)$ accuracy using at most $d^{1.25 - \delta}$ bits of memory must make at least $\Omega(d^{1 + (4/3)\delta})$ first-order queries (for any constant $\delta \in [0, 1/4]$). Consequently, the performance of such memory-constrained algorithms are a polynomial factor worse than the optimal $\tilde{O}(d)$ query bound for this problem obtained by cutting plane methods that use $\tilde{O}(d^2)$ memory. This resolves one of the open problems in the COLT 2019 open problem publication of Woodworth and Srebro.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/marsden22a.html
https://proceedings.mlr.press/v178/marsden22a.htmlStrategizing against Learners in Bayesian GamesWe study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/mansour22a.html
https://proceedings.mlr.press/v178/mansour22a.htmlPrivate Matrix Approximation and Geometry of Unitary OrbitsConsider the following optimization problem: Given $n \times n$ matrices $A$ and $\Lambda$, maximize $⟨A, U\Lambda U^*⟩$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $\Lambda$ and, by setting $\Lambda$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users’ private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/mangoubi22a.html
https://proceedings.mlr.press/v178/mangoubi22a.htmlComplete Policy Regret Bounds for Tallying BanditsPolicy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $\numact$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\bigo ( m \numact \sqrt{T} )$, where the $\bigo$ notation hides only logarithmic factors. We additionally prove an $\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/malik22a.html
https://proceedings.mlr.press/v178/malik22a.htmlStreaming Algorithms for Ellipsoidal Approximation of Convex PolytopesWe give efficient deterministic one-pass streaming algorithms for finding an ellipsoidal approximation of a symmetric convex polytope. The algorithms are near-optimal in that their approximation factors differ from that of the optimal offline solution only by a factor sub-logarithmic in the aspect ratio of the polytope.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/makarychev22a.html
https://proceedings.mlr.press/v178/makarychev22a.htmlCorralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear BanditsWe consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The Corral algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\tilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is $\mbox{poly}(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\tilde{Ø}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/luo22b.html
https://proceedings.mlr.press/v178/luo22b.htmlAdaptive Bandit Convex Optimization with Heterogeneous CurvatureWe consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them. While all previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or \emph{even improves} existing results for several homogeneous settings, but also leads to surprising results for some heterogeneous settings — for example, while Hazan and Levy (2014) showed that $\tilde{O}(d^{\frac{3}{2}}\sqrt{T})$ regret is achievable for a sequence of $T$ smooth and strongly convex $d$-dimensional functions, our algorithm reveals that the same is achievable even if $T^{\frac{3}{4}}$ of them are not strongly convex, and sometimes even if a constant fraction of them are not strongly convex. Our approach is inspired by the framework of Bartlett et al. (2007) who studied a similar heterogeneous setting but with stronger gradient feedback. Extending their framework to the bandit feedback setting requires novel ideas such as lifting the feasible domain and using a logarithmically homogeneous self-concordant barrier regularizer.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/luo22a.html
https://proceedings.mlr.press/v178/luo22a.htmlGeneralization Bounds via Convex AnalysisSince the celebrated works of Russo and Zou (2016, 2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon’s mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of p-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/lugosi22a.html
https://proceedings.mlr.press/v178/lugosi22a.htmlBetter Private Algorithms for Correlation ClusteringIn machine learning, correlation clustering is an important problem whose goal is to partition the individuals into groups that correlate with their pairwise similarities as much as possible. In this work, we revisit the correlation clustering under the differential privacy constraints. Particularly, we improve previous results and achieve an $\Tilde{O}(n^{1.5})$ additive error compared to the optimal cost in expectation on general graphs. As for unweighted complete graphs, we improve the results further and propose a more involved algorithm which achieves $\Tilde{O}(n \sqrt{\Delta^*})$ additive error, where $\Delta^*$ is the maximum degrees of positive edges among all nodes.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22h.html
https://proceedings.mlr.press/v178/liu22h.htmlOrthogonal Statistical Learning with Self-Concordant LossOrthogonal statistical learning and double machine learning have emerged as general frameworks for two-stage statistical prediction in the presence of a nuisance component. We establish non-asymptotic bounds on the excess risk of orthogonal statistical learning methods with a loss function satisfying a self-concordance property. Our bounds improve upon existing bounds by a dimension factor while lifting the assumption of strong convexity. We illustrate the results with examples from multiple treatment effect estimation and generalized partially linear modeling.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22g.html
https://proceedings.mlr.press/v178/liu22g.htmlWhen Is Partially Observable Reinforcement Learning Not Scary?Partial observability is ubiquitous in applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system. Partially observable RL is notoriously difficult in theory—well-known complexity-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the possible existence of interesting subclasses of POMDPs, which include a large set of partial observable applications in practice while being tractable. In this paper we identify a rich family of tractable POMDPs, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs with non-informative observations. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee a polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning in overcomplete POMDPs—where the number of latent states can be larger than the number of observations—in settings where exploration is necessary.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22f.html
https://proceedings.mlr.press/v178/liu22f.htmlThe Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no CommunicationWe study the stochastic multi-player multi-armed bandit problem. In this problem, there are $m$ players and $K > m$ arms and the players cooperate to maximize their total reward. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\wt{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved by \cite{pacchiano2021instance, huang2021towards} in a model in which the players are able to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\wt{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of \cite{bubeck2021cooperative} and enjoys the same strong no-collision property, while our lower bound is completely new.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22e.html
https://proceedings.mlr.press/v178/liu22e.htmlOn Almost Sure Convergence Rates of Stochastic Gradient MethodsThe vast majority of convergence rates analysis for stochastic gradient methods in the literature focus on convergence in expectation, whereas trajectory-wise almost sure convergence is clearly important to ensure that any instantiation of the stochastic algorithms would converge with probability one. Here we provide a unified almost sure convergence rates analysis for stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov’s accelerated gradient (SNAG) methods. We show, for the first time, that the almost sure convergence rates obtained for these stochastic gradient methods on strongly convex functions, are arbitrarily close to their optimal convergence rates possible. For non-convex objective functions, we not only show that a weighted average of the squared gradient norms converges to zero almost surely, but also the last iterates of the algorithms. We further provide last-iterate almost sure convergence rates analysis for stochastic gradient methods on weakly convex smooth functions, in contrast with most existing results in the literature that only provide convergence in expectation for a weighted average of the iterates.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22d.html
https://proceedings.mlr.press/v178/liu22d.htmlLearning GMMs with Nearly Optimal Robustness GuaranteesIn this work we solve the problem of robustly learning a high-dimensional Gaussian mixture model with $k$ components from $\epsilon$-corrupted samples up to accuracy $\widetilde{O}(\epsilon)$ in total variation distance for any constant $k$ and with mild assumptions on the mixture. This robustness guarantee is optimal up to polylogarithmic factors. The main challenge is that most earlier works rely on learning individual components in the mixture, but this is impossible in our setting, at least for the types of strong robustness guarantees we are aiming for. Instead we introduce a new framework which we call {\em strong observability} that gives us a route to circumvent this obstacle.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22c.html
https://proceedings.mlr.press/v178/liu22c.htmlDifferential privacy and robust statistics in high dimensionsWe introduce a universal framework for characterizing the statistical efficiency of a statistical estimation problem with differential privacy guarantees. Our framework, which we call High-dimensional Propose-Test-Release (HPTR), builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism. Connecting all these together is the concept of resilience, which is central to robust statistical estimation. Resilience guides the design of the algorithm, the sensitivity analysis, and the success probability analysis of the test step in Propose-Test-Release. The key insight is that if we design an exponential mechanism that accesses the data only via one-dimensional and robust statistics, then the resulting local sensitivity can be dramatically reduced. Using resilience, we can provide tight local sensitivity bounds. These tight bounds readily translate into near-optimal utility guarantees in several cases. We give a general recipe for applying HPTR to a given instance of a statistical estimation problem and demonstrate it on canonical problems of mean estimation, linear regression, covariance estimation, and principal component analysis. We introduce a general utility analysis technique that proves that HPTR achieves near-optimal sample complexity under several scenarios studied in the literature.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22b.html
https://proceedings.mlr.press/v178/liu22b.htmlTight query complexity bounds for learning graph partitionsGiven a partition of a graph into connected components, the membership oracle asserts whether any two vertices of the graph lie in the same component or not. We prove that for $n\ge k\ge 2$, learning the components of an $n$-vertex hidden graph with $k$ components requires at least $(k-1)n-\binom k2$ membership queries. Our result improves on the best known information-theoretic bound of $\Omega(n\log k)$ queries, and exactly matches the query complexity of the algorithm introduced by [Reyzin and Srivastava, 2007] for this problem. Additionally, we introduce an oracle that can learn the number of components of $G$ in asymptotically fewer queries than learning the full partition, thus answering another question posed by the same authors. Lastly, we introduce a more applicable version of this oracle, and prove asymptotically tight bounds of $\widetilde\Theta(m)$ queries for both learning and verifying an $m$-edge hidden graph $G$ using it.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/liu22a.html
https://proceedings.mlr.press/v178/liu22a.htmlStatistical Estimation and Online Inference via Local SGDWe analyze the novel Local SGD in federated Learning, a multi-round estimation procedure that uses intermittent communication to improve communication efficiency. Under a $2{+}\delta$ moment condition on stochastic gradients, we first establish a {\it functional central limit theorem} that shows the averaged iterates of Local SGD converge weakly to a rescaled Brownian motion. We next provide two iterative inference methods: the {\it plug-in} and the {\it random scaling}. Random scaling constructs an asymptotically pivotal statistic for inference by using the information along the whole Local SGD path. Both the methods are communication efficient and applicable to online data. Our results show that Local SGD simultaneously achieves both statistical efficiency and communication efficiency.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/li22b.html
https://proceedings.mlr.press/v178/li22b.htmlROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single AlgorithmWe study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as \emph{Recursive One-Over-T SGD} (ROOT-SGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of ROOT-SGD with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of $O(n^{-3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) ROOT-SGD converges asymptotically to a Gaussian limit with the Cramér-Rao optimal asymptotic covariance, for a broad range of step-size choices.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/li22a.html
https://proceedings.mlr.press/v178/li22a.htmlCorruption-Robust Contextual Search through Density UpdatesWe study the problem of contextual search in the adversarial noise model. Let $d$ be the dimension of the problem, $T$ be the time horizon and $C$ be the total amount of noise in the system. For the $\epsilon$-ball loss, we give a tight regret bound of $O(C + d \log(1/\epsilon))$ improving over the $O(d^3 \log(1/\epsilon)) \log^2(T) + C \log(T) \log(1/\epsilon))$ bound of Krishnamurthy et al (STOC’21). For the symmetric loss, we give an efficient algorithm with regret $O(C+d \log T)$. In terms of techniques, our algorithms are a departure from previous contextual search models in the sense that they keep track of density functions over the candidate vectors instead of a knowledge set consisting of the candidate vectors consistent with the feedback obtained.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/leme22a.html
https://proceedings.mlr.press/v178/leme22a.htmlMinimax Regret for Partial Monitoring: Infinite Outcomes and Rustichini’s RegretWe show that a version of the generalised information ratio of Lattimore and Gyorgy (2020) determines the asymptotic minimax regret for all finite-action partial monitoring games provided that (a) the standard definition of regret is used but the latent space where the adversary plays is potentially infinite; or (b) the regret introduced by Rustichini (1999) is used and the latent space is finite. Our results are complemented by a number of examples. For any p ∈ [1/2, 1] there exists an infinite partial monitoring game for which the minimax regret over n rounds is n^p up to subpolynomial factors and there exist finite games for which the minimax Rustichini regret is n^(4/7) up to subpolynomial factors.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/lattimore22a.html
https://proceedings.mlr.press/v178/lattimore22a.htmlAn Efficient Minimax Optimal Estimator For Multivariate Convex RegressionWe study the computational aspects of the task of multivariate convex regression in dimension $d \geq 5$. We present the first computationally efficient minimax optimal (up to logarithmic factors) estimators for the tasks of $L$-Lipschitz and $\Gamma$-bounded convex regression under polytopal support. This work is the first to show the existence of efficient minimax optimal estimators for non-Donsker classes whose corresponding Least Squares Estimators are provably minimax suboptimal. The proof of the correctness of these estimators uses a variety of tools from different disciplines, among them empirical process theory, stochastic geometry, and potential theory.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/kur22a.html
https://proceedings.mlr.press/v178/kur22a.htmlPrivate Robust Estimation by Stabilizing Convex RelaxationsWe give the first polynomial time and sample (epsilon, delta)-differentially private (DP) algorithm to estimate the mean, covariance and higher moments in the presence of a constant fraction of adversarial outliers. Our algorithm succeeds for families of distributions that satisfy two well-studied properties in prior works on robust estimation: certifiable subgaussianity of directional moments and certifiable hypercontractivity of degree 2 polynomials. Our recovery guarantees hold in the “right affine-invariant norms”: Mahalanobis distance for mean, multiplicative spectral and relative Frobenius distance guarantees for covariance and injective norms for higher moments. Prior works obtained private robust algorithms for mean estimation of subgaussian distributions with bounded covariance. For covariance estimation, ours is the first efficient algorithm (even in the absence of outliers) that succeeds without any condition-number assumptions. Our algorithms arise from a new framework that provides a general blueprint for modifying convex relaxations for robust estimation to satisfy strong worst-case stability guarantees in the appropriate parameter norms whenever the algorithms produce witnesses of correctness in their run. We verify such guarantees for a modification of standard sum-of-squares (SoS) semidefinite programming relaxations for robust estimation. Our privacy guarantees are obtained by combining stability guarantees with a new “estimate dependent” noise injection mechanism in which noise scales with the eigenvalues of the estimated covariance. We believe this framework will be useful more generally in obtaining DP counterparts of robust estimators. Independently of our work, Ashtiani and Liaw [AL21] also obtained a polynomial time and sample private robust estimation algorithm for Gaussian distributions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/kothari22a.html
https://proceedings.mlr.press/v178/kothari22a.htmlSampling Approximately Low-Rank Ising Models: MCMC meets Variational MethodsWe consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when \emph{all} eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/koehler22a.html
https://proceedings.mlr.press/v178/koehler22a.htmlRate of Convergence of Polynomial Networks to Gaussian ProcessesWe examine one-hidden-layer neural networks with random weights. It is well-known that in the limit of infinitely many neurons they simplify to Gaussian processes. For networks with a polynomial activation, we demonstrate that the rate of this convergence in 2-Wasserstein metric is O(1/sqrt(n)), where n is the number of hidden neurons. We suspect this rate is asymptotically sharp. We improve the known convergence rate for other activations, to power-law in n for ReLU and inverse-square-root up to logarithmic factors for erf. We explore the interplay between spherical harmonics, Stein kernels and optimal transport in the non-isotropic setting.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/klukowski22a.html
https://proceedings.mlr.press/v178/klukowski22a.htmlBig-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple ScalesWe provide new gradient-based methods for efficiently solving a broad class of ill-conditioned optimization problems. We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluations that scales (up to logarithmic factors) as the product of the square-root of the condition numbers of the components. This complexity bound (which we prove is nearly optimal) can improve almost exponentially on that of accelerated gradient methods, which grow as the square root of the condition number of $f$. Additionally, we provide efficient methods for solving stochastic, quadratic variants of this multiscale optimization problem. Rather than learn the decomposition of $f$ (which would be prohibitively expensive), our methods apply a clean recursive “Big-Step-Little-Step” interleaving of standard methods. The resulting algorithms use $\tilde{\mathcal{O}}(d m)$ space, are numerically stable, and open the door to a more fine-grained understanding of the complexity of convex optimization beyond condition number.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/kelner22a.html
https://proceedings.mlr.press/v178/kelner22a.htmlThe Dynamics of Riemannian Robbins-Monro AlgorithmsMany important learning algorithms, such as stochastic gradient methods, are often deployed to solve nonlinear problems on Riemannian manifolds. Motivated by these applications, we propose a family of Riemannian algorithms generalizing and extending the seminal stochastic approximation framework of Robbins and Monro (1951). Compared to their Euclidean counterparts, Riemannian iterative algorithms are much less understood due to the lack of a global linear structure on the manifold. We overcome this difficulty by introducing an extended Fermi coordinate frame which allows us to map the asymptotic behavior of the proposed Riemannian Robbins–Monro (RRM) class of algorithms to that of an associated deterministic dynamical system under very mild assumptions on the underlying manifold. In so doing, we provide a general template of almost sure convergence results that mirrors and extends the existing theory for Euclidean Robbins-Monro schemes, albeit with a significantly more involved analysis that requires a number of new geometric ingredients. We showcase the flexibility of the proposed RRM framework by using it to establish the convergence of a retraction-based analogue of the popular optimistic / extra-gradient methods for solving minimization problems and games, and we provide a unified treatment for their convergence.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/karimi22a.html
https://proceedings.mlr.press/v178/karimi22a.htmlThompson Sampling Achieves $\tilde{O}(\sqrt{T})$ Regret in Linear Quadratic ControlThompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde{O}(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde{O}(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of the proposed algorithm in several adaptive control tasks.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/kargin22a.html
https://proceedings.mlr.press/v178/kargin22a.htmlComputational-Statistical Gap in Reinforcement LearningReinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function V* and Q* linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community. In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/kane22a.html
https://proceedings.mlr.press/v178/kane22a.htmlA Private and Computationally-Efficient Estimator for Unbounded GaussiansWe give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $N(\mu,\Sigma)$ in $\R^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $\mu$ and $\Sigma$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $N(0,\Sigma)$ and returns a matrix $A$ such that $A \Sigma A^T$ has constant condition numberTue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/kamath22a.html
https://proceedings.mlr.press/v178/kamath22a.htmlSharper Rates for Separable Minimax and Finite Sum Optimization via Primal-Dual Extragradient MethodsWe design accelerated algorithms with improved rates for several fundamental classes of optimization problems. Our algorithms all build upon techniques related to the analysis of primal-dual extragradient methods via relative Lipschitzness proposed recently by Cohen, Sidford, and Tian ’21. (1) We study separable minimax optimization problems of the form $\min_x \max_y f(x) - g(y) + h(x, y)$, where $f$ and $g$ have smoothness and strong convexity parameters $(L^x, \mu^x)$, $(L^y, \mu^y)$, and h is convex-concave with a $(\Lambda^{xx}, \Lambda^{xy}, \Lambda^{yy})$-blockwise operator norm bounded Hessian. We provide an algorithm using $\tilde{O}(\sqrt{\frac{L^x}{\mu^x}} + \sqrt{\frac{L^y}{\mu^y}} + \frac{\Lambda^{xx}}{\mu^x} + \frac{\Lambda^{xy}}{\sqrt{\mu^x\mu^y}} + \frac{\Lambda^{yy}}{\mu^y})$ gradient queries. Notably, for convex-concave minimax problems with bilinear coupling (e.g. quadratics), where $\Lambda^{xx} = \Lambda^{yy} = 0$, our rate matches a lower bound of Zhang, Hong, and Zhang ’19. (2) We study finite sum optimization problems of the form $\min_x \frac 1 n \sum_{i \in [n]} f_i(x)$, where each $f_i$ is $L_i$-smooth and the overall problem is $\mu$-strongly convex. We provide an algorithm using $\tilde{O}(n + \sum_{i \in [n]} \sqrt{\frac{L_i}{n\mu}} )$ gradient queries. Notably, when the smoothness bounds $\{L_i\}_{i\in[n]}$ are non-uniform, our rate improves upon accelerated SVRG (Lin et al., Frostig et al. ’15) and Katyusha (Allen-Zhu ’17) by up to a $\sqrt{n}$ factor. (3) We generalize our algorithms for minimax and finite sum optimization to solve a natural family of minimax finite sum optimization problems at an accelerated rate, encapsulating both above results up to a logarithmic factor.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/jin22b.html
https://proceedings.mlr.press/v178/jin22b.htmlUnderstanding Riemannian Acceleration via a Proximal Extragradient FrameworkWe contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit “\emph{Accelerated Hybrid Proximal Extragradient}” (A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/jin22a.html
https://proceedings.mlr.press/v178/jin22a.htmlInductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight NormWe provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an \emph{induced regularizer} in the function space as the minimum $\ell_2$ norm of weights of a network required to realize a function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network are single channeled, the induced regularizer for any $K$ is \emph{independent} of the number of output channels $C$. Furthermore, we derive the regularizer is a norm given by a semidefinite program (SDP). (b) In contrast, for multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias \emph{does} depend on $C$. However, for sufficiently large $C$, the induced regularizer is again given by an SDP that is independent of $C$. In particular, the induced regularizer for $K=1$ and $K=D$ (input dimension) are given in closed form as the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients of the linear predictor. We investigate the broader applicability of our theoretical results to implicit regularization from gradient descent on linear and ReLU networks through experiments on MNIST and CIFAR-10 datasets.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/jagadeesan22a.html
https://proceedings.mlr.press/v178/jagadeesan22a.htmlParameter-free Mirror DescentWe develop a modified online mirror descent framework that is suitable for building adaptive and parameter-free algorithms in unbounded domains. We leverage this technique to develop the first unconstrained online linear optimization algorithm achieving an optimal dynamic regret bound, and we further demonstrate that natural strategies based on Follow-the-Regularized-Leader are unable to achieve similar results. We also apply our mirror descent framework to build new parameter-free implicit updates, as well as a simplified and improved unconstrained scale-free algorithm.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/jacobsen22a.html
https://proceedings.mlr.press/v178/jacobsen22a.htmlAdversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret BoundsThis paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both stochastic and adversarial settings. In stochastic settings, some existing BOBW algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: \Delta_i>0} \frac{\log T}{\Delta_i})$ for suboptimality gap $\Delta_i$ of arm $i$ and time horizon $T$. On the other hand, it is shown in Audibert et al. (2007) that the regret bound can be tightened to $O(\sum_{i: \Delta_i>0} (\frac{\sigma_i^2}{\Delta_i} + 1) \log T )$ using the loss variance $\sigma_i^2$ of each arm $i$ in the stochastic environments. In this paper, we propose an algorithm based on the follow-the-regularized-leader method, which employs adaptive learning rates that depend on the empirical prediction error of the loss. This is the first BOBW algorithm with gap-variance-dependent bounds, showing that the variance information can be used even in the possibly adversarial environment. Further, the leading constant factor in our gap-variance dependent bound is only (almost) twice the value for the lower bound. In addition, the proposed algorithm enjoys multiple data-dependent regret bounds in adversarial settings and works well in stochastic settings with adversarial corruptions. Table 1 summarizes the achievable bounds in comparison with UCB-V Audibert et al. (2007), Tsallis-INF (Zimmert and Seldin, 2021) and LB-INF (Ito, 2021).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/ito22a.html
https://proceedings.mlr.press/v178/ito22a.htmlTowards Optimal Algorithms for Multi-Player Bandits without Collision Sensing InformationWe propose a novel algorithm for multi-player multi-armed bandits without collision sensing information. Our algorithm circumvents two problems shared by all state-of-the-art algorithms: it does not need as an input a lower bound on the minimal expected reward of an arm, and its performance does not scale inversely proportionally to the minimal expected reward. We prove a theoretical regret upper bound to justify these claims. We complement our theoretical results with numerical experiments, showing that the proposed algorithm outperforms state-of-the-art in practice.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/huang22a.html
https://proceedings.mlr.press/v178/huang22a.htmlNear-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian MarginalsWe consider the well-studied problem of learning intersections of halfspaces under the Gaussian distribution in the challenging \emph{agnostic learning} model. Recent work of Diakonikolas et al. (2021) shows that any Statistical Query (SQ) algorithm for agnostically learning the class of intersections of $k$ halfspaces over $\mathbb{R}^n$ to constant excess error either must make queries of tolerance at most $n^{-\tilde{\Omega}(\sqrt{\log k})}$ or must make $2^{n^{\Omega(1)}}$ queries. We strengthen this result by improving the tolerance requirement to $n^{-\tilde{\Omega}(\log k)}$. This lower bound is essentially best possible since an SQ algorithm of Klivans et al. (2008) agnostically learns this class to any constant excess error using $n^{O(\log k)}$ queries of tolerance $n^{-O(\log k)}$. We prove two variants of our lower bound, each of which combines ingredients from Diakonikolas et al. (2021) with (an extension of) a different earlier approach for agnostic SQ lower bounds for the Boolean setting due to Dachman-Soled et al. (2014). Our approach also yields lower bounds for agnostically SQ learning the class of "convex subspace juntas" (studied by Vempala, 2010) and the class of sets with bounded Gaussian surface area; all of these lower bounds are nearly optimal since they essentially match known upper bounds from Klivans et al. (2008).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/hsu22a.html
https://proceedings.mlr.press/v178/hsu22a.htmlRealizable Learning is All You NeedThe equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust and private learning, it’s surprising we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions or general loss, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/hopkins22a.html
https://proceedings.mlr.press/v178/hopkins22a.htmlNear optimal efficient decoding from pooled dataConsider $n$ items, each of which is characterised by one of $d+1$ possible features in $\{0, \ldots, d\}$. We study the inference task of learning these types by queries on subsets, or pools, of the items that only reveal a form of coarsened information on the features - in our case, the sum of all the features in the pool. This is a realistic scenario in situations where one has memory or technical constraints in the data collection process, or where the data is subject to anonymisation. Related prominent problems are the quantitative group testing problem, of which it is a generalisation, as well as the compressed sensing problem, of which it is a special case. In the present article, we are interested in the minimum number of queries needed to efficiently infer the features, in the setting where the feature vector is chosen uniformly while fixing the frequencies, and one of the features, say $0$, is dominant in the sense that the number $k = n^{\theta}, \theta \in (0,1)$, of non-zero features among the items is much smaller than $n$. It is known that in this case, all features can be recovered in exponential time using no more than $O(k)$ queries. However, so far, all \emph{efficient} inference algorithms required at least $\Omega(k\ln n)$ queries, and it was unknown whether this gap is artificial or of a fundamental nature. Here we show that indeed, the previous gap between the information-theoretic and computational bounds is not inherent to the problem by providing an efficient algorithm that succeeds with high probability and employs no more than $O(k)$ measurements. This also solves a prominent open question for the quantitative group testing problem.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/hahn-klimroth22a.html
https://proceedings.mlr.press/v178/hahn-klimroth22a.htmlFaster online calibration without randomization: interval forecasts and the power of two choicesWe study the problem of making calibrated probabilistic forecasts for a binary sequence generated by an adversarial nature. Following the seminal paper of Foster and Vohra (1998), nature is often modeled as an adaptive adversary who sees all activity of the forecaster except the randomization that the forecaster may deploy. A number of papers have proposed randomized forecasting strategies that achieve an $\epsilon$-calibration error rate of $O(1/\sqrt{T})$, which we prove is tight in general. On the other hand, it is well known that it is not possible to be calibrated without randomization, or if nature also sees the forecaster’s randomization; in both cases the calibration error could be $\Omega(1)$. Inspired by the equally seminal works on the power of two choices and imprecise probability theory, we study a small variant of the standard online calibration problem. The adversary gives the forecaster the option of making two nearby probabilistic forecasts, or equivalently an interval forecast of small width, and the endpoint closest to the revealed outcome is used to judge calibration. This power of two choices, or imprecise forecast, accords the forecaster with significant power—we show that a faster $\epsilon$-calibration rate of $O(1/T)$ can be achieved even without deploying any randomization.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/gupta22b.html
https://proceedings.mlr.press/v178/gupta22b.htmlSharp Constants in Uniformity Testing via the Huber StatisticUniformity testing is one of the most well-studied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on $m$ elements from any $\eps$-far distribution with $1-\delta$ probability is $n = \Theta(\frac{\sqrt{m \log (1/\delta)}}{\eps^2} + \frac{\log (1/\delta)}{\eps^2})$, which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to $0$ or $\infty$. We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and non-uniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of $(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\eps^2}$ in the regime where this term is dominant, unlike all other existing testers.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/gupta22a.html
https://proceedings.mlr.press/v178/gupta22a.htmlOnline Learning to Transport via the Minimal Selection PrincipleMotivated by robust dynamic resource allocation in operations research, we study the Online Learning to Transport (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by Ambrosio et al. (2005). This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the minimal selection or exploration (MSoE) algorithm to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/guo22a.html
https://proceedings.mlr.press/v178/guo22a.htmlHardness of Maximum Likelihood Learning of DPPsDeterminantal Point Processes (DPPs) are a widely used probabilistic model for negatively correlated sets. DPPs are used in Machine Learning applications to select a diverse, yet representative subset of data. In these applications, the parameters of the DPP need to be fit to match the data; typically, we seek a set of parameters that maximize the likelihood of the data. The algorithms used for this task either optimize over a limited family of DPPs, or else use local improvement heuristics that do not provide theoretical guarantees of optimality. It is natural to ask if there exist efficient algorithms for finding a maximum likelihood DPP model for a given data set. In seminal work on DPPs in Machine Learning, Kulesza conjectured in his PhD Thesis (2012) that the problem is NP-complete. In this work we prove Kulesza’s conjecture: we prove moreover, that even computing a $1-\frac{1}{\mathrm{poly} \log N}$-approximation to the maximum log-likelihood of a DPP on a set of $N$ items is NP-complete. At the same time, we also obtain the first polynomial-time algorithm obtaining a nontrivial worst-case approximation to the optimal likelihood: we present a polynomial-time $1/\log m$-approximation algorithm (for data sets of size $m$), which moreover obtains a $1-\frac{1}{\log N}$-approximation if all $N$ elements appear in a $O(1/N)$-fraction of the subsets. In terms of techniques, the hardness result reduces to solving a gap instance of a “vector coloring" problem on a hypergraph obtained from an adaptation of the constructions of Bogdanov, Obata and Trevisan (FOCS 2002), using the strong expanders of Alon and Capalbo (FOCS 2007).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/grigorescu22a.html
https://proceedings.mlr.press/v178/grigorescu22a.htmlPrivate Convex Optimization via Exponential MechanismIn this paper, we study the private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$. We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $\pi(x)\propto \exp(-k(F(x)+\mu\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(\eps,\delta)$-DP. Furthermore, we show how to implement this mechanism using $\widetilde{O}(n \min(d, n))$ queries to $f_i(x)$ where $n$ is the number of samples/users in the DP-SCO. We also give a (nearly) matching lower bound $\widetilde{\Omega}(n \min(d, n))$ on the number of evaluation queries. Our results utilize the following tools that are of independent interests: \begin{itemize} \item We prove Gaussian Differential Privacy (GDP) of the exponential mechanism if the loss function is strongly convex and the perturbation is Lipschitz. Our privacy bound is \emph{optimal} as it includes the privacy of Gaussian mechanism as a special case. \item We show how to sample from $\exp(-F(x)-\mu \|x\|^2_2/2)$ for $G$-Lipschitz $F$ with $\eta$ error in TV distance using $\widetilde{O}((G^2/\mu) \log^2(d/\eta))$ unbiased queries to $F(x)$. This is the first sampler whose query complexity has \emph{polylogarithmic dependence} on both dimension $d$ and accuracy $\eta$. \end{itemize}Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/gopi22a.html
https://proceedings.mlr.press/v178/gopi22a.htmlLow-Degree MulticalibrationIntroduced as a notion of algorithmic fairness, multicalibration has proved to be a powerful and versatile concept with implications far beyond its original intent. This stringent notion—that predictions be well-calibrated across a rich class of intersecting subpopulations—provides its strong guarantees at a cost: the computational and sample complexity of learning multicalibrated predictors are high, and grow exponentially with the number of class labels. In contrast, the relaxed notion of multiaccuracy can be achieved more efficiently, yet many of the most desirable properties of multicalibration cannot be guaranteed assuming multiaccuracy alone. This tension raises a key question: \emph{Can we learn predictors with multicalibration-style guarantees at a cost commensurate with multiaccuracy?} In this work, we define and initiate the study of \emph{Low-Degree Multicalibration}. Low-Degree Multicalibration defines a hierarchy of increasingly-powerful multi-group fairness notions that spans multiaccuracy and the original formulation of multicalibration at the extremes. Our main technical contribution demonstrates that key properties of multicalibration, related to fairness and accuracy, actually manifest as low-degree properties. Importantly, we show that low-degree multicalibration can be significantly more efficient than full multicalibration. In the multi-class setting, the sample complexity to achieve low-degree multicalibration improves exponentially (in the number of classes) over full multicalibration. Our work presents compelling evidence that low-degree multicalibration represents a sweet spot, pairing computational and sample efficiency with strong fairness and accuracy guarantees.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/gopalan22a.html
https://proceedings.mlr.press/v178/gopalan22a.htmlCan Q-learning be Improved with Advice?Despite rapid progress in theoretical reinforcement learning (RL) over the last few years, most of the known guarantees are worst-case in nature, failing to take advantage of structure that may be known a priori about a given RL problem at hand. In this paper we address the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal Q-value function, is given to the algorithm. We show that when the predictions about the optimal Q-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. This improvement holds for both uniform regret bounds and gap-based ones. Further, we are able to achieve this property with an algorithm that achieves sublinear regret when given arbitrary predictions (i.e., even those which are not a distillation). Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/golowich22a.html
https://proceedings.mlr.press/v178/golowich22a.htmlExact Community Recovery in Correlated Stochastic Block ModelsWe consider the problem of learning latent community structure from multiple correlated networks. We study edge-correlated stochastic block models with two balanced communities, focusing on the regime where the average degree is logarithmic in the number of vertices. Our main result derives the precise information-theoretic threshold for exact community recovery using multiple correlated graphs. This threshold captures the interplay between the community recovery and graph matching tasks. In particular, we uncover and characterize a region of the parameter space where exact community recovery is possible using multiple correlated graphs, even though (1) this is information-theoretically impossible using a single graph and (2) exact graph matching is also information-theoretically impossible. In this regime, we develop a novel algorithm that carefully synthesizes algorithms from the community recovery and graph matching literatures.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/gaudio22a.html
https://proceedings.mlr.press/v178/gaudio22a.htmlNew Projection-free Algorithms for Online Convex Optimization with Adaptive Regret GuaranteesWe present new efficient \emph{projection-free} algorithms for online convex optimization (OCO), where by projection-free we refer to algorithms that avoid computing orthogonal projections onto the feasible set, and instead relay on different and potentially much more efficient oracles. While most state-of-the-art projection-free algorithms are based on the \emph{follow-the-leader} framework, our algorithms are fundamentally different and are based on the \emph{online gradient descent} algorithm with a novel and efficient approach to computing so-called \emph{infeasible projections}. As a consequence, we obtain the first projection-free algorithms which naturally yield \emph{adaptive regret} guarantees, i.e., regret bounds that hold w.r.t. any sub-interval of the sequence. Concretely, when assuming the availability of a linear optimization oracle (LOO) for the feasible set, on a sequence of length $T$, our algorithms guarantee $O(T^{3/4})$ adaptive regret and $O(T^{3/4})$ adaptive expected regret, for the full-information and bandit settings, respectively, using only $O(T)$ calls to the LOO. These bounds match the current state-of-the-art regret bounds for LOO-based projection-free OCO, which are \emph{not adaptive}. We also consider a new natural setting in which the feasible set is accessible through a separation oracle. We present algorithms which, using overall $O(T)$ calls to the separation oracle, guarantee $O(\sqrt{T})$ adaptive regret and $O(T^{3/4})$ adaptive expected regret for the full-information and bandit settings, respectively.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/garber22a.html
https://proceedings.mlr.press/v178/garber22a.htmlApproximate Cluster Recovery from Noisy LabelsDesigning algorithms for machine learning problems targeting beyond worst-case analysis and, in particular, analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels. In this problem, in addition to a set of points and parameter \(k\), we receive cluster labels of each point generated by either an adversarial or a random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor $1+o(1)$ of the optimum even when the label of each point is perturbed with a large probability (think 99%). In contrast, we show that the side-information with adversarial perturbations is as hard as the original problem even if only a small $\epsilon$ fraction of the labels are perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb an $\epsilon$ fraction of the labels per \emph{each cluster}.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/gamlath22a.html
https://proceedings.mlr.press/v178/gamlath22a.htmlEfficient decentralized multi-agent learning in asymmetric queuing systemsWe study decentralized multi-agent learning in bipartite queuing systems, a standard model for service systems. In particular, N agents request service from K servers in a fully decentralized way, i.e, by running the same algorithm without communication. Previous decentralized algorithms are restricted to symmetric systems, have performance that is degrading exponentially in the number of servers, require communication through shared randomness and unique agent identities, and are computationally demanding. In contrast, we provide a simple learning algorithm that, when run decentrally by each agent, leads the queueing system to have efficient performance in general asymmetric bipartite queuing systems while also having additional robustness properties. Along the way, we provide the first UCB-based algorithm for the centralized case of the problem, which resolves an open question by Krishnasamy et al.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/freund22a.html
https://proceedings.mlr.press/v178/freund22a.htmlBenign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear DataBenign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/frei22a.html
https://proceedings.mlr.press/v178/frei22a.htmlOffline Reinforcement Learning: Fundamental Barriers for Value Function ApproximationWe consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data. Offline RL—particularly when coupled with (value) function approximation to allow for generalization in large or continuous state spaces—is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well suited to safety-critical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all $Q$-value functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits of offline RL are not well understood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sample-efficient offline RL. We resolve this conjecture in the positive by proving that in general, even if both concentrability and realizability are satisfied, any algorithm requires sample complexity either polynomial in the size of the state space or exponential in other parameters to learn a non-trivial policy. Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond supervised learning, and highlight a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods. A consequence of our results for reinforcement learning with linear function approximation is that the separation between online and offline RL can be arbitrarily large, even in constant dimension.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/foster22a.html
https://proceedings.mlr.press/v178/foster22a.htmlThe Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine VarianceWe study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/faw22a.html
https://proceedings.mlr.press/v178/faw22a.htmlHow catastrophic can catastrophic forgetting be in linear regression?To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas – alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when $T$ tasks in $d$ dimensions are presented cyclically for $k$ iterations, we prove an upper bound of $T^2\min\{1/\sqrt{k},d/k\}$ on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the $T^2$ factor can be lifted when tasks are presented in a random ordering.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/evron22a.html
https://proceedings.mlr.press/v178/evron22a.htmlSample-Efficient Reinforcement Learning in the Presence of Exogenous InformationIn real-world reinforcement learning applications the learner’s observation space is ubiquitously high-dimensional with both relevant and irrelevant information about the task at hand. Learning from high-dimensional observations has been the subject of extensive investigation in supervised learning and statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning, the Exogenous Markov Decision Process (ExMDP), in which the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner’s actions, but evolves in an arbitrary, temporally correlated fashion. We provide a new algorithm, OSSR, which learns a near-optimal policy with sample complexity polynomial in the size of the endogenous component and nearly independent of the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation going forward.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/efroni22a.html
https://proceedings.mlr.press/v178/efroni22a.htmlDepth and Feature Learning are Provably Beneficial for Neural Network DiscriminatorsWe construct pairs of distributions $\mu_d, \nu_d$ on $\mathbb{R}^d$ such that the quantity $|\mathbb{E}_{x \sim \mu_d} [F(x)] - \mathbb{E}_{x \sim \nu_d} [F(x)]|$ decreases as $\Omega(1/d^2)$ for some three-layer ReLU network $F$ with polynomial width and weights, while declining exponentially in $d$ if $F$ is any two-layer network with polynomial weights. This shows that deep GAN discriminators are able to distinguish distributions that shallow discriminators cannot. Analogously, we build pairs of distributions $\mu_d, \nu_d$ on $\mathbb{R}^d$ such that $|\mathbb{E}_{x \sim \mu_d} [F(x)] - \mathbb{E}_{x \sim \nu_d} [F(x)]|$ decreases as $\Omega(1/(d\log d))$ for two-layer ReLU networks with polynomial weights, while declining exponentially for bounded-norm functions in the associated RKHS. This confirms that feature learning is beneficial for discriminators. Our bounds are based on Fourier transforms.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/domingo-enrich22a.html
https://proceedings.mlr.press/v178/domingo-enrich22a.htmlFast algorithm for overcomplete order-3 tensor decompositionWe develop the first fast spectral algorithm to decompose a random third-order tensor over of rank up to $$O(d^{3/2}/polylog(d))$$. Our algorithm only involves simple linear algebra operations and can recover all components in time $$O(d^{6.05})$$ under the current matrix multiplication time. Prior to this work, comparable guarantees could only be achieved via sum-of-squares [Ma, Shi, Steurer 2016]. In contrast, fast algorithms [Hopkins, Schramm, Shi, Steurer 2016] could only decompose tensors of rank at most $$O(d^{4/3}/polylog(d))$$. Our algorithmic result rests on two key ingredients. A clean lifting of the third-order tensor to a sixth-order tensor, which can be expressed in the language of tensor networks. A careful decomposition of the tensor network into a sequence of rectangular matrix multiplications, which allows us to have a fast implementation of the algorithm.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/ding22a.html
https://proceedings.mlr.press/v178/ding22a.htmlRobust Sparse Mean Estimation via Sum of SquaresWe study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb{R}^d$ with ‘certifiably bounded’ $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/diakonikolas22e.html
https://proceedings.mlr.press/v178/diakonikolas22e.htmlNon-Gaussian Component Analysis via Lattice Basis ReductionNon-Gaussian Component Analysis (NGCA) is the following distribution learning problem: Given i.i.d. samples from a distribution on $\R^d$ that is non-gaussian in a hidden direction $v$ and an independent standard Gaussian in the orthogonal directions, the goal is to approximate the hidden direction $v$. Prior work \citep{DKS17-sq} provided formal evidence for the existence of an information-computation tradeoff for NGCA under appropriate moment-matching conditions on the univariate non-gaussian distribution $A$. The latter result does not apply when the distribution $A$ is discrete. A natural question is whether information-computation tradeoffs persist in this setting. In this paper, we answer this question in the negative by obtaining a sample and computationally efficient algorithm for NGCA in the regime that $A$ is discrete or nearly discrete, in a well-defined technical sense. The key tool leveraged in our algorithm is the LLL method \citep{LLL82} for lattice basis reduction.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/diakonikolas22d.html
https://proceedings.mlr.press/v178/diakonikolas22d.htmlLearning a Single Neuron with Adversarial Label Noise via Gradient DescentWe study the fundamental problem of learning a single neuron, i.e., a function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast) = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize} In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/diakonikolas22c.html
https://proceedings.mlr.press/v178/diakonikolas22c.htmlNear-Optimal Statistical Query Hardness of Learning Halfspaces with Massart NoiseWe study the problem of PAC learning halfspaces with Massart noise. Given labeled samples $(x, y)$ from a distribution $D$ on $\R^{d} \times \{ \pm 1\}$ such that the marginal $D_x$ on the examples is arbitrary and the label $y$ of example $x$ is generated from the target halfspace corrupted by a Massart adversary with flipping probability $\eta(x) \leq \eta \leq 1/2$, the goal is to compute a hypothesis with small misclassification error. The best known $\poly(d, 1/\eps)$-time algorithms for this problem achieve error of $\eta+\eps$, which can be far from the optimal bound of $\opt+\eps$, where $\opt = \E_{x \sim D_x} [\eta(x)]$. While it is known that achieving $\opt+o(1)$ error requires super-polynomial time in the Statistical Query model, a large gap remains between known upper and lower bounds. In this work, we essentially characterize the efficient learnability of Massart halfspaces in the Statistical Query (SQ) model. Specifically, we show that no efficient SQ algorithm for learning Massart halfspaces on $\R^d$ can achieve error better than $\Omega(\eta)$, even if $\opt = 2^{-\log^{c} (d)}$, for any universal constant $c \in (0, 1)$. Furthermore, when the noise upper bound $\eta$ is close to $1/2$, our error lower bound becomes $\eta - o_{\eta}(1)$, where the $o_{\eta}(1)$ term goes to $0$ when $\eta$ approaches $1/2$. Our results provide strong evidence that known learning algorithms for Massart halfspaces are nearly best possible.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/diakonikolas22b.html
https://proceedings.mlr.press/v178/diakonikolas22b.htmlOptimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising ModelsWe establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\eps$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\eps \sqrt{\log(1/\eps)})$. Similarly, we show that no efficient SQ algorithm with access to an $\eps$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\eps \log(1/\eps))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low-dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/diakonikolas22a.html
https://proceedings.mlr.press/v178/diakonikolas22a.htmlNeural Networks can Learn Representations with Gradient DescentSignificant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U: \R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to $f^\star$. This results in an improved sample complexity of $n\asymp d^2$ and enables transfer learning with sample complexity independent of $d$.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/damian22a.html
https://proceedings.mlr.press/v178/damian22a.htmlEM’s Convergence in Gaussian Latent Tree ModelsWe study the optimization landscape of the log-likelihood function and the convergence of the Expectation-Maximization (EM) algorithm in latent Gaussian tree models, i.e. tree-structured Gaussian graphical models whose leaf nodes are observable and non-leaf nodes are unobservable. We show that the unique non-trivial stationary point of the population log-likelihood is its global maximum, and establish that the expectation-maximization algorithm is guaranteed to converge to it in the single latent variable case. Our results for the landscape of the log-likelihood function in general latent tree models provide support for the extensive practical use of maximum likelihood based-methods in this setting. Our results for the expectation-maximization algorithm extend an emerging line of work on obtaining global convergence guarantees for this celebrated algorithm. We show our results for the non-trivial stationary points of the log-likelihood by arguing that a certain system of polynomial equations obtained from the EM updates has a unique non-trivial solution. The global convergence of the EM algorithm follows by arguing that all trivial fixed points are higher-order saddle points.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/dagan22b.html
https://proceedings.mlr.press/v178/dagan22b.htmlA bounded-noise mechanism for differential privacyWe present an asymptotically optimal $(\epsilon,\delta)$ differentially private mechanism for answering multiple, adaptively asked, $\Delta$-sensitive queries, settling the conjecture of Steinke and Ullman [2020]. Our algorithm has a significant advantage that it adds independent bounded noise to each query, thus providing an absolute error bound. Additionally, we apply our algorithm in adaptive data analysis, obtaining an improved guarantee for answering multiple queries regarding some underlying distribution using a finite sample. Numerical computations show that the bounded-noise mechanism outperforms the Gaussian mechanism in many standard settings.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/dagan22a.html
https://proceedings.mlr.press/v178/dagan22a.htmlAssemblies of neurons learn to classify well-separated distributionsAn assembly is a large population of neurons whose synchronous firing represents a memory, concept, word, and other cognitive category. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the \emph{Assembly Calculus} (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate {\em learning} has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated — for example, when they are clusters of similar assemblies, or more generally separable with margin by a linear threshold function. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision — all key attributes of learning in a model of the brain. We argue that this learning mechanism, supported by separate sensory pre-processing mechanisms for extracting attributes, such as edges or phonemes, from real world data, can be the basis of biological learning in cortex.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/dabagia22a.html
https://proceedings.mlr.press/v178/dabagia22a.htmlNegative curvature obstructs acceleration for strongly geodesically convex optimization, even with exact first-order oraclesHamilton and Moitra (2021) showed that, in certain regimes, it is not possible to accelerate Riemannian gradient descent in the hyperbolic plane if we restrict ourselves to algorithms which make queries in a (large) bounded domain and which receive gradients and function values corrupted by a (small) amount of noise. We show that acceleration remains unachievable for any deterministic algorithm which receives exact gradient and function-value information (unbounded queries, no noise). Our results hold for a large class of Hadamard manifolds including hyperbolic spaces and the symmetric space $\mathrm{SL}(n) / \mathrm{SO}(n)$ of positive definite $n \times n$ matrices of determinant one. This cements a surprising gap between the complexity of convex optimization and geodesically convex optimization: for hyperbolic spaces, Riemannian gradient descent is optimal on the class of smooth and strongly geodesically convex functions (in the regime where the condition number scales with the radius of the optimization domain). The key idea for proving the lower bound consists of perturbing squared distance functions with sums of bump functions chosen by a resisting oracle.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/criscitiello22a.html
https://proceedings.mlr.press/v178/criscitiello22a.htmlStatistical and Computational Phase Transitions in Group TestingWe study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease within a population of size n, based on the outcomes of pooled tests which return positive whenever there is at least one infected individual in the tested group. We consider two different simple random procedures for assigning individuals to tests: the constant-column design and Bernoulli design. Our first set of results concerns the fundamental statistical limits. For the constant-column design, we give a new information-theoretic lower bound which implies that the proportion of correctly identifiable infected individuals undergoes a sharp “all-or-nothing” phase transition when the number of tests crosses a particular threshold. For the Bernoulli design, we determine the precise number of tests required to solve the associated detection problem (where the goal is to distinguish between a group testing instance and pure noise), improving both the upper and lower bounds of Truong, Aldridge, and Scarlett (2020). For both group testing models, we also study the power of computationally efficient (polynomial-time) inference procedures. We determine the precise number of tests required for the class of low-degree polynomial algorithms to solve the detection problem. This provides evidence for an inherent computational-statistical gap in both the detection and recovery problems at small sparsity levels. Notably, our evidence is contrary to that of Iliopoulos and Zadik (2021), who predicted the absence of a computational-statistical gap in the Bernoulli design.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/coja-oghlan22a.html
https://proceedings.mlr.press/v178/coja-oghlan22a.htmlLearning with metric lossesWe propose a practical algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is “bounded in expectation” (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnostic setting. Our technique is based on metric medoids (a variant of Fréchet means) and presents a significant departure from existing methods, which, as we demonstrate, fail to achieve Bayes-consistency on general instance- and label-space metrics. Our proofs introduce the technique of {\em semi-stable compression}, which may be of independent interest.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/cohen22a.html
https://proceedings.mlr.press/v178/cohen22a.htmlCommunity Recovery in the Degree-Heterogeneous Stochastic Block ModelWe consider the problem of recovering communities in a random directed graph with planted communities. To model real-world directed graphs such as the Twitter or Instagram graphs that exhibit very heterogeneous degree sequences, we introduce the Degree-Heterogeneous Stochastic Block Model (DHSBM), a generalization of the classic Stochastic Block Model (SBM), where the vertex set is partitioned into communities and each vertex $u$ has two (unknown) associated probabilities, $p_u$ and $q_u$, $p_u > q_u$. An arc from $u$ to $v$ is generated with probability $p_u$ if $u$ and $v$ are in the same community and with probability $q_u$ otherwise. Given a graph generated from this model, the goal is to retrieve the communities. The DHSBM allows to generate graphs with planted communities while allowing heterogeneous degree distributions, a quite important feature of real-world networks. In the case where there are two communities, we present an iterative greedy linear-time algorithm that recovers them whenever $\min_u \frac{p_u - q_u}{\sqrt{p_u}} = \Omega(\sqrt{\log (n)/n})$. We also show that, up to a constant, this condition is necessary. Our results also extend to the standard (undirected) SBM, where $p_u = p$ and $q_u= q$ for all nodes $u$. Our algorithm presents the first linear-time algorithm that recovers exactly the communities at the asymptotic information-theoretic threshold, improving over previous near-linear time spectral approaches.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/cohen-addad22a.html
https://proceedings.mlr.press/v178/cohen-addad22a.htmlChained generalisation boundsThis work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/clerico22a.html
https://proceedings.mlr.press/v178/clerico22a.htmlChasing Convex Bodies and Functions with Black-Box AdviceWe consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as \emph{consistency}, while also ensuring worst-case \emph{robustness} even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem’s convexity. The first, $\textsc{Interp}$, achieves $(\sqrt{2}+\epsilon)$-consistency and $\mathcal{O}(\frac{C}{\epsilon^2})$-robustness for any $\epsilon > 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, $\textsc{BdInterp}$, achieves $(1+\epsilon)$-consistency and $\mathcal{O}(\frac{CD}{\epsilon})$-robustness when the problem has bounded diameter $D$. Further, we show that $\textsc{BdInterp}$ achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $\alpha$-polyhedral.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/christianson22a.html
https://proceedings.mlr.press/v178/christianson22a.htmlThe query complexity of sampling from strongly log-concave distributions in one dimensionWe establish the first tight lower bound of $\Omega(\log\log\kappa)$ on the query complexity of sampling from the class of strongly log-concave and log-smooth distributions with condition number $\kappa$ in one dimension. Whereas existing guarantees for MCMC-based algorithms scale polynomially in $\kappa$, we introduce a novel algorithm based on rejection sampling that closes this doubly exponential gap.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/chewi22b.html
https://proceedings.mlr.press/v178/chewi22b.htmlAnalysis of Langevin Monte Carlo from Poincare to Log-SobolevClassically, the continuous-time Langevin diffusion converges exponentially fast to its stationary distribution $\pi$ under the sole assumption that $\pi$ satisfies a Poincaré inequality. Using this fact to provide guarantees for the discrete-time Langevin Monte Carlo (LMC) algorithm, however, is considerably more challenging due to the need for working with chi-squared or Rényi divergences, and prior works have largely focused on strongly log-concave targets. In this work, we provide the first convergence guarantees for LMC assuming that $\pi$ satisfies either a Latał{}a–Oleszkiewicz or modified log-Sobolev inequality, which interpolates between the Poincaré and log-Sobolev settings. Unlike prior works, our results allow for weak smoothness and do not require convexity or dissipativity conditions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/chewi22a.html
https://proceedings.mlr.press/v178/chewi22a.htmlOptimal Mean Estimation without a VarianceWe study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample $\bm{X} = \{X_i\}_{i = 1}^n$ from a distribution $\mc{D}$ over $\mb{R}^d$ with mean $\mu$ which satisfies the following \emph{weak-moment} assumption for some ${\alpha \in [0, 1]}$: \begin{equation*} \forall \norm{v} = 1: \mb{E}_{X \ts \mc{D}}[\abs{\inp{X - \mu}{v}}^{1 + \alpha}] \leq 1, \end{equation*} and given a target failure probability, $\delta$, our goal is to design an estimator which attains the smallest possible confidence interval as a function of $n,d,\delta$. For the specific case of $\alpha = 1$, foundational work of Lugosi and Mendelson exhibits an estimator achieving \emph{optimal} subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general $\alpha$, and provide a precise characterization of the optimal achievable confidence interval by establishing the following information-theoretic lower bound: \begin{equation*} \Omega \lprp{\sqrt{\frac{d}{n}} + \lprp{\frac{d}{n}}^{\frac{\alpha}{(1 + \alpha)}} + \lprp{\frac{\log 1 / \delta}{n}}^{\frac{\alpha}{(1 + \alpha)}}}. \end{equation*} and devising an estimator matching the aforementioned lower bound up to constants. Moreover, our estimator is computationally efficient.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/cherapanamjeri22a.html
https://proceedings.mlr.press/v178/cherapanamjeri22a.htmlMemorize to generalize: on the necessity of interpolation in high dimensional linear regressionWe examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \real^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/cheng22a.html
https://proceedings.mlr.press/v178/cheng22a.htmlOn the well-spread property and its relation to linear regressionWe consider the robust linear regression model $\bm{y} = X\beta^* + \bm{\eta}$, where an adversary oblivious to the design $X \in \R^{n \times d}$ may choose $\bm{\eta}$ to corrupt all but a (possibly vanishing) fraction of the observations $\bm{y}$ in an arbitrary way. Recent work \cite{d2021consistent, d2021consistentICML} has introduced efficient algorithms for consistent recovery of the parameter vector. These algorithms crucially rely on the design matrix being well-spread (a matrix is well-spread if its column span is far from any sparse vector). In this paper, we show that there exists a family of design matrices lacking well-spreadness such that consistent recovery of the parameter vector in the above robust linear regression model is information-theoretically impossible. We further investigate the average-case time complexity of certifying well-spreadness of random matrices. We show that it is possible to efficiently certify whether a given $n$-by-$d$ Gaussian matrix is well-spread if the number of observations is quadratic in the ambient dimension. We complement this result by showing rigorous evidence —in the form of a lower bound against low-degree polynomials— of the computational hardness of this same certification problem when the number of observations is $o(d^2)$.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/chen22d.html
https://proceedings.mlr.press/v178/chen22d.htmlImproved analysis for a proximal algorithm for samplingWe study the proximal sampler of Lee, Shen, and Tian (2021) and obtain new convergence guarantees under weaker assumptions than strong log-concavity: namely, our results hold for (1) weakly log-concave targets, and (2) targets satisfying isoperimetric assumptions which allow for non-log-concavity. We demonstrate our results by obtaining new state-of-the-art sampling guarantees for several classes of target distributions. We also strengthen the connection between the proximal sampler and the proximal method in optimization by interpreting the former as an entropically regularized Wasserstein gradient flow and the latter as the limit of one.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/chen22c.html
https://proceedings.mlr.press/v178/chen22c.htmlToward Instance-Optimal State Certification With Incoherent MeasurementsWe revisit the basic problem of quantum state certification: given copies of unknown mixed state ρ∈ℂ^{d×d} and the description of a mixed state σ, decide whether σ=ρ or ‖σ−ρ‖_𝗍𝗋 ≥ ϵ. When σ is maximally mixed, this is mixedness testing, and it is known that Ω(d^{Θ(1)}/ϵ^2) copies are necessary, where the exact exponent depends on the type of measurements the learner can make [OW15, BCL20], and in many of these settings there is a matching upper bound [OW15, BOW19, BCL20]. Can one avoid this d^{Θ(1)} dependence for certain kinds of mixed states σ, e.g. ones which are approximately low rank? More ambitiously, does there exist a simple functional f : ℂ^{d×d} → ℝ_{≥0} for which one can show that Θ(f(σ)/ϵ^2) copies are necessary and sufficient for state certification with respect to any σ? Such instance-optimal bounds are known in the context of classical distribution testing, e.g. [VV17]. Here we give the first bounds of this nature for the quantum setting, showing (up to log factors) that the copy complexity for state certification using nonadaptive incoherent measurements is essentially given by the copy complexity for mixedness testing times the fidelity between σ and the maximally mixed state. Surprisingly, our bound differs substantially from instance optimal bounds for the classical problem, demonstrating a qualitative difference between the two settings.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/chen22b.html
https://proceedings.mlr.press/v178/chen22b.htmlPolicy Optimization for Stochastic Shortest PathPolicy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees. In this work, we initiate the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications. We consider a wide range of settings, including stochastic and adversarial environments under full information or bandit feedback, and propose a policy optimization algorithm for each setting that makes use of novel correction terms and/or variants of dilated bonuses (Luo et al., 2021). For most settings, our algorithm is shown to achieve a near-optimal regret bound. One key technical contribution of this work is a new approximation scheme to tackle SSP problems that we call stacked discounted approximation and use in all our proposed algorithms. Unlike the finite-horizon approximation that is heavily used in recent SSP algorithms, our new approximation enables us to learn a near-stationary policy with only logarithmic changes during an episode and could lead to an exponential improvement in space complexity.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/chen22a.html
https://proceedings.mlr.press/v178/chen22a.htmlEfficient Online Linear Control with Stochastic Convex Costs and Unknown DynamicsWe consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate against the best stabilizing linear controller. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/cassel22a.html
https://proceedings.mlr.press/v178/cassel22a.htmlMaking SGD Parameter-FreeWe develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/carmon22a.html
https://proceedings.mlr.press/v178/carmon22a.htmlThe Price of Tolerance in Distribution TestingWe revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution $p$ over $\{1, …, n\}$, is it $\varepsilon_1$-close to or $\varepsilon_2$-far from a reference distribution $q$ (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e., $\varepsilon_1 = 0$) the sample complexity is $\Theta(\sqrt{n})$, strongly sublinear in the domain size. At the other end of the spectrum, when $\varepsilon_1 = \varepsilon_2/2$, the sample complexity jumps to the barely sublinear $\Theta(n/\log n)$. However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of $n$, $\varepsilon_1$, $\varepsilon_2$, up to a single $\log n$ factor. Specifically, we show the sample complexity to be \[\tilde \Theta\mleft(\frac{\sqrt{n}}{\ve_2^{2}} + \frac{n}{\log n} \cdot \max \mleft\{\frac{\ve_1}{\ve_2^2},\mleft(\frac{\ve_1}{\ve_2^2}\mright)^{\!\!2}\mright\}\mright),\]{providing} a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both $p$ and $q$ are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio $\varepsilon_1/\varepsilon_2^2$, and not the more intuitive $\varepsilon_1/\varepsilon_2$. Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between $\varepsilon_1$ and $\varepsilon_2$, a challenge absent from previous works.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/canonne22a.html
https://proceedings.mlr.press/v178/canonne22a.htmlStrong Gaussian Approximation for the Sum of Random VectorsThis paper derives a new strong Gaussian approximation bound for the sum of independent random vectors. The approach relies on the optimal transport theory and yields explicit dependence on the dimension size p and the sample size n. This dependence establishes a new fundamental limit for all practical applications of statistical learning theory. Particularly, based on this bound, we prove approximation in distribution for the maximum norm in a high-dimensional setting (p > n).Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/buzun22a.html
https://proceedings.mlr.press/v178/buzun22a.htmlKernel interpolation in Sobolev spaces is not consistent in low dimensionsWe consider kernel ridgeless ridge regression with kernels whose associated RKHS is a Sobolev space $H^s$. We show for $d/2<s<3d/4$ that interpolation is not consistent in fixed dimension extending earlier results for the Laplace kernel in odd dimensions and underlining again that benign overfitting is rare in low dimensions. The proof proceeds by deriving sharp bounds on the spectrum of random kernel matrices using results from the theory of radial basis functions which might be of independent interest.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/buchholz22a.html
https://proceedings.mlr.press/v178/buchholz22a.htmlStrong Memory Lower Bounds for Learning Natural ModelsWe give lower bounds on the amount of memory required by a one-pass streaming algorithms for solving several natural learning problems. In a setting where examples lie in $\{0,1\}^d$ and the optimal classifier can be encoded using $\kappa$ bits, we show that algorithms which learn to constant error using a near-minimal number of examples, $\tilde O(\kappa)$, must use $\tilde \Omega( d\kappa)$ bits of space. Our space bounds match the dimension of the ambient space of the problem’s natural parametrization, even when it is quadratic in the size of examples and the final classifier. For instance, in the setting of $d$-sparse linear classifiers over degree-2 polynomial features, for which $\kappa=\Theta(d\log d)$, our space lower bound is $\tilde\Omega(d^2)$. Our bounds degrade gracefully with the stream length $N$, generally having the form $\tilde\Omega(d\kappa \cdot \frac{\kappa}{N})$. Bounds of the form $\Omega(d\kappa)$ were known for learning parity and other problems defined over finite fields. Bounds that apply in a narrow range of sample sizes are also known for linear regression. Ours are the first such bounds for problems of the type commonly seen in recent learning applications that apply for for a large range of input sizes.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/brown22a.html
https://proceedings.mlr.press/v178/brown22a.htmlThe Query Complexity of Local Search and Brouwer in RoundsWe consider the query complexity of finding a local minimum of a function defined on a graph, where at most $k$ rounds of interaction (aka adaptivity) with the oracle are allowed. Adaptivity is a fundamental concept studied due to the need to parallelize computation and understand the speedups attainable. The query complexity of local search is tightly related to the complexity of computing stationary points of a function, thus bounds for local search can give insights into the performance of algorithms such as gradient descent. We focus on the $d$-dimensional grid $\{1, 2, \ldots, n \}^d$, where the dimension $d \geq 2$ is a constant. Our main contribution is to give algorithms and lower bounds that characterize the trade-off between the number of rounds of adaptivity and the query complexity of local search, when the number of rounds is constant and polynomial in $n$, respectively. The local search analysis also enables us to characterize the query complexity of computing a Brouwer fixed point in rounds. Our proof technique for lower bounding the query complexity in rounds may be of independent interest as an alternative to the classical relational adversary method of Aaronson from the fully adaptive setting.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/branzei22a.html
https://proceedings.mlr.press/v178/branzei22a.htmlMonotone LearningThe amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm.This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm $A$ can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to $A$. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye, Gyorfi, and Lugosi (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our general transformation readily implies monotone learners in a variety of contexts: for example, Pestov’s result follows by applying it on \emph{any} Bayes-consistent algorithm (e.g., $k$-Nearest-Neighbours). In fact, our transformation extends Pestov’s result to classification tasks with an arbitrary number of labels. This is contrast with Pestov’s work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions asked by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021)Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/bousquet22a.html
https://proceedings.mlr.press/v178/bousquet22a.htmlTrace norm regularization for multi-task learning with scarce dataMulti-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/boursier22a.html
https://proceedings.mlr.press/v178/boursier22a.htmlGardner formula for Ising perceptron models at small densitiesWe consider the Ising perceptron model with N spins and M = N*alpha patterns, with a general activation function U that is bounded above. For U bounded away from zero, or U a one-sided threshold function, it was shown by Talagrand (2000, 2011) that for small densities alpha, the free energy of the model converges in the large-N limit to the replica symmetric formula conjectured in the physics literature (Krauth–Mezard 1989, see also Gardner–Derrida 1988). We give a new proof of this result, which covers the more general class of all functions U that are bounded above and satisfy a certain variance bound. The proof uses the (first and second) moment method conditional on the approximate message passing iterates of the model. In order to deduce our main theorem, we also prove a new concentration result for the perceptron model in the case where U is not bounded away from zero.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/bolthausen22a.html
https://proceedings.mlr.press/v178/bolthausen22a.htmlSmoothed Online Learning is as Easy as Statistical LearningMuch of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. Existing results for the smooth setting were known only for binary-valued function classes and were computation- ally expensive in general; in this paper, we fill these lacunae. In particular, we provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/block22a.html
https://proceedings.mlr.press/v178/block22a.htmlUniversal Online Learning: an Optimistically Universal Learning RuleWe study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive since the learnable processes almost surely have to visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that the k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak settings. This closes all the COLT 2021 open problems posed on universal online learning.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/blanchard22b.html
https://proceedings.mlr.press/v178/blanchard22b.htmlUniversal Online Learning with Bounded Loss: Reduction to Binary ClassificationWe study universal consistency of non-i.i.d. processes in the context of online learning. A stochastic process is said to admit universal consistency if there exists a learner that achieves vanishing average loss for any measurable response function on this process. When the loss function is unbounded, [1] showed that the only processes admitting strong universal consistency are those taking a finite number of values almost surely. However, when the loss function is bounded, the class of processes admitting strong universal consistency is much richer and its characterization could be dependent on the response setting [2]. In this paper, we show that this class of processes is independent from the response setting thereby closing an open question of [3] (Open Problem 3). Specifically, we show that the class of processes that admit universal online learning is the same for binary classification as for multiclass classification with countable number of classes. Consequently, any output setting with bounded loss can be reduced to binary classification. Our reduction is constructive and practical. Indeed, we show that the nearest neighbor algorithm is transported by our construction. For binary classification on a process admitting strong universal learning, we prove that nearest neighbor successfully learns at least all finite unions of intervals.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/blanchard22a.html
https://proceedings.mlr.press/v178/blanchard22a.htmlOn the power of adaptivity in statistical adversariesWe initiate the study of a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the {\sl type} of allowable corruptions (noise model) as well as {\sl when} these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$. We investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature, under a unifying framework that we introduce. Specifically, can the behavior of an algorithm $\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}’$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of {\sl statistical query} algorithms, under all reasonable noise models. We then show that in the specific case of {\sl additive noise}, this equivalence holds for {\sl all} algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/blanc22a.html
https://proceedings.mlr.press/v178/blanc22a.htmlOn the Benefits of Large Learning Rates for Kernel MethodsThis paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian’s eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/beugnot22a.html
https://proceedings.mlr.press/v178/beugnot22a.htmlOn The Memory Complexity of Uniformity TestingIn this paper we consider the problem of uniformity testing with limited memory. We observe a sequence of independent identically distributed random variables drawn from a distribution $p$ over $[n]$, which is either uniform or is $\eps$-far from uniform under the total variation distance, and our goal is to determine the correct hypothesis. At each time point we are allowed to update the state of a finite-memory machine with $S$ states, where each state of the machine is assigned one of the hypotheses, and we are interested in obtaining an asymptotic probability of error at most $0<\delta<1/2$ uniformly under both hypotheses. The main contribution of this paper is deriving upper and lower bounds on the number of states $S$ needed in order to achieve a constant error probability $\delta$, as a function of $n$ and $\eps$, where our upper bound is $O(\frac{n\log n}{\eps})$ and our lower bound is $\Omega (n+\frac{1}{\eps})$. Prior works in the field have almost exclusively used collision counting for upper bounds, and the Paninski mixture for lower bounds. Somewhat surprisingly, in the limited memory with unlimited samples setup, the optimal solution does not involve counting collisions, and the Paninski prior is not hard, thus different proof techniques are needed in order to attain our bounds.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/berg22a.html
https://proceedings.mlr.press/v178/berg22a.htmlDerivatives and residual distribution of regularized M-estimators with application to adaptive tuningThis paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formulae for the derivatives of regularized M-estimators $\hat\beta(y,X)$ where differentiation is taken with respect to both X and y; this reveals a simple differentiability structure shared by all convex regularized M-estimators. (ii) Using these derivatives, we characterize the distribution of the residuals in the intermediate high-dimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized M-estimators. The criterion approximates the out-of-sample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the out-of-sample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion as a proxy of the out-of-sample error. Finally our results reveal new relationships between the derivatives of the $\hat\beta$ and the effective degrees of freedom of the M-estimators, which are of independent interest.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/bellec22a.html
https://proceedings.mlr.press/v178/bellec22a.htmlGeneralization Bounds for Data-Driven Numerical Linear AlgebraData-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main results are closely matching upper and lower bounds on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al. (NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/bartlett22a.html
https://proceedings.mlr.press/v178/bartlett22a.htmlLearning Low Degree HypergraphsWe study the problem of learning a hypergraph via edge detecting queries. In this problem, a learner queries subsets of vertices of a hidden hypergraph and observes whether these subsets contain an edge or not. In general, learning a hypergraph with m edges of maximum size d requires Omega((2m/d)^{d/2}) queries. In this paper, we aim to identify families of hypergraphs that can be learned without suffering from a query complexity that grows exponentially in the size of the edges. We show that hypermatchings and low-degree near-uniform hypergraphs with n vertices are learnable with poly(n) queries. For learning hypermatchings (hypergraphs of maximum degree Delta = 1), we give an O(log^3 n)-round algorithm with O(n log^5 n) queries. We complement this upper bound by showing that there are no algorithms with poly(n) queries that learn hypermatchings in o(log log n) adaptive rounds. For hypergraphs with maximum degree Delta and edge size ratio rho, we give a non-adaptive algorithm with O((2n)^{rho Delta+1} log^2 n) queries. To the best of our knowledge, these are the first algorithms with poly(n, m) query complexity for learning non-trivial families of hypergraphs that have a super-constant number of edges of arbitrarily large size.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/balkanski22a.html
https://proceedings.mlr.press/v178/balkanski22a.htmlRobustly-reliable learners under poisoning attacksData poisoning attacks, in which an adversary corrupts a training set with the goal of inducing specific desired mistakes, have raised substantial concern: even just the possibility of such an attack can make a user no longer trust the results of a learning system. In this work, we analyze when strong robustness guarantees can be achieved even in the face of such attacks. We define and show how to provide robustly-reliable predictions, in which the predicted label is guaranteed to be correct so long as the adversary has not exceeded a given corruption budget, even in the presence of instance targeted attacks, where the adversary aims to cause a failure on specific test examples. Our guarantees are substantially stronger than those in prior approaches, which were only able to provide certificates that the prediction of the learning algorithm does not change, as opposed to certifying that the prediction is correct, as we do here. Remarkably, we provide a complete characterization of learnability in this setting, in particular, nearly-tight matching upper and lower bounds on the region that can be certified, as well as efficient algorithms for computing this region given an ERM oracle. Moreover, for the case of linear separators over logconcave distributions, we provide efficient truly polynomial time algorithms (i.e., non-oracle algorithms) for such robustly-reliable predictions. We also extend these results to the active setting where the algorithm adaptively asks for labels of specific informative examples, and the difficulty is that the adversary might even be adaptive to this interaction, as well as to the agnostic learning setting where there is no perfect classifier even over the uncorrupted data.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/balcan22a.html
https://proceedings.mlr.press/v178/balcan22a.htmlTowards a Theory of Non-Log-Concave Sampling:First-Order Stationarity Guarantees for Langevin Monte CarloFor the task of sampling from a density $\pi \propto \exp(-V)$ on $\R^d$, where $V$ is possibly non-convex but $L$-gradient Lipschitz, we prove that averaged Langevin Monte Carlo outputs a sample with $\varepsilon$-relative Fisher information after $O(L^2 d^2/\varepsilon^2)$ iterations. This is the sampling analogue of complexity bounds for finding an $\varepsilon$-approximate first-order stationary points in non-convex optimization and therefore constitutes a first step towards the general theory of non-log-concave sampling. We discuss numerous extensions and applications of our result; in particular, it yields a new state-of-the-art guarantee for sampling from distributions which satisfy a Poincaré inequality.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/balasubramanian22a.html
https://proceedings.mlr.press/v178/balasubramanian22a.htmlUniform Stability for First-Order Empirical Risk MinimizationWe consider the problem of designing uniformly stable first-order optimization algorithms for empirical risk minimization. Uniform stability is often used to obtain generalization error bounds for optimization algorithms, and we are interested in a general approach to achieve it. For Euclidean geometry, we suggest a black-box conversion which given a smooth optimization algorithm, produces a uniformly stable version of the algorithm while maintaining its convergence rate up to logarithmic factors. Using this reduction we obtain a (nearly) optimal algorithm for smooth optimization with convergence rate $\tilde{O}(1/T^2)$ and uniform stability $O(T^2/n)$, resolving an open problem of Chen et al. (2018); Attia and Koren (2021). For more general geometries, we develop a variant of Mirror Descent for smooth optimization with convergence rate $\tilde{O}(1/T)$ and uniform stability $O(T/n)$, leaving open the question of devising a general conversion method as in the Euclidean case.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/attia22a.html
https://proceedings.mlr.press/v178/attia22a.htmlHierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower BoundsThe Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC’16]. Assuming that the input is an $n$-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: – With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm by Charikar and Chatziafratis [SODA’17]. – When the space is more limited, namely, $n^{1-o(1)}$, we prove that no algorithm can even estimate the value of the optimum hierarchical tree to within an $o(log(n)/loglog(n))$ factor, even when allowed polylog(n) passes over the input and exponential time. – In the most stringent setting of polylog{n} space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an $n^{1/2-o(1)}$ factor gap between their HC objective value. – Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires storing almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graphs can preserve the cost of “balanced” hierarchical trees to within some constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results involve establishing a new streaming lower bound for a novel problem “One-vs-Many-Expanders”, which can be of independent interest.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/assadi22a.html
https://proceedings.mlr.press/v178/assadi22a.htmlPrivate and polynomial time algorithms for learning Gaussians and beyondWe present a fairly general framework for reducing $(\varepsilon, \delta)$-differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\tilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \eps + d\ln(1/\delta) / \alpha \eps)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath (2021). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman (2021) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\tilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker (2021) also provided a polynomial time $(\epsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\tilde{O}(d^8)$.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/ashtiani22a.html
https://proceedings.mlr.press/v178/ashtiani22a.htmlFrom Sampling to Optimization on Discrete Domains with Applications to Determinant MaximizationWe establish a connection between sampling and optimization on discrete domains. For a family of distributions $\mu$ defined on size $k$ subsets of a ground set of elements, that is closed under external fields, we show that rapid mixing of natural local random walks implies the existence of simple approximation algorithms to find $\max \mu(\cdot)$. More precisely, we show that if $t$-step down-up random walks have spectral gap at least inverse polynomially large, then $t$-step local search finds $\max \mu(\cdot)$ within a factor of $k^{O(k)}$. As the main application of our result, we show that $2$-step local search achieves a nearly-optimal $k^{O(k)}$-factor approximation for MAP inference on nonsymmetric $k$-DPPs. This is the first nontrivial multiplicative approximation algorithm for this problem. In our main technical result, we show that an exchange inequality, a concept rooted in discrete convex analysis, can be derived from fast mixing of local random walks. We further advance the state of the art on the mixing of random walks for nonsymmetric DPPs and more generally sector-stable distributions, by obtaining the tightest possible bound on the step size needed for polynomial-time mixing of random walks. We bring the step size down by a factor of $2$ compared to prior works, and consequently get a quadratic improvement on the runtime of local search steps; this improvement is potentially of independent interest in sampling applications.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/anari22a.html
https://proceedings.mlr.press/v178/anari22a.htmlStochastic Variance Reduction for Variational Inequality MethodsWe propose stochastic variance reduced algorithms for solving convex-concave saddle point problems, monotone variational inequalities, and monotone inclusions. Our framework applies to extragradient, forward-backward-forward, and forward-reflected-backward methods both in Euclidean and Bregman setups. All proposed methods converge in exactly the same setting as their deterministic counterparts and they either match or improve the best-known complexities for solving structured min-max problems. Our results reinforce the correspondence between variance reduction in variational inequalities and minimization. We also illustrate the improvements of our approach with numerical evaluations on matrix games.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/alacaoglu22a.html
https://proceedings.mlr.press/v178/alacaoglu22a.htmlNon-Linear Reinforcement Learning in Large Action Spaces: Structural Conditions and Sample-efficiency of Posterior SamplingProvably sample-efficient Reinforcement Learning (RL) with rich observations and function approximation has witnessed tremendous recent progress, particularly when the underlying function approximators are linear. In this linear regime, computationally and statistically efficient methods exist where the potentially infinite state and action spaces can be captured through a known feature embedding, with the sample complexity scaling with the (intrinsic) dimension of these features. When the action space is finite, significantly more sophisticated results allow non-linear function approximation under appropriate structural constraints on the underlying RL problem, permitting for instance, the learning of good features instead of assuming access to them. In this work, we present the first result for non-linear function approximation which holds for general action spaces under a linear embeddability condition, which generalizes all linear and finite action settings. We design a novel optimistic posterior sampling strategy, TS$^3$ for such problems. We further show worst case sample complexity guarantees that scale with a rank parameter of the RL problem, the linear embedding dimension introduced here and standard measures of function class complexity.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/agarwal22c.html
https://proceedings.mlr.press/v178/agarwal22c.htmlMinimax Regret Optimization for Robust Machine Learning under Distribution ShiftIn this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions, this method achieves uniformly low regret across all test distributions. We also adapt our technique to have strong guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be an attractive framework to address a broad range of distribution shift scenarios.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/agarwal22b.html
https://proceedings.mlr.press/v178/agarwal22b.htmlA Sharp Memory-Regret Trade-off for Multi-Pass Streaming BanditsThe stochastic $K$-armed bandit problem has been studied extensively due to its applications in various domains ranging from online advertising to clinical trials. In practice however, the number of arms can be very large resulting in large memory requirements for simultaneously processing them. In this paper we consider a streaming setting where the arms are presented in a stream and the algorithm uses limited memory to process these arms. Here, the goal is not only to minimize regret, but also to do so in minimal memory. Previous algorithms for this problem operate in one of the two settings: they either use $\Omega(\log \log T)$ passes over the stream \citep{rathod2021reducing, ChaudhuriKa20, Liau+18}, or just a single pass \citep{Maiti+21}. In this paper we study the trade-off between memory and regret when $B$ passes over the stream are allowed, for any $B \geq 1$, and establish \emph{tight} regret upper and lower bounds for any $B$-pass algorithm. Our results uncover a surprising \emph{sharp transition phenomenon}: $O(1)$ memory is sufficient to achieve $\widetilde\Theta\paren{T^{\half + \frac{1}{2^{B+2}-2}}}$ regret in $B$ passes, and increasing the memory to any quantity that is $o(K)$ has almost no impact on further reducing this regret, unless we use $\Omega(K)$ memory. Our main technical contribution is our lower bound which requires the use of \emph{information-theoretic techniques} as well as ideas from \emph{round elimination} to show that the \emph{residual problem} remains challenging over subsequent passes.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/agarwal22a.html
https://proceedings.mlr.press/v178/agarwal22a.htmlThe Role of Interactivity in Structured EstimationWe study high-dimensional sparse estimation under three natural constraints: communication constraints, local privacy constraints, and linear measurements (compressive sensing). Without sparsity assumptions, it has been established that interactivity cannot improve the minimax rates of estimation under these information constraints. The question of whether interactivity helps with natural inference tasks has been a topic of active research. We settle this question in the affirmative for the prototypical problems of high-dimensional sparse mean estimation and compressive sensing, by demonstrating a gap between interactive and noninteractive protocols. We further establish that the gap increases when we have more structured sparsity: for \emph{block sparsity} this gap can be as large as \emph{polynomial} in the dimensionality. Thus, the more structured the sparsity is, the greater is the advantage of interaction. Proving the lower bounds requires a careful breaking of a sum of correlated random variables into independent components using Baranyai’s theorem on decomposition of hypergraphs, which might be of independent interest.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/acharya22b.html
https://proceedings.mlr.press/v178/acharya22b.htmlRobust Estimation for Random GraphsWe study the problem of robustly estimating the parameter $p$ of an Erdős-Rényi random graph on $n$ nodes, where a $\gamma$ fraction of nodes may be adversarially corrupted. After showing the deficiencies of canonical estimators, we design a computationally-efficient spectral algorithm which estimates $p$ up to accuracy $\tilde O(\sqrt{p(1-p)}/n + \gamma\sqrt{p(1-p)} /\sqrt{n}+ \gamma/n)$ for $\gamma < 1/60$. Furthermore, we give an inefficient algorithm with similar accuracy for all $\gamma<1/2$, the information-theoretic limit. Finally, we prove a nearly-matching statistical lower bound, showing that the error of our algorithms is optimal up to logarithmic factors.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/acharya22a.html
https://proceedings.mlr.press/v178/acharya22a.htmlThe merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networksIt is currently known how to characterize functions that neural networks can learn with SGD for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest —non-linear but regular networks— no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$. Our main results characterize a hierarchical property —the merged-staircase property— that is both \emph{necessary and nearly sufficient} for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new “dimension-free” dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.Tue, 28 Jun 2022 00:00:00 +0000
https://proceedings.mlr.press/v178/abbe22a.html
https://proceedings.mlr.press/v178/abbe22a.html