- title: 'Conference on Learning Theory 2021: Post-conference Preface'
  volume: 134
  URL: https://proceedings.mlr.press/v134/belkin21a.html
  PDF: http://proceedings.mlr.press/v134/belkin21a/belkin21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-10-09-belkin21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mikhail
    family: Belkin
  - given: Kpotufe
    family: Samory
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: i-iii
  id: belkin21a
  issued:
    date-parts: 
      - 2021
      - 10
      - 9
  firstpage: i
  lastpage: iii
  published: 2021-10-09 00:00:00 +0000
- title: 'Stochastic block model entropy and broadcasting on trees with survey'
  abstract: 'The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [Coja-Oghlan et al. (2017)] and for the classical case of assortative communities but in the dense regime [Deshpande et al. (2016)]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window.  The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [Kanade et al. (2016)], [Mossel et al. (2016)] and [Mossel and Xu (2015)]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/abbe21a.html
  PDF: http://proceedings.mlr.press/v134/abbe21a/abbe21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-abbe21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Emmanuel
    family: Abbe
  - given: Elisabetta
    family: Cornacchia
  - given: Yuzhou
    family: Gu
  - given: Yury
    family: Polyanskiy
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1-25
  id: abbe21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1
  lastpage: 25
  published: 2021-07-21 00:00:00 +0000
- title: 'Regret Minimization in Heavy-Tailed Bandits'
  abstract: 'We revisit the classic regret-minimization problem in the stochastic multi-armed bandit setting when the arm-distributions are allowed to be heavy-tailed. Regret minimization has been well studied in simpler settings of either bounded support reward distributions or distributions that belong to a single parameter exponential family. We work under the much weaker assumption that the moments of order \((1+\epsilon)\){are} uniformly bounded by a known constant \(B\), for some given \( \epsilon > 0\). We propose an optimal algorithm that matches the lower bound exactly in the first-order term. We also give a finite-time bound on its regret. We show that our index concentrates faster than the well-known truncated or trimmed empirical mean estimators for the mean of heavy-tailed distributions. Computing our index can be computationally demanding. To address this, we develop a batch-based algorithm that is optimal up to a multiplicative constant depending on the batch size. We hence provide a controlled trade-off between statistical optimality and computational cost.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/agrawal21a.html
  PDF: http://proceedings.mlr.press/v134/agrawal21a/agrawal21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-agrawal21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Shubhada
    family: Agrawal
  - given: Sandeep K.
    family: Juneja
  - given: Wouter M.
    family: Koolen
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 26-62
  id: agrawal21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 26
  lastpage: 62
  published: 2021-07-21 00:00:00 +0000
- title: 'SGD Generalizes Better Than GD (And Regularization Doesn’t Help)'
  abstract: 'We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that $O(1/\epsilon^2)$ iterations suffice for obtaining a solution with $\epsilon$ excess expected risk, we show that with the same number of steps GD may overfit and emit a solution with $\Omega(1)$ generalization error. Moreover, we show that in fact $\Omega(1/\epsilon^4)$ iterations are necessary for GD to match the generalization performance of SGD, which is also tight due to recent work by Bassily et al. (2020). We further discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/amir21a.html
  PDF: http://proceedings.mlr.press/v134/amir21a/amir21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-amir21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Idan
    family: Amir
  - given: Tomer
    family: Koren
  - given: Roi
    family: Livni
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 63-92
  id: amir21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 63
  lastpage: 92
  published: 2021-07-21 00:00:00 +0000
- title: 'The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum Likelihood'
  abstract: 'In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d. samples from a discrete distribution, achieve an approximation factor of $\exp\left(-O(\sqrt{n} \log n) \right)$, improving upon the previous best-known bound achievable in polynomial time of $\exp(-O(n^{2/3} \log n))$ (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter.  To obtain our results on PML we establish new connections between PML and the well-studied Bethe and Sinkhorn approximations to the permanent (Vontobel, 2012 and 2014). It is known that the PML objective is proportional to the permanent of a certain Vandermonde matrix (Vontobel, 2012) with $\sqrt{n}$ distinct columns, i.e. with non-negative rank at most $\sqrt{n}$. This allows us to show that the convex approximation to computing PML distributions studied in (Charikar, Shiragur and Sidford, 2019) is governed, in part, by the quality of Sinkhorn approximations to the permanent. We show that both Bethe and Sinkhorn permanents are $\exp(O(k \log(N/k)))$ approximations to the permanent of $N \times N$ matrices with non-negative rank at most $k$. This improves upon the previous known bounds of $\exp(O(N))$ and combining these insights with careful rounding of the convex relaxation yields our results.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/anari21a.html
  PDF: http://proceedings.mlr.press/v134/anari21a/anari21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-anari21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nima
    family: Anari
  - given: Moses
    family: Charikar
  - given: Kirankumar
    family: Shiragur
  - given: Aaron
    family: Sidford
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 93-158
  id: anari21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 93
  lastpage: 158
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning in Matrix Games can be Arbitrarily Complex'
  abstract: 'Many multi-agent systems with strategic interactions have their desired functionality encoded as the Nash equilibrium of a game, e.g. machine learning architectures such as Generative Adversarial Networks. Directly computing a Nash equilibrium of these games is often impractical or impossible in practice, which has led to the development of numerous learning algorithms with the goal of iteratively converging on a Nash equilibrium. Unfortunately, the dynamics generated by the learning process can be very intricate and instances failing to converge become hard to interpret. In this paper we show that, in a strong sense, this dynamic complexity is inherent to games. Specifically, we prove that replicator dynamics, the continuous-time analogue of Multiplicative Weights Update, even when applied in a very restricted class of games–known as finite matrix games–is rich enough to be able to approximate arbitrary dynamical systems. In the context of machine learning, our results are positive in the sense that they show the nearly boundless dynamic modelling capabilities of current machine learning practices, but also negative in implying that these capabilities may come at the cost of interpretability. As a concrete example, we show how replicator dynamics can effectively reproduce the well-known strange attractor of Lonrenz dynamics (the “butterfly effect") while achieving no regret.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/andrade21a.html
  PDF: http://proceedings.mlr.press/v134/andrade21a/andrade21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-andrade21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gabriel P.
    family: Andrade
  - given: Rafael
    family: Frongillo
  - given: Georgios
    family: Piliouras
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 159-185
  id: andrade21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 159
  lastpage: 185
  published: 2021-07-21 00:00:00 +0000
- title: 'Functions with average smoothness: structure, algorithms, and learning'
  abstract: 'We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds — assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. In addition to the usual average, we also examine a “weak” average that is more forgiving and yields a much wider function class.  Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the {\em empirical} covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/ashlagi21a.html
  PDF: http://proceedings.mlr.press/v134/ashlagi21a/ashlagi21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-ashlagi21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yair
    family: Ashlagi
  - given: Lee-Ad
    family: Gottlieb
  - given: Aryeh
    family: Kontorovich
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 186-236
  id: ashlagi21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 186
  lastpage: 236
  published: 2021-07-21 00:00:00 +0000
- title: 'Adversarially Robust Low Dimensional Representations'
  abstract: 'Many machine learning systems are vulnerable to small perturbations made to inputs either at test time or at training time. This has received much recent interest on the empirical front due to applications where reliability and security are critical. However, theoretical understanding of algorithms that are robust to adversarial perturbations is limited.  In this work we focus on Principal Component Analysis (PCA), a ubiquitous algorithmic primitive in machine learning.  We formulate a natural robust variant of PCA where the goal is to find a low dimensional subspace to represent the given data with minimum projection error, that is in addition robust to small perturbations measured in $\ell_q$ norm (say $q=\infty$). Unlike PCA which is solvable in polynomial time, our formulation is computationally intractable to optimize as it captures a variant of the well-studied sparse PCA objective as a special case. We show the following results: 1. Polynomial time algorithm that is constant factor competitive in the worst-case with respect to the best subspace, in terms of the projection error and the robustness criterion.  2. We show that our algorithmic techniques can also be made robust to adversarial training-time perturbations, in addition to yielding representations that are robust to adversarial perturbations at test time. Specifically, we design algorithms for a strong notion of training-time perturbations, where every point is adversarially perturbed up to a specified amount.  3. We illustrate the broad applicability of our algorithmic techniques in addressing robustness to adversarial perturbations, both at training time and test time. In particular, our adversarially robust PCA primitive leads to computationally efficient and robust algorithms for both unsupervised and supervised learning problems such as clustering and learning adversarially robust classifiers.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/awasthi21a.html
  PDF: http://proceedings.mlr.press/v134/awasthi21a/awasthi21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-awasthi21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Pranjal
    family: Awasthi
  - given: Vaggos
    family: Chatziafratis
  - given: Xue
    family: Chen
  - given: Aravindan
    family: Vijayaraghavan
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 237-325
  id: awasthi21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 237
  lastpage: 325
  published: 2021-07-21 00:00:00 +0000
- title: 'The Last-Iterate Convergence Rate of Optimistic Mirror Descent in Stochastic Variational Inequalities'
  abstract: 'In this paper, we analyze the local convergence rate of optimistic mirror descent methods in stochastic variational inequalities, a class of optimization problems with important applications to learning theory and machine learning. Our analysis reveals an intricate relation between the algorithm’s rate of convergence and the local geometry induced by the method’s underlying Bregman function. We quantify this relation by means of the Legendre exponent, a notion that we introduce to measure the growth rate of the Bregman divergence relative to the ambient norm near a solution. We show that this exponent determines both the optimal step-size policy of the algorithm and the optimal rates attained, explaining in this way the differences observed for some popular Bregman functions (Euclidean projection, negative entropy, fractional power, etc.).'
  volume: 134
  URL: https://proceedings.mlr.press/v134/azizian21a.html
  PDF: http://proceedings.mlr.press/v134/azizian21a/azizian21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-azizian21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Waïss
    family: Azizian
  - given: Franck
    family: Iutzeler
  - given: Jérôme
    family: Malick
  - given: Panayotis
    family: Mertikopoulos
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 326-358
  id: azizian21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 326
  lastpage: 358
  published: 2021-07-21 00:00:00 +0000
- title: 'Optimal Dynamic Regret in Exp-Concave Online Learning'
  abstract: 'We consider the problem of the Zinkevich (2003)-style dynamic regret minimization in online learning with \emph{exp-concave} losses. We show that whenever improper learning is allowed, a Strongly Adaptive online learner achieves the dynamic regret of $\tilde O^*(n^{1/3}C_n^{2/3} \vee 1)$ where $C_n$ is the \emph{total variation} (a.k.a. \emph{path length}) of the an arbitrary sequence of comparators that may not be known to the learner ahead of time. Achieving this rate was highly nontrivial even for square losses in 1D where the best known upper bound was $O(\sqrt{nC_n} \vee \log n)$ (Yuan and Lamperski, 2019). Our new proof techniques make elegant use of the intricate structures of the primal and dual variables imposed by the KKT conditions and could be of independent interest. Finally, we apply our results to the classical statistical problem of \emph{locally adaptive non-parametric regression} (Mammen, 1991; Donoho and Johnstone, 1998) and obtain a stronger and more flexible algorithm that do not require any statistical assumptions or any hyperparameter tuning.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/baby21a.html
  PDF: http://proceedings.mlr.press/v134/baby21a/baby21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-baby21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dheeraj
    family: Baby
  - given: Yu-Xiang
    family: Wang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 359-409
  id: baby21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 359
  lastpage: 409
  published: 2021-07-21 00:00:00 +0000
- title: 'Spectral Planting and the Hardness of Refuting Cuts, Colorability, and Communities in Random Graphs'
  abstract: 'We study the problem of efficiently refuting the k-colorability of a graph, or equivalently, certifying a lower bound on its chromatic number. We give formal evidence of average-case computational hardness for this problem in sparse random regular graphs, suggesting that there is no polynomial-time algorithm that improves upon a classical spectral algorithm. Our evidence takes the form of a "computationally-quiet planting": we construct a distribution of d-regular graphs that has significantly smaller chromatic number than a typical regular graph drawn uniformly at random, while providing evidence that these two distributions are indistinguishable by a large class of algorithms. We generalize our results to the more general problem of certifying an upper bound on the maximum k-cut.  This quiet planting is achieved by minimizing the effect of the planted structure (e.g. colorings or cuts) on the graph spectrum. Specifically, the planted structure corresponds exactly to eigenvectors of the adjacency matrix. This avoids the pushout effect of random matrix theory, and delays the point at which the planting becomes visible in the spectrum or local statistics. To illustrate this further, we give similar results for a Gaussian analogue of this problem: a quiet version of the spiked model, where we plant an eigenspace rather than adding a generic low-rank perturbation.  Our evidence for computational hardness of distinguishing two distributions is based on three different heuristics: stability of belief propagation, the local statistics hierarchy, and the low-degree likelihood ratio. Of independent interest, our results include general-purpose bounds on the low-degree likelihood ratio for multi-spiked matrix models, and an improved low-degree analysis of the stochastic block model.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bandeira21a.html
  PDF: http://proceedings.mlr.press/v134/bandeira21a/bandeira21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bandeira21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Afonso S.
    family: Bandeira
  - given: Jess
    family: Banks
  - given: Dmitriy
    family: Kunisky
  - given: Christopher
    family: Moore
  - given: Alex
    family: Wein
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 410-473
  id: bandeira21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 410
  lastpage: 473
  published: 2021-07-21 00:00:00 +0000
- title: 'Non-Euclidean Differentially Private Stochastic Convex Optimization'
  abstract: 'Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t. the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting. In this work, we conduct a systematic study of DP-SCO for $\ell_p$-setups. For $p=1$, under a standard smoothness assumption, we give a new algorithm with nearly optimal excess risk. This result also extends to general polyhedral norms and feasible sets. For $p\in(1, 2)$, we give two new algorithms, whose central building block is a novel privacy mechanism, which generalizes the Gaussian mechanism. Moreover, we establish a lower bound on the excess risk for this range of $p$, showing a necessary dependence on $\sqrt{d}$, where $d$ is the dimension of the space. Our lower bound implies a sudden transition of the excess risk at $p=1$, where the dependence on $d$ changes from logarithmic to polynomial, resolving an open question in prior work \citep{TTZ15a}. For $p\in (2, \infty)$, noisy SGD attains optimal excess risk in the low-dimensional regime; in particular, this proves the optimality of noisy SGD for $p=\infty$. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bassily21a.html
  PDF: http://proceedings.mlr.press/v134/bassily21a/bassily21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bassily21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Raef
    family: Bassily
  - given: Cristobal
    family: Guzman
  - given: Anupama
    family: Nandi
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 474-499
  id: bassily21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 474
  lastpage: 499
  published: 2021-07-21 00:00:00 +0000
- title: 'Reconstructing weighted voting schemes from partial information about their power indices'
  abstract: 'A number of recent works [Goldberg 2006; O’Donnell and Servedio 2011; De, Diakonikolas, and Servedio 2017; De, Diakonikolas, Feldman, and Servedio 2014] have considered the problem of approximately reconstructing an unknown weighted voting scheme given information about various sorts of “power indices” that characterize the level of control that individual voters have over the final outcome. In the language of theoretical computer science, this is the problem of approximating an unknown linear threshold function (LTF) over ${-1,1}^n$ given some numerical measure (such as the function’s n “Chow parameters,” a.k.a. its degree-1 Fourier coefficients, or the vector of its n Shapley indices) of how much each of the n individual input variables affects the outcome of the function. In this paper we consider the problem of reconstructing an LTF given only partial information about its Chow parameters or Shapley indices; i.e. we are given only the Chow parameters or the Shapley indices corresponding to a subset $S\subseteq [n]$ of the n input variables. A natural goal in this partial information setting is to find an LTF whose Chow parameters or Shapley indices corresponding to indices in S accurately match the given Chow parameters or Shapley indices of the unknown LTF. We refer to this as the Partial Inverse Power Index Problem. Our main results are a polynomial time algorithm for the ($\epsilon$-approximate) Chow Parameters Partial Inverse Power Index Problem and a quasi-polynomial time algorithm for the ($\epsilon$-approximate) Shapley Indices Partial Inverse Power Index Problem.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bennett21a.html
  PDF: http://proceedings.mlr.press/v134/bennett21a/bennett21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bennett21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Huck
    family: Bennett
  - given: Anindya
    family: De
  - given: Rocco
    family: Servedio
  - given: Emmanouil Vasileios
    family: Vlatakis-Gkaragkounis
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 500-565
  id: bennett21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 500
  lastpage: 565
  published: 2021-07-21 00:00:00 +0000
- title: 'Deterministic Finite-Memory Bias Estimation'
  abstract: 'In this paper we consider the problem of estimating a Bernoulli parameter using finite memory. Let $X_1,X_2,\ldots$ be a sequence of independent identically distributed Bernoulli random variables with expectation $\theta$, where $\theta \in [0,1]$. Consider a finite-memory deterministic machine with $S$ states, that updates its state $M_n \in \{1,2,\ldots,S\}$ at each time according to the rule $M_n = f(M_{n-1},X_n)$, where $f$ is a deterministic time-invariant function. Assume that the machine outputs an estimate at each time point according to some fixed mapping from the state space to the unit interval. The quality of the estimation procedure is measured by the asymptotic risk, which is the long-term average of the instantaneous quadratic risk. The main contribution of this paper is an upper bound on the smallest worst-case asymptotic risk any such machine can attain. This bound coincides with a lower bound derived by Leighton and Rivest, to imply that $\Theta(1/S)$ is the minimax asymptotic risk for deterministic $S$-state machines. In particular, our result disproves a longstanding $\Theta(\log S/S)$ conjecture for this quantity, also posed by Leighton and Rivest.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/berg21a.html
  PDF: http://proceedings.mlr.press/v134/berg21a/berg21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-berg21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tomer
    family: Berg
  - given: Or
    family: Ordentlich
  - given: Ofer
    family: Shayevitz
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 566-585
  id: berg21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 566
  lastpage: 585
  published: 2021-07-21 00:00:00 +0000
- title: 'Online Learning from Optimal Actions'
  abstract: 'We study the problem of online contextual optimization where, at each period, instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. At each period, the decision-maker has access to a new set of feasible actions to select from and to a new contextual function that affects that period’s loss function. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. We obtain the first regret bound for this problem that is logarithmic in the time horizon. Our results are derived through the development and analysis of a novel algorithmic structure that leverages the underlying geometry of the problem.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/besbes21a.html
  PDF: http://proceedings.mlr.press/v134/besbes21a/besbes21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-besbes21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Omar
    family: Besbes
  - given: Yuri
    family: Fonseca
  - given: Ilan
    family: Lobel
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 586-586
  id: besbes21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 586
  lastpage: 586
  published: 2021-07-21 00:00:00 +0000
- title: 'Majorizing Measures, Sequential Complexities, and Online Learning'
  abstract: 'We introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequential Rademacher complexity in terms of the integral of sequential scale-sensitive dimension. Finally, we establish a tight contraction inequality for worst-case sequential Rademacher complexity.  The above constitutes the resolution of a number of outstanding open problems in extending the classical theory of empirical processes to the sequential case, and, in turn, establishes sharp results for online learning.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/block21a.html
  PDF: http://proceedings.mlr.press/v134/block21a/block21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-block21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Adam
    family: Block
  - given: Yuval
    family: Dagan
  - given: Alexander
    family: Rakhlin
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 587-590
  id: block21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 587
  lastpage: 590
  published: 2021-07-21 00:00:00 +0000
- title: 'Robust learning under clean-label attack'
  abstract: 'We study the problem of robust learning under clean-label data-poisoning attacks, where the attacker injects (an arbitrary set of) \emph{correctly-labeled} examples to the training set to fool the algorithm into making mistakes on \emph{specific} test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than optimal PAC learning. As we show, any robust algorithm with diminishing attackable rate can achieve the optimal dependence on $\epsilon$ in its PAC sample complexity, i.e., $O(1/\epsilon)$. On the other hand, the attackable rate might be large even for some optimal PAC learners, e.g., SVM for linear classifiers. Furthermore, we show that the class of linear hypotheses is not robustly learnable when the data distribution has zero margin and is robustly learnable in the case of positive margin but requires sample complexity exponential in the dimension. For a general hypothesis class with bounded VC dimension, if the attacker is limited to add at most $t=O(1/\epsilon)$ poison examples, the optimal robust learning sample complexity grows linearly with $t$.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/blum21a.html
  PDF: http://proceedings.mlr.press/v134/blum21a/blum21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-blum21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Avrim
    family: Blum
  - given: Steve
    family: Hanneke
  - given: Jian
    family: Qian
  - given: Han
    family: Shao
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 591-634
  id: blum21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 591
  lastpage: 634
  published: 2021-07-21 00:00:00 +0000
- title: 'Rank-one matrix estimation: analytic time evolution of gradient descent dynamics'
  abstract: 'We consider a rank-one symmetric matrix corrupted by additive noise. The rank-one matrix is formed by an n-component unknown vector on the sphere of radius $\sqrt{n}$, and we consider the problem of estimating this vector from the corrupted matrix in the high dimensional limit of $n$ large, by gradient descent for a quadratic cost function on the sphere. Explicit formulas for the whole time evolution of the overlap between the estimator and unknown vector, as well as the cost, are rigorously derived. In the long time limit we recover the well known spectral phase transition, as a function of the signal-to-noise ratio. The explicit formulas also allow to point out interesting transient features of the time evolution. Our analysis technique is based on recent progress in random matrix theory and uses local versions of the semi-circle law.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bodin21a.html
  PDF: http://proceedings.mlr.press/v134/bodin21a/bodin21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bodin21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Antoine
    family: Bodin
  - given: Nicolas
    family: Macris
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 635-678
  id: bodin21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 635
  lastpage: 678
  published: 2021-07-21 00:00:00 +0000
- title: 'Multiplayer Bandit Learning, from Competition to Cooperation'
  abstract: 'The stochastic multi-armed bandit model captures the tradeoff between exploration and exploitation. We study the effects of competition and cooperation on this tradeoff. Suppose there are two arms, one predictable and one risky, and two players, Alice and Bob. In every round, each player pulls an arm, receives the resulting reward, and observes the choice of the other player but not their reward. Alice’s utility is $\Gamma_A + \lambda \Gamma_B$ (and similarly for Bob), where $\Gamma_A$ is Alice’s total reward and $\lambda \in [-1, 1]$ is a cooperation parameter. At $\lambda = -1$ the players are competing in a zero-sum game, at $\lambda = 1$, their interests are aligned, and at $\lambda = 0$, they are neutral: each player’s utility is their own reward. The model is related to the economics literature on strategic experimentation, where usually players observe each other’s rewards.  Suppose the predictable arm has success probability $p$ and the risky arm has prior $\mu$. If the discount factor is $\beta$, then the value of $p$ where a single player is indifferent between the arms is the Gittins index $g = g(\mu,\beta) > m$, where $m$ is the mean of the risky arm.  Our first result answers, in this setting, a fundamental question posed by Rothschild \cite{rotschild}. We show that competing and neutral players eventually settle on the same arm (even though it may not be the best arm) in every Nash equilibrium, while this can fail for players with aligned interests.  Moreover, we show that \emph{competing players} explore \emph{less} than a single player: there is $p^* \in (m, g)$ so that for all $p > p^*$, the players stay at the predictable arm. However, the players are not myopic: they still explore for some $p > m$. On the other hand, \emph{cooperating players} (with $\lambda =1$) explore \emph{more} than a single player. We also show that \emph{neutral players} learn from each other, receiving strictly higher total rewards than they would playing alone, for all $ p\in (p^*, g)$, where $p^*$ is the threshold above which competing players do not explore.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/branzei21a.html
  PDF: http://proceedings.mlr.press/v134/branzei21a/branzei21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-branzei21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Simina
    family: Branzei
  - given: Yuval
    family: Peres
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 679-723
  id: branzei21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 679
  lastpage: 723
  published: 2021-07-21 00:00:00 +0000
- title: 'Near Optimal Distributed Learning of Halfspaces with Two Parties'
  abstract: '<i>Distributed learning</i> protocols are designed to train on distributed data without gathering it all on a single centralized machine, thus contributing to the efficiency of the system and enhancing its privacy. We study a central problem in distributed learning, called {\it distributed learning of halfspaces}: let $U \subseteq \mathbb{R}^d$ be a known domain of size $n$ and let $h:\mathbb{R}^d\to \mathbb{R}$ be an unknown target affine function.\footnote{In practice, the domain $U$ is defined implicitly by the representation of $d$-dimensional vectors which is used in the protocol.} A set of <i>examples</i> $\{(u,b)\}$ is distributed between several parties, where~$u \in U$ is a point and $b = \mathsf{sign}(h(u)) \in \{\pm 1\}$ is its label. The parties goal is to agree on a classifier~$f: U\to\{\pm 1\}$ such that~$f(u)=b$ for every input example~$(u,b)$.
We design a protocol for the distributed halfspace learning problem in the two-party setting, communicating only $\tilde O(d\log n)$ bits. To this end, we introduce a new tool called <i>halfspace containers</i>, that is closely related to <i>bracketing numbers</i> in statistics and to <i>hyperplane cuttings</i> in discrete geometry, and allows for a compressed approximate representation of every halfspace. We complement our upper bound result by an almost matching $\tilde \Omega(d\log n)$ lower bound on the communication complexity of any such protocol
Since the distributed halfspace learning problem is closely related to the <i>convex set disjointness</i> problem in communication complexity and the problem of <i>distributed linear programming</i> in distributed optimization, we also derive upper and lower bounds of $\tilde O(d^2\log n)$ and~$\tilde{\Omega}(d\log n)$ on the communication complexity of both of these basic problems.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/braverman21a.html
  PDF: http://proceedings.mlr.press/v134/braverman21a/braverman21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-braverman21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mark
    family: Braverman
  - given: Gillat
    family: Kol
  - given: Shay
    family: Moran
  - given: Raghuvansh R.
    family: Saxena
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 724-758
  id: braverman21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 724
  lastpage: 758
  published: 2021-07-21 00:00:00 +0000
- title: 'Near-Optimal Entrywise Sampling of Numerically Sparse Matrices'
  abstract: 'Many real-world data sets are sparse or almost sparse. One method to measure this for a matrix $A\in \mathbb{R}^{n\times n}$ is the \emph{numerical sparsity}, denoted $\mathsf{ns}(A)$, defined as the minimum $k\geq 1$ such that $\|a\|_1/\|a\|_2 \leq \sqrt{k}$ for every row and every column $a$ of $A$. This measure of $a$ is smooth and is clearly only smaller than the number of non-zeros in the row/column $a$.  The seminal work of Achlioptas and McSherry (2007) has put forward the question of approximating an input matrix $A$ by entrywise sampling. More precisely, the goal is to quickly compute a sparse matrix $\tilde{A}$ satisfying $\|A - \tilde{A}\|_2 \leq \epsilon \|A\|_2$ (i.e., additive spectral approximation) given an error parameter $\epsilon>0$. The known schemes sample and rescale a small fraction of entries from $A$.  We propose a scheme that sparsifies an almost-sparse matrix $A$ — it produces a matrix $\tilde{A}$ with $O(\epsilon^{-2}\mathsf{ns}(A) \cdot n\ln n)$ non-zero entries with high probability. We also prove that this upper bound on $\mathsf{nnz}(\tilde{A})$ is \emph{tight} up to logarithmic factors. Moreover, our upper bound improves when the spectrum of $A$ decays quickly (roughly replacing $n$ with the stable rank of $A$). Our scheme can be implemented in time $O(\mathsf{nnz}(A))$ when $\|A\|_2$ is given. Previously, a similar upper bound was obtained by Achlioptas et al. (2013), but only for a restricted class of inputs that does not even include symmetric or covariance matrices. Finally, we demonstrate two applications of these sampling techniques, to faster approximate matrix multiplication, and to ridge regression by using sparse preconditioners.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/braverman21b.html
  PDF: http://proceedings.mlr.press/v134/braverman21b/braverman21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-braverman21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vladimir
    family: Braverman
  - given: Robert
    family: Krauthgamer
  - given: Aditya R.
    family: Krishnan
  - given: Shay
    family: Sapir
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 759-773
  id: braverman21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 759
  lastpage: 773
  published: 2021-07-21 00:00:00 +0000
- title: 'Statistical Query Algorithms and Low Degree Tests Are Almost Equivalent'
  abstract: 'Researchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approach is to characterize the limits of restricted models of computation, which on the one hand yields strong computational lower bounds for powerful classes of algorithms and on the other hand helps guide the development of efficient algorithms. In this paper, we study two of the most popular restricted computational models, the statistical query framework and low-degree polynomials, in the context of high-dimensional hypothesis testing. Our main result is that under mild conditions on the testing problem, the two classes of algorithms are essentially equivalent in power. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensor PCA and several variants of the planted clique problem.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/brennan21a.html
  PDF: http://proceedings.mlr.press/v134/brennan21a/brennan21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-brennan21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Matthew S
    family: Brennan
  - given: Guy
    family: Bresler
  - given: Sam
    family: Hopkins
  - given: Jerry
    family: Li
  - given: Tselil
    family: Schramm
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 774-774
  id: brennan21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 774
  lastpage: 774
  published: 2021-07-21 00:00:00 +0000
- title: 'Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries'
  abstract: 'We investigate the problem of exact cluster recovery using oracle queries.  Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points.  In this work, we study this problem in the more challenging non-convex setting. We introduce a structural characterization of clusters, called $(\beta,\gamma)$-convexity, that can be applied to any finite set of points equipped with a metric (or even a semimetric, as the triangle inequality is not needed).  Using $(\beta,\gamma)$-convexity, we can translate natural density properties of clusters (which include, for instance, clusters that are strongly non-convex in $R^d$) into a graph-theoretic notion of convexity.  By exploiting this convexity notion, we design a deterministic algorithm that recovers $(\beta,\gamma)$-convex clusters using $O(k^2 \log n + k^2 (\frac{6}{\beta\gamma})^{dens(X)})$ same-cluster queries, where $k$ is the number of clusters and $dens(X)$ is the density dimension of the semimetric.  We show that an exponential dependence on the density dimension is necessary, and we also show that, if we are allowed to make $O(k^2 + k \log n)$ additional queries to a "cluster separation" oracle, then we can recover clusters that have different and arbitrary scales, even when the scale of each cluster is unknown.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bressan21a.html
  PDF: http://proceedings.mlr.press/v134/bressan21a/bressan21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bressan21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Marco
    family: Bressan
  - given: Nicoló
    family: Cesa-Bianchi
  - given: Silvio
    family: Lattanzi
  - given: Andrea
    family: Paudice
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 775-803
  id: bressan21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 775
  lastpage: 803
  published: 2021-07-21 00:00:00 +0000
- title: 'A Law of Robustness for Two-Layers Neural Networks'
  abstract: 'We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with $k$ neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than $\sqrt{n/k}$ where $n$ is the number of datapoints. In particular, this conjecture implies that overparametrization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a $O(1)$-Lipschitz network, while mere data fitting of $d$-dimensional data requires only one neuron per $d$ datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the high-dimensional regime $n \approx d$ (which we also refer to as the undercomplete case, since only $k \leq d$ is relevant here). Finally we prove the conjecture for polynomial activation functions of degree $p$ when $n \approx d^p$. We complement these findings with experimental evidence supporting the conjecture.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bubeck21a.html
  PDF: http://proceedings.mlr.press/v134/bubeck21a/bubeck21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bubeck21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sebastien
    family: Bubeck
  - given: Yuanzhi
    family: Li
  - given: Dheeraj M
    family: Nagaraj
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 804-820
  id: bubeck21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 804
  lastpage: 820
  published: 2021-07-21 00:00:00 +0000
- title: 'Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions'
  abstract: 'We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no collisions at all) are achievable for any number of players and arms. At a high level, the previous strategy heavily relied on a 2-dimensional geometric intuition that was difficult to generalize in higher dimensions, while here we take a more combinatorial route to build the new strategy.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/bubeck21b.html
  PDF: http://proceedings.mlr.press/v134/bubeck21b/bubeck21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-bubeck21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sebastien
    family: Bubeck
  - given: Thomas
    family: Budzinski
  - given: Mark
    family: Sellke
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 821-822
  id: bubeck21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 821
  lastpage: 822
  published: 2021-07-21 00:00:00 +0000
- title: 'Fast Rates for Structured Prediction'
  abstract: 'Discrete supervised learning problems such as classification are often tackled by introducing a continuous surrogate problem akin to regression. Bounding the original error, between estimate and solution, by the surrogate error endows discrete problems with convergence rates already shown for continuous instances. Yet, current approaches do not leverage the fact that discrete problems are essentially predicting a discrete output when continuous problems are predicting a continuous value. In this paper, we tackle this issue for general structured prediction problems, opening the way to “super fast” rates, that is, convergence rates for the excess risk faster than $n^{-1}$, where $n$ is the number of observations, with even exponential rates with the strongest assumptions. We first illustrate it for predictors based on nearest neighbors, generalizing rates known for binary classification to any discrete problem within the framework of structured prediction. We then consider kernel ridge regression where we improve known rates in $n^{-1/4}$ to arbitrarily fast rates, depending on a parameter characterizing the hardness of the problem, thus allowing, under smoothness assumptions, to bypass the curse of dimensionality.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/cabannes21a.html
  PDF: http://proceedings.mlr.press/v134/cabannes21a/cabannes21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-cabannes21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vivien A
    family: Cabannes
  - given: Francis
    family: Bach
  - given: Alessandro
    family: Rudi
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 823-865
  id: cabannes21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 823
  lastpage: 865
  published: 2021-07-21 00:00:00 +0000
- title: 'Thinking Inside the Ball: Near-Optimal Minimization of the Maximal Loss'
  abstract: 'We characterize the complexity of minimizing $\max_{i\in[N]} f_i(x)$ for convex, Lipschitz functions $f_1,\ldots, f_N$. For non-smooth functions, existing methods require $O(N\epsilon^{-2})$ queries to a first-order oracle to compute an $\epsilon$-suboptimal point and $\widetilde{O}(N\epsilon^{-1})$ queries if the $f_i$ are $O(1/\epsilon)$-smooth.  We develop methods with improved complexity bounds of $\widetilde{O}(N\epsilon^{-2/3} + \epsilon^{-8/3})$ in the non-smooth case and $\widetilde{O}(N\epsilon^{-2/3} + \sqrt{N}\epsilon^{-1})$ in the $O(1/\epsilon)$-smooth case.  Our methods consist of a recently proposed ball optimization oracle acceleration algorithm (which we refine) and a careful implementation of said oracle for the softmax function.  We also prove an oracle complexity lower bound scaling as $\Omega(N\epsilon^{-2/3})$, showing that our dependence on $N$ is optimal up to polylogarithmic factors.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/carmon21a.html
  PDF: http://proceedings.mlr.press/v134/carmon21a/carmon21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-carmon21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yair
    family: Carmon
  - given: Arun
    family: Jambulapati
  - given: Yujia
    family: Jin
  - given: Aaron
    family: Sidford
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 866-882
  id: carmon21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 866
  lastpage: 882
  published: 2021-07-21 00:00:00 +0000
- title: 'Optimizing Optimizers: Regret-optimal gradient descent algorithms'
  abstract: 'This paper treats the task of designing optimization algorithms as an optimal control problem. Using regret as a metric for an algorithm’s performance, we study the existence, uniqueness and consistency of regret-optimal algorithms. By providing first-order optimality conditions for the control problem, we show that regret-optimal algorithms must satisfy a specific structure in their dynamics which we show is equivalent to performing \emph{dual-preconditioned gradient descent} on the value function generated by its regret. Using these optimal dynamics, we provide bounds on their rates of convergence to solutions of convex optimization problems. Though closed-form optimal dynamics cannot be obtained in general, we present fast numerical methods for approximating them, generating optimization algorithms which directly optimize their long-term regret. These are benchmarked against commonly used optimization algorithms to demonstrate their effectiveness.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/casgrain21a.html
  PDF: http://proceedings.mlr.press/v134/casgrain21a/casgrain21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-casgrain21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Philippe
    family: Casgrain
  - given: Anastasis
    family: Kratsios
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 883-926
  id: casgrain21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 883
  lastpage: 926
  published: 2021-07-21 00:00:00 +0000
- title: 'When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?'
  abstract: 'We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence.  The first is simply a bound on the loss at initialization.  The second is a data separation condition used in prior analyses.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chatterji21a.html
  PDF: http://proceedings.mlr.press/v134/chatterji21a/chatterji21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chatterji21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Niladri S.
    family: Chatterji
  - given: Philip M.
    family: Long
  - given: Peter
    family: Bartlett
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 927-1027
  id: chatterji21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 927
  lastpage: 1027
  published: 2021-07-21 00:00:00 +0000
- title: 'Breaking The Dimension Dependence in Sparse Distribution Estimation under Communication Constraints'
  abstract: 'We consider the problem of estimating a $d$-dimensional $s$-sparse discrete distribution from its samples observed under a $b$-bit communication constraint. The best-known previous result on $\ell_2$ estimation error for this problem is $O\left( \frac{s\log\left( {d}/{s}\right)}{n2^b}\right)$. Surprisingly, we show that when sample size $n$ exceeds a minimum threshold $n^*(s, d, b)$, we can achieve an $\ell_2$ estimation error of $O\left( \frac{s}{n2^b}\right)$. This implies that when $n>n^*(s, d, b)$ the convergence rate does not depend on the ambient dimension $d$ and is the same as knowing the support of the distribution beforehand.
We next ask the question: ``what is the minimum $n^*(s, d, b)$ that allows dimension-free convergence?''.  To upper bound $n^*(s, d, b)$, we develop novel localization schemes to accurately and efficiently localize the unknown support. For the non-interactive setting, we show that $n^*(s, d, b) = O\left( \min \left( {d^2\log^2 d}/{2^b}, {s^4\log^2 d}/{2^b}\right) \right)$. Moreover,  we connect the problem with non-adaptive group testing and obtain a polynomial-time estimation scheme when $n = \tilde{\Omega}\left({s^4\log^4 d}/{2^b}\right)$. This group testing based scheme is adaptive to the sparsity parameter $s$, and hence can be applied without knowing it. For the interactive setting, we propose a novel tree-based estimation scheme and show that the minimum sample-size needed to achieve dimension-free convergence can be further reduced to $n^*(s, d, b) = \tilde{O}\left( {s^2\log^2 d}/{2^b} \right)$.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chen21a.html
  PDF: http://proceedings.mlr.press/v134/chen21a/chen21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chen21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Wei-Ning
    family: Chen
  - given: Peter
    family: Kairouz
  - given: Ayfer
    family: Ozgur
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1028-1059
  id: chen21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1028
  lastpage: 1059
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning and testing junta distributions with sub cube conditioning'
  abstract: 'We study the problems of learning and testing junta distributions on $\{-1,1\}^n$ with respect to the uniform distribution, where a distribution $p$ is a $k$-junta if its probability mass function $p(x)$  depends on a subset of at most $k$ variables. The main contribution is an algorithm for finding relevant coordinates in a $k$-junta distribution with subcube conditioning (Bhattacharyya et al 2018., Canonne et al. 2019). We give two applications: An algorithm for learning $k$-junta distributions with $\tilde{O}(k/\epsilon^2) \log n + O(2^k/\epsilon^2)$ subcube conditioning queries, and  an algorithm for testing $k$-junta distributions with $\tilde{O}((k + \sqrt{n})/\epsilon^2)$ subcube conditioning queries. All our algorithms are optimal up to poly-logarithmic factors.
Our results show that subcube conditioning, as a natural model for accessing high-dimensional distributions, enables significant savings in learning and testing junta distributions compared to the standard sampling model. This addresses an open question posed by Aliakbarpour et al. 2016.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chen21b.html
  PDF: http://proceedings.mlr.press/v134/chen21b/chen21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chen21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xi
    family: Chen
  - given: Rajesh
    family: Jayaram
  - given: Amit
    family: Levi
  - given: Erik
    family: Waingarten
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1060-1113
  id: chen21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1060
  lastpage: 1113
  published: 2021-07-21 00:00:00 +0000
- title: 'Black-Box Control for Linear Dynamical Systems'
  abstract: 'We consider the problem of black-box control: the task of controlling an unknown linear time-invariant dynamical system from a single trajectory without a stabilizing controller. Under the assumption that the system is controllable, we give the first {\it efficient} algorithm that is capable of attaining sublinear regret under the setting of online nonstochastic control. This resolves an open problem since the work of Abbasi-Yadkori and Szepesvari(2011) on the stochastic LQR problem, and in a more challenging setting that allows for adversarial perturbations and adversarially chosen changing convex loss functions. We give finite-time regret bounds for our algorithm on the order of $2^{poly(d)} + \tilde{O}(poly(d) T^{2/3})$ for general nonstochastic control, and $2^{poly(d)} + \tilde{O}(poly(d) \sqrt{T})$ for black-box LQR. To complete the picture, we investigate the complexity of the online black-box control problem and give a matching regret lower bound of $2^{\Omega(d)}$, showing that the exponential cost is inevitable. This lower bound holds even in the noiseless setting, and applies to any, randomized or deterministic, black-box control method.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chen21c.html
  PDF: http://proceedings.mlr.press/v134/chen21c/chen21c.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chen21c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xinyi
    family: Chen
  - given: Elad
    family: Hazan
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1114-1143
  id: chen21c
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1114
  lastpage: 1143
  published: 2021-07-21 00:00:00 +0000
- title: 'Query complexity of least absolute deviation regression via robust uniform convergence'
  abstract: 'Consider a regression problem where the learner is given a large collection of $d$-dimensional data points, but can only query a small subset of the real-valued labels. How many queries are needed to obtain a $1+\epsilon$ relative error approximation of the optimum? While this problem has been extensively studied for least squares regression, little is known for other losses. An important example is least absolute deviation regression ($\ell_1$ regression) which enjoys superior robustness to outliers compared to least squares. We develop a new framework for analyzing importance sampling methods in regression problems, which enables us to show that the query complexity of least absolute deviation regression is $\Theta(d/\epsilon^2)$ up to logarithmic factors. We further extend our techniques to show the first bounds on the query complexity for any $\ell_p$ loss with $p\in(1,2)$. As a key novelty in our analysis, we introduce the notion of robust uniform convergence, which is a new approximation guarantee for the empirical loss. While it is inspired by uniform convergence in statistical learning, our approach additionally incorporates a correction term to avoid unnecessary variance due to outliers. This can be viewed as a new connection between statistical learning theory and variance reduction techniques in stochastic optimization, which should be of independent interest.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chen21d.html
  PDF: http://proceedings.mlr.press/v134/chen21d/chen21d.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chen21d.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xue
    family: Chen
  - given: Michal
    family: Derezinski
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1144-1179
  id: chen21d
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1144
  lastpage: 1179
  published: 2021-07-21 00:00:00 +0000
- title: 'Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition'
  abstract: 'We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $O(\sqrt{DT_\star K})$ and $O(\sqrt{DT_\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the recent work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs.  Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chen21e.html
  PDF: http://proceedings.mlr.press/v134/chen21e/chen21e.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chen21e.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Liyu
    family: Chen
  - given: Haipeng
    family: Luo
  - given: Chen-Yu
    family: Wei
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1180-1215
  id: chen21e
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1180
  lastpage: 1215
  published: 2021-07-21 00:00:00 +0000
- title: 'Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications'
  abstract: 'We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t,i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t,i}$ is the loss for expert $i$ in round $t$. Our algorithm is based on the Mirror Descent framework with a correction term and a weighted entropy regularizer. While natural, the algorithm has not been studied before and requires a careful analysis. We also generalize the bound to $O\left(\sqrt{(\ln d)\sum_t (\ell_{t,i}-m_{t,i})^2}\right)$ for any prediction vector $m_t$ that the learner receives, and recover or improve many existing results by choosing different $m_t$. Furthermore, we use the same framework to create a master algorithm that combines a set of base algorithms and learns the best one with little overhead. The new guarantee of our master allows us to derive many new results for both the expert problem and more generally Online Linear Optimization.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chen21f.html
  PDF: http://proceedings.mlr.press/v134/chen21f/chen21f.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chen21f.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Liyu
    family: Chen
  - given: Haipeng
    family: Luo
  - given: Chen-Yu
    family: Wei
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1216-1259
  id: chen21f
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1216
  lastpage: 1259
  published: 2021-07-21 00:00:00 +0000
- title: 'Optimal dimension dependence of the Metropolis-Adjusted Langevin Algorithm'
  abstract: 'Conventional wisdom in the sampling literature, backed by a popular diffusion scaling limit, suggests that the mixing time of the Metropolis-Adjusted Langevin Algorithm (MALA) scales as O(d^{1/3}), where d is the dimension. However, the diffusion scaling limit requires stringent assumptions on the target distribution and is asymptotic in nature. In contrast, the best known non-asymptotic mixing time bound for MALA on the class of log-smooth and strongly log-concave distributions is O(d). In this work, we establish that the mixing time of MALA on this class of target distributions is \tilde\Theta(d^{1/2}) under a warm start.  Our upper bound proof introduces a new technique based on a projection characterization of the Metropolis adjustment which reduces the study of MALA to the well-studied discretization analysis of the Langevin SDE and bypasses direct computation of the acceptance probability.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/chewi21a.html
  PDF: http://proceedings.mlr.press/v134/chewi21a/chewi21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-chewi21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sinho
    family: Chewi
  - given: Chen
    family: Lu
  - given: Kwangjun
    family: Ahn
  - given: Xiang
    family: Cheng
  - given: Thibaut Le
    family: Gouic
  - given: Philippe
    family: Rigollet
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1260-1300
  id: chewi21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1260
  lastpage: 1300
  published: 2021-07-21 00:00:00 +0000
- title: 'Online Markov Decision Processes with Aggregate Bandit Feedback'
  abstract: 'We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.  In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with the cumulative loss suffered, rather than the individual losses encountered along the trajectory.  Our main result is a computationally efficient algorithm with $O(\sqrt{K})$ regret for this setting, where K is the number of episodes.  We establish this result via an efficient reduction to a novel bandit learning setting we call Distorted Linear Bandits (DLB), which is a variant of bandit linear optimization where actions chosen by the learner are adversarially distorted before they are committed.  We then develop a computationally-efficient online algorithm for DLB for which we prove an $O(\sqrt{T})$ regret bound, where T is the number of time steps.  Our algorithm is based on online mirror descent with a self-concordant barrier regularization that employs a novel increasing learning rate schedule.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/cohen21a.html
  PDF: http://proceedings.mlr.press/v134/cohen21a/cohen21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-cohen21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alon
    family: Cohen
  - given: Haim
    family: Kaplan
  - given: Tomer
    family: Koren
  - given: Yishay
    family: Mansour
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1301-1329
  id: cohen21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1301
  lastpage: 1329
  published: 2021-07-21 00:00:00 +0000
- title: 'Quantifying Variational Approximation for Log-Partition Function'
  abstract: 'Variational methods, such as mean-field (MF) and tree-reweighted (TRW), provide computationally efficient approximations of the log-partition function for generic graphical models but their approximation ratio is generally not quantified. As the primary contribution of this work, we provide an approach to quantify their approximation ratio for any discrete pairwise graphical model with non-negative potentials through a property of the underlying graph structure $G$. Specifically, we argue that (a variant of) TRW produces an estimate within factor $1/\sqrt{\kappa(G)}$ where $\kappa(G) \in (0,1]$ captures how far $G$ is from tree structure. As a consequence, the approximation ratio is $1$ for trees, $\sqrt{(d+1)/2}$ for graphs with maximum average degree $d$ and $1+1/(2\beta)+o_{\beta\to \infty}(1/\beta)$ for graphs with girth at least $\beta \log N$. The quantity $\kappa(G)$ is the solution of a min-max problem associated with the spanning tree polytope of $G$ that can be evaluated in polynomial time for any graph. We provide a near linear-time variant that achieves an approximation ratio depending on the minimal (across edges) effective resistance of the graph. We connect our results to the graph partition approximation method and thus provide a unified perspective.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/cosson21a.html
  PDF: http://proceedings.mlr.press/v134/cosson21a/cosson21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-cosson21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Romain
    family: Cosson
  - given: Devavrat
    family: Shah
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1330-1357
  id: cosson21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1330
  lastpage: 1357
  published: 2021-07-21 00:00:00 +0000
- title: 'From Local Pseudorandom Generators to Hardness of Learning'
  abstract: 'We prove hardness-of-learning results under a well-studied assumption on the existence of local pseudorandom generators. As we show, this assumption allows us to surpass the current state of the art, and prove hardness of various basic problems, with no hardness results to date.  Our results include: hardness of learning shallow ReLU neural networks under the Gaussian distribution and other distributions; hardness of learning intersections of $\omega(1)$ halfspaces, DNF formulas with $\omega(1)$ terms, and ReLU networks with $\omega(1)$ hidden neurons; hardness of weakly learning deterministic finite automata under the uniform distribution; hardness of weakly learning depth-$3$ Boolean circuits under the uniform distribution, as well as distribution-specific hardness results for learning DNF formulas and intersections of halfspaces. We also establish lower bounds on the complexity of learning intersections of a constant number of halfspaces, and ReLU networks with a constant number of hidden neurons. Moreover, our results imply the hardness of virtually all improper PAC-learning problems (both distribution-free and distribution-specific) that were previously shown hard under other assumptions.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/daniely21a.html
  PDF: http://proceedings.mlr.press/v134/daniely21a/daniely21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-daniely21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Amit
    family: Daniely
  - given: Gal
    family: Vardi
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1358-1394
  id: daniely21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1358
  lastpage: 1394
  published: 2021-07-21 00:00:00 +0000
- title: 'A Statistical Taylor Theorem and Extrapolation of Truncated Densities'
  abstract: 'We show a statistical version of Taylor’s theorem and apply this result to non-parametric density estimation from truncated samples, which is a classical challenge in Statistics [Woodroofe 1985, Stute 1993]. The single-dimensional version of our theorem has the following implication: "For any distribution P on [0, 1] with a smooth log-density function, given samples from the conditional distribution of P on [a, a + \varepsilon] \subset [0, 1], we can efficiently identify an approximation to P over the whole interval [0, 1], with quality of approximation that improves with the smoothness of P".  To the best of knowledge, our result is the first in the area of non-parametric density estimation from truncated samples, which works under the hard truncation model, where the samples outside some survival set S are never observed, and applies to multiple dimensions. In contrast, previous works assume single dimensional data where each sample has a different survival set $S$ so that samples from the whole support will ultimately be collected.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/daskalakis21a.html
  PDF: http://proceedings.mlr.press/v134/daskalakis21a/daskalakis21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-daskalakis21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Constantinos
    family: Daskalakis
  - given: Vasilis
    family: Kontonis
  - given: Christos
    family: Tzamos
  - given: Emmanouil
    family: Zampetakis
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1395-1398
  id: daskalakis21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1395
  lastpage: 1398
  published: 2021-07-21 00:00:00 +0000
- title: 'Weak learning convex sets under normal distributions'
  abstract: 'This paper addresses the following natural question: can efficient algorithms weakly learn convex sets under normal distributions? Strong learnability of convex sets under normal distributions is well understood, with near-matching upper and lower bounds given by Klivans et al (2008), but prior to the current work nothing seems to have been known about weak learning. We essentially answer this question, giving near-matching algorithms and lower bounds.  For our positive result, we give a poly(n)-time algorithm that can weakly learn the class of convex sets to advantage $\Omega(1/\sqrt{n})$ using only random examples drawn from the background Gaussian distribution. Our algorithm and analysis are based on a new “density increment” result for convex sets, which we prove using tools from isoperimetry.  We also give an information-theoretic lower bound showing that $O(\log(n)/\sqrt{n})$ advantage is best possible even for algorithms that are allowed to make poly(n) many membership queries.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/de21a.html
  PDF: http://proceedings.mlr.press/v134/de21a/de21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-de21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Anindya
    family: De
  - given: Rocco
    family: Servedio
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1399-1428
  id: de21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1399
  lastpage: 1428
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning sparse mixtures of permutations from noisy information'
  abstract: 'We study the problem of learning an unknown mixture of k permutations over n elements, given access to noisy samples drawn from the unknown mixture. We consider a range of different noise models, including natural variants of the “heat kernel” noise framework and the Mallows model. We give an algorithm which, for each of these noise models, learns the unknown mixture to high accuracy under mild assumptions and runs in $n^{O(log k)}$ time. Our approach is based on a new procedure that recovers an unknown mixture of permutations from noisy higher-order marginals.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/de21b.html
  PDF: http://proceedings.mlr.press/v134/de21b/de21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-de21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Anindya
    family: De
  - given: Ryan
    family: O’Donnell
  - given: Rocco
    family: Servedio
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1429-1466
  id: de21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1429
  lastpage: 1466
  published: 2021-07-21 00:00:00 +0000
- title: 'Sparse sketches with small inversion bias'
  abstract: 'For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/derezinski21a.html
  PDF: http://proceedings.mlr.press/v134/derezinski21a/derezinski21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-derezinski21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Michal
    family: Derezinski
  - given: Zhenyu
    family: Liao
  - given: Edgar
    family: Dobriban
  - given: Michael
    family: Mahoney
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1467-1510
  id: derezinski21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1467
  lastpage: 1510
  published: 2021-07-21 00:00:00 +0000
- title: 'The Sample Complexity of Robust Covariance Testing'
  abstract: 'We study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber’s contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-\epsilon) X + \epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, \Sigma)$, $B$ is a fixed but unknown noise distribution, and $\epsilon>0$ is an arbitrarily small constant representing the proportion of contamination.  We want to distinguish between the cases that $\Sigma$ is the identity matrix versus $\gamma$-far from the identity in Frobenius norm.  In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. Moreover, this sample upper bound was shown to be best possible, within constant factors. Our main result is that the sample complexity of covariance testing dramatically increases in the contaminated setting. In particular, we prove a sample complexity lower bound of $\Omega(d^2)$ for $\epsilon$ an arbitrarily small constant and $\gamma = 1/2$.  This lower bound is best possible, as $O(d^2)$ samples suffice to even robustly {\em learn} the covariance. The conceptual implication of our result is that, for the natural setting we consider, robust hypothesis testing is at least as hard as robust estimation.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/diakonikolas21a.html
  PDF: http://proceedings.mlr.press/v134/diakonikolas21a/diakonikolas21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-diakonikolas21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Daniel M.
    family: Kane
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1511-1521
  id: diakonikolas21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1511
  lastpage: 1521
  published: 2021-07-21 00:00:00 +0000
- title: 'Agnostic Proper Learning of Halfspaces under Gaussian Marginals'
  abstract: 'We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose running time qualitatively matches that of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other activation functions, yielding the first proper agnostic algorithm for ReLU regression.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/diakonikolas21b.html
  PDF: http://proceedings.mlr.press/v134/diakonikolas21b/diakonikolas21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-diakonikolas21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Daniel M
    family: Kane
  - given: Vasilis
    family: Kontonis
  - given: Christos
    family: Tzamos
  - given: Nikos
    family: Zarifis
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1522-1551
  id: diakonikolas21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1522
  lastpage: 1551
  published: 2021-07-21 00:00:00 +0000
- title: 'The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals in the SQ Model'
  abstract: 'We study the problem of agnostic learning under the Gaussian distribution in the Statistical Query (SQ) model.  We develop a method for finding hard families of examples for a wide range of concept classes by using LP duality.  For Boolean-valued concept classes, we show that the $L^1$-polynomial regression algorithm is essentially best possible among SQ algorithms, and therefore that the SQ complexity of agnostic learning is closely related to the polynomial degree required to approximate any function from the concept class in $L^1$-norm. Using this characterization along with additional analytic tools, we obtain explicit optimal SQ lower bounds for agnostically learning linear threshold functions and the first non-trivial explicit SQ lower bounds for polynomial threshold functions and intersections of halfspaces. We also develop an analogous theory for agnostically learning real-valued functions, and as an application prove near-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/diakonikolas21c.html
  PDF: http://proceedings.mlr.press/v134/diakonikolas21c/diakonikolas21c.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-diakonikolas21c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Daniel M.
    family: Kane
  - given: Thanasis
    family: Pittas
  - given: Nikos
    family: Zarifis
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1552-1584
  id: diakonikolas21c
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1552
  lastpage: 1584
  published: 2021-07-21 00:00:00 +0000
- title: 'Boosting in the Presence of Massart Noise'
  abstract: 'We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $\eta(x) \leq \eta$, where $\eta<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $\eta$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/diakonikolas21d.html
  PDF: http://proceedings.mlr.press/v134/diakonikolas21d/diakonikolas21d.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-diakonikolas21d.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Russell
    family: Impagliazzo
  - given: Daniel M.
    family: Kane
  - given: Rex
    family: Lei
  - given: Jessica
    family: Sorrell
  - given: Christos
    family: Tzamos
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1585-1644
  id: diakonikolas21d
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1585
  lastpage: 1644
  published: 2021-07-21 00:00:00 +0000
- title: 'Outlier-Robust Learning of Ising Models Under Dobrushin’s Condition'
  abstract: 'We study the problem of learning Ising models satisfying Dobrushin’s condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-2 polynomials of Ising models that may be of independent interest.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/diakonikolas21e.html
  PDF: http://proceedings.mlr.press/v134/diakonikolas21e/diakonikolas21e.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-diakonikolas21e.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Daniel M.
    family: Kane
  - given: Alistair
    family: Stewart
  - given: Yuxin
    family: Sun
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1645-1682
  id: diakonikolas21e
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1645
  lastpage: 1682
  published: 2021-07-21 00:00:00 +0000
- title: 'Random Coordinate Langevin Monte Carlo'
  abstract: 'Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the partial derivative along this direction plus noise, while all other coordinates remain untouched. We investigate the total complexity of RC-LMC and compare it with the classical LMC for log-concave probability distributions. We show that when the gradient of the log-density is Lipschitz, RC-LMC is less expensive than the classical LMC if the log-density is highly skewed for high dimensional problems. Further, when both the gradient and the Hessian of the log-density are Lipschitz, RC-LMC is always cheaper than the classical LMC, by a factor proportional to the square root of the problem dimension. In the latter case, we use an example to demonstrate that our estimate of complexity is sharp with respect to the dimension.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/ding21a.html
  PDF: http://proceedings.mlr.press/v134/ding21a/ding21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-ding21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zhiyan
    family: Ding
  - given: Qin
    family: Li
  - given: Jianfeng
    family: Lu
  - given: Stephen J
    family: Wright
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1683-1710
  id: ding21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1683
  lastpage: 1710
  published: 2021-07-21 00:00:00 +0000
- title: 'On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning'
  abstract: 'This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online-learning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrix-valued functions and uniform ergodicity of the Markov chains.  Our main contribution is an exponential stability result for the p-th moment of random matrix product, provided that (i) the underlying Markov chain satisfies a super-Lyapunov drift condition, (ii) the growth of the matrix-valued functions is controlled by an appropriately defined function (related to the drift condition). Using this result, we give finite-time p-th moment bounds for constant and decreasing stepsize linear stochastic approximation schemes with Markovian noise on general state space. We illustrate these findings for linear value-function estimation in reinforcement learning. We provide finite-time p-th moment bound for various members of temporal difference (TD) family of algorithms.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/durmus21a.html
  PDF: http://proceedings.mlr.press/v134/durmus21a/durmus21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-durmus21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alain
    family: Durmus
  - given: Eric
    family: Moulines
  - given: Alexey
    family: Naumov
  - given: Sergey
    family: Samsonov
  - given: Hoi-To
    family: Wai
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1711-1752
  id: durmus21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1711
  lastpage: 1752
  published: 2021-07-21 00:00:00 +0000
- title: 'Kernel Thinning'
  abstract: 'We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$.  In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error.  Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Matérn, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/dwivedi21a.html
  PDF: http://proceedings.mlr.press/v134/dwivedi21a/dwivedi21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-dwivedi21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Raaz
    family: Dwivedi
  - given: Lester
    family: Mackey
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1753-1753
  id: dwivedi21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1753
  lastpage: 1753
  published: 2021-07-21 00:00:00 +0000
- title: 'Non-asymptotic approximations of neural networks by Gaussian processes'
  abstract: 'We study the extent to which wide neural networks may be approximated by Gaussian processes, when initialized with random weights. It is a well-established fact that as the width of a network goes to infinity, its law converges to that of a Gaussian process. We make this quantitative by establishing explicit convergence rates for the central limit theorem in an infinite-dimensional functional space, metrized with a natural transportation distance. We identify two regimes of interest; when the activation function is polynomial, its degree determines the rate of convergence, while for non-polynomial activations, the rate is governed by the smoothness of the function.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/eldan21a.html
  PDF: http://proceedings.mlr.press/v134/eldan21a/eldan21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-eldan21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ronen
    family: Eldan
  - given: Dan
    family: Mikulincer
  - given: Tselil
    family: Schramm
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1754-1775
  id: eldan21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1754
  lastpage: 1775
  published: 2021-07-21 00:00:00 +0000
- title: 'On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness'
  abstract: 'We study sampling from a target distribution $\nu_* = e^{-f}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm.  For any potential function $f$ whose tails behave like $\|x\|^\alpha$ for ${\alpha \in [1,2]}$, and has $\beta$-Hölder continuous gradient, we prove that $\widetilde{\mathcal{O}} \Big(d^{\frac{1}{\beta}+\frac{1+\beta}{\beta}(\frac{2}{\alpha}-{1}_{\{\alpha \neq 1\}})} \epsilon^{-\frac{1}{\beta}}\Big)$ steps are sufficient to reach the $\epsilon$-neighborhood of a $d$-dimensional target distribution $\nu_*$ in KL-divergence.  This bound, in terms of $\epsilon$ dependency, is not directly influenced by the tail growth rate $\alpha$ of the potential function as long as its growth is at least linear, and it only relies on the order of smoothness $\beta$.  One notable consequence of this result is that for potentials with Lipschitz gradient, i.e. $\beta=1$, the above rate recovers the best known rate $\widetilde{\mathcal{O}} (d\epsilon^{-1})$ which was established for strongly convex potentials in terms of $\epsilon$ dependency, but we show that the same rate is achievable for a wider class of potentials that are degenerately convex at infinity.  The growth rate $\alpha$ affects the rate estimate in high dimensions where $d$ is large; furthermore, it recovers the best-known dimension dependency when the tail growth of the potential is quadratic, i.e. $\alpha = 2$, in the current setup.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/erdogdu21a.html
  PDF: http://proceedings.mlr.press/v134/erdogdu21a/erdogdu21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-erdogdu21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Murat A
    family: Erdogdu
  - given: Rasa
    family: Hosseinzadeh
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1776-1822
  id: erdogdu21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1776
  lastpage: 1822
  published: 2021-07-21 00:00:00 +0000
- title: 'Adaptivity in Adaptive Submodularity'
  abstract: 'Adaptive sequential decision making is one of the central challenges in machine learning and artificial intelligence. In such problems, the goal is to design an interactive policy that plans for an action to take, from a finite set of $n$ actions, given some partial observations. It has been shown that in many applications such as active learning, robotics, sequential experimental design, and active detection, the utility function satisfies adaptive submodularity, a notion that generalizes the notion of diminishing returns to policies.  In this paper, we revisit the power of adaptivity in maximizing an adaptive monotone submodular function. We propose an efficient semi adaptive policy that with $O(\log n \times\log k)$ adaptive rounds of observations can achieve an almost tight $1-1/e-\epsilon$ approximation guarantee with respect to an optimal policy that carries out $k$ actions in a fully sequential manner. To complement our results, we also show that it is impossible to achieve a constant factor approximation with $o(\log n)$ adaptive rounds. We also extend our result to the case of adaptive stochastic minimum cost coverage where the goal is to reach a desired utility $Q$ with the cheapest policy. We first prove the long-standing conjecture by Golovin and Krause and show that the greedy policy achieves the asymptotically tight logarithmic approximation guarantee. We then propose a semi adaptive policy that provides the same guarantee in polylogarithmic adaptive rounds through a similar information-parallelism scheme. Our results shrink the adaptivity gap in adaptive submodular maximization by an exponential factor.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/esfandiari21a.html
  PDF: http://proceedings.mlr.press/v134/esfandiari21a/esfandiari21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-esfandiari21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hossein
    family: Esfandiari
  - given: Amin
    family: Karbasi
  - given: Vahab
    family: Mirrokni
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1823-1846
  id: esfandiari21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1823
  lastpage: 1846
  published: 2021-07-21 00:00:00 +0000
- title: 'Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization'
  abstract: 'Dimension is an inherent bottleneck to some modern learning tasks, where optimization methods suffer from the size of the data. In this paper, we study non-isotropic distributions of data and develop tools that aim at reducing these dimensional costs by a dependency on an effective dimension rather than the ambient one.  Based on non-asymptotic estimates of the metric entropy of ellipsoids -that prove to generalize to infinite dimensions- and on a chaining argument, our uniform concentration bounds involve an effective dimension instead of the global dimension, improving over existing results.  We show the importance of taking advantage of non-isotropic properties in learning problems with the following applications: i) we improve state-of-the-art results in statistical preconditioning for communication-efficient distributed optimization, ii) we introduce a non-isotropic randomized smoothing for non-smooth optimization. Both applications cover a class of functions that encompasses empirical risk minization (ERM) for linear models.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/even21a.html
  PDF: http://proceedings.mlr.press/v134/even21a/even21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-even21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mathieu
    family: Even
  - given: Laurent
    family: Massoulie
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1847-1886
  id: even21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1847
  lastpage: 1886
  published: 2021-07-21 00:00:00 +0000
- title: 'Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks'
  abstract: 'This paper proposes a new mean-field framework for over-parameterized deep neural networks (DNNs), which can be used to analyze neural network training.  In this framework, a DNN is represented by probability measures and functions over its features (that is, the function values of the hidden units over the training data) in the continuous limit, instead of the neural network parameters as most existing studies have done. This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, leading to a simpler representation of DNNs. Moreover, we construct a non-linear dynamics called neural feature flow, which captures the evolution of an over-parameterized DNN trained by Gradient Descent. We illustrate the framework via the Residual Network (Res-Net) architecture.  It is shown that when the neural feature flow process converges, it reaches a global minimal solution under suitable conditions.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/fang21a.html
  PDF: http://proceedings.mlr.press/v134/fang21a/fang21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-fang21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cong
    family: Fang
  - given: Jason
    family: Lee
  - given: Pengkun
    family: Yang
  - given: Tong
    family: Zhang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1887-1936
  id: fang21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1887
  lastpage: 1936
  published: 2021-07-21 00:00:00 +0000
- title: 'Sequential prediction under log-loss and misspecification'
  abstract: 'We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption that the data-generating distribution belongs to the hypothesis class as well. Here we present results in the more general misspecified case. Due to special properties of the log-loss, the same problem arises in the context of competitive-optimality in density estimation, and model selection.  For the $d$-dimensional Gaussian location hypothesis class, we show that cumulative regrets in the well-specified and misspecified cases asymptotically coincide. In other words, we provide an $o(1)$ characterization of the distribution-free (or PAC) regret in this case – the first such result as far as we know. We recall that the worst-case (or individual-sequence) regret in this case is larger by an additive constant ${d\over 2} + o(1)$.  Surprisingly, neither the traditional Bayesian estimators, nor the Shtarkov’s normalized maximum likelihood achieve the PAC regret and our estimator requires special “robustification” against heavy-tailed data.  In addition, we show two general results for misspecified regret: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecified regret between well-specified regrets with (asymptotically) close hypotheses classes.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/feder21a.html
  PDF: http://proceedings.mlr.press/v134/feder21a/feder21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-feder21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Meir
    family: Feder
  - given: Yury
    family: Polyanskiy
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1937-1964
  id: feder21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1937
  lastpage: 1964
  published: 2021-07-21 00:00:00 +0000
- title: 'Convergence rates and approximation results for SGD and its continuous-time counterpart'
  abstract: 'This paper proposes a thorough theoretical analysis of Stochastic Gradient Descent (SGD) with non-increasing step sizes.  First, we show that the recursion defining SGD can be provably approximated by solutions of a time inhomogeneous Stochastic Differential Equation (SDE) using an appropriate coupling. In the specific case of a batch noise we refine our results using recent advances in Stein’s method. Then, motivated by recent analyses of deterministic and stochastic optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and establish non-asymptotic bounds. To that purpose, we develop new comparison techniques which are of independent interest. Adapting these techniques to the discrete setting, we show that the same results hold for the corresponding SGD sequences.  In our analysis, we notably improve non-asymptotic bounds in the convex setting for SGD under weaker assumptions than the ones considered in previous works. Finally, we also establish finite-time convergence results under various conditions, including relaxations of the famous Ł{ojasciewicz} inequality, which can be applied to a class of non-convex functions.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/fontaine21a.html
  PDF: http://proceedings.mlr.press/v134/fontaine21a/fontaine21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-fontaine21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xavier
    family: Fontaine
  - given: Valentin De
    family: Bortoli
  - given: Alain
    family: Durmus
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 1965-2058
  id: fontaine21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 1965
  lastpage: 2058
  published: 2021-07-21 00:00:00 +0000
- title: 'Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective'
  abstract: 'In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems.  Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/foster21a.html
  PDF: http://proceedings.mlr.press/v134/foster21a/foster21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-foster21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dylan
    family: Foster
  - given: Alexander
    family: Rakhlin
  - given: David
    family: Simchi-Levi
  - given: Yunzong
    family: Xu
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2059-2059
  id: foster21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2059
  lastpage: 2059
  published: 2021-07-21 00:00:00 +0000
- title: 'Efficient Algorithms for Learning from Coarse Labels'
  abstract: 'For many learning problems one may not have access to fine grained label information; e.g., an image can be labeled as husky, dog, or even animal depending on the expertise of the annotator.  In this work, we formalize these settings and study the problem of learning from such coarse data.  Instead of observing the actual labels from a set $\mathcal{Z}$, we observe coarse labels corresponding to a partition of $\mathcal{Z}$ (or a mixture of partitions).  Our main algorithmic result is that essentially any problem learnable from fine grained labels can also be learned efficiently when the coarse data are sufficiently informative.  We obtain our result through a generic reduction for answering Statistical Queries (SQ) over fine grained labels given only coarse labels.  The number of coarse labels required depends polynomially on the information distortion due to coarsening and the number of fine labels $|\mathcal{Z}|$.  We also investigate the case of (infinitely many) real valued labels focusing on a central problem in censored and truncated statistics: Gaussian mean estimation from coarse data.  We provide an efficient algorithm when the sets in the partition are convex and establish that the problem is NP-hard even for very simple non-convex sets.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/fotakis21a.html
  PDF: http://proceedings.mlr.press/v134/fotakis21a/fotakis21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-fotakis21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dimitris
    family: Fotakis
  - given: Alkis
    family: Kalavasis
  - given: Vasilis
    family: Kontonis
  - given: Christos
    family: Tzamos
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2060-2079
  id: fotakis21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2060
  lastpage: 2079
  published: 2021-07-21 00:00:00 +0000
- title: 'Impossibility of Partial Recovery in the Graph Alignment Problem'
  abstract: 'Random graph alignment refers to recovering the underlying vertex correspondence between two random graphs with correlated edges. This can be viewed as an average-case and noisy version of the well-known graph isomorphism problem. For the correlated Erdös-Rényi model, we prove the first impossibility result for partial recovery in the sparse regime (with constant average degree). Our bound is tight in the noiseless case (the graph isomorphism problem) and we conjecture that it is still tight with noise. Our proof technique relies on a careful application of the probabilistic method to build automorphisms between tree components of a subcritical Erdös-Rényi graph.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/ganassali21a.html
  PDF: http://proceedings.mlr.press/v134/ganassali21a/ganassali21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-ganassali21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Luca
    family: Ganassali
  - given: Laurent
    family: Massoulie
  - given: Marc
    family: Lelarge
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2080-2102
  id: ganassali21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2080
  lastpage: 2102
  published: 2021-07-21 00:00:00 +0000
- title: 'Frank-Wolfe with a Nearest Extreme Point Oracle'
  abstract: 'We consider variants of the classical Frank-Wolfe algorithm for constrained smooth convex minimization, that instead of access to the standard oracle for minimizing a linear function over the feasible set, have access to an oracle that can find an extreme point of the feasible set that is closest in Euclidean distance to a given vector. We first show that for many feasible sets of interest, such an oracle can be implemented with the same complexity as the standard linear optimization oracle. We then show that with such an oracle we can design new Frank-Wolfe variants which enjoy significantly improved complexity bounds in case the set of optimal solutions lies in the convex hull of a subset of extreme points with small diameter (e.g., a low-dimensional face of a polytope). In particular, for many $0\text{–}1$ polytopes, under quadratic growth and strict complementarity conditions, we obtain the first linearly convergent variant with rate that depends only on the dimension of the optimal face and not on the ambient dimension.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/garber21a.html
  PDF: http://proceedings.mlr.press/v134/garber21a/garber21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-garber21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dan
    family: Garber
  - given: Noam
    family: Wolf
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2103-2132
  id: garber21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2103
  lastpage: 2132
  published: 2021-07-21 00:00:00 +0000
- title: 'On Avoiding the Union Bound When Answering Multiple Differentially Private Queries'
  abstract: 'In this work, we study the problem of answering $k$ queries with $(\epsilon, \delta)$-differential privacy, where each query has sensitivity one. We give an algorithm for this task that achieves an expected $\ell_\infty$ error bound of $O(\frac{1}{\epsilon}\sqrt{k \log \frac{1}{\delta}})$, which is known to be tight (Steinke and Ullman, 2016).  A very recent work by Dagan and Kur (2020) provides a similar result, albeit via a completely different approach. One difference between our work and theirs is that our guarantee holds even when $\delta < 2^{-\Omega(k/(\log k)^8)}$ whereas theirs does not apply in this case. On the other hand, the algorithm of Dagan and Kur (2020) has a remarkable advantage that the $\ell_{\infty}$ error bound of $O(\frac{1}{\epsilon}\sqrt{k \log \frac{1}{\delta}})$ holds not only in expectation but always (i.e., with probability one) while we can only get a high probability (or expected) guarantee on the error.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/ghazi21a.html
  PDF: http://proceedings.mlr.press/v134/ghazi21a/ghazi21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-ghazi21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Badih
    family: Ghazi
  - given: Ravi
    family: Kumar
  - given: Pasin
    family: Manurangsi
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2133-2146
  id: ghazi21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2133
  lastpage: 2146
  published: 2021-07-21 00:00:00 +0000
- title: 'Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information'
  abstract: 'In this paper, we examine the Nash equilibrium convergence properties of no-regret learning in general N -player games. For concreteness, we focus on the archetypal “follow the regularized leader” (FTRL) family of algorithms, and we consider the full spectrum of uncertainty that the players may encounter – from noisy, oracle-based feedback, to bandit, payoff-based information. In this general context, we establish a comprehensive equivalence between the stability of a Nash equilibrium and its support: a Nash equilibrium is stable and attracting with arbitrarily high probability if and only if it is strict (i.e., each equilibrium strategy has a unique best response). This equivalence extends existing continuous-time versions of the “folk theorem” of evolutionary game theory to a bona fide algorithmic learning setting, and it provides a clear refinement criterion for the prediction of the day-to-day behavior of no-regret learning in games.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/giannou21a.html
  PDF: http://proceedings.mlr.press/v134/giannou21a/giannou21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-giannou21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Angeliki
    family: Giannou
  - given: Emmanouil Vasileios
    family: Vlatakis-Gkaragkounis
  - given: Panayotis
    family: Mertikopoulos
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2147-2148
  id: giannou21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2147
  lastpage: 2148
  published: 2021-07-21 00:00:00 +0000
- title: 'Differentially Private Nonparametric Regression Under a Growth Condition'
  abstract: 'Given a real-valued hypothesis class H, we investigate under what conditions there is a differentially private algorithm which learns an optimal hypothesis from H given i.i.d. data. Inspired by recent results for the related setting of binary classification (Alon et al., 2019; Bun et al., 2020), where it was shown that online learnability of a binary class is necessary and sufficient for its private learnability, Jung et al. (2020) showed that in the setting of regression, online learnability of H is necessary for private learnability. Here online learnability of H is characterized by the finiteness of its eta-sequential fat shattering dimension, sfat_eta(H), for all eta > 0. In terms of sufficient conditions for private learnability, Jung et al. (2020) showed that H is privately learnable if lim_{\eta -> 0} sfat_eta(H) is finite, which is a fairly restrictive condition. We show that under the relaxed condition liminf_{eta -> 0} eta * sfat_eta(H) = 0, H is privately learnable, thus establishing the first nonparametric private learnability guarantee for classes H with sfat_eta(H) diverging as eta -> 0. Our techniques involve a novel filtering procedure to output stable hypotheses for nonparametric function classes.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/golowich21a.html
  PDF: http://proceedings.mlr.press/v134/golowich21a/golowich21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-golowich21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Noah
    family: Golowich
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2149-2192
  id: golowich21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2149
  lastpage: 2192
  published: 2021-07-21 00:00:00 +0000
- title: 'Source Identification for Mixtures of Product Distributions'
  abstract: 'We give an algorithm for source identification of a mixture of k product distributions on n bits. This is a fundamental problem in machine learning with many applications. Our algorithm identifies the source parameters of an identifiable mixture, given, as input, approximate values of multilinear moments (derived, for instance, from a sufficiently large sample), using $2^{O(k^2)}n^{O(k)}$ arithmetic operations. Our result is the first explicit bound on the computational complexity of source identification of such mixtures. The running time improves previous results by Feldman, O’Donnell, and Servedio (FOCS 2005) and Chen and Moitra (STOC 2019) that guaranteed only learning the mixture (without parametric identification of the source). Our analysis gives a quantitative version of a qualitative characterization of identifiable sources that is due to Tahmasebi, Motahari, and Maddah-Ali (ISIT 2018).'
  volume: 134
  URL: https://proceedings.mlr.press/v134/gordon21a.html
  PDF: http://proceedings.mlr.press/v134/gordon21a/gordon21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-gordon21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Spencer
    family: Gordon
  - given: Bijan H
    family: Mazaheri
  - given: Yuval
    family: Rabani
  - given: Leonard
    family: Schulman
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2193-2216
  id: gordon21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2193
  lastpage: 2216
  published: 2021-07-21 00:00:00 +0000
- title: 'PAC-Bayes, MAC-Bayes and Conditional Mutual Information: Fast rate bounds that handle general VC classes'
  abstract: 'We give a novel, unified derivation of conditional PAC-Bayesian and mutual information (MI) generalization bounds. We derive conditional MI bounds as an instance, with special choice of prior, of conditional MAC-Bayesian (Mean Approximately Correct) bounds, itself derived from conditional PAC-Bayesian bounds, where ‘conditional’ means that one can use priors conditioned on a joint training and ghost sample.  This allows us to get nontrivial PAC-Bayes and MI-style bounds for general VC classes, something recently shown to be impossible with standard PAC-Bayesian/MI bounds. Second, it allows us to get fast rates of order $O((\text{KL}/n)^{\gamma}$ for $\gamma > 1/2$ if a Bernstein condition holds and for exp-concave losses (with $\gamma=1$), which is impossible with both standard PAC-Bayes generalization and MI bounds. Our work extends the recent work by Steinke and Zakynthinou (2020) who handle MI with VC but neither PAC-Bayes nor fast rates and Mhammedi et al. (2019) who initiated fast rate PAC-Bayes generalization error bounds but handle neither MI nor general VC classes.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/grunwald21a.html
  PDF: http://proceedings.mlr.press/v134/grunwald21a/grunwald21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-grunwald21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Peter
    family: Grunwald
  - given: Thomas
    family: Steinke
  - given: Lydia
    family: Zakynthinou
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2217-2247
  id: grunwald21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2217
  lastpage: 2247
  published: 2021-07-21 00:00:00 +0000
- title: 'Generalizing Complex Hypotheses on Product Distributions: Auctions, Prophet Inequalities, and Pandora’s Problem'
  abstract: 'This paper explores a theory of generalization for learning problems on product distributions, complementing the existing learning theories in the sense that it does not rely on any complexity measures of the hypothesis classes.  The main contributions are two general sample complexity bounds: (1) $\tilde{O} \big( \frac{nk}{\epsilon^2} \big)$ samples are sufficient and necessary for learning an $\epsilon$-optimal hypothesis in \emph{any problem} on an $n$-dimensional product distribution, whose marginals have finite supports of sizes at most $k$; (2) $\tilde{O} \big( \frac{n}{\epsilon^2} \big)$ samples are sufficient and necessary for any problem on $n$-dimensional product distributions if it satisfies a notion of strong monotonicity from the algorithmic game theory literature.  As applications of these theories, we match the optimal sample complexity for single-parameter revenue maximization (Guo et al., STOC 2019), improve the state-of-the-art for multi-parameter revenue maximization (Gonczarowski and Weinberg, FOCS 2018) and prophet inequality (Correa et al., EC 2019; Rubinstein et al., ITCS 2020), and provide the first and tight sample complexity bound for Pandora’s problem.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/guo21a.html
  PDF: http://proceedings.mlr.press/v134/guo21a/guo21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-guo21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chenghao
    family: Guo
  - given: Zhiyi
    family: Huang
  - given: Zhihao Gavin
    family: Tang
  - given: Xinzhi
    family: Zhang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2248-2288
  id: guo21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2248
  lastpage: 2288
  published: 2021-07-21 00:00:00 +0000
- title: 'Online Learning with Simple Predictors and a Combinatorial Characterization of Minimax in 0/1 Games'
  abstract: 'Which classes can be learned properly in the online model? — that is, by an algorithm that on each round uses a predictor from the concept class. While there are simple and natural cases where improper learning is useful and even necessary, it is natural to ask how complex must the improper predictors be in such cases.  Can one always achieve nearly optimal mistake/regret bounds using "simple" predictors?  In this work, we give a complete characterization of when this is possible, thus settling an open problem which has been studied since the pioneering works of Angluin (1987) and Littlestone (1988).  More precisely, given any concept class C and any hypothesis class H, we provide nearly tight bounds (up to a log factor) on the optimal mistake bounds for online learning C using predictors from H.  Our bound yields an exponential improvement over the previously best known bound by Chase and Freitag (2020).  As applications, we give constructive proofs showing that (i) in the realizable setting, a near-optimal mistake bound (up to a constant factor) can be attained by a sparse majority-vote of proper predictors, and (ii) in the agnostic setting, a near-optimal regret bound (up to a log factor) can be attained by a randomized proper algorithm.  The latter was proven non-constructively by Rakhlin, Sridharan, and Tewari (2015).  It was also achieved by constructive but improper algorithms proposed by Ben-David, Pal, and Shalev-Shwartz (2009) and Rakhlin, Shamir, and Sridharan (2012).  A technical ingredient of our proof which may be of independent interest is a generalization of the celebrated Minimax Theorem (von Neumann, 1928) for binary zero-sum games with arbitrary action-sets: a simple game which fails to satisfy Minimax is "Guess the Larger Number".  In this game, each player picks a natural number and the player who picked the larger number wins.  Equivalently, the payoff matrix of this game is infinite triangular.  We show that this is the only obstruction: if the payoff matrix does not contain triangular submatrices of unbounded sizes then the Minimax theorem is satisfied.  This generalizes von Neumann’s Minimax Theorem by removing requirements of finiteness (or compactness) of the action-sets, and moreover it captures precisely the types of games of interest in online learning: namely, Littlestone games.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/hanneke21a.html
  PDF: http://proceedings.mlr.press/v134/hanneke21a/hanneke21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-hanneke21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  - given: Roi
    family: Livni
  - given: Shay
    family: Moran
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2289-2314
  id: hanneke21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2289
  lastpage: 2314
  published: 2021-07-21 00:00:00 +0000
- title: 'Shape Matters: Understanding the Implicit Bias of the Noise Covariance'
  abstract: 'The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise.  This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al.  We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/haochen21a.html
  PDF: http://proceedings.mlr.press/v134/haochen21a/haochen21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-haochen21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jeff Z.
    family: HaoChen
  - given: Colin
    family: Wei
  - given: Jason
    family: Lee
  - given: Tengyu
    family: Ma
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2315-2357
  id: haochen21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2315
  lastpage: 2357
  published: 2021-07-21 00:00:00 +0000
- title: 'Bounded Memory Active Learning through Enriched Queries'
  abstract: 'The explosive growth of easily-accessible unlabeled data has lead to growing interest in \emph{active learning}, a paradigm in which data-hungry learning algorithms adaptively select informative examples in order to lower prohibitively expensive labeling costs. Unfortunately, in standard worst-case models of learning, the active setting often provides no improvement over non-adaptive algorithms. To combat this, a series of recent works have considered a model in which the learner may ask \emph{enriched} queries beyond labels. While such models have seen success in drastically lowering label costs, they tend to come at the expense of requiring large amounts of memory. In this work, we study what families of classifiers can be learned in \emph{bounded memory}. To this end, we introduce a novel streaming-variant of enriched-query active learning along with a natural combinatorial parameter called \emph{lossless sample compression} that is sufficient for learning not only with bounded memory, but in a query-optimal and computationally efficient manner as well. Finally, we give three fundamental examples of classifier families with small, easy to compute lossless compression schemes when given access to basic enriched queries: axis-aligned rectangles, decision trees, and halfspaces in two dimensions.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/hopkins21a.html
  PDF: http://proceedings.mlr.press/v134/hopkins21a/hopkins21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-hopkins21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Max
    family: Hopkins
  - given: Daniel
    family: Kane
  - given: Shachar
    family: Lovett
  - given: Michal
    family: Moshkovitz
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2358-2387
  id: hopkins21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2358
  lastpage: 2387
  published: 2021-07-21 00:00:00 +0000
- title: 'Adaptive Learning in Continuous Games: Optimal Regret Bounds and Convergence to Nash Equilibrium'
  abstract: 'In game-theoretic learning, several agents are simultaneously following their individual interests, so the environment is non-stationary from each player’s perspective. In this context, the performance of a learning algorithm is often measured by its regret. However, no-regret algorithms are not created equal in terms of game-theoretic guarantees: depending on how they are tuned, some of them may drive the system to an equilibrium, while others could produce cyclic, chaotic, or otherwise divergent trajectories. To account for this, we propose a range of no-regret policies based on optimistic mirror descent, with the following desirable properties: (\emph{i}) they do not require \emph{any} prior tuning or knowledge of the game; (\emph{ii}) they all achieve $\mathcal{O}(\sqrt{T})$ regret against arbitrary, adversarial opponents; and (\emph{iii}) they converge to the best response against convergent opponents. Also, if employed by all players, then (\emph{iv}) they guarantee $\mathcal{O}(1)$ \emph{social} regret; while (\emph{v}) the induced sequence of play converges to Nash equilibirum with $\mathcal{O}(1)$ \emph{individual} regret in all variationally stable games (a class of games that includes all monotone and convex-concave zero-sum games).'
  volume: 134
  URL: https://proceedings.mlr.press/v134/hsieh21a.html
  PDF: http://proceedings.mlr.press/v134/hsieh21a/hsieh21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-hsieh21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yu-Guan
    family: Hsieh
  - given: Kimon
    family: Antonakopoulos
  - given: Panayotis
    family: Mertikopoulos
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2388-2422
  id: hsieh21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2388
  lastpage: 2422
  published: 2021-07-21 00:00:00 +0000
- title: 'On the Approximation Power of Two-Layer Networks of Random ReLUs'
  abstract: 'This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upper- and lower-bounds for L2-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/hsu21a.html
  PDF: http://proceedings.mlr.press/v134/hsu21a/hsu21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-hsu21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Daniel
    family: Hsu
  - given: Clayton H
    family: Sanford
  - given: Rocco
    family: Servedio
  - given: Emmanouil Vasileios
    family: Vlatakis-Gkaragkounis
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2423-2461
  id: hsu21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2423
  lastpage: 2461
  published: 2021-07-21 00:00:00 +0000
- title: 'Fast Rates for the Regret of Offline Reinforcement Learning'
  abstract: 'We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterized this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate’s pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/hu21a.html
  PDF: http://proceedings.mlr.press/v134/hu21a/hu21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-hu21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yichun
    family: Hu
  - given: Nathan
    family: Kallus
  - given: Masatoshi
    family: Uehara
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2462-2462
  id: hu21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2462
  lastpage: 2462
  published: 2021-07-21 00:00:00 +0000
- title: 'Streaming k-PCA: Efficient guarantees for Oja’s algorithm, beyond rank-one updates'
  abstract: 'We analyze Oja’s algorithm for streaming $k$-PCA, and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. $d \times d$ symmetric matrices, we show that Oja’s algorithm can obtain an accurate approximation to the subspace of the top $k$ eigenvectors of their expectation using a number of samples that scales polylogarithmically with $d$. Previously, such a result was only known in the case where the updates have rank one.  Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm’s execution.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/huang21a.html
  PDF: http://proceedings.mlr.press/v134/huang21a/huang21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-huang21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: De
    family: Huang
  - given: Jonathan
    family: Niles-Weed
  - given: Rachel
    family: Ward
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2463-2498
  id: huang21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2463
  lastpage: 2498
  published: 2021-07-21 00:00:00 +0000
- title: 'Group testing and local search: is there a computational-statistical gap?'
  abstract: 'Group testing is a fundamental problem in statistical inference with many real-world applications, including the need for massive group testing during the ongoing COVID-19 pandemic. In this paper we study the task of approximate recovery, in which we tolerate having a small number of incorrectly classified items.  One of the most well-known, optimal, and easy to implement testing procedures is the non-adaptive Bernoulli group testing, where all tests are conducted in parallel, and each item is chosen to be part of any certain test independently with some fixed probability. In this setting, there is an observed gap between the number of tests above which recovery is information theoretically possible, and the number of tests required by the currently best known efficient algorithms to succeed. In this paper we seek to understand whether this computational-statistical gap can be closed. Our main contributions are the following: 1.Often times such gaps are explained by a phase transition in the landscape of the solution space of the problem (an Overlap Gap Property (OGP) phase transition). We provide first moment evidence that, perhaps surprisingly, such a phase transition does not take place throughout the regime for which recovery is information theoretically possible. This fact suggests that the model is in fact amenable to local search algorithms.  2. We prove the complete absence of “bad” local minima for a part of the “hard” regime, a fact which implies an improvement over known theoretical results on the performance of efficient algorithms for approximate recovery without false-negatives.  3. Finally, motivated by the evidence for the abscence for the OGP, we present extensive simulations that strongly suggest that a very simple local algorithm known as Glauber Dynamics does indeed succeed, and can be used to efficiently implement the well-known (theoretically optimal) Smallest Satisfying Set (SSS) estimator. Given that practical algorithms for this task utilize Branch and Bound and Linear Programming relaxation techniques, our finding could potentially be of practical interest.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/iliopoulos21a.html
  PDF: http://proceedings.mlr.press/v134/iliopoulos21a/iliopoulos21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-iliopoulos21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Fotis
    family: Iliopoulos
  - given: Ilias
    family: Zadik
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2499-2551
  id: iliopoulos21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2499
  lastpage: 2551
  published: 2021-07-21 00:00:00 +0000
- title: 'Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds'
  abstract: 'This paper presents multi-armed bandit (MAB) algorithms that work well in adversarial environments and that offer improved performance by exploiting inherent structures in such environments, as stochastic generative models, as well as small variations in loss vectors. The fundamental aim of this work is to overcome the limitation of worst-case analyses in MAB contexts. There can be found two basic approaches achieving this purpose: best-of-both-worlds algorithms that work well for both stochastic and adversarial settings, and data-dependent regret bounds that work well depending on certain difficulty indicators w.r.g. loss sequences. One remarkable study w.r.t. the best-of-both-worlds approach deals with the Tsallis-INF algorithm \citep{zimmert2019optimal}, which achieves nearly optimal regret bounds up to small constants in both settings, though such bounds have remained unproven for a special case of a stochastic setting with multiple optimal arms.  This paper offers two particular contributions: (i) We show that the Tsallis-INF algorithm enjoys a regret bound of a logarithmic order in the number of rounds for stochastic environments, even if the best arm is not unique. (ii) We provide a new algorithm with a new \textit{hybrid} regret bound that implies logarithmic regret in the stochastic regime and multiple data-dependent regret bounds in the adversarial regime, including bounds dependent on cumulative loss, total variation, and loss-sequence path-length. Both our proposed algorithm and the Tsallis-INF algorithm are based on a follow-the-regularized-leader (FTRL) framework with a time-varying regularizer. The analyses in this paper rely on \textit{skewed Bregman divergence}, which provides simple expressions of regret bounds for FTRL with a time-varying regularizer.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/ito21a.html
  PDF: http://proceedings.mlr.press/v134/ito21a/ito21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-ito21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Shinji
    family: Ito
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2552-2583
  id: ito21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2552
  lastpage: 2583
  published: 2021-07-21 00:00:00 +0000
- title: 'Double Explore-then-Commit: Asymptotic Optimality and Beyond'
  abstract: 'We study the multi-armed bandit problem with subGaussian rewards. The explore-then-commit (ETC) strategy, which consists of an exploration phase followed by an exploitation phase, is one of the most widely used algorithms in a variety of online decision applications. Nevertheless, it has been shown in \cite{garivier2016explore} that ETC is suboptimal in the asymptotic sense as the horizon grows, and thus, is worse than fully sequential strategies such as Upper Confidence Bound (UCB). In this paper, we show that a variant of ETC algorithm can actually achieve the asymptotic optimality for multi-armed bandit problems as UCB-type algorithms do and extend it to the batched bandit setting. Specifically, we propose a double explore-then-commit (DETC) algorithm that has two exploration and exploitation phases and proves that DETC achieves the asymptotically optimal regret bound. To our knowledge, DETC is the first non-fully-sequential algorithm that achieves such asymptotic optimality. In addition, we extend DETC to batched bandit problems, where (i) the exploration process is split into a small number of batches and (ii) the round complexity is of central interest. We prove that a batched version of DETC can achieve the asymptotic optimality with only a constant round complexity. This is the first batched bandit algorithm that can attain the optimal asymptotic regret bound and optimal round complexity simultaneously.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/jin21a.html
  PDF: http://proceedings.mlr.press/v134/jin21a/jin21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-jin21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tianyuan
    family: Jin
  - given: Pan
    family: Xu
  - given: Xiaokui
    family: Xiao
  - given: Quanquan
    family: Gu
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2584-2633
  id: jin21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2584
  lastpage: 2633
  published: 2021-07-21 00:00:00 +0000
- title: 'Moment Multicalibration for Uncertainty Estimation'
  abstract: 'We show how to achieve the notion of "multicalibration" from Hebert-Johnson et al. (2018) not just for means, but also for variances and other higher moments.  Informally, this means that we can find regression functions which, given a data point, can make point predictions not just for the expectation of its label, but for higher moments of its label distribution as well—and those predictions match the true distribution quantities when averaged not just over the population as a whole, but also when averaged over an enormous number of finely defined subgroups. It yields a principled way to estimate the uncertainty of predictions on many different subgroups—and to diagnose potential sources of unfairness in the predictive power of features across subgroups. As an application, we show that our moment estimates can be used to derive marginal prediction intervals that are simultaneously valid as averaged over all of the (sufficiently large) subgroups for which moment multicalibration has been obtained.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/jung21a.html
  PDF: http://proceedings.mlr.press/v134/jung21a/jung21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-jung21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Christopher
    family: Jung
  - given: Changhwa
    family: Lee
  - given: Mallesh
    family: Pai
  - given: Aaron
    family: Roth
  - given: Rakesh
    family: Vohra
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2634-2678
  id: jung21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2634
  lastpage: 2678
  published: 2021-07-21 00:00:00 +0000
- title: 'Reduced-Rank Regression with Operator Norm Error'
  abstract: 'A common data analysis task is the reduced-rank regression problem: $$\min_{\textrm{rank-}k  X} \|AX-B\|,$$ where $A \in \mathbb{R}^{n \times c}$ and $B \in \mathbb{R}^{n \times d}$ are given large matrices and $\|\cdot\|$ is some norm. Here the unknown matrix $X \in \mathbb{R}^{c \times d}$ is constrained to be of rank $k$ as it results in a significant parameter reduction of the solution when $c$ and $d$ are large. In the case of Frobenius norm error, there is a standard closed form solution to this problem and a fast algorithm to find a $(1+\varepsilon)$-approximate solution. However, for the important case of operator norm error, no closed form solution is known and the fastest known algorithms take singular value decomposition time. We give the first randomized algorithms for this problem running in time $$(nnz{(A)} + nnz{(B)} + c^2) \cdot k/\varepsilon^{1.5} + (n+d)k^2/\epsilon + c^{\omega},$$ up to a polylogarithmic factor involving condition numbers, matrix dimensions, and dependence on $1/\varepsilon$. Here $nnz{(M)}$ denotes the number of non-zero entries of a matrix $M$, and $\omega$ is the exponent of matrix multiplication. As both (1) spectral low rank approximation ($A = B$) and (2) linear system solving ($n = c$ and $d = 1$) are special cases, our time cannot be improved by more than a $1/\varepsilon$ factor (up to polylogarithmic factors) without a major breakthrough in linear algebra. Interestingly, known techniques for low rank approximation, such as alternating minimization or sketch-and-solve, provably fail for this problem. Instead, our algorithm uses an existential characterization of a solution, together with Krylov methods, low degree polynomial approximation, and sketching-based preconditioning.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kacham21a.html
  PDF: http://proceedings.mlr.press/v134/kacham21a/kacham21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kacham21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Praneeth
    family: Kacham
  - given: David
    family: Woodruff
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2679-2716
  id: kacham21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2679
  lastpage: 2716
  published: 2021-07-21 00:00:00 +0000
- title: '(Nearly) Dimension Independent Private ERM with AdaGrad Rates\{via Publicly Estimated Subspaces'
  abstract: 'We revisit the problem of empirical risk minimziation (ERM) with differential privacy. We show that noisy AdaGrad, given appropriate knowledge and conditions on the subspace from which gradients can be drawn, achieves a regret comparable to traditional AdaGrad plus a well-controlled term due to noise. We show a convergence rate of $O(\tr(G_T)/T)$, where $G_T$ captures the geometry of the gradient subspace. Since $\tr(G_T)=O(\sqrt{T})$ we can obtain faster rates for convex and Lipschitz functions, compared to the $O(1/\sqrt{T})$ rate achieved by known versions of noisy (stochastic) gradient descent with comparable noise variance. In particular, we show that if the gradients lie in a known constant rank subspace, and assuming algorithmic access to an envelope which bounds decaying sensitivity, one can achieve faster convergence to an excess empirical risk of $\tilde O(1/\epsilon n)$, where $\epsilon$ is the privacy budget and $n$ the number of samples. Letting $p$ be the problem dimension, this result implies that, by running noisy Adagrad, we can bypass the DP-SGD bound $\tilde O(\sqrt{p}/\epsilon n)$ in $T=(\epsilon n)^{2/(1+2\alpha)}$ iterations, where $\alpha \geq 0$ is a parameter controlling gradient norm decay, instead of the rate achieved by SGD of $T=\epsilon^2n^2$. Our results operate with general convex functions in both constrained and unconstrained minimization.  Along the way, we do a perturbation analysis of noisy AdaGrad, which is of independent interest. Our utility guarantee for the private ERM problem follows as a corollary to the regret guarantee of noisy AdaGrad.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kairouz21a.html
  PDF: http://proceedings.mlr.press/v134/kairouz21a/kairouz21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kairouz21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Peter
    family: Kairouz
  - given: Monica Ribero
    family: Diaz
  - given: Keith
    family: Rush
  - given: Abhradeep
    family: Thakurta
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2717-2746
  id: kairouz21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2717
  lastpage: 2746
  published: 2021-07-21 00:00:00 +0000
- title: 'The Sparse Vector Technique, Revisited'
  abstract: 'We revisit one of the most basic and widely applicable techniques in the literature of differential privacy – the sparse vector technique [Dwork et al., STOC 2009]. This simple algorithm privately tests whether the value of a given query on a database is close to what we expect it to be. It allows to ask an unbounded number of queries as long as the answer is close to what we expect, and halts following the first query for which this is not the case.  We suggest an alternative, equally simple, algorithm that can continue testing queries as long as any single individual does not contribute to the answer of too many queries whose answer deviates substantially form what we expect. Our analysis is subtle and some of its ingredients may be more widely applicable.  In some cases our new algorithm allows to privately extract much more information from the database than the original.  We demonstrate this by applying our algorithm to the shifting-heavy-hitters problem: On every time step, each of n users gets a new input, and the task is to privately identify all the current heavy-hitters. That is, on time step i, the goal is to identify all data elements x such that many of the users have x as their current input. We present an algorithm for this problem with improved error guarantees over what can be obtained using existing techniques. Specifically, the error of our algorithm depends on the maximal number of times that a single user holds a heavy-hitter as input, rather than the total number of times in which a heavy-hitter exists.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kaplan21a.html
  PDF: http://proceedings.mlr.press/v134/kaplan21a/kaplan21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kaplan21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Haim
    family: Kaplan
  - given: Yishay
    family: Mansour
  - given: Uri
    family: Stemmer
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2747-2776
  id: kaplan21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2747
  lastpage: 2776
  published: 2021-07-21 00:00:00 +0000
- title: 'Asymptotically Optimal Information-Directed Sampling'
  abstract: 'We introduce a simple and efficient algorithm for stochastic linear bandits with finitely many actions that is asymptotically optimal and (nearly) worst-case optimal in finite time. The approach is based on the frequentist information-directed sampling (IDS) framework, with a surrogate for the information gain that is informed by the optimization problem that defines the asymptotic lower bound. Our analysis sheds light on how IDS balances the trade-off between regret and information and uncovers a surprising connection between the recently proposed primal-dual methods and the IDS algorithm. We demonstrate empirically that IDS is competitive with UCB in finite-time, and can be significantly better in the asymptotic regime.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kirschner21a.html
  PDF: http://proceedings.mlr.press/v134/kirschner21a/kirschner21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kirschner21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Johannes
    family: Kirschner
  - given: Tor
    family: Lattimore
  - given: Claire
    family: Vernade
  - given: Csaba
    family: Szepesvari
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2777-2821
  id: kirschner21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2777
  lastpage: 2821
  published: 2021-07-21 00:00:00 +0000
- title: 'Hypothesis testing with low-degree polynomials in the Morris class of exponential families'
  abstract: 'Analysis of low-degree polynomial algorithms is a powerful, newly-popular method for predicting computational thresholds in hypothesis testing problems. One limitation of current techniques for this analysis is their restriction to Bernoulli and Gaussian distributions. We expand this range of possibilities by performing the low-degree analysis of hypothesis testing for the Morris class of natural exponential families with quadratic variance function, giving a unified treatment of Gaussian, Poisson, gamma (including exponential and chi-squared), binomial (including Bernoulli), negative binomial (including geometric), and generalized hyperbolic secant distributions. We then give several algorithmic applications.  1. In models where a random signal is observed through coordinatewise-independent noise applied in an exponential family, the success or failure of low-degree polynomials is governed by the z-score overlap, the inner product of z-score vectors with respect to the null distribution of two independent copies of the signal.  2. In the same models, testing with low-degree polynomials exhibits channel monotonicity: the above distributions admit a total ordering by computational cost of hypothesis testing, according to a scalar parameter describing how the variance depends on the mean in an exponential family.  3. In a spiked matrix model with a particular non-Gaussian noise distribution, the low-degree prediction is incorrect unless polynomials with arbitrarily large degree in individual matrix entries are permitted. This shows that polynomials summing over self-avoiding walks and variants thereof, as proposed recently by Ding, Hopkins, and Steurer (2020) for spiked matrix models with heavy-tailed noise, are strictly suboptimal for this model. Thus low-degree polynomials appear to offer a tradeoff between robustness and strong performance fine-tuned to specific models. Inspired by this, we suggest that a class of problems requiring "exploration before inference," where an algorithm must first examine the input and then use some intermediate computation to choose a suitable inference subroutine, appears especially difficult for low-degree polynomials.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kunisky21a.html
  PDF: http://proceedings.mlr.press/v134/kunisky21a/kunisky21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kunisky21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dmitriy
    family: Kunisky
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2822-2848
  id: kunisky21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2822
  lastpage: 2848
  published: 2021-07-21 00:00:00 +0000
- title: 'On the Minimal Error of Empirical Risk Minimization'
  abstract: 'We study the minimal squared error of the Empirical Risk Minimization (ERM) procedure in the task of regression, both in random and fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM in Donsker and non-Donsker classes. We also discuss our results through the lens of recent studies on interpolation in overparameterized models.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kur21a.html
  PDF: http://proceedings.mlr.press/v134/kur21a/kur21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kur21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gil
    family: Kur
  - given: Alexander
    family: Rakhlin
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2849-2852
  id: kur21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2849
  lastpage: 2852
  published: 2021-07-21 00:00:00 +0000
- title: '**Paper retracted by author request (see pdf for retraction notice from the authors)** Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping'
  abstract: 'We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell_2$-regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).'
  volume: 134
  URL: https://proceedings.mlr.press/v134/kuzborskij21a.html
  PDF: http://proceedings.mlr.press/v134/kuzborskij21a/kuzborskij21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-kuzborskij21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilja
    family: Kuzborskij
  - given: Csaba
    family: Szepesvari
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2853-2890
  id: kuzborskij21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2853
  lastpage: 2890
  published: 2021-07-21 00:00:00 +0000
- title: 'Projected Stochastic Gradient Langevin Algorithms for Constrained Sampling and Non-Convex Learning'
  abstract: 'Langevin algorithms are gradient descent methods with additive noise. They have been used for decades in Markov Chain Monte Carlo (MCMC) sampling, optimization, and learning. Their convergence properties for unconstrained non-convex optimization and learning problems have been studied widely in the last few years. Other work has examined projected Langevin algorithms for sampling from log-concave distributions restricted to convex compact sets. For learning and optimization, log-concave distributions correspond to convex losses. In this paper, we analyze the case of non-convex losses with compact convex constraint sets and IID external data variables. We term the resulting method the projected stochastic gradient Langevin algorithm (PSGLA). We show the algorithm achieves a deviation of $O(T^{-1/4} (log T )^{1/2} )$ from its target distribution in 1-Wasserstein distance. For optimization and learning, we show that the algorithm achieves $\epsilon$-suboptimal solutions, on average, provided that it is run for a time that is polynomial in $\epsilon$ and slightly super-exponential in the problem dimension.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/lamperski21a.html
  PDF: http://proceedings.mlr.press/v134/lamperski21a/lamperski21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-lamperski21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Andrew
    family: Lamperski
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2891-2937
  id: lamperski21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2891
  lastpage: 2937
  published: 2021-07-21 00:00:00 +0000
- title: 'Improved Regret for Zeroth-Order Stochastic Convex Bandits'
  abstract: 'We present an efficient algorithm for stochastic bandit convex optimisation with no assumptions on smoothness or strong convexity and for which the regret is bounded by O(d^(4.5) sqrt(n) polylog(n)), where n is the number of interactions and d is the dimension.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/lattimore21a.html
  PDF: http://proceedings.mlr.press/v134/lattimore21a/lattimore21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-lattimore21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tor
    family: Lattimore
  - given: Andras
    family: Gyorgy
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2938-2964
  id: lattimore21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2938
  lastpage: 2964
  published: 2021-07-21 00:00:00 +0000
- title: 'Mirror Descent and the Information Ratio'
  abstract: 'We establish a connection between the stability of mirror descent and the information ratio by Russo and Van Roy (2014). Our analysis shows that mirror descent with suitable loss estimators and exploratory distributions enjoys the same bound on the adversarial regret as the bounds on the Bayesian regret for information-directed sampling. Along the way, we develop the theory for information-directed sampling and provide an efficient algorithm for adversarial bandits for which the regret upper bound matches exactly the best known information-theoretic upper bound. Keywords: Bandits, partial monitoring, mirror descent, information theory.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/lattimore21b.html
  PDF: http://proceedings.mlr.press/v134/lattimore21b/lattimore21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-lattimore21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tor
    family: Lattimore
  - given: Andras
    family: Gyorgy
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2965-2992
  id: lattimore21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2965
  lastpage: 2992
  published: 2021-07-21 00:00:00 +0000
- title: 'Structured Logconcave Sampling with a Restricted Gaussian Oracle'
  abstract: 'We give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to generically improve dependences on problem conditioning $\kappa$ from polynomial to linear. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for $g: \mathbb{R}^d \rightarrow \mathbb{R}$, which is a sampler for distributions whose negative log-likelihood sums a quadratic (in a multiple of the identity) and $g$. By combining our reduction framework with our new samplers, we obtain the following bounds for sampling structured distributions to total variation distance $\eps$.
For composite densities $\exp(-f(x) - g(x))$, where $f$ has condition number $\kappa$ and convex (but possibly non-smooth) $g$ admits an RGO, we obtain a mixing time of $O(\kappa d \log^3\tfrac{\kappa d}{\epsilon})$, matching the state-of-the-art non-composite bound Lee et.\ al.\. No composite samplers with better mixing than general-purpose logconcave samplers were previously known.
For logconcave finite sums $\exp(-F(x))$, where $F(x) = \tfrac{1}{n}\sum_{i \in [n]} f_i(x)$ has condition number $\kappa$, we give a sampler querying $\widetilde{O}(n + \kappa\max(d, \sqrt{nd}))$ gradient oracles\footnote{For convenience of exposition, the $\widetilde{O}$ notation hides logarithmic factors in the dimension $d$, problem conditioning $\kappa$, desired accuracy $\epsilon$, and summand count $n$ (when applicable). A first-order (gradient) oracle for $f:\mathbb{R}^d \rightarrow \mathbb{R}$ returns $(f(x), \nabla f(x))$ on input $x$, and a zeroth-order (value) oracle only returns $f(x)$.} to $\{f_i\}_{i \in [n]}$. No high-accuracy samplers with nontrivial gradient query complexity were previously known.
For densities with condition number $\kappa$, we give an algorithm obtaining mixing time $O(\kappa d \log^2\tfrac{\kappa d}{\epsilon})$, improving Lee et.\ al.\ by a logarithmic factor with a significantly simpler analysis. We also show a zeroth-order algorithm attains the same query complexity.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/lee21a.html
  PDF: http://proceedings.mlr.press/v134/lee21a/lee21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-lee21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yin Tat
    family: Lee
  - given: Ruoqi
    family: Shen
  - given: Kevin
    family: Tian
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 2993-3050
  id: lee21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 2993
  lastpage: 3050
  published: 2021-07-21 00:00:00 +0000
- title: 'Stochastic Approximation for Online Tensorial Independent Component Analysis'
  abstract: 'Independent component analysis (ICA) has been a popular dimension reduction tool in statistical machine learning and signal processing. In this paper, we present a convergence analysis for an online tensorial ICA algorithm, by viewing the problem as a nonconvex stochastic approximation problem. For estimating one component, we provide a dynamics-based analysis to prove that our online tensorial ICA algorithm with a specific choice of stepsize achieves a sharp finite-sample error bound. In particular, under a mild assumption on the data-generating distribution and a scaling condition such that $d^4/T$ is sufficiently small up to a polylogarithmic factor of data dimension $d$ and sample size $T$, a sharp finite-sample error bound of $\tilde{O}(\sqrt{d/T})$ can be obtained.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/li21a.html
  PDF: http://proceedings.mlr.press/v134/li21a/li21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-li21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chris Junchi
    family: Li
  - given: Michael
    family: Jordan
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3051-3106
  id: li21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3051
  lastpage: 3106
  published: 2021-07-21 00:00:00 +0000
- title: 'Softmax Policy Gradient Methods Can Take Exponential Time to Converge'
  abstract: 'The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $\eta$ can take \[ \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}}  \text{iterations} \]{to} converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/li21b.html
  PDF: http://proceedings.mlr.press/v134/li21b/li21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-li21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gen
    family: Li
  - given: Yuting
    family: Wei
  - given: Yuejie
    family: Chi
  - given: Yuantao
    family: Gu
  - given: Yuxin
    family: Chen
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3107-3110
  id: li21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3107
  lastpage: 3110
  published: 2021-07-21 00:00:00 +0000
- title: 'Exponentially Improved Dimensionality Reduction for l1: Subspace Embeddings and Independence Testing'
  abstract: 'Despite many applications, dimensionality reduction in the $\ell_1$-norm is much less understood than in the Euclidean norm. We give two new oblivious dimensionality reduction techniques for the $\ell_1$-norm which improve exponentially over prior ones:
- We design a distribution over random matrices $S \in \mathbb{R}^{r \times n}$, where $r = 2^{\textrm{poly}(d/(\varepsilon \delta))}$, such that given any matrix $A \in \mathbb{R}^{n \times d}$, with probability at least $1-\delta$, simultaneously for all $x$, $\|SAx\|_1 = (1 \pm \varepsilon)\|Ax\|_1$. Note that $S$ is linear, does not depend on $A$, and maps $\ell_1$ into $\ell_1$. Our distribution provides an exponential improvement on the previous best known map of Wang and Woodruff (SODA, 2019), which required $r = 2^{2^{\Omega(d)}}$, even for constant $\varepsilon$ and $\delta$. Our bound is optimal, up to a polynomial factor in the exponent, given a known $2^{\textrm{poly}(d)}$ lower bound for constant $\varepsilon$ and $\delta$.
- We design a distribution over matrices $S \in \mathbb{R}^{k \times n}$, where $k = 2^{O(q^2)}(\varepsilon^{-1} q \log d)^{O(q)}$, such that given any $q$-mode tensor $A \in (\mathbb{R}^{d})^{\otimes q}$, one can estimate the entrywise $\ell_1$-norm $\|A\|_1$ from $S(A)$. Moreover, $S = S^1 \otimes S^2 \otimes \cdots \otimes S^q$ and so given vectors $u_1, \ldots, u_q \in \mathbb{R}^d$, one can compute $S(u_1 \otimes u_2 \otimes \cdots \otimes u_q)$ in time $2^{O(q^2)}(\varepsilon^{-1} q \log d)^{O(q)}$, which is much faster than the $d^q$ time required to form $u_1 \otimes u_2 \otimes \cdots \otimes u_q$. Our linear map gives a streaming algorithm for independence testing using space $2^{O(q^2)}(\varepsilon^{-1} q \log d)^{O(q)}$, improving the previous doubly exponential $(\varepsilon^{-1} \log d)^{q^{O(q)}}$ space bound of Braverman and Ostrovsky (STOC, 2010).
For subspace embeddings, we also study the setting when $A$ is itself drawn from distributions with independent entries, and obtain a polynomial embedding dimension. For independence testing, we also give algorithms for any distance measure with a polylogarithmic-sized sketch and satisfying an approximate triangle inequality.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/li21c.html
  PDF: http://proceedings.mlr.press/v134/li21c/li21c.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-li21c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yi
    family: Li
  - given: David
    family: Woodruff
  - given: Taisuke
    family: Yasuda
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3111-3195
  id: li21c
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3111
  lastpage: 3195
  published: 2021-07-21 00:00:00 +0000
- title: 'A Priori Generalization Analysis of the Deep Ritz Method for Solving High Dimensional Elliptic Partial Differential Equations'
  abstract: 'This paper concerns the a priori generalization analysis of the Deep Ritz Method (DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving high dimensional partial differential equations.  We derive the generalization error bounds of two-layer neural networks in the framework of the DRM for solving two prototype elliptic PDEs: Poisson equation and static Schrödinger equation on the $d$-dimensional unit hypercube. Specifically, we prove that the convergence rates of generalization errors are independent of the dimension $d$, under the a priori assumption that the exact solutions of the PDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we give sufficient conditions on the forcing term and the potential function which guarantee that the solutions are spectral Barron functions. We achieve this by developing a new solution theory for the PDEs on the spectral Barron space, which can be viewed as an analog of the classical Sobolev regularity theory for PDEs.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/lu21a.html
  PDF: http://proceedings.mlr.press/v134/lu21a/lu21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-lu21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yulong
    family: Lu
  - given: Jianfeng
    family: Lu
  - given: Min
    family: Wang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3196-3241
  id: lu21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3196
  lastpage: 3241
  published: 2021-07-21 00:00:00 +0000
- title: 'Corruption-robust exploration in episodic reinforcement learning'
  abstract: 'We initiate the study of episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of multi-armed bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on “optimism in the face of uncertainty”, by complementing them with principles from “action elimination”. Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear MDP settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit feedback model for episodic reinforcement learning.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/lykouris21a.html
  PDF: http://proceedings.mlr.press/v134/lykouris21a/lykouris21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-lykouris21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Thodoris
    family: Lykouris
  - given: Max
    family: Simchowitz
  - given: Alex
    family: Slivkins
  - given: Wen
    family: Sun
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3242-3245
  id: lykouris21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3242
  lastpage: 3245
  published: 2021-07-21 00:00:00 +0000
- title: 'Approximation Algorithms for Socially Fair Clustering'
  abstract: 'We present an (e^{O(p)} (log \ell) / (log log \ell))-approximation algorithm for socially fair clustering with the l_p-objective. In this problem, we are given a set of points in a metric space. Each point belongs to one (or several) of \ell groups. The goal is to find a k-medians, k-means, or, more generally, l_p-clustering that is simultaneously good for all of the groups. More precisely, we need to find a set of k centers C so as to minimize the maximum over all groups j of \sum_{u in group j} d(u, C)^p.  The socially fair clustering problem was independently proposed by Abbasi, Bhaskara, and Venkatasubramanian (2021) and Ghadiri, Samadi, and Vempala (2021). Our algorithm improves and generalizes their O(\ell)-approximation algorithms for the problem.  The natural LP relaxation for the problem has an integrality gap of \Omega(\ell). In order to obtain our result, we introduce a strengthened LP relaxation and show that it has an integrality gap of \Theta((log \ell) / (log log \ell)) for a fixed p. Additionally, we present a bicriteria approximation algorithm, which generalizes the bicriteria approximation of Abbasi et al. (2021).'
  volume: 134
  URL: https://proceedings.mlr.press/v134/makarychev21a.html
  PDF: http://proceedings.mlr.press/v134/makarychev21a/makarychev21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-makarychev21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yury
    family: Makarychev
  - given: Ali
    family: Vakilian
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3246-3264
  id: makarychev21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3246
  lastpage: 3264
  published: 2021-07-21 00:00:00 +0000
- title: 'The Connection Between Approximation, Depth Separation and Learnability in Neural Networks'
  abstract: 'Several recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks. In this work we study the intricate connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target. Specifically, we show that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks. We also show that a class of functions can be learned by an efficient statistical query algorithm if and only if it can be approximated in a weak sense by some kernel class. We give several examples of functions which demonstrate depth separation, and conclude that they cannot be efficiently learned, even by a hypothesis class that can efficiently approximate them.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/malach21a.html
  PDF: http://proceedings.mlr.press/v134/malach21a/malach21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-malach21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Eran
    family: Malach
  - given: Gilad
    family: Yehudai
  - given: Shai
    family: Shalev-Schwartz
  - given: Ohad
    family: Shamir
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3265-3295
  id: malach21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3265
  lastpage: 3295
  published: 2021-07-21 00:00:00 +0000
- title: 'Random Graph Matching with Improved Noise Robustness'
  abstract: 'Graph matching, also known as network alignment, refers to finding a bijection between the vertex sets of two given graphs so as to maximally align their edges. This fundamental computational problem arises frequently in multiple fields such as computer vision and biology. Recently, there has been a plethora of work studying efficient algorithms for graph matching under probabilistic models. In this work, we propose a new algorithm for graph matching: Our algorithm associates each vertex with a signature vector using a multistage procedure and then matches a pair of vertices from the two graphs if their signature vectors are close to each other. We show that, for two Erdős–Rényi graphs with edge correlation $1-\alpha$, our algorithm recovers the underlying matching exactly with high probability when $\alpha \le 1 / (\log \log n)^C$, where $n$ is the number of vertices in each graph and $C$ denotes a positive universal constant. This improves the condition $\alpha \le 1 / (\log n)^C$ achieved in previous work.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/mao21a.html
  PDF: http://proceedings.mlr.press/v134/mao21a/mao21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-mao21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cheng
    family: Mao
  - given: Mark
    family: Rudelson
  - given: Konstantin
    family: Tikhomirov
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3296-3329
  id: mao21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3296
  lastpage: 3329
  published: 2021-07-21 00:00:00 +0000
- title: 'Improved Analysis of the Tsallis-INF Algorithm in Stochastically Constrained Adversarial Bandits and Stochastic Bandits with Adversarial Corruptions'
  abstract: 'We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(\Delta,C,T)$ self-bounding constraint the algorithm achieves $\mathcal{O}\left(\left(\sum_{i\neq i^*} \frac{1}{\Delta_i}\right)\log_+\left(\frac{(K-1)T}{\left(\sum_{i\neq i^*} \frac{1}{\Delta_i}\right)^2}\right)+\sqrt{C\left(\sum_{i\neq i^*}\frac{1}{\Delta_i}\right)\log_+\left(\frac{(K-1)T}{C\sum_{i\neq i^*}\frac{1}{\Delta_i}}\right)}\right)$ regret bound, where  $T$ is the time horizon, $K$ is the number of arms, $\Delta_i$ are the suboptimality gaps, $i^*$ is the best arm, $C$ is the corruption magnitude, and $\log_+(x) = \max\left(1,\log x\right)$. The regime includes stochastic bandits, stochastically constrained adversarial bandits, and stochastic bandits with adversarial corruptions as special cases. Additionally, we provide a general analysis, which allows to achieve the same kind of improvement for generalizations of Tsallis-INF to other settings beyond multiarmed bandits.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/masoudian21a.html
  PDF: http://proceedings.mlr.press/v134/masoudian21a/masoudian21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-masoudian21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Saeed
    family: Masoudian
  - given: Yevgeny
    family: Seldin
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3330-3350
  id: masoudian21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3330
  lastpage: 3350
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning with invariances in random features and kernel models'
  abstract: 'A number of machine learning tasks entail a high degree of invariance: the data distribution does not change if we act on the data with a certain group of transformations. For instance, labels of images are invariant under translations of the images. Certain neural network architectures —for instance, convolutional networks—are believed to owe their success to the fact that they exploit such invariance properties. With the objective of quantifying the gain achieved by invariant architectures, we introduce two classes of models: invariant random features and invariant kernel methods. The latter includes, as a special case, the neural tangent kernel for convolutional networks with global average pooling. We consider uniform covariates distributions on the sphere and hypercube and a general invariant target function. We characterize the test error of invariant methods in a high-dimensional regime in which the sample size and number of hidden units scale as polynomials in the dimension, for a class of groups that we call ‘degeneracy $\alpha$’, with $\alpha \leq 1$. We show that exploiting invariance in the architecture saves a $d^\alpha$ factor ($d$ stands for the dimension) in sample size and number of hidden units to achieve the same test error as for unstructured architectures. Finally, we show that output symmetrization of an unstructured kernel estimator does not give a significant statistical improvement; on the other hand, data augmentation with an unstructured kernel estimator is equivalent to an invariant kernel estimator and enjoys the same improvement in statistical efficiency.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/mei21a.html
  PDF: http://proceedings.mlr.press/v134/mei21a/mei21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-mei21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Song
    family: Mei
  - given: Theodor
    family: Misiakiewicz
  - given: Andrea
    family: Montanari
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3351-3418
  id: mei21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3351
  lastpage: 3418
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning to Sample from Censored Markov Random Fields'
  abstract: 'We study the problem of learning Censored Markov Random Fields (abbreviated CMRFs), which are Markov Random Fields where some of the nodes are censored (i.e. not observed). We assume the CMRF is high temperature but, crucially, make no assumption about its structure. This makes structure learning impossible. Nevertheless we introduce a new definition, which we call learning to sample, that circumvents this obstacle. We give an algorithm that can learn to sample from a distribution within $\epsilon n$ earthmover distance of the target distribution for any $\epsilon > 0$. We obtain stronger results when we additionally assume high girth, as well as computational lower bounds showing that these are essentially optimal.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/moitra21a.html
  PDF: http://proceedings.mlr.press/v134/moitra21a/moitra21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-moitra21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ankur
    family: Moitra
  - given: Elchanan
    family: Mossel
  - given: Colin P
    family: Sandon
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3419-3451
  id: moitra21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3419
  lastpage: 3451
  published: 2021-07-21 00:00:00 +0000
- title: 'Adversarially Robust Learning with Unknown Perturbation Sets'
  abstract: 'We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of successful attacks, in different interaction models, in terms of the VC and Littlestone dimensions of the hypothesis class of predictors, and without any assumptions on the perturbation set.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/montasser21a.html
  PDF: http://proceedings.mlr.press/v134/montasser21a/montasser21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-montasser21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Omar
    family: Montasser
  - given: Steve
    family: Hanneke
  - given: Nathan
    family: Srebro
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3452-3482
  id: montasser21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3452
  lastpage: 3482
  published: 2021-07-21 00:00:00 +0000
- title: 'A Theory of Heuristic Learnability'
  abstract: 'Which concepts can we learn efficiently on average? In this paper, we investigate the capability of a natural average-case learning framework, heuristic PAC (heurPAC) learning to answer this and some other related questions. Roughly speaking, we say that a concept class is heurPAC learnable if there exists a learning algorithm that given $n, s\in\mathbb{N}$ and $\epsilon,\delta,\eta\in(0,1]$ as input, learns all but $\eta$ fraction of $n$-input target functions represented as $s$-bit strings in the class from passively collected examples and then outputs an $\epsilon$-close hypothesis with failure probability at most $\delta$ in polynomial-time in $n$, $s$, $\epsilon^{-1},\delta^{-1}$, and $\eta^{-1}$, where each example is generated according to some example distribution.  First, we establish a positive learnability result. Specifically, we show that a simple Fourier-based algorithm heurPAC learns $\Omega(\log n)$-junta functions on the uniform distribution, which is a central open question in the original PAC learning model. Our technical contribution is to introduce the notion of elusive functions that captures hard-to-learn cases and to establish a polynomial relation between the running time and the fraction of such elusive functions. Second, we present clear relations between heurPAC learnability and cryptography. Particularly, we show that for any efficiently evaluated class $\mathscr{C}$, (1) if $\mathscr{C}$ is not heurPAC learnable, then an auxiliary-input one-way function (AIOWF) exists; (2) if $\mathscr{C}$ is not heurPAC learnable on the uniform distribution, then an infinitely-often one-way function (io-OWF) exists. As a corollary, we also present new characterizations for AIOWF and io-OWF based on heurPAC learnability, which is conceptually stronger than the previous ones that are based on average-case learnability for fixed parameters. These results show that our framework might yield heuristic learners with theoretical guarantees for broader classes than the usual PAC learning framework, and any efficiently evaluated class has a potential for such a heuristic learner or a secure cryptographic primitive. Through this paper, we suggest further research toward the win-win “learning vs. cryptography” paradigm.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/nanashima21a.html
  PDF: http://proceedings.mlr.press/v134/nanashima21a/nanashima21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-nanashima21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mikito
    family: Nanashima
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3483-3525
  id: nanashima21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3483
  lastpage: 3525
  published: 2021-07-21 00:00:00 +0000
- title: 'Information-Theoretic Generalization Bounds for Stochastic Gradient Descent'
  abstract: 'We study the generalization properties of the popular stochastic optimization method known as  stochastic gradient descent (SGD) for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/neu21a.html
  PDF: http://proceedings.mlr.press/v134/neu21a/neu21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-neu21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gergely
    family: Neu
  - given: Gintare Karolina
    family: Dziugaite
  - given: Mahdi
    family: Haghifam
  - given: Daniel M.
    family: Roy
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3526-3545
  id: neu21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3526
  lastpage: 3545
  published: 2021-07-21 00:00:00 +0000
- title: 'It was “all” for “nothing”: sharp phase transitions for noiseless discrete channels'
  abstract: 'We prove a phase transition known as the “all-or-nothing” phenomenon for noiseless discrete channels. This class of models includes the Bernoulli group testing model and the planted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenon for such models was only known in a limited range of parameters. Our work extends the results to all signals with sublinear sparsity.  Our main technique is to show that for such models, the “all” half of all-or-nothing implies the “nothing” half, so that a proof of “all” can be turned into a proof of “nothing.” Since the “all” half can often be proven by straightforward means, our equivalence gives a powerful and general approach towards establishing the existence of this phenomenon in other contexts.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/nilesweed21a.html
  PDF: http://proceedings.mlr.press/v134/nilesweed21a/nilesweed21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-nilesweed21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jonathan
    family: Niles-Weed
  - given: Ilias
    family: Zadik
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3546-3547
  id: nilesweed21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3546
  lastpage: 3547
  published: 2021-07-21 00:00:00 +0000
- title: 'SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality'
  abstract: 'We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and on the least squares problem with random data (finite-sum). Using this new framework, we show that the dynamics of SGD become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which is also verified experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates (i.e., the complexity of an algorithm averaged over all possible inputs). These rates show significant improvement over classical worst-case complexities.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/paquette21a.html
  PDF: http://proceedings.mlr.press/v134/paquette21a/paquette21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-paquette21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Courtney
    family: Paquette
  - given: Kiwon
    family: Lee
  - given: Fabian
    family: Pedregosa
  - given: Elliot
    family: Paquette
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3548-3626
  id: paquette21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3548
  lastpage: 3626
  published: 2021-07-21 00:00:00 +0000
- title: 'Provable Memorization via Deep Neural Networks using Sub-linear Parameters'
  abstract: 'It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $O(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width 3) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide empirical results that support our theoretical findings.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/park21a.html
  PDF: http://proceedings.mlr.press/v134/park21a/park21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-park21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sejun
    family: Park
  - given: Jaeho
    family: Lee
  - given: Chulhee
    family: Yun
  - given: Jinwoo
    family: Shin
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3627-3661
  id: park21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3627
  lastpage: 3661
  published: 2021-07-21 00:00:00 +0000
- title: 'Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle'
  abstract: 'Motivated by applications in crowdsourced entity resolution in database, signed edge prediction in social networks and correlation clustering, Mazumdar and Saha [NIPS 2017] proposed an elegant theoretical model for studying clustering with a faulty oracle. In this model, given a set of $n$ items which belong to $k$ unknown groups (or clusters), our goal is to recover the clusters by asking pairwise queries to an oracle. This oracle can answer the query that “do items $u$ and $v$ belong to the same cluster?”. However, the answer to each pairwise query errs with probability $\epsilon$, for some $\epsilon\in(0,\frac12)$. Mazumdar and Saha provided two algorithms under this model: one algorithm is query-optimal while time-inefficient (i.e., running in quasi-polynomial time), the other is time efficient (i.e., in polynomial time) while query-suboptimal. Larsen, Mitzenmacher and Tsourakakis [WWW 2020] then gave a new time-efficient algorithm for the special case of $2$ clusters, which is query-optimal if the bias $\delta:=1-2\epsilon$ of the model is large. It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters and other regimes of $\delta$.  In this paper, we make progress on the above question and provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(\log^2 n)$) for all constant $k$ and any $\delta$ in the regime when information-theoretic recovery is possible. Our algorithm is built on a connection to the stochastic block model.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/peng21a.html
  PDF: http://proceedings.mlr.press/v134/peng21a/peng21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-peng21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Pan
    family: Peng
  - given: Jiapeng
    family: Zhang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3662-3680
  id: peng21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3662
  lastpage: 3680
  published: 2021-07-21 00:00:00 +0000
- title: 'Towards a Dimension-Free Understanding of Adaptive Linear Control'
  abstract: 'We study the problem of adaptive control of the linear quadratic regulator for systems in very high, or even infinite dimension. We demonstrate that while sublinear regret requires finite dimensional inputs, the ambient state dimension of the system need not be bounded in order to perform online control. We provide the first regret bounds for LQR which hold for infinite dimensional systems, replacing dependence on ambient dimension with more natural notions of problem complexity. Our guarantees arise from a novel perturbation bound for certainty equivalence which scales with the prediction error in estimating the system parameters, without requiring consistent parameter recovery in more stringent measures like the operator norm. When specialized to finite dimensional settings, our bounds recover near optimal dimension and time horizon dependence.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/perdomo21a.html
  PDF: http://proceedings.mlr.press/v134/perdomo21a/perdomo21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-perdomo21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Juan C
    family: Perdomo
  - given: Max
    family: Simchowitz
  - given: Alekh
    family: Agarwal
  - given: Peter
    family: Bartlett
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3681-3770
  id: perdomo21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3681
  lastpage: 3770
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning from Censored and Dependent Data: The case of Linear Dynamics'
  abstract: 'Observations from dynamical systems often exhibit irregularities, such as censoring, where values are recorded only if they fall within a certain range. Censoring is ubiquitous in practice, due to saturating sensors, limit-of-detection effects, image frame effects, and combined with the temporal dependencies within the data, makes the task of system identification particularly challenging.  In light of recent developments on learning linear dynamical systems (LDSs), and on censored statistics with independent data, we revisit the decades-old problem of learning an LDS, from censored observations (Lee and Maddala (1985), Zeger and Brookmeyer (1986)). Here, the learner observes the state x_t \in R^d if and only if x_t belongs to some set S_t _x0012_\in R^d. We develop the first computationally and statistically efficient algorithm for learning the system, assuming only oracle to the sets St. Our algorithm, Stochastic Online Newton with Switching Gradients, is a novel second-order method that builds on the Online Newton Step (ONS) of Hazan et al. (2007). Our Switching-Gradient scheme does not always use (stochastic) gradients of the function we want to optimize, which we call censor-aware function. Instead, in each iteration, it performs a simple test to decide whether to use the censor-aware, or another censor-oblivious function, for getting a stochastic gradient.  In our analysis, we consider a “generic” Online Newton method, which uses arbitrary vectors instead of gradients, and we prove an error-bound for it. This can be used to appropriately design these vectors, Leading to our Switching-Gradient scheme. This framework significantly deviates from the recent long line of works on censored statistics (e.g, Daskalakis et al. (2018); Kontonis et al. (2019); Daskalakis et al. (2019), which apply Stochastic Gradient Descent (SGD), and their analysis reduces to establishing conditions for off-the-shelf SGD-bounds. Our approach enables to relax these conditions, and gives rise to phenomena that might appear counterintuitive, given the previous works. Specifically, our method makes progress even when the current “survival probability” is exponentially small. We believe that our analysis framework will have applications in more settings where the data are subject to censoring.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/plevrakis21a.html
  PDF: http://proceedings.mlr.press/v134/plevrakis21a/plevrakis21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-plevrakis21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Orestis
    family: Plevrakis
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3771-3787
  id: plevrakis21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3771
  lastpage: 3787
  published: 2021-07-21 00:00:00 +0000
- title: 'Adaptive Discretization for Adversarial Lipschitz Bandits'
  abstract: 'Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces such as the [0,1] interval, where similar actions are guaranteed to have similar rewards. A central theme here is the adaptive discretization of the action space, which gradually "zooms in" on the more promising regions thereof. The goal is to take advantage of "nicer" problem instances, while retaining near-optimal worst-case performance. While the stochastic version of the problem is well-understood, the general version with adversarial rewards is not. We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds. In particular, we recover the worst-case optimal regret bound for the adversarial version, and the instance-dependent regret bound for the stochastic version.  A version with full proofs (and additional results) appears at arxiv.org/abs/2006.12367v2.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/podimata21a.html
  PDF: http://proceedings.mlr.press/v134/podimata21a/podimata21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-podimata21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chara
    family: Podimata
  - given: Alex
    family: Slivkins
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3788-3805
  id: podimata21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3788
  lastpage: 3805
  published: 2021-07-21 00:00:00 +0000
- title: 'Exponential savings in agnostic active learning through abstention'
  abstract: 'We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss 1/2 of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/puchkin21a.html
  PDF: http://proceedings.mlr.press/v134/puchkin21a/puchkin21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-puchkin21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nikita
    family: Puchkin
  - given: Nikita
    family: Zhivotovskiy
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3806-3832
  id: puchkin21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3806
  lastpage: 3832
  published: 2021-07-21 00:00:00 +0000
- title: 'Exponential Weights Algorithms for Selective Learning'
  abstract: 'We study the selective learning problem introduced by Qiao and Valiant (2019), in which the learner observes $n$ labeled data points one at a time. At a time of its choosing, the learner selects a window length $w$ and a model $\hat\ell$ from the model class $\mathcal{L}$, and then labels the next $w$ data points using $\hat\ell$. The \emph{excess risk} incurred by the learner is defined as the difference between the average loss of $\hat\ell$ over those $w$ data points and the smallest possible average loss among all models in $\mathcal{L}$ over those $w$ data points.  We give an improved algorithm, termed the \emph{hybrid exponential weights} algorithm, that achieves an expected excess risk of $O((\log\log|\mathcal{L}| + \log\log n)/\log n)$. This result gives a doubly exponential improvement in the dependence on $|\mathcal{L}|$ over the best known bound of $O(\sqrt{|\mathcal{L}|/\log n})$. We complement the positive result with an almost matching lower bound, which suggests the worst-case optimality of the algorithm.  We also study a more restrictive family of learning algorithms that are \emph{bounded-recall} in the sense that when a prediction window of length $w$ is chosen, the learner’s decision only depends on the most recent $w$ data points. We analyze an exponential weights variant of the ERM algorithm in Qiao and Valiant (2019). This new algorithm achieves an expected excess risk of $O(\sqrt{\log |\mathcal{L}|/\log n})$, which is shown to be nearly optimal among all bounded-recall learners. Our analysis builds on a generalized version of the selective mean prediction problem in Drucker (2013); Qiao and Valiant (2019), which may be of independent interest.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/qiao21a.html
  PDF: http://proceedings.mlr.press/v134/qiao21a/qiao21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-qiao21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mingda
    family: Qiao
  - given: Gregory
    family: Valiant
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3833-3858
  id: qiao21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3833
  lastpage: 3858
  published: 2021-07-21 00:00:00 +0000
- title: 'Average-Case Communication Complexity of Statistical Problems'
  abstract: 'We study statistical problems, such as planted clique, its variants, and sparse principal component analysis in the context of average-case communication complexity. Our motivation is to understand the statistical-computational trade-offs in streaming, sketching, and query-based models. Communication complexity is the main tool for proving lower bounds in these models, yet many prior results do not hold in an average-case setting. We provide a general reduction method that preserves the input distribution for problems involving a random graph or matrix with planted structure. Then, we derive two-party and multi-party communication lower bounds for detecting or finding planted cliques, bipartite cliques, and related problems. As a consequence, we obtain new bounds on the query complexity in the edge-probe, vector-matrix-vector, matrix-vector, linear sketching, and $\mathbb{F}_2$-sketching models. Many of these results are nearly tight, and we use our techniques to provide simple proofs of some known lower bounds for the edge-probe model.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/rashtchian21a.html
  PDF: http://proceedings.mlr.press/v134/rashtchian21a/rashtchian21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-rashtchian21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cyrus
    family: Rashtchian
  - given: David
    family: Woodruff
  - given: Peng
    family: Ye
  - given: Hanlin
    family: Zhu
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3859-3886
  id: rashtchian21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3859
  lastpage: 3886
  published: 2021-07-21 00:00:00 +0000
- title: 'Learning to Stop with Surprisingly Few Samples'
  abstract: 'We consider a discounted infinite horizon optimal stopping problem. If the underlying distribution is known a priori, the solution of this problem is obtained via dynamic programming (DP) and is given by a well known threshold rule. When information on this distribution is lacking, a natural (though naive) approach is “explore-then-exploit," whereby the unknown distribution or its parameters are estimated over an initial exploration phase, and this estimate is then used in the DP to determine actions over the residual exploitation phase. We show: (i) with proper tuning, this approach leads to performance comparable to the full information DP solution; and (ii) despite common wisdom on the sensitivity of such “plug in" approaches in DP due to propagation of estimation errors, a surprisingly “short" (logarithmic in the horizon) exploration horizon suffices to obtain said performance.  In cases where the underlying distribution is heavy-tailed, these observations are even more pronounced: a single sample exploration phase suffices.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/russo21a.html
  PDF: http://proceedings.mlr.press/v134/russo21a/russo21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-russo21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Daniel
    family: Russo
  - given: Assaf
    family: Zeevi
  - given: Tianyi
    family: Zhang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3887-3888
  id: russo21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3887
  lastpage: 3888
  published: 2021-07-21 00:00:00 +0000
- title: 'The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks'
  abstract: 'We study the effects of mild over-parameterization on the optimization landscape of a simple ReLU neural network of the form $\mathbf{x}\mapsto\sum_{i=1}^k\max\{0,\mathbf{w}_i^{\top}\mathbf{x}\}$, in a well-studied teacher-student setting where the target values are generated by the same architecture, and when directly optimizing over the population squared loss with respect to Gaussian inputs. We prove that while the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons, it is not even \emph{locally convex} after any amount of over-parameterization. Moreover, related desirable properties (e.g., one-point strong convexity and the Polyak-{Ł}ojasiewicz condition) also do not hold even locally. On the other hand, we establish that the objective remains one-point strongly convex in \emph{most} directions (suitably defined), and show an optimization guarantee under this property. For the non-global minima, we prove that adding even just a single neuron will turn a non-global minimum into a saddle point. This holds under some technical conditions which we validate empirically.  These results provide a possible explanation for why recovering a global minimum becomes significantly easier when we over-parameterize, even if the amount of over-parameterization is very moderate.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/safran21a.html
  PDF: http://proceedings.mlr.press/v134/safran21a/safran21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-safran21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Itay M
    family: Safran
  - given: Gilad
    family: Yehudai
  - given: Ohad
    family: Shamir
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3889-3934
  id: safran21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3889
  lastpage: 3934
  published: 2021-07-21 00:00:00 +0000
- title: 'Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball'
  abstract: 'We study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem.  For SGD, in the convex and smooth setting, we provide the first \emph{almost sure} asymptotic convergence \emph{rates} for a weighted average of the iterates . More precisely, we show that the convergence rate of the function values is arbitrarily close to $o(1/\sqrt{k})$, and is exactly $o(1/k)$ in the so-called overparametrized case. We show that these results still hold when using a decreasing step size version of stochastic line search and stochastic Polyak stepsizes, thereby giving the first proof of convergence of these methods in the non-overparametrized regime.  Using a substantially different analysis, we show that these rates hold for SHB as well, but at the last iterate. This distinction is important because it is the last iterate of SGD and SHB which is used in practice. We also show that the last iterate of SHB converges to a minimizer \emph{almost surely}. Additionally, we prove that the function values of the deterministic HB converge at a $o(1/k)$ rate, which is faster than the previously known $O(1/k)$.  Finally, in the nonconvex setting, we prove similar rates on the lowest gradient norm along the trajectory of SGD.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/sebbouh21a.html
  PDF: http://proceedings.mlr.press/v134/sebbouh21a/sebbouh21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-sebbouh21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Othmane
    family: Sebbouh
  - given: Robert M
    family: Gower
  - given: Aaron
    family: Defazio
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3935-3971
  id: sebbouh21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3935
  lastpage: 3971
  published: 2021-07-21 00:00:00 +0000
- title: 'Lazy OCO: Online Convex Optimization on a Switching Budget'
  abstract: 'We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary.  In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d. losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings.  Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/sherman21a.html
  PDF: http://proceedings.mlr.press/v134/sherman21a/sherman21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-sherman21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Uri
    family: Sherman
  - given: Tomer
    family: Koren
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3972-3988
  id: sherman21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3972
  lastpage: 3988
  published: 2021-07-21 00:00:00 +0000
- title: 'Johnson-Lindenstrauss Transforms with Best Confidence'
  abstract: 'The seminal result of Johnson and Lindenstrauss on random embeddings has been intensively studied in applied and theoretical computer science. Despite that vast body of literature, we still lack of complete understanding of statistical properties of random projections; a particularly intriguing question is: why are the theoretical bounds that far behind the empirically observed performance? Motivated by this question, this work develops Johnson-Lindenstrauss distributions with optimal, data-oblivious, statistical confidence bounds. These bounds are numerically best possible, for any given data dimension, embedding dimension, and distortion tolerance. They improve upon prior works in terms of statistical accuracy, as well as exactly determine the no-go regimes for data-oblivious approaches. Furthermore, the projection matrices are efficiently samplable.  The construction relies on orthogonal matrices, and the proof uses certain elegant properties of the unit sphere. In particular, the following techniques introduced in this work are of independent interest: a) a compact expression for the projection distortion in terms of singular eigenvalues of the projection matrix, b) a parametrization linking the unit sphere and the Dirichlet distribution and c) anti-concentration bounds for the Dirichlet distribution.  Besides the technical contribution, the paper presents applications and numerical evaluation along with working implementation in Python (shared as a GitHub repository).'
  volume: 134
  URL: https://proceedings.mlr.press/v134/skorski21a.html
  PDF: http://proceedings.mlr.press/v134/skorski21a/skorski21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-skorski21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Maciej
    family: Skorski
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 3989-4007
  id: skorski21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 3989
  lastpage: 4007
  published: 2021-07-21 00:00:00 +0000
- title: 'Efficient Bandit Convex Optimization: Beyond Linear Losses'
  abstract: 'We study the problem of online learning with bandit feedback, where a learner aims to minimize a sequence of adversarially generated loss functions, while only observing the value of each function at a single point. When the loss functions chosen by the adversary are convex and quadratic, we develop a new algorithm which achieves the optimal regret rate of $O{T^{1/2}}$. Furthermore, our algorithm satisfies three important desiderata: (a) it is practical and can be efficiently implemented for high dimensional problems, (b) the regret bound holds with high probability even against adaptive adversaries whose decisions can depend on the learner’s previous actions, and (c) it is robust to model mis-specification; that is, the regret bound degrades gracefully when the loss functions deviate from convex quadratics.  To the best of our knowledge, ours is the first algorithm for bandit convex optimization with quadratic losses which is efficiently implementable and achieves optimal regret guarantees. Existing algorithms for this problem either have sub-optimal regret guarantees or are computationally expensive and do not scale well to high-dimensional problems.  Our algorithm is a bandit version of the classical regularized Newton’s method. It involves estimation of gradients and Hessians of the loss functions from single point feedback. A key caveat of working with Hessian estimates however is that they typically have large variance. In this work, we show that it is nonetheless possible to finesse this caveat by relying on the idea of “focus regions”, where we restrict the algorithm to choose actions from a subset of the action space in which the variance of our estimates can be controlled.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/suggala21a.html
  PDF: http://proceedings.mlr.press/v134/suggala21a/suggala21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-suggala21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Arun Sai
    family: Suggala
  - given: Pradeep
    family: Ravikumar
  - given: Praneeth
    family: Netrapalli
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4008-4067
  id: suggala21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4008
  lastpage: 4067
  published: 2021-07-21 00:00:00 +0000
- title: 'On Empirical Bayes Variational Autoencoder: An Excess Risk Bound'
  abstract: 'In this paper, we consider variational autoencoders (VAE) via empirical Bayes estimation, referred to as Empirical Bayes Variational Autoencoders (EBVAE), which is a general framework including popular VAE methods as special cases. Despite the widespread use of VAE, its theoretical aspects are less explored in the literature. Motivated by this, we establish a general theoretical framework for analyzing the excess risk associated with EBVAE under the setting of density estimation, covering both parametric and nonparametric cases, through the lens of M-estimation. As an application, we analyze the excess risk of the commonly-used EBVAE with Gaussian models and highlight the importance of covariance matrices of Gaussian encoders and decoders in obtaining a good statistical guarantee, shedding light on the empirical observations reported in the literature.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/tang21a.html
  PDF: http://proceedings.mlr.press/v134/tang21a/tang21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-tang21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Rong
    family: Tang
  - given: Yun
    family: Yang
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4068-4125
  id: tang21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4068
  lastpage: 4125
  published: 2021-07-21 00:00:00 +0000
- title: 'Machine Unlearning via Algorithmic Stability'
  abstract: 'We study the problem of machine unlearning and identify a notion of algorithmic stability, Total Variation (TV) stability, which we argue, is suitable for the goal of exact unlearning. For convex risk minimization problems, we design TV-stable algorithms based on noisy Stochastic Gradient Descent (SGD). Our key contribution is the design of corresponding efficient unlearning algorithms, which are based on constructing a near-maximal coupling of Markov chains for the noisy SGD procedure. To understand the trade-offs between accuracy and unlearning efficiency, we give upper and lower bounds on excess empirical and populations risk of TV stable algorithms for convex risk minimization. Our techniques generalize to arbitrary non-convex functions, and our algorithms are differentially private as well.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/ullah21a.html
  PDF: http://proceedings.mlr.press/v134/ullah21a/ullah21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-ullah21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Enayat
    family: Ullah
  - given: Tung
    family: Mai
  - given: Anup
    family: Rao
  - given: Ryan A.
    family: Rossi
  - given: Raman
    family: Arora
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4126-4142
  id: ullah21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4126
  lastpage: 4142
  published: 2021-07-21 00:00:00 +0000
- title: 'A Dimension-free Computational Upper-bound for Smooth Optimal Transport Estimation'
  abstract: 'It is well-known that plug-in statistical estimation of optimal transport suffers from the curse of dimensionality. Despite recent efforts to improve the rate of estimation with the smoothness of the problem, the computational complexity of these recently proposed methods still degrade exponentially with the dimension. In this paper, thanks to an infinite-dimensional sum-of-squares representation, we derive a statistical estimator of smooth optimal transport which achieves a precision $\varepsilon$ from $\tilde{O}(\varepsilon^{-2})$ independent and identically distributed samples from the distributions, for a computational cost of $\tilde{O}(\varepsilon^{-4})$ when the smoothness increases, hence yielding dimension-free statistical \emph{and} computational rates, with potentially exponentially dimension-dependent constants.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/vacher21a.html
  PDF: http://proceedings.mlr.press/v134/vacher21a/vacher21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-vacher21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Adrien
    family: Vacher
  - given: Boris
    family: Muzellec
  - given: Alessandro
    family: Rudi
  - given: Francis
    family: Bach
  - given: Francois-Xavier
    family: Vialard
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4143-4173
  id: vacher21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4143
  lastpage: 4173
  published: 2021-07-21 00:00:00 +0000
- title: 'Robust Online Convex Optimization in the Presence of Outliers'
  abstract: 'We consider online convex optimization when a number k of data points are outliers that may be corrupted. We model this by introducing the notion of robust regret, which measures the regret only on rounds that are not outliers. The aim for the learner is to achieve small robust regret, without knowing where the outliers are. If the outliers are chosen adversarially, we show that a simple filtering strategy on extreme gradients incurs O(k) overhead compared to the usual regret bounds, and that this is unimprovable, which means that k needs to be sublinear in the number of rounds. We further ask which additional assumptions would allow for a linear number of outliers. It turns out that the usual benign cases of independently, identically distributed (i.i.d.) observations or strongly convex losses are not sufficient. However, combining i.i.d. observations with the assumption that outliers are those observations that are in an extreme quantile of the distribution, does lead to sublinear robust regret, even though the expected number of outliers is linear.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/vanerven21a.html
  PDF: http://proceedings.mlr.press/v134/vanerven21a/vanerven21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-vanerven21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tim
    family: van Erven
  - given: Sarah
    family: Sachs
  - given: Wouter M
    family: Koolen
  - given: Wojciech
    family: Kotlowski
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4174-4194
  id: vanerven21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4174
  lastpage: 4194
  published: 2021-07-21 00:00:00 +0000
- title: 'Size and Depth Separation in Approximating Benign Functions with Neural Networks'
  abstract: 'When studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions.  However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially-bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these conditions “benign”, and explore the benefits of size and depth for approximation of benign functions with ReLU networks.  As we show, this problem is more challenging than the corresponding problem for non-benign functions.  We give complexity-theoretic barriers to showing depth-lower-bounds: Proving existence of a benign function that cannot be approximated by polynomial-sized networks of depth $4$ would settle longstanding open problems in computational complexity. It implies that beyond depth $4$ there is a barrier to showing depth-separation for benign functions, even between networks of constant depth and networks of nonconstant depth.  We also study size-separation, namely, whether there are benign functions that can be approximated with networks of size $O(s(d))$, but not with networks of size $O(s’(d))$. We show a complexity-theoretic barrier to proving such results beyond size $O(d\log^2(d))$, but also show an explicit benign function, that can be approximated with networks of size $O(d)$ and not with networks of size $o(d/\log d)$. For approximation in the $L_\infty$ sense we achieve such separation already between size $O(d)$ and size $o(d)$.  Moreover, we show superpolynomial size lower bounds and barriers to such lower bounds, depending on the assumptions on the function.  Our size-separation results rely on an analysis of size lower bounds for Boolean functions, which is of independent interest: We show linear size lower bounds for computing explicit Boolean functions (such as set disjointness) with neural networks and threshold circuits.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/vardi21a.html
  PDF: http://proceedings.mlr.press/v134/vardi21a/vardi21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-vardi21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gal
    family: Vardi
  - given: Daniel
    family: Reichman
  - given: Toniann
    family: Pitassi
  - given: Ohad
    family: Shamir
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4195-4223
  id: vardi21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4195
  lastpage: 4223
  published: 2021-07-21 00:00:00 +0000
- title: 'Implicit Regularization in ReLU Networks with the Square Loss'
  abstract: 'Understanding the implicit regularization (or implicit bias) of gradient descent has recently been a very active research area. However, the implicit regularization in nonlinear neural networks is still poorly understood, especially for regression losses such as the square loss. Perhaps surprisingly, we prove that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters (although on the positive side, we show it can be characterized approximately). For one hidden-layer networks, we prove a similar result, where in general it is impossible to characterize implicit regularization properties in this manner, except for the “balancedness” property identified in Du et al. (2018). Our results suggest that a more general framework than the one considered so far may be needed to understand implicit regularization for nonlinear predictors, and provides some clues on what this framework should be.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/vardi21b.html
  PDF: http://proceedings.mlr.press/v134/vardi21b/vardi21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-vardi21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gal
    family: Vardi
  - given: Ohad
    family: Shamir
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4224-4258
  id: vardi21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4224
  lastpage: 4258
  published: 2021-07-21 00:00:00 +0000
- title: 'Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games'
  abstract: 'We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent’s best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/wei21a.html
  PDF: http://proceedings.mlr.press/v134/wei21a/wei21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-wei21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chen-Yu
    family: Wei
  - given: Chung-Wei
    family: Lee
  - given: Mengxiao
    family: Zhang
  - given: Haipeng
    family: Luo
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4259-4299
  id: wei21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4259
  lastpage: 4299
  published: 2021-07-21 00:00:00 +0000
- title: 'Non-stationary Reinforcement Learning without Prior Knowledge: an Optimal Black-box Approach'
  abstract: 'We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity.  By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for (generalzed) linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/wei21b.html
  PDF: http://proceedings.mlr.press/v134/wei21b/wei21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-wei21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chen-Yu
    family: Wei
  - given: Haipeng
    family: Luo
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4300-4354
  id: wei21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4300
  lastpage: 4354
  published: 2021-07-21 00:00:00 +0000
- title: 'On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function'
  abstract: 'We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a restricted, “local” access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and the features are also only accessible for the states that are encountered in this process. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in $H$ (the horizon of the MDP) or $d$ (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that $\poly(H,d)$ planning is possible with state value function realizability whenever the action set has a constant size. In particular, we present the TensorPlan algorithm which uses $\poly((dH/\delta)^A)$ simulator queries to find a $\delta$-optimal policy relative to any deterministic policy for which the value function is linearly realizable with some bounded parameter (with a known bound). This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an interesting open question. We also extend the upper bound to the near-realizable case and to the infinite-horizon discounted MDP setup. The upper bounds are complemented by a lower bound which states that in the infinite-horizon episodic setting, planners that achieve constant suboptimality need exponentially many queries, either in the dimension or the number of actions.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/weisz21a.html
  PDF: http://proceedings.mlr.press/v134/weisz21a/weisz21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-weisz21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gellert
    family: Weisz
  - given: Philip
    family: Amortila
  - given: Barnabás
    family: Janzer
  - given: Yasin
    family: Abbasi-Yadkori
  - given: Nan
    family: Jiang
  - given: Csaba
    family: Szepesvari
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4355-4385
  id: weisz21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4355
  lastpage: 4385
  published: 2021-07-21 00:00:00 +0000
- title: 'The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication'
  abstract: 'We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/woodworth21a.html
  PDF: http://proceedings.mlr.press/v134/woodworth21a/woodworth21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-woodworth21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Blake E
    family: Woodworth
  - given: Brian
    family: Bullins
  - given: Ohad
    family: Shamir
  - given: Nathan
    family: Srebro
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4386-4437
  id: woodworth21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4386
  lastpage: 4437
  published: 2021-07-21 00:00:00 +0000
- title: 'Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap'
  abstract: 'This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB),  which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$  regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/xu21a.html
  PDF: http://proceedings.mlr.press/v134/xu21a/xu21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-xu21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Haike
    family: Xu
  - given: Tengyu
    family: Ma
  - given: Simon
    family: Du
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4438-4472
  id: xu21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4438
  lastpage: 4472
  published: 2021-07-21 00:00:00 +0000
- title: 'Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation'
  abstract: 'Policy optimization methods are popular reinforcement learning (RL) algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy nature is not amenable to data reuse and the incremental updates result in a large iteration complexity before a near optimal policy is discovered. These empirical findings are mirrored in theory in the recent work of \citet{agarwal2020pc}, which provides a policy optimization method PC-PG that provably finds near optimal policies in the linear MDP model and is robust to model misspecification, but suffers from an extremely poor sample complexity compared with value-based techniques.  In this paper, we propose a new algorithm, COPOE, that overcomes the poor sample complexity of PC-PG while retaining the robustness to model misspecification.  COPOE makes several important algorithmic enhancements of PC-PG, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new RL algorithms.  The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PC-PG to $\widetilde{O}(1/\epsilon^3)$ for COPOE, nearly bridging the gap with value-based techniques.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/zanette21a.html
  PDF: http://proceedings.mlr.press/v134/zanette21a/zanette21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-zanette21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Andrea
    family: Zanette
  - given: Ching-An
    family: Cheng
  - given: Alekh
    family: Agarwal
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4473-4525
  id: zanette21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4473
  lastpage: 4525
  published: 2021-07-21 00:00:00 +0000
- title: 'Improved Algorithms for Efficient Active Learning Halfspaces with Massart and Tsybakov Noise'
  abstract: 'We give a computationally-efficient PAC active learning algorithm for $d$-dimensional homogeneous halfspaces that can tolerate Massart noise (Massart and Nedelec, 2006) and Tsybakov noise (Tsybakov, 2004). Specialized to the $\eta$-Massart noise setting, our algorithm achieves an information-theoretically near-optimal label complexity of $\tilde{O}\left( \frac{d}{(1-2\eta)^2} \mathrm{polylog}(\frac1\epsilon) \right)$ under a wide range of unlabeled data distributions (specifically, the family of ``structured distributions'' defined in Diakonikolas et al. (2020)). Under the more challenging Tsybakov noise condition, we identify two subfamilies of noise conditions, under which our efficient algorithm provides label complexity guarantees strictly lower than passive learning algorithms.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/zhang21a.html
  PDF: http://proceedings.mlr.press/v134/zhang21a/zhang21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-zhang21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chicheng
    family: Zhang
  - given: Yinan
    family: Li
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4526-4527
  id: zhang21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4526
  lastpage: 4527
  published: 2021-07-21 00:00:00 +0000
- title: 'Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon'
  abstract: 'Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity.  We consider the episodic reinforcement learning with S states, A actions, planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We propose a new algorithm, Monotonic Value Propagation (MVP), which relies on a new Bernstein-type bonus. The new bonus only requires tweaking the constants to ensure optimism and thus is significantly simpler than existing bonus constructions. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of contextual bandits. Notably, this result 1) exponentially improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019], and Zhang et al. [2020] in terms of the dependency on H, and 2) exponentially improves the running time in [Wang et al. 2020] and significantly improves the dependency on S, A and K in sample complexity.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/zhang21b.html
  PDF: http://proceedings.mlr.press/v134/zhang21b/zhang21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-zhang21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zihan
    family: Zhang
  - given: Xiangyang
    family: Ji
  - given: Simon
    family: Du
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4528-4531
  id: zhang21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4528
  lastpage: 4531
  published: 2021-07-21 00:00:00 +0000
- title: 'Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes'
  abstract: 'We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. For the fixed-horizon episodic setting with inhomogeneous transition kernels, we propose a new, computationally efficient algorithm that uses the basis kernels to approximate value functions. We show that the new algorithm, which we call ${\text{UCRL-VTR}^{+}}$, attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the number of basis kernels, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that ${\text{UCRL-VTR}^{+}}$ is minimax optimal up to logarithmic factors. At the core of our results are (1) a weighted least squares estimator for the unknown transitional probability; and (2) a new Bernstein-type concentration inequality for self-normalized vector-valued martingales with bounded increments. Together, these new tools enable tight control of the Bellman error and lead to a nearly minimax regret. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for RL with linear function approximation.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/zhou21a.html
  PDF: http://proceedings.mlr.press/v134/zhou21a/zhou21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-zhou21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dongruo
    family: Zhou
  - given: Quanquan
    family: Gu
  - given: Csaba
    family: Szepesvari
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4532-4576
  id: zhou21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4532
  lastpage: 4576
  published: 2021-07-21 00:00:00 +0000
- title: 'A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network'
  abstract: 'While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason—they either work in the Neural Tangent Kernel regime where neurons don’t move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly over-parameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly over-parameterized two-layer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an over-parameterized two-layer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape—we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/zhou21b.html
  PDF: http://proceedings.mlr.press/v134/zhou21b/zhou21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-zhou21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mo
    family: Zhou
  - given: Rong
    family: Ge
  - given: Chi
    family: Jin
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4577-4632
  id: zhou21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4577
  lastpage: 4632
  published: 2021-07-21 00:00:00 +0000
- title: 'Benign Overfitting of Constant-Stepsize SGD for Linear Regression'
  abstract: 'There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/zou21a.html
  PDF: http://proceedings.mlr.press/v134/zou21a/zou21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-zou21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Difan
    family: Zou
  - given: Jingfeng
    family: Wu
  - given: Vladimir
    family: Braverman
  - given: Quanquan
    family: Gu
  - given: Sham
    family: Kakade
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4633-4635
  id: zou21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4633
  lastpage: 4635
  published: 2021-07-21 00:00:00 +0000
- title: 'Open Problem: Are all VC-classes CPAC learnable?'
  abstract: 'A few years ago, it was shown that there exist basic statistical learning problems whose learnability can not be determined within ZFC [Ben-David, Hrubes, Moran, Shpilka, Yehudayoff, 2017]. Such independence, and the implied impossibility of characterizing learnability of a class by any combinatorial parameter, stems from the basic definitions viewing learners as arbitrary functions. That level of generality not only results in unprovability issues but is also problematic from the perspective of modeling practical machine learning, where learners and predictors are computable objects. In light of that, it is natural to consider learnability by algorithms that output computable predictors (both learners and predictors are then representable as finite objects). A recent study [Agarwal, Ananthakrishnan, Ben-David, Lechner and Urner, 2020] initiated a theory of such models of learning. It proposed the notion of CPAC learnability, by adding some basic computability requirements into a PAC learning framework. As a first step towards a characterization of learnability in the CPAC framework, Agarwal et al showed that CPAC learnability of a binary hypothesis class is not implied by the finiteness of its VC-dimension anymore, as far as proper learners are concerned. A major remaining open question is whether a similar result holds also for improper learning. Namely, does there exist a computable concept class consisting of computable classifiers, that has a finite VC-dimension but no computable learner can PAC learn it (even if the learner is not restricted to output a hypothesis that is a member of the class)? Another implied interesting question concerns coming up with combinatorial characterizations of learnability for computable learners.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/open-problem-agarwal21b.html
  PDF: http://proceedings.mlr.press/v134/agarwal21b/agarwal21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-open-problem-agarwal21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sushant
    family: Agarwal
  - given: Nivasini
    family: Ananthakrishnan
  - given: Shai
    family: Ben-David
  - given: Tosca
    family: Lechner
  - given: Ruth
    family: Urner
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4636-4641
  id: open-problem-agarwal21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4636
  lastpage: 4641
  published: 2021-07-21 00:00:00 +0000
- title: 'Open Problem: Is There an Online Learning Algorithm That Learns Whenever Online Learning Is Possible?'
  abstract: 'The open problem asks whether there exists an online learning algorithm for binary classification that guarantees, for all target concepts, to make a sublinear number of mistakes, under only the assumption that the (possibly random) sequence of points X allows that such a learning algorithm can exist for that sequence.  As a secondary problem, it also asks whether a specific concise condition completely determines whether a given (possibly random) sequence of points X admits the existence of online learning algorithms guaranteeing a sublinear number of mistakes for all target concepts.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/open-problem-hanneke21b.html
  PDF: http://proceedings.mlr.press/v134/hanneke21b/hanneke21b.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-open-problem-hanneke21b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4642-4646
  id: open-problem-hanneke21b
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4642
  lastpage: 4646
  published: 2021-07-21 00:00:00 +0000
- title: 'Open Problem: Tight Online Confidence Intervals for RKHS Elements'
  abstract: 'Confidence intervals are a crucial building block in the analysis of various online learning problems. The analysis of kernel-based bandit and reinforcement learning problems utilize confidence intervals applicable to the elements of a reproducing kernel Hilbert space (RKHS).  However, the existing confidence bounds do not appear to be tight, resulting in suboptimal regret bounds.  In fact, the existing regret bounds for several kernelized bandit algorithms (e.g., GP-UCB, GP-TS, and their variants) may fail to even be sublinear. It is unclear whether the suboptimal regret bound is a fundamental shortcoming of these algorithms or an artifact of the proof, and the main challenge seems to stem from the online (sequential) nature of the observation points. We formalize the question of online confidence intervals in the RKHS setting and overview the existing results. '
  volume: 134
  URL: https://proceedings.mlr.press/v134/open-problem-vakili21a.html
  PDF: http://proceedings.mlr.press/v134/vakili21a/vakili21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-open-problem-vakili21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sattar
    family: Vakili
  - given: Jonathan
    family: Scarlett
  - given: Tara
    family: Javidi
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4647-4652
  id: open-problem-vakili21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4647
  lastpage: 4652
  published: 2021-07-21 00:00:00 +0000
- title: 'Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?'
  abstract: 'We propose matrix norm inequalities that extend the Recht and Ré (2012) conjecture on a noncommutative AM-GM inequality, by supplementing it with another inequality that accounts for single-shuffle in stochastic finite-sum minimization. Single-shuffle is a popular without-replacement sampling scheme that shuffles only once in the beginning, but has not been studied in the Recht-Ré conjecture and the follow-up literature. Instead of focusing on general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products satisfy a series of spectral norm inequalities that imply “single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD and GD” in special cases.'
  volume: 134
  URL: https://proceedings.mlr.press/v134/open-problem-yun21a.html
  PDF: http://proceedings.mlr.press/v134/yun21a/yun21a.pdf
  edit: https://github.com/mlresearch//v134/edit/gh-pages/_posts/2021-07-21-open-problem-yun21a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Fourth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chulhee
    family: Yun
  - given: Suvrit
    family: Sra
  - given: Ali
    family: Jadbabaie
  editor: 
  - given: Mikhail
    family: Belkin
  - given: Samory
    family: Kpotufe
  page: 4653-4658
  id: open-problem-yun21a
  issued:
    date-parts: 
      - 2021
      - 7
      - 21
  firstpage: 4653
  lastpage: 4658
  published: 2021-07-21 00:00:00 +0000