- title: 'Conference on Learning Theory 2019: Preface'
  abstract: 'Preface to the proceedings of the 32nd Conference on Learning Theory.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/beygelzimer19a.html
  PDF: http://proceedings.mlr.press/v99/beygelzimer19a/beygelzimer19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-beygelzimer19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1-2
  id: beygelzimer19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1
  lastpage: 2
  published: 2019-06-25 00:00:00 +0000
- title: 'Inference under Information Constraints: Lower Bounds from Chi-Square Contraction'
  abstract: 'Multiple users getting one sample each from an unknown  distribution seek to enable a central server to conduct statistical inference. However, each player can only provide limited amount of information about its sample to the server.  We propose a unified framework to study such distributed inference problems under local information constraints.  We model the local information constraints by a set of channels $\mathcal{W}$: each player chooses a channel from $\mathcal{W}$, and then passes their data through this channel before transmitting the output to the server. The goal in this distributed setting is to understand the blow-up in data requirement imposed by the information constraints, compared to the centralized setting where all data samples are available to the server.  We introduce two notions of \emph{chi-square fluctuations} which provide bounds for the  average distance and the distance to the  average of a local perturbation. When information constraints are imposed, by the standard data-processing inequality, pairwise distances contract and so do our chi-square fluctuations. We provide a precise characterization of this contraction for discrete $k$-ary distributions and use it to obtain to general lower bounds for distribution learning and testing under information constraints. Our results involve  notions of minmax and maxmin chi-square fluctuations, where  the maximum is over the choice of channels and the minimum is over perturbations. The former emerges when considering public-coin protocols for testing and is bounded in terms of Frobenius norm of a positive semidefinite matrix $H$, a function of the channel family $\mathcal{W}$. The latter appears for private-coin protocols and is bounded by the nuclear norm of $H$ which is smaller than its Frobenius norm, establishing a separation between the sample complexity of testing using public and private coins.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/acharya19a.html
  PDF: http://proceedings.mlr.press/v99/acharya19a/acharya19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-acharya19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jayadev
    family: Acharya
  - given: Clément L
    family: Canonne
  - given: Himanshu
    family: Tyagi
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3-17
  id: acharya19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3
  lastpage: 17
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning in Non-convex Games with an Optimization Oracle'
  abstract: 'We consider  online learning in an adversarial, non-convex setting under the assumption that the learner has an access to an offline optimization oracle.  In the general setting of prediction with expert advice, Hazan and Koren established that in the optimization-oracle model, online learning requires exponentially more computation than statistical learning.  In this paper we show that by slightly strengthening the oracle model, the online and the statistical learning models become computationally equivalent. Our result holds for any Lipschitz and bounded (but not necessarily convex) function.  As an application we demonstrate how the offline oracle enables efficient computation of an equilibrium in non-convex games, that include GAN (generative adversarial networks) as a special case.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/agarwal19a.html
  PDF: http://proceedings.mlr.press/v99/agarwal19a/agarwal19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-agarwal19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Naman
    family: Agarwal
  - given: Alon
    family: Gonen
  - given: Elad
    family: Hazan
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 18-29
  id: agarwal19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 18
  lastpage: 29
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning to Prune: Speeding up Repeated Computations'
  abstract: 'Algorithms often must solve sequences of closely related problems. If the algorithm runs a standard procedure with worst-case runtime guarantees on each instance, it will fail to take advantage of valuable structure shared across the problem instances. When a commuter drives from work to home, for example, there are typically only a handful of routes that will ever be the shortest path. A naïve algorithm that does not exploit this common structure may spend most of its time checking roads that will never be in the shortest path. More generally, we can often ignore large swaths of the search space that will likely never contain an optimal solution. We present an algorithm that learns to maximally prune the search space on repeated computations, thereby reducing runtime while provably outputting the correct solution each period with high probability. Our algorithm employs a simple explore-exploit technique resembling those used in online algorithms, though our setting is quite different. We prove that, with respect to our model of pruning search spaces, our approach is optimal up to constant factors. Finally, we illustrate the applicability of our model and algorithm to three classic problems: shortest-path routing, string search, and linear programming. We present experiments confirming that our simple algorithm is effective at significantly reducing the runtime of solving repeated computations.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/alabi19a.html
  PDF: http://proceedings.mlr.press/v99/alabi19a/alabi19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-alabi19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Daniel
    family: Alabi
  - given: Adam Tauman
    family: Kalai
  - given: Katrina
    family: Liggett
  - given: Cameron
    family: Musco
  - given: Christos
    family: Tzamos
  - given: Ellen
    family: Vitercik
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 30-33
  id: alabi19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 30
  lastpage: 33
  published: 2019-06-25 00:00:00 +0000
- title: 'Towards Testing Monotonicity of Distributions Over General Posets'
  abstract: 'In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders.  A distribution $p$ over a poset is {\em monotone} if, for any pair of domain elements $x$ and $y$ such that $x \preceq y$, $p(x) \leq p(y)$. To understand the sample complexity of this problem, we introduce a new property called \emph{bigness} over a finite domain, where the distribution is $T$-big if the minimum probability for any domain element is at least $T$. We establish a lower bound of $\Omega(n/\log n)$ for testing bigness of distributions on  domains of size $n$. We then build on  these lower bounds to give $\Omega(n/\log{n})$ lower bounds for  testing monotonicity over a matching poset of size $n$ and significantly improved lower bounds over the hypercube poset. We give sublinear sample complexity bounds for testing bigness and for testing monotonicity over the matching poset.   We then give a number of tools for analyzing upper bounds on the sample complexity of the monotonicity testing problem. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/aliakbarpour19a.html
  PDF: http://proceedings.mlr.press/v99/aliakbarpour19a/aliakbarpour19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-aliakbarpour19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Maryam
    family: Aliakbarpour
  - given: Themis
    family: Gouleakis
  - given: John
    family: Peebles
  - given: Ronitt
    family: Rubinfeld
  - given: Anak
    family: Yodpinyanee
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 34-82
  id: aliakbarpour19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 34
  lastpage: 82
  published: 2019-06-25 00:00:00 +0000
- title: 'Testing Mixtures of Discrete Distributions'
  abstract: ' There has been significant study on the sample complexity of testing properties of distributions over large domains.  For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size $n$, distinguishing the uniform distribution from distributions that are far from uniform in $\ell_1$-distance uses only $O(\sqrt{n})$ samples.  However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small.  In this case, one must distinguish if samples are coming from a distribution that is $\epsilon$-close to uniform from the case where the distribution is $(1-\epsilon)$-far from uniform.  The latter task requires nearly linear in $n$ samples (Valiant, 2008; Valiant and Valiant, 2017a). In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families.   In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}.  Focusing on the identity and closeness testing problems leads to the following mixture testing question:  Given samples of distributions $p, q_1,q_2$, can we test if $p$ is a mixture of $q_1$ and $q_2$?  We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable.  Our results  show that the sample complexity of our testers are exactly the same as for the classical non-mixture case. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/aliakbarpour19b.html
  PDF: http://proceedings.mlr.press/v99/aliakbarpour19b/aliakbarpour19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-aliakbarpour19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Maryam
    family: Aliakbarpour
  - given: Ravi
    family: Kumar
  - given: Ronitt
    family: Rubinfeld
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 83-114
  id: aliakbarpour19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 83
  lastpage: 114
  published: 2019-06-25 00:00:00 +0000
- title: 'Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT'
  abstract: 'We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions. A crucial intermediate step is proving a non-asymptotic martingale central limit theorem (CLT), i.e., establishing the rates of convergence of a multivariate martingale difference sequence to a normal random vector, which might be of independent interest. We obtain the explicit rates for the multivariate martingale CLT using a combination of Stein?s method and Lindeberg?s argument, which is then used in conjunction with a non-asymptotic analysis of averaged SGD proposed in [PJ92]. Our results have potentially interesting consequences for computing confidence intervals for parameter estimation with SGD and constructing hypothesis tests with SGD that are valid in a non-asymptotic sense'
  volume: 99
  URL: https://proceedings.mlr.press/v99/anastasiou19a.html
  PDF: http://proceedings.mlr.press/v99/anastasiou19a/anastasiou19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-anastasiou19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Andreas
    family: Anastasiou
  - given: Krishnakumar
    family: Balasubramanian
  - given: Murat A.
    family: Erdogdu
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 115-137
  id: anastasiou19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 115
  lastpage: 137
  published: 2019-06-25 00:00:00 +0000
- title: 'Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes'
  abstract: 'We consider the variant of the stochastic multi-armed bandit problem where the stochastic reward distributions may change abruptly several times. In contrast to previous work, we are able to achieve (nearly) optimal mini-max regret bounds without knowing the number of changes. For this setting, we propose an algorithm called ADSWITCH and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. Our regret bound is the first optimal bound for an algorithm that is not tuned with respect to the number of changes.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/auer19a.html
  PDF: http://proceedings.mlr.press/v99/auer19a/auer19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-auer19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Peter
    family: Auer
  - given: Pratik
    family: Gajane
  - given: Ronald
    family: Ortner
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 138-158
  id: auer19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 138
  lastpage: 158
  published: 2019-06-25 00:00:00 +0000
- title: 'Achieving Optimal Dynamic Regret for Non-stationary Bandits without Prior Information'
  abstract: 'This joint extended abstract introduces and compares the results of (Auer et al., 2019) and (Chen et al., 2019), both of which resolve the problem of achieving optimal dynamic regret for non-stationary bandits without prior information on the non-stationarity. Specifically, Auer et al. (2019) resolve the problem for the traditional multi-armed bandits setting, while Chen et al. (2019) give a solution for the more general contextual bandits setting. Both works extend the key idea of (Auer et al., 2018) developed for a simpler two-armed setting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/auer19b.html
  PDF: http://proceedings.mlr.press/v99/auer19b/auer19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-auer19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Peter
    family: Auer
  - given: Yifang
    family: Chen
  - given: Pratik
    family: Gajane
  - given: Chung-Wei
    family: Lee
  - given: Haipeng
    family: Luo
  - given: Ronald
    family: Ortner
  - given: Chen-Yu
    family: Wei
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 159-163
  id: auer19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 159
  lastpage: 163
  published: 2019-06-25 00:00:00 +0000
- title: 'A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise'
  abstract: 'We consider variational inequalities coming from monotone operators, a setting that includes convex minimization and convex-concave saddle-point problems.  We assume an access to potentially noisy unbiased values of the monotone operators and assess convergence through a compatible gap function which corresponds to the standard optimality criteria in the aforementioned subcases. We present a  universal algorithm for these inequalities based on the Mirror-Prox algorithm. Concretely, our algorithm \emph{simultaneously} achieves the optimal rates for the smooth/non-smooth, and noisy/noiseless settings. This is done without any prior knowledge of these properties, and in the  general set-up of arbitrary norms and compatible Bregman divergences. For convex minimization and convex-concave saddle-point problems, this leads to new adaptive algorithms. Our method relies on a novel yet simple adaptive choice of the step-size, which can be seen as  the appropriate  extension of AdaGrad to  handle constrained problems.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/bach19a.html
  PDF: http://proceedings.mlr.press/v99/bach19a/bach19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bach19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Francis
    family: Bach
  - given: Kfir Y
    family: Levy
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 164-194
  id: bach19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 164
  lastpage: 194
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning Two Layer Rectified Neural Networks in Polynomial Time'
  abstract: 'We consider the following fundamental problem in the study of neural networks: given input examples $x \in \mathbb{R}^d$ and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network. We consider two-layer networks, mapping $\mathbb{R}^d$ to $\mathbb{R}^m$, with a single hidden layer and $k$ non-linear activation units $f(\cdot)$, where $f(x) = \max \{x , 0\}$ is the ReLU activation function. Such a network is specified by two weight matrices, $\mathbf{U}^* \in \mathbb{R}^{m \times k}, \mathbf{V}^* \in \mathbb{R}^{k \times d}$, such that the label of an example $x \in \mathbb{R}^{d}$ is given by $\mathbf{U}^* f(\mathbf{V}^* x)$, where $f(\cdot)$ is applied coordinate-wise. Given $n$ samples $x^1,…,x^n \in \mathbb{R}^d$ as a matrix $\mathbf{X} \in \mathbb{R}^{d \times n}$ and the label $\mathbf{U}^* f(\mathbf{V}^* \mathbf{X})$ of the network on these samples, our goal is to recover the weight matrices $\mathbf{U}^*$ and $\mathbf{V}^*$. More generally, our labels $\mathbf{U}^* f(\mathbf{V}^* \mathbf{X})$  may be corrupted by noise, and instead we observe $\mathbf{U}^* f(\mathbf{V}^* \mathbf{X}) + \mathbf{E}$ where $\mathbf{E}$ is some noise matrix. Even in this case, we may still be interested in recovering good approximations to the weight matrices $\mathbf{U}^*$ and $\mathbf{V}^*$.  In this work, we develop algorithms and hardness results under varying assumptions on the input and noise. Although the problem is NP-hard even for $k=2$, by assuming Gaussian marginals over the input $\mathbf{X}$ we are able to develop polynomial time algorithms for the approximate recovery of $\mathbf{U}^*$ and $\mathbf{V}^*$. Perhaps surprisingly, in the noiseless case our algorithms recover $\mathbf{U}^*,\mathbf{V}^*$ \textit{exactly}, i.e. with no error, in \textit{strongly} polynomial time. To the best of the our knowledge, this is the first algorithm to accomplish exact recovery for the ReLU activation function. For the noisy case, we give the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise $\mathbf{E}$. Our algorithms generalize to a larger class of \textit{rectified} activation functions, $f(x) = 0$ when $x\leq 0$, and $f(x) > 0$ otherwise. Although our polynomial time results require $\mathbf{U}^*$ to have full column rank, we also give a fixed-parameter tractable algorithm (in $k$) when $\mathbf{U}^*$ does not have this property. Lastly, we give a  fixed-parameter tractable algorithm for more arbitrary noise matrices $\mathbf{E}$, so long as they are independent of $\mathbf{X}$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/bakshi19a.html
  PDF: http://proceedings.mlr.press/v99/bakshi19a/bakshi19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bakshi19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ainesh
    family: Bakshi
  - given: Rajesh
    family: Jayaram
  - given: David P
    family: Woodruff
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 195-268
  id: bakshi19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 195
  lastpage: 268
  published: 2019-06-25 00:00:00 +0000
- title: 'Private Center Points and Learning of Halfspaces'
  abstract: ' We present a private agnostic learner for halfspaces over an arbitrary finite domain $X\subset \R^d$ with sample complexity $\mathsf{poly}(d,2^{\log^*|X|})$. The building block for this learner is a differentially private algorithm for locating an approximate center point of $m>\mathsf{poly}(d,2^{\log^*|X|})$ points – a high dimensional generalization of the median function. Our construction establishes a relationship between these two problems that is reminiscent of the relation between the median and learning one-dimensional thresholds [Bun et al. FOCS ’15]. This relationship suggests that the problem of privately locating a center point may have further applications in the design of differentially private algorithms. We also provide a lower bound on the sample complexity for privately finding a point in the convex hull. For approximate  differential privacy, we show a lower bound of $m=\Omega(d+\log^*|X|)$, whereas for pure differential privacy $m=\Omega(d\log|X|)$. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/beimel19a.html
  PDF: http://proceedings.mlr.press/v99/beimel19a/beimel19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-beimel19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Amos
    family: Beimel
  - given: Shay
    family: Moran
  - given: Kobbi
    family: Nissim
  - given: Uri
    family: Stemmer
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 269-282
  id: beimel19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 269
  lastpage: 282
  published: 2019-06-25 00:00:00 +0000
- title: 'Lower bounds for testing graphical models: colorings and antiferromagnetic Ising models'
  abstract: ' We study the identity testing problem in the context of spin systems or undirected graphical models, where it takes the following form: given the parameter specification of the model $M$ and a sampling oracle for the  distribution $\mu_{M^*}$ of an unknown model $M^*$, can we efficiently determine if the two models $M$ and $M^*$ are the same? We consider identity testing for both soft-constraint and hard-constraint systems. In particular, we prove hardness results in two prototypical cases, the \emph{Ising model} and \emph{proper colorings}, and explore whether identity testing is easier than structure learning. For the ferromagnetic (attractive) Ising model, Daskalasis et al. (2018) presented a polynomial time algorithm for identity testing. We prove hardness results in the antiferromagnetic (repulsive) setting in the same regime of parameters where structure learning is known to require a super-polynomial number of samples. Specifically, for $n$-vertex graphs of maximum degree $d$, we prove that if $|\beta| d = \omega(\log{n})$ (where $\beta$ is the inverse temperature parameter), then there is no identity testing algorithm  for the antiferromagnetic Ising model  that runs in polynomial time unless $RP\!=\!NP$. We also establish computational lower bounds for a broader set of parameters under the (randomized) exponential time hypothesis. In our proofs, we  use random graphs as gadgets; this is inspired by similar constructions in seminal works on the hardness of approximate counting. In the hard-constraint setting, we present hardness results  for identity testing for proper colorings.   Our results are based on the presumed hardness of \textsc{#BIS}, the problem of (approximately) counting independent sets in bipartite graphs. In particular, we prove that identity testing for colorings is hard in the same range of parameters  where structure learning is known to be hard, which in turn matches the parameter regime for  NP-hardness of the corresponding decision problem. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/bezakova19a.html
  PDF: http://proceedings.mlr.press/v99/bezakova19a/bezakova19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bezakova19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ivona
    family: Bezáková
  - given: Antonio
    family: Blanca
  - given: Zongchen
    family: Chen
  - given: Daniel
    family: Štefankovič
  - given: Eric
    family: Vigoda
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 283-298
  id: bezakova19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 283
  lastpage: 298
  published: 2019-06-25 00:00:00 +0000
- title: 'Approximate Guarantees for Dictionary Learning'
  abstract: 'In the dictionary learning (or sparse coding) problem, we are given a collection of signals (vectors in $\mathbb{R}^d$), and the goal is to find a "basis" in which the signals have a sparse (approximate) representation. The problem has received a lot of attention in signal processing, learning, and theoretical computer science. The problem is formalized as factorizing a matrix $X (d \times n)$ (whose columns are the signals) as $X = AY$, where $A$ has a prescribed number $m$ of columns (typically $m \ll n$), and $Y$ has columns that are $k$-sparse (typically $k \ll d$). Most of the known theoretical results involve assuming that the columns of the unknown $A$ have certain incoherence properties, and that the coefficient matrix $Y$ has random (or partly random) structure.  The goal of our work is to understand what can be said in the absence of such assumptions. Can we still find $A$ and $Y$ such that $X \approx AY$? We show that this is possible, if we allow violating the bounds on $m$ and $k$ by appropriate factors that depend on $k$ and the desired approximation. Our results rely on an algorithm for what we call the threshold correlation problem, which turns out to be related to hypercontractive norms of matrices. We also show that our algorithmic ideas apply to a setting in which some of the columns of $X$ are outliers, thus giving similar guarantees even in this challenging setting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/bhaskara19a.html
  PDF: http://proceedings.mlr.press/v99/bhaskara19a/bhaskara19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bhaskara19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Aditya
    family: Bhaskara
  - given: Wai Ming
    family: Tai
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 299-317
  id: bhaskara19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 299
  lastpage: 317
  published: 2019-06-25 00:00:00 +0000
- title: 'The Optimal Approximation Factor in Density Estimation'
  abstract: ' Consider the following problem: given two arbitrary densities $q_1,q_2$ and a sample-access to an unknown target density $p$, find which of the $q_i$’s is closer to $p$ in total variation. A remarkable result due to Yatracos shows that this problem is tractable in the following sense: there exists an algorithm that uses  $O(\epsilon^{-2})$ samples from $p$ and outputs $q_i$ such that with high probability, $TV(q_i,p) \leq 3\cdot OPT + \epsilon$, where $OPT= \min\{TV(q_1,p),TV(q_2,p)\}$. Moreover, this result extends to any finite class of densities $\mathcal{Q}$: there exists an algorithm that outputs the best density in $\mathcal{Q}$ up to a multiplicative approximation factor of 3. We complement and extend this result by showing that: (i) the factor 3 can not be improved if one restricts the algorithm to output a density from $\mathcal{Q}$, and (ii) if one allows the algorithm to output arbitrary densities (e.g. a mixture of densities from $\mathcal{Q}$),  then the approximation factor can be reduced to 2, which is optimal. In particular this demonstrates an advantage of improper learning over proper in this setup. We develop two approaches to achieve the optimal approximation factor of $2$: an adaptive one and a static one. Both approaches are based on a geometric point of view of the problem and rely on estimating surrogate metrics to the total variation. Our sample complexity bounds exploit techniques from {\it Adaptive Data Analysis}.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/bousquet19a.html
  PDF: http://proceedings.mlr.press/v99/bousquet19a/bousquet19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bousquet19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Olivier
    family: Bousquet
  - given: Daniel
    family: Kane
  - given: Shay
    family: Moran
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 318-341
  id: bousquet19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 318
  lastpage: 341
  published: 2019-06-25 00:00:00 +0000
- title: 'Sorted Top-k in Rounds'
  abstract: ' We consider the sorted top-$k$ problem whose goal is to recover the top-$k$ items with the correct order out of $n$ items using pairwise comparisons. In many applications, multiple rounds of interaction can be costly. We restrict our attention to algorithms with a constant number of rounds $r$ and try to minimize the sample complexity, i.e. the number of comparisons. When the comparisons are noiseless, we characterize how the optimal sample complexity depends on the number of rounds (up to a polylogarithmic factor for general $r$ and up to a constant factor for $r=1$ or 2). In particular, the sample complexity is $\Theta(n^2)$ for $r=1$, $\Theta(n\sqrt{k} + n^{4/3})$ for $r=2$ and $\tilde{\Theta}\left(n^{2/r} k^{(r-1)/r} + n\right)$ for $r \geq 3$.  We extend our results of sorted top-$k$ to the noisy case where each comparison is correct with probability $2/3$. When $r=1$ or 2, we show that the sample complexity gets an extra $\Theta(\log(k))$ factor when we transition from the noiseless case to the noisy case. We also prove new results for top-$k$ and sorting in the noisy case. We believe our techniques can be generally useful for understanding the trade-off between round complexities and sample complexities of rank aggregation problems.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/braverman19a.html
  PDF: http://proceedings.mlr.press/v99/braverman19a/braverman19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-braverman19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mark
    family: Braverman
  - given: Jieming
    family: Mao
  - given: Yuval
    family: Peres
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 342-382
  id: braverman19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 342
  lastpage: 382
  published: 2019-06-25 00:00:00 +0000
- title: 'Multi-armed Bandit Problems with Strategic Arms'
  abstract: ' We study a strategic version of the multi-armed bandit problem, where each arm is an individual strategic agent and we, the principal, pull one arm each round. When pulled, the arm receives some private reward $v_a$ and can choose an amount $x_a$ to pass on to the principal (keeping $v_a-x_a$ for itself). All non-pulled arms get reward $0$. Each strategic arm tries to maximize its own utility over the course of $T$ rounds. Our goal is to design an algorithm for the principal incentivizing these arms to pass on as much of their private rewards as possible. When private rewards are stochastically drawn each round ($v_a^t \leftarrow D_a$), we show that: \begin{itemize} \item Algorithms that perform well in the classic adversarial multi-armed bandit setting necessarily perform poorly: For all algorithms that guarantee low regret in an adversarial setting, there exist distributions $D_1,\ldots,D_k$ and an $o(T)$-approximate Nash equilibrium for the arms where the principal receives reward $o(T)$.  \item There exists an algorithm for the principal that induces a game among the arms where each arm has a dominant strategy. Moreover, for every $o(T)$-approximate Nash equilibrium, the principal receives expected reward $\mu’T - o(T)$, where $\mu’$ is the second-largest of the means $\mathbb{E}[D_{a}]$. This algorithm maintains its guarantee if the arms are non-strategic ($x_a = v_a$), and also if there is a mix of strategic and non-strategic arms. \end{itemize}'
  volume: 99
  URL: https://proceedings.mlr.press/v99/braverman19b.html
  PDF: http://proceedings.mlr.press/v99/braverman19b/braverman19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-braverman19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mark
    family: Braverman
  - given: Jieming
    family: Mao
  - given: Jon
    family: Schneider
  - given: S. Matthew
    family: Weinberg
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 383-416
  id: braverman19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 383
  lastpage: 416
  published: 2019-06-25 00:00:00 +0000
- title: 'Universality of Computational Lower Bounds for Submatrix Detection'
  abstract: 'In the general submatrix detection problem, the task is to detect the presence of a small $k \times k$ submatrix with entries sampled from a distribution $\mathcal{P}$ in an $n \times n$ matrix of samples from $\mathcal{Q}$. This formulation includes a number of well-studied problems, such as biclustering when $\mathcal{P}$ and $\mathcal{Q}$ are Gaussians and the planted dense subgraph formulation of community detection when the submatrix is a principal minor and $\mathcal{P}$ and $\mathcal{Q}$ are Bernoulli random variables. These problems all seem to exhibit a universal phenomenon: there is a statistical-computational gap depending on $\mathcal{P}$ and $\mathcal{Q}$ between the minimum $k$ at which this task can be solved and the minimum $k$ at which it can be solved in polynomial time. Our main result is to tightly characterize this computational barrier as a tradeoff between $k$ and the KL divergences between $\mathcal{P}$ and $\mathcal{Q}$ through average-case reductions from the planted clique conjecture. These computational lower bounds hold given mild assumptions on $\mathcal{P}$ and $\mathcal{Q}$ arising naturally from classical binary hypothesis testing. Our results recover and generalize the planted clique lower bounds for Gaussian biclustering in Ma and Wu (2015) and Brennan et al. (2018) and for the sparse and general regimes of planted dense subgraph in Hajek et al. (2015) and Brennan et al. (2018). This yields the first universality principle for computational lower bounds obtained through average-case reductions.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/brennan19a.html
  PDF: http://proceedings.mlr.press/v99/brennan19a/brennan19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-brennan19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Matthew
    family: Brennan
  - given: Guy
    family: Bresler
  - given: Wasim
    family: Huleihel
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 417-468
  id: brennan19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 417
  lastpage: 468
  published: 2019-06-25 00:00:00 +0000
- title: 'Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong Hardness'
  abstract: 'In the past decade, sparse principal component analysis has emerged as an archetypal problem for illustrating statistical-computational tradeoffs. This trend has largely been driven by a line of research aiming to characterize the average-case complexity of sparse PCA through reductions from the planted clique ($\textsc{pc}$) conjecture. All previous reductions either fail to show tight computational lower bounds matching existing algorithms or show lower bounds for formulations of sparse PCA other than its canonical generative model, the spiked covariance model. Also, these lower bounds all quickly degrade given weak forms of the $\textsc{pc}$ conjecture. We give a reduction from $\textsc{pc}$ that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities $k$. We also show the surprising result that even a mild improvement in the signal strength needed by the best known polynomial-time sparse PCA algorithms would imply that the hardness threshold for $\textsc{pc}$ is $\text{polylog}(N)$, rather than on the order $N^{1/2}$ as is widely conjectured. This is the first instance of a suboptimal hardness assumption implying optimal lower bounds for another problem in unsupervised learning.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/brennan19b.html
  PDF: http://proceedings.mlr.press/v99/brennan19b/brennan19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-brennan19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Matthew
    family: Brennan
  - given: Guy
    family: Bresler
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 469-470
  id: brennan19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 469
  lastpage: 470
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning rates for Gaussian mixtures under group action'
  abstract: 'We study the pointwise maximum likelihood estimation rates for a class of Gaussian mixtures that are invariant under the action of some isometry group. This model is also known as multi-reference alignment, where random rotations of a given vector are observed, up to Gaussian noise. We completely characterize the speed of the maximum likelihood estimator, by giving a comprehensive description of the likelihood geometry of the model. We show that the unknown parameter can always be decomposed into two components, one of which can be estimated at the fast rate $n^{-1/2}$, the other one being estimated at the slower rate $n^{-1/4}$. We provide an algebraic description and a geometric interpretation of these facts.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/brunel19a.html
  PDF: http://proceedings.mlr.press/v99/brunel19a/brunel19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-brunel19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Victor-Emmanuel
    family: Brunel
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 471-491
  id: brunel19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 471
  lastpage: 491
  published: 2019-06-25 00:00:00 +0000
- title: 'Near-optimal method for highly smooth convex optimization'
  abstract: 'We propose a near-optimal method for highly smooth convex optimization. More precisely, in the oracle model where one obtains the $p^{th}$ order Taylor expansion of a function at the query point, we propose a method with rate of convergence $\tilde{O}(1/k^{\frac{ 3p +1}{2}})$ after $k$ queries to the oracle for any convex function whose $p^{th}$ order derivative is Lipschitz.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/bubeck19a.html
  PDF: http://proceedings.mlr.press/v99/bubeck19a/bubeck19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bubeck19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sébastien
    family: Bubeck
  - given: Qijia
    family: Jiang
  - given: Yin Tat
    family: Lee
  - given: Yuanzhi
    family: Li
  - given: Aaron
    family: Sidford
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 492-507
  id: bubeck19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 492
  lastpage: 507
  published: 2019-06-25 00:00:00 +0000
- title: 'Improved Path-length Regret Bounds for Bandits'
  abstract: 'We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit. We first show that the seemingly suboptimal path-length bound of (Wei and Luo, 2018) is in fact not improvable for adaptive adversary. Despite this negative result, we then develop two new algorithms, one that strictly improves over (Wei and Luo, 2018) with a smaller path-length measure, and the other which improves over (Wei and Luo, 2018) for oblivious adversary when the path-length is large. Our algorithms are based on the well-studied optimistic mirror descent framework, but importantly with several novel techniques, including new optimistic predictions, a slight bias towards recently selected arms, and the use of a hybrid regularizer similar to that of (Bubeck et al., 2018). Furthermore, we extend our results to linear bandit by showing a reduction to obtaining dynamic regret for a full-information problem, followed by a further reduction to convex body chasing. As a consequence we obtain new dynamic regret results as well as the first path-length regret bounds for general linear bandit.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/bubeck19b.html
  PDF: http://proceedings.mlr.press/v99/bubeck19b/bubeck19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-bubeck19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sébastien
    family: Bubeck
  - given: Yuanzhi
    family: Li
  - given: Haipeng
    family: Luo
  - given: Chen-Yu
    family: Wei
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 508-528
  id: bubeck19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 508
  lastpage: 528
  published: 2019-06-25 00:00:00 +0000
- title: 'Optimal Learning of Mallows Block Model'
  abstract: 'The Mallows model, introduced in the seminal paper of Mallows 1957, is one of the most fundamental ranking distribution over the symmetric group $S_m$. To analyze more complex ranking data, several studies considered the Generalized Mallows model defined by Fligner and Verducci 1986. Despite the significant research interest of ranking distributions, the exact sample complexity of estimating the parameters of a Mallows and a Generalized Mallows Model is not well-understood. The main result of the paper is a tight sample complexity bound for learning Mallows and Generalized Mallows Model. We approach the learning problem by analyzing a more general model which interpolates between the single parameter Mallows Model and the $m$ parameter Mallows model. We call our model Mallows Block Model – referring to the Block Models that are a popular model in theoretical statistics. Our sample complexity analysis gives tight bound for learning the Mallows Block Model for any number of blocks. We provide essentially matching lower bounds for our sample complexity results. As a corollary of our analysis, it turns out that, if the central ranking is known, one single sample from the Mallows Block Model is sufficient to estimate the spread parameters with error that goes to zero as the size of the permutations goes to infinity. In addition, we calculate the exact rate of the parameter estimation error.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/busa-fekete19a.html
  PDF: http://proceedings.mlr.press/v99/busa-fekete19a/busa-fekete19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-busa-fekete19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Robert
    family: Busa-Fekete
  - given: Dimitris
    family: Fotakis
  - given: Balázs
    family: Szörényi
  - given: Manolis
    family: Zampetakis
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 529-532
  id: busa-fekete19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 529
  lastpage: 532
  published: 2019-06-25 00:00:00 +0000
- title: 'Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret'
  abstract: ' Gaussian processes (GP) are a stochastic processes, used  as Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to high-dimensional functions, as their per-iteration time and space cost is at least \emph{quadratic} in the number of dimensions $d$ and iterations $t$. Given a set of $A$ alternatives to choose from, the overall runtime $O(t^3 A)$ is prohibitive.  In this paper, we introduce BKB (\textit{budgeted kernelized bandit}), a new approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and remarkably no assumption on the input space or covariance of the GP. We combine a kernelized linear bandit algorithm (GP-UCB)  leverage score sampling as a randomized matrix sketching and  prove that selecting inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from \emph{variance starvation}, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most $\widetilde{O}(d_{eff})$ points, where $d_{eff}$ is the \emph{effective} dimension of the explored space, which is typically much smaller than both $d$ and $t$. This greatly reduces the dimensionality of the problem, thus leading to a $O(T A d_{eff}^2)$ runtime and $O(A d_{eff})$ space complexity.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/calandriello19a.html
  PDF: http://proceedings.mlr.press/v99/calandriello19a/calandriello19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-calandriello19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Daniele
    family: Calandriello
  - given: Luigi
    family: Carratino
  - given: Alessandro
    family: Lazaric
  - given: Michal
    family: Valko
  - given: Lorenzo
    family: Rosasco
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 533-557
  id: calandriello19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 533
  lastpage: 557
  published: 2019-06-25 00:00:00 +0000
- title: 'Disagreement-Based Combinatorial Pure Exploration: Sample Complexity Bounds and an Efficient Algorithm'
  abstract: 'We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework. In this problem, we are given $K$ distributions and a collection of subsets $\mathcal{V} \subset 2^{[K]}$ of these distributions, and we would like to find the subset $v \in \mathcal{V}$ that has largest mean, while collecting, in a sequential fashion, as few samples from the distributions as possible. In both the fixed budget and fixed confidence settings, our algorithms achieve new sample-complexity bounds that provide polynomial improvements on previous results in some settings. Via an information-theoretic lower bound, we show that no approach based on uniform sampling can improve on ours in any regime, yielding the first interactive algorithms for this problem with this basic property.  Computationally, we show how to efficiently implement our fixed confidence algorithm whenever $\mathcal{V}$ supports efficient linear optimization.  Our results involve precise concentration-of-measure arguments and a new algorithm for linear programming with exponentially many constraints.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/cao19a.html
  PDF: http://proceedings.mlr.press/v99/cao19a/cao19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cao19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tongyi
    family: Cao
  - given: Akshay
    family: Krishnamurthy
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 558-588
  id: cao19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 558
  lastpage: 588
  published: 2019-06-25 00:00:00 +0000
- title: 'A Rank-1 Sketch for Matrix Multiplicative Weights'
  abstract: 'We show that a simple randomized sketch of the matrix  multiplicative weight (MMW) update enjoys (in expectation) the same  regret bounds as MMW, up to a small constant factor. Unlike MMW, where  every step requires full matrix exponentiation, our steps require only a  single product of the form $e^A b$, which the Lanczos method  approximates efficiently. Our key technique is to view the sketch as a  \emph{randomized mirror projection}, and perform mirror descent analysis  on the \emph{expected projection}. Our sketch solves the online  eigenvector problem, improving the best known complexity bounds by  $\Omega(\log^5 n)$. We also apply this sketch to semidefinite  programming in saddle-point form, yielding a simple primal-dual scheme  with guarantees matching the best in the literature.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/carmon19a.html
  PDF: http://proceedings.mlr.press/v99/carmon19a/carmon19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-carmon19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yair
    family: Carmon
  - given: John C
    family: Duchi
  - given: Sidford
    family: Aaron
  - given: Tian
    family: Kevin
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 589-623
  id: carmon19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 589
  lastpage: 623
  published: 2019-06-25 00:00:00 +0000
- title: 'On the Computational Power of Online Gradient Descent'
  abstract: 'We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in very simple learning settings. Our results imply that, under weak complexity-theoretic assumptions, it is impossible to reason efficiently about the fine-grained behavior of online gradient descent.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/chatziafratis19a.html
  PDF: http://proceedings.mlr.press/v99/chatziafratis19a/chatziafratis19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-chatziafratis19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vaggos
    family: Chatziafratis
  - given: Tim
    family: Roughgarden
  - given: Joshua R.
    family: Wang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 624-662
  id: chatziafratis19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 624
  lastpage: 662
  published: 2019-06-25 00:00:00 +0000
- title: 'Active Regression via Linear-Sample Sparsification'
  abstract: ' We present an approach that improves the sample complexity for a variety of curve fitting problems, including active learning for linear regression, polynomial regression, and continuous sparse Fourier transforms.  In the active linear regression problem, one would like to estimate the least squares solution $\beta^*$ minimizing $\|X\beta - y\|_2$ given the entire unlabeled dataset $X \in \mathbb{R}^{n \times d}$ but only observing a small number of labels $y_i$.  We show that $O(d)$ labels suffice to find a constant factor approximation $\widetilde{\beta}$: \[ \mathbb{E}[\|{X} \widetilde{\beta} - y \|_2^2] \leq 2 \mathbb{E}[\|X \beta^* - y\|_2^2]. \]{This} improves on the best previous result of $O(d \log d)$ from leverage score sampling.  We also present results for the \emph{inductive} setting, showing when $\widetilde{\beta}$ will generalize to fresh samples; these apply to continuous settings such as polynomial regression.  Finally, we show how the techniques yield improved results for the non-linear sparse Fourier transform setting.   '
  volume: 99
  URL: https://proceedings.mlr.press/v99/chen19a.html
  PDF: http://proceedings.mlr.press/v99/chen19a/chen19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-chen19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xue
    family: Chen
  - given: Eric
    family: Price
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 663-695
  id: chen19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 663
  lastpage: 695
  published: 2019-06-25 00:00:00 +0000
- title: 'A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free'
  abstract: 'We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves $\mathcal{O}(\min\{\sqrt{KST}, K^{\frac{1}{3}}\Delta ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ dynamic regret for a contextual bandit problem with $T$ rounds, $K$ actions, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O} (\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al., 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce {\it replay phases}, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/chen19b.html
  PDF: http://proceedings.mlr.press/v99/chen19b/chen19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-chen19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yifang
    family: Chen
  - given: Chung-Wei
    family: Lee
  - given: Haipeng
    family: Luo
  - given: Chen-Yu
    family: Wei
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 696-726
  id: chen19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 696
  lastpage: 726
  published: 2019-06-25 00:00:00 +0000
- title: 'Faster Algorithms for High-Dimensional Robust Covariance Estimation'
  abstract: ' We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal error guarantees for several natural structured distributions.  Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance. Given $N = \tilde{\Omega}(d^2/\epsilon^2)$ samples from a $d$-dimensional Gaussian distribution, an $\epsilon$-fraction of which may be arbitrarily corrupted, our algorithm runs in time $\tilde{O}(d^{3.26})/\mathrm{poly}(\epsilon)$ and approximates the unknown covariance matrix to optimal error up to a logarithmic factor. Previous robust algorithms with comparable error guarantees all have runtimes $\tilde{\Omega}(d^{2 \omega})$ when $\epsilon = \Omega(1)$, where $\omega$ is the exponent of matrix multiplication. We also provide evidence that improving the running time of our algorithm may require new algorithmic techniques. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/cheng19a.html
  PDF: http://proceedings.mlr.press/v99/cheng19a/cheng19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cheng19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yu
    family: Cheng
  - given: Ilias
    family: Diakonikolas
  - given: Rong
    family: Ge
  - given: David P.
    family: Woodruff
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 727-757
  id: cheng19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 727
  lastpage: 757
  published: 2019-06-25 00:00:00 +0000
- title: 'Testing Symmetric Markov Chains Without Hitting'
  abstract: 'We study the problem of identity testing of symmetric markov chains. In this setting, we are given access to a single trajectory from a markov chain with unknown transition matrix $\bm{Q}$ and the goal is to determine whether $\bm{Q} = \bm{P}$ for some known matrix $\bm{P}$ or $\text{Dist}(\bm{P}, \bm{Q}) \geq \epsilon$ where $\text{Dist}$ is suitably defined. In recent work by Daskalakis et al, 2018, it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of $\bm{P}$ which may be arbitrarily large. In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of markov chains.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/cherapanamjeri19a.html
  PDF: http://proceedings.mlr.press/v99/cherapanamjeri19a/cherapanamjeri19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cherapanamjeri19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yeshwanth
    family: Cherapanamjeri
  - given: Peter L.
    family: Bartlett
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 758-785
  id: cherapanamjeri19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 758
  lastpage: 785
  published: 2019-06-25 00:00:00 +0000
- title: 'Fast Mean Estimation with Sub-Gaussian Rates'
  abstract: 'We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^{3.5}+n^2d)$ for $n$ i.i.d. samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins (2018), which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/cherapanamjeri19b.html
  PDF: http://proceedings.mlr.press/v99/cherapanamjeri19b/cherapanamjeri19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cherapanamjeri19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yeshwanth
    family: Cherapanamjeri
  - given: Nicolas
    family: Flammarion
  - given: Peter L.
    family: Bartlett
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 786-806
  id: cherapanamjeri19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 786
  lastpage: 806
  published: 2019-06-25 00:00:00 +0000
- title: 'Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games'
  abstract: 'We establish that algorithmic experiments in zero-sum games “fail miserably” to confirm the unique, sharp prediction of maxmin equilibration. Contradicting nearly a century of economic thought that treats zero-sum games nearly axiomatically as the exemplar symbol of economic stability, we prove that no meaningful prediction can be made about the  day-to-day behavior of online learning dynamics in zero-sum games. Concretely, Multiplicative Weights Updates (MWU) with constant step-size is \emph{Lyapunov chaotic} in the dual (payoff) space. Simply put, let’s assume that an observer asks the agents playing Matching-Pennies whether they prefer Heads or Tails (and by how much in terms of aggregate payoff so far). The range of possible answers consistent with any arbitrary small set of initial conditions blows up exponentially with time everywhere in the payoff space. This result is \emph{robust} both \emph{algorithmically} as well as \emph{game theoretically}. \textbf{Algorithmic robustness:} Chaos is robust to agents using any of a general sub-family of Follow-the-Regularized-Leader (FTRL) algorithms, the well known regret-minimizing dynamics, even when agents mix-and-match dynamics, use different or slowly decreasing step-sizes. \textbf{Game theoretic robustness:} Chaos is robust to all affine variants of zero-sum games (strictly competitive games), network variants with arbitrary large number of agents and even to competitive settings beyond these. Our result is in stark contrast with the time-average convergence of online learning to (approximate) Nash equilibrium, a result widely reported as “(weak) convergence to equilibrium”.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/cheung19a.html
  PDF: http://proceedings.mlr.press/v99/cheung19a/cheung19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cheung19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yun Kuen
    family: Cheung
  - given: Georgios
    family: Piliouras
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 807-834
  id: cheung19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 807
  lastpage: 834
  published: 2019-06-25 00:00:00 +0000
- title: 'Pure entropic regularization for metrical task systems'
  abstract: 'We show that on every $n$-point HST metric, there is a randomized online algorithm for metrical task systems (MTS) that is $1$-competitive for service costs and $O(\log n)$-competitive for movement costs. In general, these refined guarantees are optimal up to the implicit constant. While an $O(\log n)$-competitive algorithm for MTS on HST metrics was developed by Bubeck et al. (2018), that approach could only establish an $O((\log n)^2)$-competitive ratio when the service costs are required to be $O(1)$-competitive. Our algorithm is an instantiation of online mirror descent with the regularizer derived from a multiscale conditional entropy. In fact, our algorithm satisfies a set of even more refined guarantees; we are able to exploit this property to combine it with known random embedding theorems and obtain, for {\em any} $n$-point metric space, a randomized algorithm that is $1$-competitive for service costs and $O((\log n)^2)$-competitive for movement costs.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/coester19a.html
  PDF: http://proceedings.mlr.press/v99/coester19a/coester19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-coester19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Christian
    family: Coester
  - given: James R.
    family: Lee
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 835-848
  id: coester19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 835
  lastpage: 848
  published: 2019-06-25 00:00:00 +0000
- title: 'A near-optimal algorithm for approximating the John Ellipsoid'
  abstract: 'We develop a simple and efficient algorithm for approximating the John Ellipsoid of a symmetric polytope.  Our algorithm is near optimal in the sense that our time complexity matches the current best verification algorithm.  Experimental results suggest that our algorithm significantly outperforms existing algorithms. We also provide the MATLAB code for further research. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/cohen19a.html
  PDF: http://proceedings.mlr.press/v99/cohen19a/cohen19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cohen19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Michael B.
    family: Cohen
  - given: Ben
    family: Cousins
  - given: Yin Tat
    family: Lee
  - given: Xin
    family: Yang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 849-873
  id: cohen19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 849
  lastpage: 873
  published: 2019-06-25 00:00:00 +0000
- title: 'Artificial Constraints and Hints for Unbounded Online Learning'
  abstract: 'We provide algorithms that guarantees regret $R_T(u)\le \tilde O(G\|u\|^3 +  G(\|u\|+1)\sqrt{T})$ or $R_T(u)\le \tilde O(G\|u\|^3T^{1/3} + GT^{1/3}+ G\|u\|\sqrt{T})$ for online convex optimization with $G$-Lipschitz losses for any comparison point $u$ without prior knowledge of either $G$ or $\|u\|$. Previous algorithms dispense with the $O(\|u\|^3)$ term at the expense of knowledge of one or both of these parameters, while a lower bound shows that some additional penalty term over $G\|u\|\sqrt{T}$ is necessary. Previous penalties were \emph{exponential} while our bounds are polynomial in all quantities. Further, given a known bound $\|u\|\le D$, our same techniques allow us to design algorithms that adapt optimally to the unknown value of $\|u\|$ without requiring knowledge of $G$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/cutkosky19a.html
  PDF: http://proceedings.mlr.press/v99/cutkosky19a/cutkosky19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cutkosky19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ashok
    family: Cutkosky
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 874-894
  id: cutkosky19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 874
  lastpage: 894
  published: 2019-06-25 00:00:00 +0000
- title: 'Combining Online Learning Guarantees'
  abstract: 'We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms. Our method is embarrassingly simple: just add the iterates. This trick can generate efficient algorithms that adapt to many norms simultaneously, as well as providing diagonal-style algorithms that still maintain dimension-free guarantees. We then proceed to show how a variant on this idea yields a black-box procedure for generating \emph{optimistic} online learning algorithms. This yields the first optimistic regret guarantees in the unconstrained setting and generically increases adaptivity. Further, our optimistic algorithms are guaranteed to do no worse than their non-optimistic counterparts regardless of the quality of the optimistic estimates provided to the algorithm.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/cutkosky19b.html
  PDF: http://proceedings.mlr.press/v99/cutkosky19b/cutkosky19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-cutkosky19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ashok
    family: Cutkosky
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 895-913
  id: cutkosky19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 895
  lastpage: 913
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning from Weakly Dependent Data under Dobrushin’s Condition'
  abstract: 'Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples. Motivated by applications involving time-series data, there has been a growing literature on learning and generalization in settings where data is sampled from an ergodic process. This work has also developed complexity measures, which appropriately extend the notion of Rademacher complexity to bound the generalization error and learning rates of hypothesis classes in this setting. Rather than time-series data, our work is  motivated  by settings where data  is sampled on a network or a spatial domain, and thus do not fit well within the framework of prior work. We provide learning and generalization bounds for data that are complexly dependent, yet their distribution satisfies the standard Dobrushin’s condition. Indeed, we show that the standard complexity measures of Gaussian and Rademacher complexities and VC dimension are sufficient measures of complexity for the purposes of bounding the generalization error and learning rates of hypothesis classes in our setting. Moreover, our generalization bounds only degrade by constant factors compared to their i.i.d. analogs, and our learnability bounds degrade by log factors in the size of the training set.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/dagan19a.html
  PDF: http://proceedings.mlr.press/v99/dagan19a/dagan19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-dagan19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yuval
    family: Dagan
  - given: Constantinos
    family: Daskalakis
  - given: Nishanth
    family: Dikkala
  - given: Siddhartha
    family: Jayanti
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 914-928
  id: dagan19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 914
  lastpage: 928
  published: 2019-06-25 00:00:00 +0000
- title: 'Space lower bounds for linear prediction in the streaming model'
  abstract: 'We show that fundamental learning tasks, such as  finding an approximate linear separator or linear regression, require memory at least  \emph{quadratic} in the dimension, in a natural streaming setting. This  implies that such problems cannot be solved (at least in this setting) by  scalable memory-efficient streaming algorithms. Our  results build  on a memory lower bound for a simple linear-algebraic problem  – finding approximate null vectors – and utilize the estimates on the packing of the Grassmannian, the manifold of all linear subspaces of fixed dimension. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/dagan19b.html
  PDF: http://proceedings.mlr.press/v99/dagan19b/dagan19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-dagan19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yuval
    family: Dagan
  - given: Gil
    family: Kur
  - given: Ohad
    family: Shamir
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 929-954
  id: dagan19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 929
  lastpage: 954
  published: 2019-06-25 00:00:00 +0000
- title: 'Computationally and Statistically Efficient Truncated Regression'
  abstract: ' We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = \vec{w}^{\rm T} \vec{x}+{\varepsilon}$ and its corresponding vector of covariates $\vec{x} \in \mathbb{R}^k$ are only revealed if the dependent variable falls in some subset $S \subseteq \mathbb{R}$; otherwise the existence of the pair $(\vec{x},y)$ is hidden. This problem has remained a challenge since the early works of Tobin 1958, Amemiya et al. 1973, HAusman et al 1977, Breen et al. 1996, its applications are abundant, and its history dates back even further to the work of Galton 1897, Pearson 1908, Lee 1915, and Fisher 1931. While consistent estimators of the regression coefficients have been identified, the error rates are not well-understood, especially in high-dimensional settings. Under a “thickness assumption” about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vector $\vec{w}$ from $n$ revealed samples that attains $\ell_2$ error $\tilde{O}(\sqrt{k / n})$, almost recovering the guarantees of least squares in the standard (untruncated) linear regression setting. Our estimator uses Projected Stochastic Gradient Descent (PSGD) on the negative log-likelihood of the truncated sample, and only needs oracle access to the set $S$, which may otherwise be arbitrary, and in particular may be non-convex. PSGD must be restricted to an appropriately defined convex cone to guarantee that the negative log-likelihood is strongly convex, which in turn is established using concentration of matrices on variables with sub-exponential tails. We perform experiments on simulated data to illustrate the accuracy of our estimator. As a corollary of our work, we show that SGD provably learns the parameters of single-layer neural networks with noisy Relu activation functions (see Nair and Hinton 2010, Bengio et al. 2013, Gulcehre et al. 2016), given linearly many, in the number of network parameters, input-output pairs in the realizable setting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/daskalakis19a.html
  PDF: http://proceedings.mlr.press/v99/daskalakis19a/daskalakis19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-daskalakis19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Constantinos
    family: Daskalakis
  - given: Themis
    family: Gouleakis
  - given: Christos
    family: Tzamos
  - given: Manolis
    family: Zampetakis
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 955-960
  id: daskalakis19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 955
  lastpage: 960
  published: 2019-06-25 00:00:00 +0000
- title: 'Reconstructing Trees from Traces'
  abstract: ' We study the problem of learning a node-labeled tree given independent traces from an appropriately defined deletion channel. This problem, tree trace reconstruction, generalizes string trace reconstruction, which corresponds to the tree being a path. For many classes of trees, including complete trees and spiders, we provide algorithms that reconstruct the labels using only a polynomial number of traces. This exhibits a stark contrast to known results on string trace reconstruction, which require exponentially many traces, and where a central open problem is to determine whether a polynomial number of traces suffice. Our techniques combine novel combinatorial and complex analytic methods.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/davies19a.html
  PDF: http://proceedings.mlr.press/v99/davies19a/davies19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-davies19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sami
    family: Davies
  - given: Miklos Z.
    family: Racz
  - given: Cyrus
    family: Rashtchian
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 961-978
  id: davies19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 961
  lastpage: 978
  published: 2019-06-25 00:00:00 +0000
- title: 'Is your function low dimensional?'
  abstract: 'We study the problem of testing if a function depends on a small number of linear directions of its input data. We call a function $f$  a \emph{linear $k$-junta} if it is completely determined by some $k$-dimensional subspace of the input space. In this paper, we study the problem of testing whether a given $n$ variable function $f : \mathbb{R}^n \to \{0,1\}$, is a linear $k$-junta or $\epsilon$-far from all linear $k$-juntas, where the closeness is measured with respect to the Gaussian measure on $\mathbb{R}^n$. Linear $k$-juntas are a common generalization of two fundamental classes from Boolean function analysis (both of which have been studied in property testing) \textbf{1.} $k$- juntas which are functions on the Boolean cube which depend on at most k of the variables and \textbf{2.}  intersection of $k$ halfspaces, a fundamental geometric concept class.   We show that the class of linear $k$-juntas is not testable, but adding a surface area constraint makes it testable: we give a $\mathsf{poly}(k \cdot s/\epsilon)$-query non-adaptive tester for linear $k$-juntas with surface area at most $s$. We show that the polynomial dependence on $s$ is necessary. Moreover, we show that if the function is a linear $k$-junta with surface area at most $s$,  we give a $(s \cdot k)^{O(k)}$-query non-adaptive algorithm to learn the function \emph{up to a rotation of the basis}.  In particular, this implies that we can test the class of intersections of $k$ halfspaces in $\mathbb{R}^n$ with query complexity independent of $n$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/de19a.html
  PDF: http://proceedings.mlr.press/v99/de19a/de19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-de19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Anindya
    family: De
  - given: Elchanan
    family: Mossel
  - given: Joe
    family: Neeman
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 979-993
  id: de19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 979
  lastpage: 993
  published: 2019-06-25 00:00:00 +0000
- title: 'Computational Limitations in Robust Classification and Win-Win Results'
  abstract: 'We continue the study of statistical/computational tradeoffs in learning robust classifiers, following the recent work of Bubeck, Lee, Price and Razenshteyn who showed examples of classification tasks where (a) an efficient robust classifier exists, in the small-perturbation regime; (b) a non-robust classifier can be learned efficiently; but (c) it is computationally hard to learn a robust classifier, assuming the hardness of factoring large numbers. Indeed, the question of whether a robust classifier for their task exists in the large perturbation regime seems related to important open questions in computational number theory. In this work, we extend their work in three directions. First, we demonstrate classification tasks where computationally efficient robust classification is impossible, even when computationally unbounded robust classifiers exist. For this, we rely on the existence of average-case hard functions, requiring no cryptographic assumptions. Second, we show hard-to-robustly-learn classification tasks in the large-perturbation regime. Namely, we show that even though an efficient classifier that is very robust (namely, tolerant to large perturbations) exists, it is computationally hard to learn any non-trivial robust classifier. Our first construction relies on the existence of one-way functions, a minimal assumption in cryptography, and the second on the hardness of the learning parity with noise problem. In the latter setting, not only does a non-robust classifier exist, but also an efficient algorithm that generates fresh new labeled samples given access to polynomially many training examples (termed as generation by Kearns et al. (1994)). Third, we show that any such counterexample implies the existence of cryptographic primitives such as one-way functions or even forms of public-key encryption. This leads us to a win-win scenario: either we can quickly learn an efficient robust classifier, or we can construct new instances of popular and useful cryptographic primitives.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/degwekar19a.html
  PDF: http://proceedings.mlr.press/v99/degwekar19a/degwekar19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-degwekar19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Akshay
    family: Degwekar
  - given: Preetum
    family: Nakkiran
  - given: Vinod
    family: Vaikuntanathan
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 994-1028
  id: degwekar19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 994
  lastpage: 1028
  published: 2019-06-25 00:00:00 +0000
- title: 'Fast determinantal point processes via distortion-free intermediate sampling'
  abstract: ' Given a fixed $n\times d$ matrix $\mathbf{X}$, where $n\gg d$, we study the complexity of sampling from a  distribution over all subsets of rows where the probability of a subset is proportional to the squared volume of the parallelepiped spanned by the rows (a.k.a. a determinantal point process). In this task, it is important to minimize the preprocessing cost of the procedure (performed once) as well as the  sampling cost (performed repeatedly). To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O\big(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n\big)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$. We achieve this by introducing a new \textit{regularized} determinantal point process (R-DPP), which serves as an intermediate distribution in the sampling procedure by reducing the number of rows from $n$ to $\text{poly}(d)$. Crucially, this intermediate distribution does not distort the probabilities of the target sample. Our key novelty in defining the R-DPP  is the use of a Poisson random variable for controlling the probabilities of different subset sizes, leading to new determinantal formulas such as the normalization constant for this distribution. Our algorithm has applications in many diverse areas where determinantal point processes have been used, such as  machine learning, stochastic optimization, data summarization and low-rank matrix reconstruction.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/derezinski19a.html
  PDF: http://proceedings.mlr.press/v99/derezinski19a/derezinski19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-derezinski19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Michał
    family: Dereziński
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1029-1049
  id: derezinski19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1029
  lastpage: 1049
  published: 2019-06-25 00:00:00 +0000
- title: 'Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression'
  abstract: 'In experimental design, we are given a large collection of vectors, each with a hidden response value that we assume derives from an underlying linear model, and we wish to pick a small subset of the vectors such that querying the corresponding responses will lead to a good estimator of the model.   A classical approach in statistics is to assume the responses are linear, plus zero-mean i.i.d. Gaussian noise, in which case the goal is to provide an unbiased estimator with smallest mean squared error (A-optimal design).   A related approach, more common in computer science, is to assume the responses are arbitrary but fixed, in which case the goal is to estimate the least squares solution using few responses, as quickly as possible, for worst-case inputs.    Despite many attempts, characterizing the relationship between these two approaches has proven elusive.   We address this by proposing a framework for experimental  design where the responses are produced by an arbitrary unknown distribution.   We show that there is an efficient randomized experimental design procedure that achieves strong variance bounds for an unbiased estimator using few responses  in this general model. Nearly tight bounds for the classical A-optimality criterion, as well as improved bounds for worst-case responses, emerge as special cases of this result.   In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i.i.d. importance sampling method: inverse score sampling.   A key novelty of our analysis is in developing new expected error bounds for worst-case regression by controlling the tail behavior of i.i.d. sampling via the jointness of volume sampling.   Our result motivates a new minimax-optimality criterion for experimental design with unbiased estimators, which can be viewed as an extension of both A-optimal design and sampling for worst-case regression. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/derezinski19b.html
  PDF: http://proceedings.mlr.press/v99/derezinski19b/derezinski19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-derezinski19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Michał
    family: Dereziński
  - given: Kenneth L.
    family: Clarkson
  - given: Michael W.
    family: Mahoney
  - given: Manfred K.
    family: Warmuth
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1050-1069
  id: derezinski19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1050
  lastpage: 1069
  published: 2019-06-25 00:00:00 +0000
- title: 'Communication and Memory Efficient Testing of Discrete Distributions'
  abstract: 'We study distribution testing with communication and memory constraints  in the following computational models: (1) The {\em one-pass streaming model}  where the goal is to minimize the sample complexity of the protocol subject to a memory constraint, and (2) A {\em distributed model} where the data samples reside at multiple machines and the goal is to minimize the communication cost of the protocol. In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing). Moreover, we show nearly-tight lower bounds on (1) the sample complexity of any one-pass streaming tester for uniformity, subject to the memory constraint,  and (2) the communication cost of any uniformity testing protocol,  in a restricted “one-pass” model of communication.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/diakonikolas19a.html
  PDF: http://proceedings.mlr.press/v99/diakonikolas19a/diakonikolas19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-diakonikolas19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Themis
    family: Gouleakis
  - given: Daniel M.
    family: Kane
  - given: Sankeerth
    family: Rao
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1070-1106
  id: diakonikolas19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1070
  lastpage: 1106
  published: 2019-06-25 00:00:00 +0000
- title: 'Testing Identity of Multidimensional Histograms'
  abstract: 'We investigate the problem of identity testing for multidimensional histogram distributions. A distribution $p: D \to \mathbb{R}_+$, where $D \subseteq \mathbb{R}^d$, is called a $k$-histogram if there exists a partition of the domain  into $k$ axis-aligned rectangles such that $p$ is constant within each such rectangle. Histograms are one of the most fundamental nonparametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, let $q$ be an unknown $d$-dimensional $k$-histogram distribution in fixed dimension $d$,  and $p$ be an explicitly given $d$-dimensional $k$-histogram. We want to correctly distinguish, with probability at least $2/3$, between the case that $p = q$ versus $\|p-q\|_1 \geq \epsilon$. We design an algorithm for this hypothesis testing problem with sample complexity $O((\sqrt{k}/\epsilon^2) 2^{d/2} \log^{2.5 d}(k/\epsilon))$ that runs in sample-polynomial time. Our algorithm is robust to model misspecification, i.e., succeeds even if $q$ is only promised  to be {\em close} to a $k$-histogram. Moreover, for $k = 2^{\Omega(d)}$, we show a sample complexity lower bound of $(\sqrt{k}/\epsilon^2) \cdot \Omega(\log(k)/d)^{d-1}$ when $d\geq 2$.  That is, for any fixed dimension $d$, our upper and lower bounds are nearly matching. Prior to our work, the sample complexity of the $d=1$ case was well-understood, but no algorithm with sub-learning sample complexity was known, even for $d=2$. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/diakonikolas19b.html
  PDF: http://proceedings.mlr.press/v99/diakonikolas19b/diakonikolas19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-diakonikolas19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Daniel M.
    family: Kane
  - given: John
    family: Peebles
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1107-1131
  id: diakonikolas19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1107
  lastpage: 1131
  published: 2019-06-25 00:00:00 +0000
- title: 'Lower Bounds for Parallel and Randomized Convex Optimization'
  abstract: 'We study the question of whether parallelization in the exploration of the feasible set can be used to speed up convex optimization, in the local oracle model of computation. We show that the answer is negative for both deterministic and randomized algorithms applied to essentially any of the interesting geometries and nonsmooth, weakly-smooth, or smooth objective functions. In particular, we show that it is not possible to obtain a polylogarithmic (in the sequential complexity of the problem) number of parallel rounds  with a polynomial (in the dimension) number of queries per round. In the majority of these settings and when the dimension of the space is polynomial in the inverse target accuracy, our lower bounds match the oracle complexity of sequential convex optimization, up to at most a logarithmic factor in the dimension, which makes them (nearly) tight. Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean ($\ell_2$) and $\ell_\infty$ spaces. Our work provides a more general and streamlined approach for proving lower bounds in the setting of parallel convex optimization. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/diakonikolas19c.html
  PDF: http://proceedings.mlr.press/v99/diakonikolas19c/diakonikolas19c.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-diakonikolas19c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jelena
    family: Diakonikolas
  - given: Cristóbal
    family: Guzmán
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1132-1157
  id: diakonikolas19c
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1132
  lastpage: 1157
  published: 2019-06-25 00:00:00 +0000
- title: 'On the Performance of Thompson Sampling on Logistic Bandits'
  abstract: 'We study the logistic bandit, in which rewards are binary with success probability $\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta))$ and actions $a$ and coefficients $\theta$ are within the $d$-dimensional unit ball.  While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter $\beta$, we establish a regret bound for Thompson sampling that is independent of $\beta$.  Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$.  We also establish a $\tilde{O}(\sqrt{d\eta T}/\Delta)$ bound that applies more broadly, where $\Delta$ is the worst-case optimal log-odds and $\eta$ is the “fragility dimension,” a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others.  We demonstrate that the fragility dimension plays an essential role by showing that, for any $\epsilon > 0$, no algorithm can achieve $\mathrm{poly}(d, 1/\Delta)\cdot T^{1-\epsilon}$ regret.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/dong19a.html
  PDF: http://proceedings.mlr.press/v99/dong19a/dong19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-dong19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Shi
    family: Dong
  - given: Tengyu
    family: Ma
  - given: Benjamin
    family: Van Roy
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1158-1160
  id: dong19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1158
  lastpage: 1160
  published: 2019-06-25 00:00:00 +0000
- title: 'Lower Bounds for Locally Private Estimation via Communication Complexity'
  abstract: 'We develop lower bounds for estimation under local privacy constraints—including differential privacy and its relaxations to approximate or Rényi differential privacy—by showing an equivalence between private estimation and communication-restricted estimation problems. Our results apply to arbitrarily interactive privacy mechanisms, and they also give sharp lower bounds for all levels of differential privacy protections, that is, privacy mechanisms with privacy levels $\varepsilon \in [0, \infty)$.  As a particular consequence of our results, we show that the minimax mean-squared error for estimating the mean of a bounded or Gaussian random vector in $d$ dimensions scales as $\frac{d}{n} \cdot \frac{d}{ \min\{\varepsilon, \varepsilon^2\}}$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/duchi19a.html
  PDF: http://proceedings.mlr.press/v99/duchi19a/duchi19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-duchi19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: John
    family: Duchi
  - given: Ryan
    family: Rogers
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1161-1191
  id: duchi19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1161
  lastpage: 1191
  published: 2019-06-25 00:00:00 +0000
- title: 'Sharp Analysis for Nonconvex SGD Escaping from Saddle Points'
  abstract: 'In this paper, we give a sharp analysis for  Stochastic Gradient Descent (SGD)  and  prove that SGD is able to efficiently escape from  saddle points and find an $(\epsilon, O(\epsilon^{0.5}))$-approximate second-order stationary point in $\tilde{O}(\epsilon^{-3.5})$ stochastic gradient computations for generic nonconvex optimization problems, when the objective function satisfies  gradient-Lipschitz, Hessian-Lipschitz, and dispersive noise assumptions. This  result subverts the classical belief that  SGD requires at least $O(\epsilon^{-4})$ stochastic gradient computations for obtaining an $(\epsilon,O(\epsilon^{0.5}))$-approximate second-order stationary point. Such SGD rate matches, up to a polylogarithmic factor of problem-dependent parameters, the rate of most accelerated  nonconvex stochastic optimization algorithms that adopt additional techniques, such as Nesterov’s momentum acceleration, negative curvature search, as well as quadratic and cubic regularization tricks. Our novel analysis gives new insights into nonconvex SGD and can be potentially generalized to a broad class of stochastic optimization algorithms.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/fang19a.html
  PDF: http://proceedings.mlr.press/v99/fang19a/fang19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-fang19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cong
    family: Fang
  - given: Zhouchen
    family: Lin
  - given: Tong
    family: Zhang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1192-1234
  id: fang19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1192
  lastpage: 1234
  published: 2019-06-25 00:00:00 +0000
- title: 'Achieving the Bayes Error Rate in Stochastic Block Model by SDP, Robustly'
  abstract: 'We study the statistical performance of the semidefinite programming (SDP) relaxation approach for clustering under the binary symmetric Stochastic Block Model (SBM). We show that the SDP achieves an error rate of the form  $\exp\left[-(1-o(1))\frac{n I}{2}\right]$, where $I$ is an appropriate information-theoretic measure of the signal-to-noise ratio. This bound matches the minimax lower bound on the optimal Bayes error rate for this problem, and improves upon existing results that are sub-optimal by a multiplicative constant in the exponent. As a corollary, our result implies that SDP achieves the optimal exact recovery threshold with the correct leading constant. We further show that this error rate of SDP is robust; that is, it remains unchanged under the so-called semirandom model where the graph is modified by a monotone adversary, as well as under the setting with heterogeneous edge probabilities. Our proof is based on a novel primal-dual analysis of the SDP.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/fei19a.html
  PDF: http://proceedings.mlr.press/v99/fei19a/fei19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-fei19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yingjie
    family: Fei
  - given: Yudong
    family: Chen
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1235-1269
  id: fei19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1235
  lastpage: 1269
  published: 2019-06-25 00:00:00 +0000
- title: 'High probability generalization bounds for uniformly stable algorithms with nearly optimal rate'
  abstract: 'Algorithmic stability is a classical approach to understanding and analysis of the generalization error of learning algorithms. A notable weakness of most stability-based generalization bounds is that they hold only in expectation. Generalization with high probability has been established in a landmark paper of Bousquet and Elisseeff (2001) albeit at the expense of an additional $\sqrt{n}$ factor in the bound. Specifically, their bound on the estimation error of any $\gamma$-uniformly stable learning algorithm on $n$ samples and range in $[0,1]$ is $O(\gamma \sqrt{n \log(1/\delta)} + \sqrt{\log(1/\delta)/n})$ with probability  $\geq 1-\delta$. The $\sqrt{n}$ overhead makes the bound vacuous in the common settings where $\gamma \geq 1/\sqrt{n}$. A stronger bound was recently proved by the authors (Feldman and Vondrak, 2018) that reduces the overhead to at most $O(n^{1/4})$. Still, both of these results give optimal generalization bounds only when $\gamma = O(1/n)$. We prove a nearly tight bound of $O(\gamma \log(n)\log(n/\delta) + \sqrt{\log(1/\delta)/n})$ on the estimation error of any $\gamma$-uniformly stable algorithm. It implies that for algorithms that are uniformly stable with $\gamma = O(1/\sqrt{n})$, estimation error is essentially the same as the sampling error. Our result leads to the first high-probability generalization bounds for multi-pass stochastic gradient descent and regularized ERM for stochastic convex problems with nearly optimal rate — resolving open problems in prior work. Our proof technique is new and we introduce several analysis tools that might find additional applications.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/feldman19a.html
  PDF: http://proceedings.mlr.press/v99/feldman19a/feldman19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-feldman19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vitaly
    family: Feldman
  - given: Jan
    family: Vondrak
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1270-1279
  id: feldman19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1270
  lastpage: 1279
  published: 2019-06-25 00:00:00 +0000
- title: 'Sum-of-squares meets square loss: Fast rates for agnostic tensor completion'
  abstract: 'We study tensor completion in the agnostic setting. In the classical tensor completion problem, we receive $n$ entries of an unknown rank-$r$ tensor and wish to exactly complete the remaining entries. In agnostic tensor completion, we make \emph{no assumption} on the rank of the unknown tensor, but attempt to predict unknown entries as well as the best rank-$r$ tensor. For agnostic learning of third-order tensors with the square loss, we give the first polynomial time algorithm that obtains a “fast” (i.e., $O(1/n)$-type) rate improving over the rate obtained by reduction to matrix completion. Our prediction error rate to compete with the best $d\times{}d\times{}d$ tensor of rank-$r$ is $\tilde{O}(r^{2}d^{3/2}/n)$. We also obtain an exact oracle inequality that trades off estimation and approximation error.  Our algorithm is based on the degree-six sum-of-squares relaxation of the tensor nuclear norm. The key feature of our analysis is to show that a certain characterization for the subgradient of the tensor nuclear norm can be encoded in the sum-of-squares proof system. This unlocks the standard toolbox for localization of empirical processes under the square loss, and allows us to establish restricted eigenvalue-type guarantees for various tensor regression models, with tensor completion as a special case. The new analysis of the relaxation complements Barak and Moitra (2016), who gave slow rates for agnostic tensor completion, and Potechin and Steurer (2017), who gave exact recovery guarantees for the noiseless setting. Our techniques are user-friendly, and we anticipate that they will find use elsewhere.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/foster19a.html
  PDF: http://proceedings.mlr.press/v99/foster19a/foster19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-foster19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dylan J.
    family: Foster
  - given: Andrej
    family: Risteski
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1280-1318
  id: foster19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1280
  lastpage: 1318
  published: 2019-06-25 00:00:00 +0000
- title: 'The Complexity of Making the Gradient Small in Stochastic Convex Optimization'
  abstract: 'We give nearly matching upper and lower bounds on the oracle complexity of finding $\epsilon$-stationary points $(\|\nabla F(x)\|\leq\epsilon$ in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent “recursive regularization” technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/foster19b.html
  PDF: http://proceedings.mlr.press/v99/foster19b/foster19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-foster19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dylan J.
    family: Foster
  - given: Ayush
    family: Sekhari
  - given: Ohad
    family: Shamir
  - given: Nathan
    family: Srebro
  - given: Karthik
    family: Sridharan
  - given: Blake
    family: Woodworth
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1319-1345
  id: foster19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1319
  lastpage: 1345
  published: 2019-06-25 00:00:00 +0000
- title: 'Statistical Learning with a Nuisance Component'
  abstract: 'We provide excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target model depends on an unknown model that must be to be estimated from data (a “nuisance model”). We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target model and one for the nuisance model. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning literature to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class. We characterize conditions on the metric entropy such that oracle rates—rates of the same order as if we knew the nuisance model—are achieved. We also analyze the rates achieved by specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance in the literature: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/foster19c.html
  PDF: http://proceedings.mlr.press/v99/foster19c/foster19c.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-foster19c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dylan J.
    family: Foster
  - given: Vasilis
    family: Syrgkanis
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1346-1348
  id: foster19c
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1346
  lastpage: 1348
  published: 2019-06-25 00:00:00 +0000
- title: 'On the Regret Minimization of Nonconvex Online Gradient Ascent for Online PCA'
  abstract: 'In this paper we focus on the problem of Online Principal Component Analysis in the regret minimization framework. For this problem, all existing regret minimization algorithms for the fully-adversarial setting are based on a positive semidefinite convex relaxation, and hence require quadratic memory and SVD computation (either thin of full) on each iteration, which amounts to at least quadratic runtime per iteration. This is in stark contrast to a corresponding stochastic i.i.d. variant of the problem, which was studied extensively lately, and admits very efficient gradient ascent algorithms that work directly on the natural non-convex formulation of the problem, and hence require only linear memory and linear runtime per iteration. This raises the question: \textit{can non-convex online gradient ascent algorithms be shown to minimize regret in online adversarial settings?} In this paper we take a step forward towards answering this question. We introduce an \textit{adversarially-perturbed spiked-covariance model} in which, each data point is assumed to follow a fixed stochastic distribution with a non-zero spectral gap in the covariance matrix,  but is then perturbed with some adversarial vector. This model is a natural extension of a well studied standard \textit{stochastic} setting that allows for non-stationary (adversarial) patterns to arise in the data  and hence, might serve as a significantly better approximation for real-world data-streams. We show that in an interesting regime of parameters, when the non-convex online gradient ascent algorithm is initialized with a “warm-start" vector, it provably minimizes the regret with high probability. We further discuss the possibility of computing such a “warm-start" vector, and also the use of regularization to obtain fast regret rates. Our theoretical findings are supported by empirical experiments on both synthetic and real-world data.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/garber19a.html
  PDF: http://proceedings.mlr.press/v99/garber19a/garber19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-garber19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dan
    family: Garber
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1349-1373
  id: garber19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1349
  lastpage: 1373
  published: 2019-06-25 00:00:00 +0000
- title: 'Optimal Tensor Methods in Smooth Convex and Uniformly ConvexOptimization'
  abstract: 'We consider convex optimization problems with the objective function having Lipshitz-continuous $p$-th order derivative, where $p\geq 1$. We propose a new tensor method, which closes the gap between the lower  $\Omega\left(\e^{-\frac{2}{3p+1}} \right)$ and upper $O\left(\e^{-\frac{1}{p+1}} \right)$ iteration complexity bounds for this class of optimization problems. We also consider uniformly convex functions, and show how the proposed method can be accelerated under this additional assumption. Moreover, we introduce a $p$-th order condition number which naturally arises in the complexity analysis of tensor methods under this assumption.  Finally, we make a numerical study of the proposed optimal method and show that in practice it is faster than the best known accelerated tensor method. We also compare the performance of tensor methods for $p=2$ and $p=3$ and show that the 3rd-order method is superior to the 2nd-order method in practice. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/gasnikov19a.html
  PDF: http://proceedings.mlr.press/v99/gasnikov19a/gasnikov19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-gasnikov19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alexander
    family: Gasnikov
  - given: Pavel
    family: Dvurechensky
  - given: Eduard
    family: Gorbunov
  - given: Evgeniya
    family: Vorontsova
  - given: Daniil
    family: Selikhanovych
  - given: César A.
    family: Uribe
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1374-1391
  id: gasnikov19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1374
  lastpage: 1391
  published: 2019-06-25 00:00:00 +0000
- title: 'Near Optimal Methods for Minimizing Convex Functions with Lipschitz $p$-th Derivatives'
  abstract: 'In this merged paper, we consider the problem of minimizing a convex function with Lipschitz-continuous $p$-th order derivatives. Given an oracle which when queried at a point returns the first $p$-derivatives of the function at that point we provide some methods which compute an $\e$ approximate minimizer in $O\left(\e^{-\frac{2}{3p+1}} \right)$ iterations. These methods match known lower bounds up to polylogarithmic factors for constant $p$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/gasnikov19b.html
  PDF: http://proceedings.mlr.press/v99/gasnikov19b/gasnikov19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-gasnikov19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alexander
    family: Gasnikov
  - given: Pavel
    family: Dvurechensky
  - given: Eduard
    family: Gorbunov
  - given: Evgeniya
    family: Vorontsova
  - given: Daniil
    family: Selikhanovych
  - given: César A.
    family: Uribe
  - given: Bo
    family: Jiang
  - given: Haoyue
    family: Wang
  - given: Shuzhong
    family: Zhang
  - given: Sébastien
    family: Bubeck
  - given: Qijia
    family: Jiang
  - given: Yin Tat
    family: Lee
  - given: Yuanzhi
    family: Li
  - given: Aaron
    family: Sidford
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1392-1393
  id: gasnikov19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1392
  lastpage: 1393
  published: 2019-06-25 00:00:00 +0000
- title: 'Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization'
  abstract: 'Variance reduction techniques like SVRG provide simple and fast algorithms for optimizing a convex finite-sum objective. For nonconvex objectives, these techniques can also find a first-order stationary point (with small gradient). However, in nonconvex optimization it is often crucial to find a second-order stationary point (with small gradient and almost PSD hessian). In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an $\epsilon$-second-order stationary point using only $\widetilde{O}(n^{2/3}/\epsilon^2+n/\epsilon^{1.5})$ stochastic gradients. To our best knowledge, this is the first second-order guarantee for a simple variant of SVRG. The running time almost matches the known guarantees for finding $\epsilon$-first-order stationary points.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/ge19a.html
  PDF: http://proceedings.mlr.press/v99/ge19a/ge19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-ge19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Rong
    family: Ge
  - given: Zhize
    family: Li
  - given: Weiyao
    family: Wang
  - given: Xiang
    family: Wang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1394-1448
  id: ge19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1394
  lastpage: 1448
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning Ising Models with Independent Failures'
  abstract: ' We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability $p$. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/goel19a.html
  PDF: http://proceedings.mlr.press/v99/goel19a/goel19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-goel19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Surbhi
    family: Goel
  - given: Daniel M.
    family: Kane
  - given: Adam R.
    family: Klivans
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1449-1469
  id: goel19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1449
  lastpage: 1469
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning Neural Networks with Two Nonlinear Layers in Polynomial Time'
  abstract: ' We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU).  The algorithm succeeds with respect to {\em any} distribution on the unit ball in $n$ dimensions (hidden weight vectors in the first layer have unit norm).  This is the first efficient algorithm for learning a general class of neural networks with more than one nonlinear layer that makes no restrictions on the VC-dimension of the network. Algorithms for learning relaxations of our model (e.g., allowing larger weight vectors in the first layer) would lead to breakthroughs on notoriously hard problems in Boolean function learning.  Thus, our results are “best possible” with respect to current techniques. Our algorithm– {\em Alphatron}– is an iterative update rule that combines isotonic regression with kernel methods.  We use this algorithm to give a simple reduction for translating PAC learning algorithms to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance.  This substantially improves many longstanding results for PAC learning Boolean functions. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/goel19b.html
  PDF: http://proceedings.mlr.press/v99/goel19b/goel19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-goel19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Surbhi
    family: Goel
  - given: Adam R.
    family: Klivans
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1470-1499
  id: goel19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1470
  lastpage: 1499
  published: 2019-06-25 00:00:00 +0000
- title: 'When can unlabeled data improve the learning rate?'
  abstract: 'In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the \emph{minimax} learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate examples where these conditions can be met, in particular showing rate changes from $1/\sqrt{\ell}$ to $e^{-c\ell}$ and from $1/\sqrt{\ell}$ to $1/\ell$. These results improve our understanding of what is and isn’t possible in semi-supervised learning.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/gopfert19a.html
  PDF: http://proceedings.mlr.press/v99/gopfert19a/gopfert19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-gopfert19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Christina
    family: Göpfert
  - given: Shai
    family: Ben-David
  - given: Olivier
    family: Bousquet
  - given: Sylvain
    family: Gelly
  - given: Ilya
    family: Tolstikhin
  - given: Ruth
    family: Urner
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1500-1518
  id: gopfert19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1500
  lastpage: 1518
  published: 2019-06-25 00:00:00 +0000
- title: 'Sampling and Optimization on Convex Sets in Riemannian Manifolds of Non-Negative Curvature'
  abstract: 'The Euclidean space notion of convex sets (and functions) generalizes to Riemannian manifolds in a natural sense and is called geodesic convexity. Extensively studied computational problems such as convex optimization and sampling in convex sets also have meaningful counterparts in the manifold setting. Geodesically convex optimization is a well-studied problem with ongoing research and considerable recent interest in machine learning and theoretical computer science. In this paper, we study sampling and convex optimization problems over manifolds of non-negative curvature proving polynomial running time in the dimension and other relevant parameters. Our algorithms assume a warm start. We first present a random walk based sampling algorithm and then combine it with simulated annealing for solving convex optimization problems. To our knowledge, these are the first algorithms in the general setting of positively curved manifolds with provable polynomial guarantees under reasonable assumptions, and the first study of the connection between sampling and optimization in this setting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/goyal19a.html
  PDF: http://proceedings.mlr.press/v99/goyal19a/goyal19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-goyal19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Navin
    family: Goyal
  - given: Abhishek
    family: Shetty
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1519-1561
  id: goyal19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1519
  lastpage: 1561
  published: 2019-06-25 00:00:00 +0000
- title: 'Better Algorithms for Stochastic Bandits with Adversarial Corruptions'
  abstract: 'We study the stochastic multi-armed bandits problem in the presence of adversarial corruption. We present a new algorithm for this problem whose regret is nearly optimal, substantially improving upon previous work. Our algorithm is agnostic to the level of adversarial contamination and can tolerate a significant amount of corruption with virtually no degradation in performance.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/gupta19a.html
  PDF: http://proceedings.mlr.press/v99/gupta19a/gupta19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-gupta19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Anupam
    family: Gupta
  - given: Tomer
    family: Koren
  - given: Kunal
    family: Talwar
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1562-1578
  id: gupta19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1562
  lastpage: 1578
  published: 2019-06-25 00:00:00 +0000
- title: 'Tight analyses for non-smooth stochastic gradient descent'
  abstract: 'Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ \emph{with high probability}. We also construct a function from this class for which the error of the final iterate of \emph{deterministic} gradient descent is $\Omega(\log(T)/T)$. This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). An intermediate step of our analysis proves that the suffix averaging method achieves error $O(1/T)$ \emph{with high probability}, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error $O(1/T)$, but only in expectation, and achieved a high probability error bound of $O(\log \log(T)/T)$, which is suboptimal.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/harvey19a.html
  PDF: http://proceedings.mlr.press/v99/harvey19a/harvey19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-harvey19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nicholas J. A.
    family: Harvey
  - given: Christopher
    family: Liaw
  - given: Yaniv
    family: Plan
  - given: Sikander
    family: Randhawa
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1579-1613
  id: harvey19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1579
  lastpage: 1613
  published: 2019-06-25 00:00:00 +0000
- title: 'Reasoning in Bayesian Opinion Exchange Networks Is PSPACE-Hard'
  abstract: ' We study the Bayesian model of opinion exchange of fully rational agents arranged on a network. In this model, the agents receive private signals that are indicative of an unknown state of the world.  Then, they repeatedly announce the state of the world they consider most likely to their neighbors, at the same time updating their beliefs based on their neighbors’ announcements. This model is extensively studied in economics since the work of Aumann (1976) and Geanakoplos and Polemarchakis (1982). It is known that the agents eventually agree with high probability on any network. It is often argued that the computations needed by agents in this model are difficult, but prior to our results there was no rigorous work showing this hardness.  We show that it is  $\mathsf{PSPACE}$-hard for the agents to compute their actions in this model. Furthermore, we show that it is equally difficult even to approximate an agent’s posterior: It is $\mathsf{PSPACE}$-hard to distinguish between the posterior being almost entirely concentrated on one state of the world or another.  '
  volume: 99
  URL: https://proceedings.mlr.press/v99/hazla19a.html
  PDF: http://proceedings.mlr.press/v99/hazla19a/hazla19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-hazla19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jan
    family: Hązła
  - given: Ali
    family: Jadbabaie
  - given: Elchanan
    family: Mossel
  - given: M. Amin
    family: Rahimian
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1614-1648
  id: hazla19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1614
  lastpage: 1648
  published: 2019-06-25 00:00:00 +0000
- title: 'How Hard is Robust Mean Estimation?'
  abstract: 'Robust mean estimation is the problem of estimating the mean $\mu \in \mathbb{R}^d$ of a $d$-dimensional distribution $D$ from a list of independent samples, an $\varepsilon$-fraction of which have been arbitrarily corrupted by a malicious adversary. Recent algorithmic progress has resulted in the first polynomial-time algorithms which achieve \emph{dimension-independent} rates of error: for instance, if $D$ has covariance $I$, in polynomial-time one may find $\hat{\mu}$ with $\|\mu - \hat{\mu}\| \leq O(\sqrt{\varepsilon})$. However, error rates achieved by current polynomial-time algorithms, while dimension-independent, are sub-optimal in many natural settings, such as when $D$ is sub-Gaussian, or has bounded $4$-th moments. In this work we give worst-case complexity-theoretic evidence that improving on the error rates of current polynomial-time algorithms for robust mean estimation may be computationally intractable in natural settings. We show that several natural approaches to improving error rates of current polynomial-time robust mean estimation algorithms would imply efficient algorithms for the small-set expansion problem, refuting Raghavendra and Steurer’s small-set expansion hypothesis (so long as $P \neq NP$). We also give the first direct reduction to the robust mean estimation problem, starting from a plausible but nonstandard variant of the small-set expansion problem.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/hopkins19a.html
  PDF: http://proceedings.mlr.press/v99/hopkins19a/hopkins19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-hopkins19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Samuel B.
    family: Hopkins
  - given: Jerry
    family: Li
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1649-1682
  id: hopkins19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1649
  lastpage: 1682
  published: 2019-06-25 00:00:00 +0000
- title: 'A Robust Spectral Algorithm for Overcomplete Tensor Decomposition'
  abstract: 'We give a spectral algorithm for decomposing overcomplete order-4 tensors, so long as their components satisfy an algebraic non-degeneracy condition that holds for nearly all (all but an algebraic set of measure $0$) tensors over $(\mathbb{R}^d)^{\otimes 4}$ with rank $n \le d^2$. Our algorithm is robust to adversarial perturbations of bounded spectral norm. Our algorithm is inspired by one which uses the sum-of-squares semidefinite programming hierarchy (Ma, Shi, and Steurer STOC’16), and we achieve comparable robustness and overcompleteness guarantees under similar algebraic assumptions. However, our algorithm avoids semidefinite programming and may be implemented as a series of basic linear-algebraic operations. We consequently obtain a much faster running time than semidefinite programming methods: our algorithm runs in time $\tilde O(n^2d^3) \le \tilde O(d^7)$, which is subquadratic in the input size $d^4$ (where we have suppressed factors related to the condition number of the input tensor).'
  volume: 99
  URL: https://proceedings.mlr.press/v99/hopkins19b.html
  PDF: http://proceedings.mlr.press/v99/hopkins19b/hopkins19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-hopkins19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Samuel B.
    family: Hopkins
  - given: Tselil
    family: Schramm
  - given: Jonathan
    family: Shi
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1683-1722
  id: hopkins19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1683
  lastpage: 1722
  published: 2019-06-25 00:00:00 +0000
- title: 'Sample-Optimal Low-Rank Approximation of Distance Matrices'
  abstract: 'A distance matrix $A \in \mathbb{R}^{n \times m}$ represents all pairwise distances, $A_{i,j} = d(x_i,y_j)$, between two point sets $x_1,\dotsc,x_n$ and $y_1,\dotsc,y_m$ in an arbitrary metric space $(\mathcal{Z},d)$. Such matrices arise in various computational contexts such as learning image manifolds, handwriting recognition, and multi-dimensional unfolding. In this work we study algorithms for low-rank approximation of distance matrices. Recent work by Bakshi and Woodruff (NeurIPS 2018) showed it is possible to compute a rank-$k$ approximation of a distance matrix in time $O((n+m)^{1+\gamma}) \mathrm{poly}(k,1/\epsilon)$, where $\epsilon>0$ is an error parameter and $\gamma>0$ is an arbitrarily small constant. Notably, their bound is sublinear in the matrix size, which is unachieveable for general matrices. We present an algorithm that is both simpler and more efficient. It reads only $O((n+m)k/\epsilon)$ entries of the input matrix, and has a running time of $O(n+m) \cdot \mathrm{poly}(k,1/\epsilon)$. We complement the sample complexity of our algorithm with a matching lower bound on the number of entries that must be ready by any algorithm. We provide experimental results to validate the approximation quality and running time of our algorithm'
  volume: 99
  URL: https://proceedings.mlr.press/v99/indyk19a.html
  PDF: http://proceedings.mlr.press/v99/indyk19a/indyk19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-indyk19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Pitor
    family: Indyk
  - given: Ali
    family: Vakilian
  - given: Tal
    family: Wagner
  - given: David P
    family: Woodruff
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1723-1751
  id: indyk19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1723
  lastpage: 1751
  published: 2019-06-25 00:00:00 +0000
- title: 'Making the Last Iterate of SGD Information Theoretically Optimal'
  abstract: 'Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) \emph{averages} of iterates and obtains information theoretically optimal bounds on suboptimality, the \emph{last point} of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD (Shamir and Zhang, 2013) however, are suboptimal compared to information theoretic lower bounds by a $\log T$ factor, where $T$ is the number of iterations. Harvey et. al (2018) shows that in fact, this additional $\log T$ factor is tight for standard step size sequences of $\OTheta{\frac{1}{\sqrt{t}}}$ and $\OTheta{\frac{1}{t}}$ for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to $O(\log T)$-suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of \emph{last point} of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/jain19a.html
  PDF: http://proceedings.mlr.press/v99/jain19a/jain19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-jain19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Prateek
    family: Jain
  - given: Dheeraj
    family: Nagaraj
  - given: Praneeth
    family: Netrapalli
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1752-1755
  id: jain19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1752
  lastpage: 1755
  published: 2019-06-25 00:00:00 +0000
- title: 'Accuracy-Memory Tradeoffs and Phase Transitions in Belief Propagation'
  abstract: 'The analysis of Belief Propagation and other algorithms for the {\em reconstruction problem} plays a key role in the analysis of community detection in inference on graphs, phylogenetic reconstruction in bioinformatics, and the cavity method in statistical physics.  We prove a conjecture of Evans, Kenyon, Peres, and Schulman (2000) which states that any bounded memory message passing algorithm is statistically much weaker than Belief Propagation for the reconstruction problem. More formally, any recursive algorithm with bounded memory for the  reconstruction problem on the trees with the binary symmetric channel has a phase transition strictly below the Belief Propagation threshold, also known as the Kesten-Stigum bound. The proof combines in novel fashion tools from recursive reconstruction, information theory, and optimal transport, and also establishes an asymptotic normality result for BP and other message-passing algorithms near the critical threshold.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/jain19b.html
  PDF: http://proceedings.mlr.press/v99/jain19b/jain19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-jain19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vishesh
    family: Jain
  - given: Frederic
    family: Koehler
  - given: Jingbo
    family: Liu
  - given: Elchanan
    family: Mossel
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1756-1771
  id: jain19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1756
  lastpage: 1771
  published: 2019-06-25 00:00:00 +0000
- title: 'The implicit bias of gradient descent on nonseparable data'
  abstract: 'Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/ji19a.html
  PDF: http://proceedings.mlr.press/v99/ji19a/ji19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-ji19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ziwei
    family: Ji
  - given: Matus
    family: Telgarsky
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1772-1798
  id: ji19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1772
  lastpage: 1798
  published: 2019-06-25 00:00:00 +0000
- title: 'An Optimal High-Order Tensor Method for Convex Optimization'
  abstract: 'This paper is concerned with finding an  optimal algorithm for minimizing a composite convex objective function. The basic setting is that the objective is the sum of two convex functions: the first function is smooth with up to the d-th order derivative information available, and the second function is possibly non-smooth, but its proximal tensor mappings can be computed approximately in an efficient manner. The problem is to find – in that setting – the best possible (optimal) iteration complexity for convex optimization. Along that line, for the smooth case (without the second non-smooth part in the objective), Nesterov (1983) proposed an optimal algorithm for the first-order methods (d=1) with iteration complexity O( 1 / k^2 ).  A high-order tensor algorithm with iteration complexity of O( 1 / k^{d+1} ) was proposed by Baes (2009) and Nesterov (2018). In this paper, we propose a new high-order tensor algorithm for the general composite case, with the iteration complexity of O( 1 / k^{(3d+1)/2} ), which matches the lower bound for the d-th order methods as established in Nesterov (2018) and Shamir et al. (2018), and hence is optimal. Our approach is based on the  Accelerated Hybrid Proximal Extragradient (A-HPE) framework proposed in Monteiro and Svaiter (2013), where a bisection procedure is installed for each A-HPE iteration. At each bisection step a proximal tensor subproblem is approximately solved, and the total number of bisection steps per A-HPE iteration is bounded by a logarithmic factor in the precision required.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/jiang19a.html
  PDF: http://proceedings.mlr.press/v99/jiang19a/jiang19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-jiang19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Bo
    family: Jiang
  - given: Haoyue
    family: Wang
  - given: Shuzhong
    family: Zhang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1799-1801
  id: jiang19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1799
  lastpage: 1801
  published: 2019-06-25 00:00:00 +0000
- title: 'Parameter-Free Online Convex Optimization with Sub-Exponential Noise'
  abstract: ' We consider the problem of unconstrained online convex optimization (OCO) with sub-exponential noise, a strictly more general problem than the standard OCO. In this setting, the learner receives a subgradient of the loss functions corrupted by sub-exponential noise and strives to achieve optimal regret guarantee, without knowledge of the competitor norm, i.e., in a parameter-free way. Recently, Cutkosky and Boahen (COLT 2017) proved that, given unbounded subgradients, it is impossible to guarantee a sublinear regret due to an exponential penalty. This paper shows that it is possible to go around the lower bound by allowing the observed subgradients to be unbounded via stochastic noise. However, the presence of unbounded noise in unconstrained OCO is challenging; existing algorithms do not provide near-optimal regret bounds or fail to have a guarantee. So, we design a novel parameter-free OCO algorithm for Banach space, which we call BANCO, via a reduction to betting on noisy coins. We show that BANCO achieves the optimal regret rate in our problem. Finally, we show the application of our results to obtain a parameter-free locally private stochastic subgradient descent algorithm, and the connection to the law of iterated logarithms.  '
  volume: 99
  URL: https://proceedings.mlr.press/v99/jun19a.html
  PDF: http://proceedings.mlr.press/v99/jun19a/jun19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-jun19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Kwang-Sung
    family: Jun
  - given: Francesco
    family: Orabona
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1802-1823
  id: jun19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1802
  lastpage: 1823
  published: 2019-06-25 00:00:00 +0000
- title: 'Sample complexity of partition identification using multi-armed bandits'
  abstract: 'Given a vector of probability distributions, or arms, each of which can be sampled independently, we consider the problem of identifying the partition to which this vector belongs from a finitely partitioned universe of such vector of distributions. We study this as a pure exploration problem in multi-armed bandit settings and develop sample complexity bounds on the total mean number of samples required for identifying the correct partition with high probability.  This framework subsumes  well studied problems such as finding the best arm or the best few arms. We consider distributions belonging to the single parameter exponential family and primarily consider partitions  where the vector of means of arms lie either in a given  set or its complement. The sets considered  correspond to distributions where there exists  a mean above a specified threshold, where the set is a half space and where either the set or its complement  is a polytope, or more generally, a convex set.  In these settings, we characterize the lower bounds on mean number of samples for each arm  highlighting  their dependence on the problem geometry. Further, inspired by the lower bounds, we propose algorithms  that can match these bounds asymptotically with decreasing probability of error.  Applications of this framework may be diverse.  We briefly discuss a few associated with  simulation in finance.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/juneja19a.html
  PDF: http://proceedings.mlr.press/v99/juneja19a/juneja19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-juneja19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sandeep
    family: Juneja
  - given: Subhashini
    family: Krishnasamy
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1824-1852
  id: juneja19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1824
  lastpage: 1852
  published: 2019-06-25 00:00:00 +0000
- title: 'Privately Learning High-Dimensional Distributions'
  abstract: 'We present novel, computationally efficient, and differentially private algorithms for two fundamental high-dimensional learning problems: learning a multivariate Gaussian and learning a product distribution over the Boolean hypercube in total variation distance.  The sample complexity of our algorithms nearly matches the sample complexity of the optimal non-private learners for these tasks in a wide range of parameters, showing that privacy comes essentially for free for these problems.  In particular, in contrast to previous approaches, our algorithm for learning Gaussians does not require strong a priori bounds on the range of the parameters.  Our algorithms introduce a novel technical approach to reducing the sensitivity of the estimation procedure that we call recursive private preconditioning.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/kamath19a.html
  PDF: http://proceedings.mlr.press/v99/kamath19a/kamath19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-kamath19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gautam
    family: Kamath
  - given: Jerry
    family: Li
  - given: Vikrant
    family: Singhal
  - given: Jonathan
    family: Ullman
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1853-1902
  id: kamath19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1853
  lastpage: 1902
  published: 2019-06-25 00:00:00 +0000
- title: 'On Communication Complexity of Classification Problems'
  abstract: 'This work studies distributed learning in the spirit of Yao’s model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, the players can send each other examples (as well as bits) where each example/bit costs one unit of communication.   This enables a uniform treatment of infinite classes such as half-spaces in $\R^d$, which are ubiquitous in machine learning.  We study several fundamental questions in this model.  For example, we provide combinatorial characterizations of the classes that can be learned with efficient communication in the proper-case as well as in the improper-case. These findings imply unconditional separations in this context between various  learning tasks, e.g. realizable versus agnostic learning, proper versus improper learning, etcetera. %They also imply lower bounds that match the performance %of algorithm from previous works. The derivation of these results hinges on a type of decision problems we term “{\it realizability problems}” where the goal is deciding whether a distributed input sample is consistent with an hypothesis from a pre-specified class. From a technical perspective,  the protocols we devise (i.e. the upper bounds) are based on ideas from machine learning and the impossibility results (i.e. the lower bounds) are based on ideas from communication complexity.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/kane19a.html
  PDF: http://proceedings.mlr.press/v99/kane19a/kane19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-kane19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Daniel
    family: Kane
  - given: Roi
    family: Livni
  - given: Shay
    family: Moran
  - given: Amir
    family: Yehudayoff
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1903-1943
  id: kane19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1903
  lastpage: 1943
  published: 2019-06-25 00:00:00 +0000
- title: 'Non-asymptotic Analysis of Biased Stochastic Approximation Scheme'
  abstract: 'Stochastic approximation (SA) is a key method used in statistical learning. Recently, its non-asymptotic convergence analysis has been considered in many papers. However, most of the prior analyses are made under restrictive assumptions such as unbiased gradient estimates and convex objective function, which significantly limit their applications to sophisticated tasks such as online and reinforcement learning. These restrictions are all essentially relaxed in this work. In particular, we analyze a general SA scheme to minimize a non-convex, smooth objective function. We consider update procedure whose drift term depends on a state-dependent Markov chain and the mean field is not necessarily of gradient type, covering approximate second-order method and allowing asymptotic bias for the one-step updates. We illustrate these settings with the online EM algorithm and the policy-gradient method for average reward maximization in reinforcement learning.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/karimi19a.html
  PDF: http://proceedings.mlr.press/v99/karimi19a/karimi19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-karimi19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Belhal
    family: Karimi
  - given: Blazej
    family: Miasojedow
  - given: Eric
    family: Moulines
  - given: Hoi-To
    family: Wai
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1944-1974
  id: karimi19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1944
  lastpage: 1974
  published: 2019-06-25 00:00:00 +0000
- title: 'Discrepancy, Coresets, and Sketches in Machine Learning'
  abstract: 'This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy of logistic regression, sigmoid activation loss, matrix covariance, kernel density and any analytic function of the dot product or the squared distance. Our result resolves a long-standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple deterministic algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between class discrepancy, coreset complexity, learnability, and streaming algorithms.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/karnin19a.html
  PDF: http://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-karnin19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zohar
    family: Karnin
  - given: Edo
    family: Liberty
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1975-1993
  id: karnin19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1975
  lastpage: 1993
  published: 2019-06-25 00:00:00 +0000
- title: 'Bandit Principal Component Analysis'
  abstract: 'We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of $d$-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment’s choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/kotlowski19a.html
  PDF: http://proceedings.mlr.press/v99/kotlowski19a/kotlowski19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-kotlowski19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Wojciech
    family: Kotłowski
  - given: Gergely
    family: Neu
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 1994-2024
  id: kotlowski19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 1994
  lastpage: 2024
  published: 2019-06-25 00:00:00 +0000
- title: 'Contextual bandits with continuous actions: Smoothing, zooming, and adapting'
  abstract: 'We study contextual bandit learning for any competitor policy class and continuous action space.  We obtain two qualitatively different regret bounds: one competes with a smoothed version of the policy class under no continuity assumptions, while the other requires standard Lipschitz assumptions. Both bounds exhibit data-dependent “zooming" behavior and, with no tuning, yield improved guarantees for benign problems. We also study adapting to unknown smoothness parameters, establishing a price-of-adaptivity and deriving optimal adaptive algorithms that require no additional information.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/krishnamurthy19a.html
  PDF: http://proceedings.mlr.press/v99/krishnamurthy19a/krishnamurthy19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-krishnamurthy19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Akshay
    family: Krishnamurthy
  - given: John
    family: Langford
  - given: Aleksandrs
    family: Slivkins
  - given: Chicheng
    family: Zhang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2025-2027
  id: krishnamurthy19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2025
  lastpage: 2027
  published: 2019-06-25 00:00:00 +0000
- title: 'Distribution-Dependent Analysis of Gibbs-ERM Principle'
  abstract: 'Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as SGLD and —to some extent— SGD), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. Our main results are \emph{distribution-dependent} upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the \emph{effective dimension} $\text{tr}\left(\boldsymbol{H}^{\star} (\boldsymbol{H}^{\star} + \lambda \boldsymbol{I})^{-1}\right)$ of the problem, where $\boldsymbol{H}^{\star}$ is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. 2017, and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the \emph{localized} excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the \emph{global} excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/kuzborskij19a.html
  PDF: http://proceedings.mlr.press/v99/kuzborskij19a/kuzborskij19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-kuzborskij19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilja
    family: Kuzborskij
  - given: Nicolò
    family: Cesa-Bianchi
  - given: Csaba
    family: Szepesvári
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2028-2054
  id: kuzborskij19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2028
  lastpage: 2054
  published: 2019-06-25 00:00:00 +0000
- title: 'Global Convergence of the EM Algorithm for Mixtures of Two Component Linear Regression'
  abstract: 'The Expectation-Maximization algorithm is perhaps the most broadly used algorithm for inference of latent variable problems. A theoretical understanding of its performance, however, largely remains lacking. Recent results established that EM enjoys global convergence for Gaussian Mixture Models. For Mixed Linear Regression, however, only local convergence results have been established, and those only for the high SNR regime. We show here that EM converges for mixed linear regression with two components (it is known that it may fail to converge for three or more), and moreover that this convergence holds for random initialization. Our analysis reveals that EM exhibits very different behavior in Mixed Linear Regression from its behavior in Gaussian Mixture Models, and hence our proofs require the development of several new ideas.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/kwon19a.html
  PDF: http://proceedings.mlr.press/v99/kwon19a/kwon19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-kwon19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jeongyeol
    family: Kwon
  - given: Wei
    family: Qian
  - given: Constantine
    family: Caramanis
  - given: Yudong
    family: Chen
  - given: Damek
    family: Davis
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2055-2110
  id: kwon19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2055
  lastpage: 2110
  published: 2019-06-25 00:00:00 +0000
- title: 'An Information-Theoretic Approach to Minimax Regret in Partial Monitoring'
  abstract: ' We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under finite-action partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then generalise the information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian regret bounds and combine them with the minimax theorem to derive minimax regret bounds for various partial monitoring settings. The highlight is a clean analysis of ‘easy’ and ‘hard’ finite partial monitoring, with new regret bounds that are independent of arbitrarily large game-dependent constants and eliminate the logarithmic dependence on the horizon for easy games that appeared in earlier work. The power of the generalised machinery is further demonstrated by proving that the minimax regret for $k$-armed adversarial bandits is at most $\sqrt{2kn}$, improving on existing results by a factor of 2. Finally, we provide a simple analysis of the cops and robbers game, also improving best known constants. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/lattimore19a.html
  PDF: http://proceedings.mlr.press/v99/lattimore19a/lattimore19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-lattimore19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tor
    family: Lattimore
  - given: Csaba
    family: Szepesvári
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2111-2139
  id: lattimore19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2111
  lastpage: 2139
  published: 2019-06-25 00:00:00 +0000
- title: 'Solving Empirical Risk Minimization in the Current Matrix Multiplication Time'
  abstract: 'Many convex problems in machine learning and computer science share the same form:  \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where $f_i$ are convex functions on $\R^{n_i}$ with constant $n_i$, $A_i \in \R^{n_i \times d}$, $b_i \in \R^{n_i}$ and $\sum_i n_i = n$. This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* (  ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta)  ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy.  Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in $\delta$.  For the current bound $\omega \sim 2.38$ [Vassilevska Williams’12, Le Gall’14] and $\alpha \sim 0.31$ [Le Gall, Urrutia’18], our runtime $O^* ( n^{\omega} \log (n / \delta))$ matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman’18] proved that all the current known techniques can not give a better $\omega$ below $2.168$ which is larger than our $2+1/6$. Our result generalizes the very recent result of solving linear programs in the  current matrix multiplication time [Cohen, Lee, Song’19] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song’19] :\ $\bullet$ We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. \ $\bullet$ We propose an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/lee19a.html
  PDF: http://proceedings.mlr.press/v99/lee19a/lee19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-lee19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yin Tat
    family: Lee
  - given: Zhao
    family: Song
  - given: Qiuyi
    family: Zhang
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2140-2157
  id: lee19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2140
  lastpage: 2157
  published: 2019-06-25 00:00:00 +0000
- title: 'On Mean Estimation for General Norms with Statistical Queries'
  abstract: 'We study the problem of mean estimation for high-dimensional distributions given access to a statistical query oracle. For a normed space $X = (\mathbb{R}^d, \|\cdot\|_X)$ and a distribution supported on vectors $x \in \mathbb{R}^d$ with $\|x\|_{X} \leq 1$, the task is to output an estimate $\hat{\mu} \in \mathbb{R}^d$ which is $\varepsilon$-close in the distance induced by $\|\cdot\|_X$ to the true mean of the distribution. We obtain sharp upper and lower bounds for the statistical query complexity of this problem when the the underlying norm is \emph{symmetric} as well as for Schatten-$p$ norms, answering two questions raised by Feldman, Guzmán, and Vempala (SODA 2017).'
  volume: 99
  URL: https://proceedings.mlr.press/v99/li19a.html
  PDF: http://proceedings.mlr.press/v99/li19a/li19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-li19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jerry
    family: Li
  - given: Aleksandar
    family: Nikolov
  - given: Ilya
    family: Razenshteyn
  - given: Erik
    family: Waingarten
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2158-2172
  id: li19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2158
  lastpage: 2172
  published: 2019-06-25 00:00:00 +0000
- title: 'Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits'
  abstract: 'We study the linear contextual bandit problem with finite action sets. When the problem dimension is $d$,  the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we  (1) show that the minimax expected regret is $\Omega(\sqrt{dT \log T \log n})$ for every algorithm, and  (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors.  Our algorithmic result saves two $\sqrt{\log T}$ factors from previous analysis, and our information-theoretical lower bound also improves previous results by one $\sqrt{\log T}$ factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic $T$ term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side,  and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/li19b.html
  PDF: http://proceedings.mlr.press/v99/li19b/li19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-li19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yingkai
    family: Li
  - given: Yining
    family: Wang
  - given: Yuan
    family: Zhou
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2173-2174
  id: li19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2173
  lastpage: 2174
  published: 2019-06-25 00:00:00 +0000
- title: 'Sharp Theoretical Analysis for Nonparametric Testing under Random Projection'
  abstract: 'A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is sufficient for achieving testing optimality in terms of the minimax rate. As a by-product, the lower bound of projection dimension for minimax optimal estimation derived in Yang (2017) is proven to be sharp. One technical contribution is to establish upper bounds for a range of tail sums of empirical kernel eigenvalues. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/liu19a.html
  PDF: http://proceedings.mlr.press/v99/liu19a/liu19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-liu19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Meimei
    family: Liu
  - given: Zuofeng
    family: Shang
  - given: Guang
    family: Cheng
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2175-2209
  id: liu19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2175
  lastpage: 2209
  published: 2019-06-25 00:00:00 +0000
- title: 'Combinatorial Algorithms for Optimal Design'
  abstract: 'In an optimal design problem, we are given a set of linear experiments $v_1,…,v_n\in \mathbb{R}^d$ and $k \geq d$, and our goal is to select a set or a multiset $S \subseteq [n]$ of size $k$ such that $\Phi((\sum_{i \in S} v_i v_i^\top )^{-1})$ is minimized. When $\Phi(M) = Determinant(M)^{1/d}$, the problem is known as the D-optimal design problem, and when $\Phi(M) = Trace(M)$, it is known as the A-optimal design problem. One of the most common heuristics used in practice to solve these problems is the local search heuristic, also known as the Fedorov’s exchange method (Fedorov, 1972). This is due to its simplicity and its empirical performance (Cook and Nachtrheim, 1980; Miller and Nguyen, 1994; Atkinson et al., 2007).  However, despite its wide usage no theoretical bound has been proven for this algorithm. In this paper, we bridge this gap and prove approximation guarantees for the local search algorithms for D-optimal design and A-optimal design problems. We show that the local search algorithms are asymptotically optimal when $\frac{k}{d}$ is large. In addition to this, we also prove similar approximation guarantees for the greedy algorithms for D-optimal design and A-optimal design problems when $\frac{k}{d}$ is large.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/madan19a.html
  PDF: http://proceedings.mlr.press/v99/madan19a/madan19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-madan19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vivek
    family: Madan
  - given: Mohit
    family: Singh
  - given: Uthaipon
    family: Tantipongpipat
  - given: Weijun
    family: Xie
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2210-2258
  id: madan19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2210
  lastpage: 2258
  published: 2019-06-25 00:00:00 +0000
- title: 'Nonconvex sampling with the Metropolis-adjusted Langevin algorithm'
  abstract: 'The Langevin Markov chain algorithms are widely deployed methods to sample from distributions in challenging high-dimensional and non-convex statistics and machine learning applications. Despite this, current bounds for the Langevin algorithms are worse than those of competing algorithms in many important situations, for instance when sampling from weakly log-concave distributions, or when sampling or optimizing non-convex log-densities. We obtain improved bounds in many of these situations, showing that the Metropolis-adjusted Langevin algorithm (MALA) is faster than the best bounds for its competitor algorithms when the target distribution satisfies weak third- and fourth- order regularity properties associated with the input data. In many settings, our regularity conditions are weaker than the usual Euclidean operator norm regularity properties, allowing us to show faster bounds for a much larger class of distributions than would be possible with the usual Euclidean operator norm approach,  including in statistics and machine learning applications where the data satisfy a certain incoherence condition. In particular, we show that using our regularity conditions one can obtain faster bounds for applications which include sampling problems in Bayesian logistic regression with weakly convex priors, and the nonconvex optimization problem of learning linear classifiers with zero-one loss functions. Our main technical contribution is an analysis of the Metropolis acceptance probability of MALA in terms of its “energy-conservation error," and a bound for this error in terms of third- and fourth- order regularity conditions. The combination of this higher-order analysis of the energy conservation error with the conductance method is key to obtaining bounds which have a sub-linear dependence on the dimension $d$ in the non-strongly logconcave setting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/mangoubi19a.html
  PDF: http://proceedings.mlr.press/v99/mangoubi19a/mangoubi19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-mangoubi19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Oren
    family: Mangoubi
  - given: Nisheeth K
    family: Vishnoi
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2259-2293
  id: mangoubi19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2259
  lastpage: 2293
  published: 2019-06-25 00:00:00 +0000
- title: 'Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance'
  abstract: 'We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/marteau-ferey19a.html
  PDF: http://proceedings.mlr.press/v99/marteau-ferey19a/marteau-ferey19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-marteau-ferey19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ulysse
    family: Marteau-Ferey
  - given: Dmitrii
    family: Ostrovskii
  - given: Francis
    family: Bach
  - given: Alessandro
    family: Rudi
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2294-2340
  id: marteau-ferey19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2294
  lastpage: 2340
  published: 2019-06-25 00:00:00 +0000
- title: 'Planting trees in graphs, and finding them back'
  abstract: 'In this paper we study the two inference problems of detection and reconstruction in the context of planted structures in sparse Erdős-Rényi random graphs $\mathcal G(n,\lambda/n)$ with fixed average degree $\lambda>0$. Motivated by a problem of communication security, we focus on the case where the planted structure consists in the addition of a tree graph.  In the case of planted line graphs, we establish the following phase diagram for detection and reconstruction. In a low density region where the average degree $\lambda$ of the original graph is below some critical value $\lambda_c=1$, both detection and reconstruction go from  impossible to easy as the line length $K$ crosses some critical value $K^*=\ln(n)/\ln(1/\lambda)$, where $n$ is the number of nodes in the graph. In a high density region where $\lambda>\lambda_c$, detection goes from impossible to easy as $K$ goes from $o(\sqrt{n})$ to $\omega(\sqrt{n})$. In contrast, reconstruction remains impossible so long as $K=o(n)$.  We then consider planted $D$-ary trees of varying depth $h$ and $2\le D\le O(1)$. For these we identify a low-density region $\lambda<\lambda_D$, where $\lambda_D$ is the threshold for emergence of the $D$-core in Erdős-Rényi random graphs $\mathcal G(n,\lambda/n)$ for which the following holds. There is a threshold $h*=g(D)\ln(\ln(n))$ with the following properties. Detection goes from impossible to feasible as $h$ crosses $h*$. Interestingly, we show that only partial reconstruction is feasible at best for $h\ge h*$. We conjecture a similar picture to hold for $D$-ary trees as for lines in the high-density region $\lambda>\lambda_D$, but confirm only the following part of this picture: Detection is easy for $D$-ary trees of size $\omega(\sqrt{n})$, while at best only partial reconstruction is feasible for $D$-ary trees of any size $o(n)$. These results provide a clear contrast with the corresponding picture for detection and reconstruction of {\em low rank} planted structures, such as dense subgraphs and block communities.  In the examples we study, there is i) an absence of hard phases for both detection and reconstruction, and ii) a discrepancy between detection and reconstruction, the latter being impossible for a wide range of parameters where detection is easy. The latter property does not hold for previously studied low rank planted structures.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/massoulie19a.html
  PDF: http://proceedings.mlr.press/v99/massoulie19a/massoulie19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-massoulie19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Laurent
    family: Massoulié
  - given: Ludovic
    family: Stephan
  - given: Don
    family: Towsley
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2341-2371
  id: massoulie19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2341
  lastpage: 2371
  published: 2019-06-25 00:00:00 +0000
- title: 'Uniform concentration and symmetrization for weak interactions'
  abstract: 'The method to derive uniform bounds with Gaussian and Rademacher complexities is extended to the case where the sample average is replaced by a nonlinear statistic. Tight bounds are obtained for U-statistics, smoothened L-statistics and error functionals of l2-regularized algorithms.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/maurer19a.html
  PDF: http://proceedings.mlr.press/v99/maurer19a/maurer19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-maurer19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Andreas
    family: Maurer
  - given: Massimiliano
    family: Pontil
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2372-2387
  id: maurer19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2372
  lastpage: 2387
  published: 2019-06-25 00:00:00 +0000
- title: 'Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit'
  abstract: 'We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $\mathbb{R}^D$ (where $D$ is the number of parameters associated to each neuron). This evolution can be defined through a partial differential equation or, equivalently, as the gradient flow in the Wasserstein space of probability distributions. Earlier work shows that (under some regularity assumptions), the mean field description is accurate as soon as the number of hidden units is much larger than the dimension $D$. In this paper we establish stronger and more general approximation guarantees. First of all, we show that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. Next, we generalize this analysis to the case of unbounded activation functions, which was not covered by earlier bounds.  We extend our results to noisy stochastic gradient descent. Finally, we show that kernel ridge regression can be recovered as a special limit of  the mean field analysis. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/mei19a.html
  PDF: http://proceedings.mlr.press/v99/mei19a/mei19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-mei19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Song
    family: Mei
  - given: Theodor
    family: Misiakiewicz
  - given: Andrea
    family: Montanari
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2388-2464
  id: mei19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2388
  lastpage: 2464
  published: 2019-06-25 00:00:00 +0000
- title: 'Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem'
  abstract: 'We consider the combinatorial multi-armed bandit (CMAB) problem, where the reward function is nonlinear. In this setting, the agent chooses a batch of arms on each round and receives feedback from each arm of the batch. The reward that the agent aims to maximize is a function of the selected arms and their expectations. In many applications, the reward function is highly nonlinear, and the performance of existing algorithms relies on a global Lipschitz constant to encapsulate the function’s nonlinearity. This may lead to loose regret bounds, since by itself, a large gradient does not necessarily cause a large regret, but only in regions where the uncertainty in the reward’s parameters is high.  To overcome this problem, we introduce a new smoothness criterion, which we term \emph{Gini-weighted smoothness}, that takes into account both the nonlinearity of the reward and concentration properties of the arms. We show that a linear dependence of the regret in the batch size in existing algorithms can be replaced by this smoothness parameter. This, in turn, leads to much tighter regret bounds when the smoothness parameter is batch-size independent. For example, in the probabilistic maximum coverage (PMC) problem, that has many applications, including influence maximization, diverse recommendations and more, we achieve dramatic improvements in the upper bounds. We also prove matching lower bounds for the PMC problem and show that our algorithm is tight, up to a logarithmic factor in the problem’s parameters.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/merlis19a.html
  PDF: http://proceedings.mlr.press/v99/merlis19a/merlis19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-merlis19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nadav
    family: Merlis
  - given: Shie
    family: Mannor
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2465-2489
  id: merlis19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2465
  lastpage: 2489
  published: 2019-06-25 00:00:00 +0000
- title: 'Lipschitz Adaptivity with Multiple Learning Rates in Online Learning'
  abstract: 'We aim to design adaptive online learning algorithms that take advantage of any special structure that might be present in the learning task at hand, with as little manual tuning by the user as possible. A fundamental obstacle that comes up in the design of such adaptive algorithms is to calibrate a so-called step-size or learning rate hyperparameter depending on variance, gradient norms, etc. A recent technique promises to overcome this difficulty by maintaining multiple learning rates in parallel. This technique has been applied in the MetaGrad algorithm for online convex optimization and the Squint algorithm for prediction with expert advice. However, in both cases the user still has to provide in advance a Lipschitz hyperparameter that bounds the norm of the gradients. Although this hyperparameter is typically not available in advance, tuning it correctly is crucial: if it is set too small, the methods may fail completely; but if it is taken too large, performance deteriorates significantly. In the present work we remove this Lipschitz hyperparameter by designing new versions of MetaGrad and Squint that adapt to its optimal value automatically. We achieve this by dynamically updating the set of active learning rates. For MetaGrad, we further improve the computational efficiency of handling constraints on the domain of prediction, and we remove the need to specify the number of rounds in advance.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/mhammedi19a.html
  PDF: http://proceedings.mlr.press/v99/mhammedi19a/mhammedi19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-mhammedi19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zakaria
    family: Mhammedi
  - given: Wouter M
    family: Koolen
  - given: Tim
    family: Van Erven
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2490-2511
  id: mhammedi19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2490
  lastpage: 2511
  published: 2019-06-25 00:00:00 +0000
- title: 'VC Classes are Adversarially Robustly Learnable, but Only Improperly'
  abstract: 'We study the question of learning an adversarially robust predictor. We show that any hypothesis class $\mathcal{H}$ with finite VC dimension is robustly PAC learnable with an \emph{improper} learning rule. The requirement of being improper is necessary as we exhibit examples of hypothesis classes $\mathcal{H}$ with finite VC dimension that are \emph{not} robustly PAC learnable with any \emph{proper} learning rule.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/montasser19a.html
  PDF: http://proceedings.mlr.press/v99/montasser19a/montasser19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-montasser19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Omar
    family: Montasser
  - given: Steve
    family: Hanneke
  - given: Nathan
    family: Srebro
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2512-2530
  id: montasser19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2512
  lastpage: 2530
  published: 2019-06-25 00:00:00 +0000
- title: 'Affine Invariant Covariance Estimation for Heavy-Tailed Distributions'
  abstract: 'In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distribution. We prove that the proposed estimator $\widehat{\mathbf{S}}$ admits an \textit{affine-invariant} bound of the form \[ (1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S} \]{in} high probability, where $\mathbf{S}$ is the unknown covariance matrix, and $\preccurlyeq$ is the positive semidefinite order on symmetric matrices. The result only requires the existence of fourth-order moments, and allows for $\varepsilon = O(\sqrt{\kappa^4 d\log(d/\delta)/n})$ where $\kappa^4$ is a measure of kurtosis of the distribution, $d$ is the dimensionality of the space, $n$ is the sample size, and $1-\delta$ is the desired confidence level. More generally, we can allow for regularization with level $\lambda$, then $d$ gets replaced with the degrees of freedom number. Denoting $\text{cond}(\mathbf{S})$ the condition number of $\mathbf{S}$, the computational cost of the novel estimator is $O(d^2 n + d^3\log(\text{cond}(\mathbf{S})))$, which is comparable to the cost of the sample covariance estimator in the statistically interesing regime $n \ge d$. We consider applications of our estimator to eigenvalue estimation with relative error, and to ridge regression with heavy-tailed random design.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/ostrovskii19a.html
  PDF: http://proceedings.mlr.press/v99/ostrovskii19a/ostrovskii19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-ostrovskii19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dmitrii M.
    family: Ostrovskii
  - given: Alessandro
    family: Rudi
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2531-2550
  id: ostrovskii19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2531
  lastpage: 2550
  published: 2019-06-25 00:00:00 +0000
- title: 'Stochastic Gradient Descent Learns State Equations with Nonlinear Activations'
  abstract: 'We study discrete time dynamical systems governed by the state equation $h_{t+1}=\phi(Ah_t+Bu_t)$. Here $A,B$ are weight matrices, $\phi$ is an activation function, and $u_t$ is the input data. This relation is the backbone of recurrent neural networks (e.g. LSTMs) which have broad applications in sequential learning tasks. We utilize stochastic gradient descent to learn the weight matrices from a finite input/state trajectory $\{u_t,h_t\}_{t=0}^N$. We prove that SGD estimate linearly converges to the ground truth weights while using near-optimal sample size. Our results apply to increasing activations whose derivatives are bounded away from zero. The analysis is based on i) a novel SGD convergence result with nonlinear activations and ii) careful statistical characterization of the state vector. Numerical experiments verify the fast convergence of SGD on ReLU and leaky ReLU in consistence with our theory.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/oymak19a.html
  PDF: http://proceedings.mlr.press/v99/oymak19a/oymak19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-oymak19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Samet
    family: Oymak
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2551-2579
  id: oymak19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2551
  lastpage: 2579
  published: 2019-06-25 00:00:00 +0000
- title: 'A Theory of Selective Prediction'
  abstract: 'We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence $x_1, \ldots, x_n$ of length $n$.  Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some $t < n$ and $m \le n - t$, after seeing $t$ observations we predict the average of $x_{t+1}, \ldots, x_{t+m}$. This particular problem was first studied in Drucker (2013) and referred to as the “density prediction game”. We show that the expected squared error of our prediction can be bounded by $O(\frac{1}{\log n})$ and prove a matching lower bound, which resolves an open question raised in Drucker (2013). This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight several open directions for future work.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/qiao19a.html
  PDF: http://proceedings.mlr.press/v99/qiao19a/qiao19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-qiao19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mingda
    family: Qiao
  - given: Gregory
    family: Valiant
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2580-2594
  id: qiao19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2580
  lastpage: 2594
  published: 2019-06-25 00:00:00 +0000
- title: 'Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon'
  abstract: '  We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on data. The result supports the empirical observation that minimum-norm interpolation (that is, exact fit to training data) in RKHS generalizes well for some high-dimensional datasets, but not for low-dimensional ones.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/rakhlin19a.html
  PDF: http://proceedings.mlr.press/v99/rakhlin19a/rakhlin19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-rakhlin19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alexander
    family: Rakhlin
  - given: Xiyu
    family: Zhai
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2595-2623
  id: rakhlin19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2595
  lastpage: 2623
  published: 2019-06-25 00:00:00 +0000
- title: 'Classification with unknown class-conditional label noise on non-compact feature spaces'
  abstract: 'We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class-conditional label noise have required that the regression function is close to its extrema on sets of large measure. We shall consider this problem in the setting of non-compact metric spaces, where the regression function need not attain its extrema. In this setting we determine the minimax optimal learning rates (up to logarithmic factors). The rate displays interesting threshold behaviour: When the regression function approaches its extrema at a sufficient rate, the optimal learning rates are of the same order as those obtained in the label-noise free setting. If the regression function approaches its extrema more gradually then classification performance necessarily degrades. In addition, we present an adaptive algorithm which attains these rates without prior knowledge of either the distributional parameters or the local density. This identifies for the first time a scenario in which finite sample rates are achievable in the label noise setting, but they differ from the optimal rates without label noise. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/reeve19a.html
  PDF: http://proceedings.mlr.press/v99/reeve19a/reeve19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-reeve19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Henry
    family: Reeve
  - given: 
    family: Kabán
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2624-2651
  id: reeve19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2624
  lastpage: 2651
  published: 2019-06-25 00:00:00 +0000
- title: 'The All-or-Nothing Phenomenon in Sparse Linear Regression'
  abstract: 'We study the problem of recovering a hidden binary $k$-sparse $p$-dimensional vector $\beta$ from $n$ noisy linear observations $Y=X\beta+W$ where $X_{ij}$ are i.i.d.  $\mathcal{N}(0,1)$ and $W_i$ are i.i.d.  $\mathcal{N}(0,\sigma^2)$. A closely related  hypothesis testing problem is to distinguish the pair $(X,Y)$ generated from this structured model from a corresponding null model where $(X,Y)$ consist of purely independent Gaussian entries. In the low sparsity $k=o(p)$ and high signal to noise ratio $k/\sigma^2=\Omega\left(1\right)$ regime, we establish an “All-or-Nothing” information-theoretic phase transition at a critical sample size $n^*=2 k\log \left(p/k\right) /\log \left(1+k/\sigma^2\right)$, resolving a conjecture of [GamarnikZadik17]. Specifically, we show that if $\liminf_{p\rightarrow \infty} n/n^*>1$, then the maximum likelihood estimator almost perfectly recovers the hidden vector with high probability and moreover the true hypothesis can be detected with a vanishing error probability. Conversely, if $\limsup_{p\rightarrow \infty} n/n^*<1$, then it becomes information-theoretically impossible even to  recover an arbitrarily small but fixed fraction of the hidden vector support, or to test hypotheses strictly better than random guess. Our proof of the impossibility result builds upon two key techniques, which could be of independent interest. First, we use a conditional second moment method to upper bound the Kullback-Leibler (KL) divergence between the structured and the null model. Second, inspired by the celebrated area theorem, we establish a lower bound to the minimum mean squared estimation error of the hidden vector in terms of the KL divergence between the two models.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/reeves19a.html
  PDF: http://proceedings.mlr.press/v99/reeves19a/reeves19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-reeves19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Galen
    family: Reeves
  - given: Jiaming
    family: Xu
  - given: Ilias
    family: Zadik
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2652-2663
  id: reeves19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2652
  lastpage: 2663
  published: 2019-06-25 00:00:00 +0000
- title: 'Depth Separations in Neural Networks: What is Actually Being Separated?'
  abstract: 'Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $\mathcal{O}(1)$-Lipschitz functions only when the target accuracy $\epsilon$ is at most $\text{poly}(1/d)$). In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $\epsilon$ does not scale with $d$. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it \emph{is} possible to approximate $\mathcal{O}(1)$-Lipschitz radial functions with depth $2$, size $\text{poly}(d)$ networks, for every constant $\epsilon$. We complement it by showing that approximating such functions is also possible with depth $2$, size $\text{poly}(1/\epsilon)$ networks, for every constant $d$. Finally, we show that it is not possible to have polynomial dependence in both $d,1/\epsilon$ simultaneously. Overall, our results indicate that in order to show depth separations for expressing $\mathcal{O}(1)$-Lipschitz functions with constant accuracy – if at all possible – one would need fundamentally different techniques than existing ones in the literature.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/safran19a.html
  PDF: http://proceedings.mlr.press/v99/safran19a/safran19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-safran19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Itay
    family: Safran
  - given: Ronen
    family: Eldan
  - given: Ohad
    family: Shamir
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2664-2666
  id: safran19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2664
  lastpage: 2666
  published: 2019-06-25 00:00:00 +0000
- title: 'How do infinite width bounded norm networks look in function space?'
  abstract: 'We consider the question of what functions can be captured by ReLU networks with an unbounded number of units (infinite width), but where the overall network Euclidean norm (sum of squares of all weights in the system, except for an unregularized bias term for each unit) is bounded; or equivalently what is the minimal norm required to approximate a given function. For functions $f:\mathbb R \rightarrow\mathbb R$ and a single hidden layer, we show that the minimal network norm for representing $f$ is $\max(\int \lvert f”(x) \rvert \mathrm{d} x, \lvert  f’(-\infty) + f’(+\infty) \rvert)$, and hence the minimal norm fit for a sample is given by a linear spline interpolation.  '
  volume: 99
  URL: https://proceedings.mlr.press/v99/savarese19a.html
  PDF: http://proceedings.mlr.press/v99/savarese19a/savarese19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-savarese19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Pedro
    family: Savarese
  - given: Itay
    family: Evron
  - given: Daniel
    family: Soudry
  - given: Nathan
    family: Srebro
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2667-2690
  id: savarese19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2667
  lastpage: 2690
  published: 2019-06-25 00:00:00 +0000
- title: 'Exponential Convergence Time of Gradient Descent for    One-Dimensional Deep Linear Neural Networks'
  abstract: 'We study the dynamics of gradient descent on objective  functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar  parameters $w_1,\ldots,w_k$), which arise in the context of  training depth-$k$ linear neural networks. We prove that for standard  random initializations, and under mild assumptions on $f$, the number of  iterations required for convergence scales exponentially with the depth  $k$. We also show empirically that this phenomenon can occur in higher  dimensions, where each $w_i$ is a matrix. This highlights a potential  obstacle in understanding the convergence of gradient-based methods for  deep linear neural networks, where $k$ is large.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/shamir19a.html
  PDF: http://proceedings.mlr.press/v99/shamir19a/shamir19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-shamir19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ohad
    family: Shamir
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2691-2713
  id: shamir19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2691
  lastpage: 2713
  published: 2019-06-25 00:00:00 +0000
- title: 'Learning Linear Dynamical Systems with Semi-Parametric Least Squares'
  abstract: 'We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, \emph{semi-parametric} noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies.  % We then demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not not incur a worst-case dependence on the rate at which these dependencies decay.  % The algorithm is provably consistent even for systems which satisfy the weaker \emph{marginal stability} condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/simchowitz19a.html
  PDF: http://proceedings.mlr.press/v99/simchowitz19a/simchowitz19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-simchowitz19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Max
    family: Simchowitz
  - given: Ross
    family: Boczar
  - given: Benjamin
    family: Recht
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2714-2802
  id: simchowitz19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2714
  lastpage: 2802
  published: 2019-06-25 00:00:00 +0000
- title: 'Finite-Time Error Bounds For Linear Stochastic Approximation andTD Learning'
  abstract: 'We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of the algorithm from the equilibrium point of an associated ordinary differential equation (ODE). We obtain finite-time bounds on the mean-square error in the case of constant step-size algorithms by considering the drift of an appropriately chosen Lyapunov function. The Lyapunov function can be interpreted either in terms of Stein’s method to obtain bounds on steady-state performance or in terms of Lyapunov stability theory for linear ODEs. We also provide a comprehensive treatment of the moments of the square of the 2-norm of the approximation error. Our analysis yields the following results: (i) for a given step-size, we show that the lower-order moments can be made small as a function of the step-size and can be upper-bounded by the moments of a Gaussian random variable; (ii) we show that the higher-order moments beyond a threshold may be infinite in steady-state; and (iii) we characterize the number of samples needed for the finite-time bounds to be of the same order as the steady-state bounds. As a by-product of our analysis, we also solve the open problem of obtaining finite-time bounds for the performance of temporal difference learning algorithms with linear function approximation and a constant step-size, without requiring a projection step or an i.i.d. noise assumption.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/srikant19a.html
  PDF: http://proceedings.mlr.press/v99/srikant19a/srikant19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-srikant19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: R.
    family: Srikant
  - given: Lei
    family: Ying
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2803-2830
  id: srikant19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2803
  lastpage: 2830
  published: 2019-06-25 00:00:00 +0000
- title: 'Robustness of Spectral Methods for Community Detection'
  abstract: 'The present work is concerned with community detection. Specifically, we consider a random graph drawn according to the stochastic block model: its vertex set is partitioned into blocks, or communities, and edges are placed randomly and independently of each other with probability depending only on the communities of their two endpoints. In this context, our aim is to recover the community labels better than by random guess, based only on the observation of the graph. In the sparse case, where edge probabilities are in $O(1/n)$, we introduce a new spectral method based on the distance matrix $D^{(\ell)}$, where $D^{(\ell)}_{ij} = 1$ iff the graph distance between $i$ and $j$, noted $d(i, j)$ is equal to $\ell$. We show that when $\ell \sim c\log(n)$ for carefully chosen $c$, the eigenvectors associated to the largest eigenvalues of $D^{(\ell)}$ provide enough information to perform non-trivial community recovery with high probability, provided we are above the so-called Kesten-Stigum threshold. This yields an efficient algorithm for community detection, since computation of the matrix $D^{(\ell)}$ can be done in $O(n^{1+\kappa})$ operations for a small constant $\kappa$.  We then study the sensitivity of the eigendecomposition of $D^{(\ell)}$ when we allow an adversarial perturbation of the edges of $G$. We show that when the considered perturbation does not affect more than $O(n^\varepsilon)$ vertices for some small $\varepsilon > 0$, the highest eigenvalues and their corresponding eigenvectors incur negligible perturbations, which allows us to still perform efficient recovery. Our proposed spectral method therefore: i) is robust to larger perturbations than prior spectral methods,  while semi-definite programming (or SDP) methods can tolerate yet larger perturbations; ii) achieves  non-trivial detection down to the KS threshold, which is conjectured to be optimal and is beyond reach of  existing SDP approaches; iii) is faster than SDP approaches.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/stephan19a.html
  PDF: http://proceedings.mlr.press/v99/stephan19a/stephan19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-stephan19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ludovic
    family: Stephan
  - given: Laurent
    family: Massoulié
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2831-2860
  id: stephan19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2831
  lastpage: 2860
  published: 2019-06-25 00:00:00 +0000
- title: 'Maximum Entropy Distributions: Bit Complexity and Stability'
  abstract: 'Maximum entropy distributions with discrete support in $m$ dimensions  arise in machine learning, statistics, information theory, and theoretical computer science. While structural and computational properties of max-entropy distributions have been extensively studied,  basic questions such as: Do max-entropy distributions over a large support  (e.g., $2^m$) with a specified marginal vector have succinct descriptions (polynomial-size in the input description)? and: Are entropy maximizing distributions “stable” under the perturbation of the marginal vector? have resisted a rigorous resolution. Here we show that these questions are related and resolve both of them. Our main result shows  a ${\rm poly}(m, \log 1/\varepsilon)$ bound on the bit complexity of $\varepsilon$-optimal dual solutions to the maximum entropy convex program – for very general support sets and with no restriction on the marginal vector. Applications of this result include polynomial time algorithms to compute  max-entropy distributions over several new and old polytopes for any marginal vector in a unified manner, a  polynomial time algorithm to compute the Brascamp-Lieb constant in the rank-1 case. The proof of this result allows us to show that changing the marginal vector by $\delta$ changes the max-entropy distribution in the total variation distance roughly by a factor of ${\rm poly}(m, \log 1/\delta)\sqrt{\delta}$ – even when the size of the support set is exponential. Together, our results put max-entropy distributions on a mathematically sound footing –  these distributions are  robust and computationally feasible models for data.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/straszak19a.html
  PDF: http://proceedings.mlr.press/v99/straszak19a/straszak19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-straszak19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Damian
    family: Straszak
  - given: Nisheeth K.
    family: Vishnoi
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2861-2891
  id: straszak19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2861
  lastpage: 2891
  published: 2019-06-25 00:00:00 +0000
- title: 'Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression'
  abstract: 'We study the problem of robust linear regression with response variable corruptions. We consider the oblivious adversary model, where the adversary corrupts a fraction of the responses in complete ignorance of the data. We provide a nearly linear time estimator which consistently estimates the true regression vector, even with $1-o(1)$ fraction of corruptions. Existing results in this setting either don’t guarantee consistent estimates or can only handle a small fraction of corruptions. We also extend our estimator to robust sparse linear regression and show that similar guarantees hold in this setting. Finally, we apply our estimator to the problem of linear regression with heavy-tailed noise and show that our estimator consistently estimates the regression vector even when the noise has unbounded variance (e.g., Cauchy distribution), for which most existing results don’t even apply. Our estimator is based on a novel variant of outlier removal via hard thresholding in which the threshold is chosen adaptively and crucially relies on randomness to escape bad fixed points of the non-convex hard thresholding operation.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/suggala19a.html
  PDF: http://proceedings.mlr.press/v99/suggala19a/suggala19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-suggala19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Arun Sai
    family: Suggala
  - given: Kush
    family: Bhatia
  - given: Pradeep
    family: Ravikumar
  - given: Prateek
    family: Jain
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2892-2897
  id: suggala19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2892
  lastpage: 2897
  published: 2019-06-25 00:00:00 +0000
- title: 'Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches'
  abstract: 'We study the sample complexity of model-based reinforcement learning (henceforth RL) in general contextual decision processes that require strategic exploration to find a near-optimal policy. We design new algorithms for RL with a generic model class and analyze their statistical properties. Our algorithms have sample complexity governed by a new structural parameter called the witness rank, which we show to be small in several settings of interest, including factored MDPs. We also show that the witness rank is never larger than the recently proposed Bellman rank parameter governing the sample complexity of the model-free algorithm OLIVE (Jiang et al., 2017), the only other provably sample-efficient algorithm for global exploration at this level of generality. Focusing on the special case of factored MDPs, we prove an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with our algorithmic results, demonstrates exponential separation between model-based and model-free RL in some rich-observation settings.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/sun19a.html
  PDF: http://proceedings.mlr.press/v99/sun19a/sun19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-sun19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Wen
    family: Sun
  - given: Nan
    family: Jiang
  - given: Akshay
    family: Krishnamurthy
  - given: Alekh
    family: Agarwal
  - given: John
    family: Langford
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2898-2933
  id: sun19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2898
  lastpage: 2933
  published: 2019-06-25 00:00:00 +0000
- title: 'Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions'
  abstract: 'We provide a novel computer-assisted technique for systematically analyzing first-order methods for optimization. In contrast with previous works, the approach is particularly suited for handling sublinear convergence rates and stochastic oracles. The technique relies on semidefinite programming and potential functions. It allows simultaneously obtaining worst-case guarantees on the behavior of those algorithms, and assisting in choosing appropriate parameters for tuning their worst-case performances. The technique also benefits from comfortable tightness guarantees, meaning that unsatisfactory results can be improved only by changing the setting. We use the approach for analyzing deterministic and stochastic first-order methods under different assumptions on the nature of the stochastic noise. Among others, we treat unstructured noise with bounded variance, different noise models arising in over-parametrized expectation minimization problems, and randomized block-coordinate descent schemes.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/taylor19a.html
  PDF: http://proceedings.mlr.press/v99/taylor19a/taylor19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-taylor19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Adrien
    family: Taylor
  - given: Francis
    family: Bach
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2934-2992
  id: taylor19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2934
  lastpage: 2992
  published: 2019-06-25 00:00:00 +0000
- title: 'The Relative Complexity of Maximum Likelihood Estimation, MAP Estimation, and Sampling'
  abstract: 'We prove that, for a broad range of problems, maximum-a-posteriori (MAP) estimation and approximate sampling of the posterior are at least as computationally difficult as maximum-likelihood (ML) estimation. By way of illustration, we show how hardness results for ML estimation of mixtures of Gaussians and topic models carry over to MAP estimation and approximate sampling under commonly used priors.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/tosh19a.html
  PDF: http://proceedings.mlr.press/v99/tosh19a/tosh19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-tosh19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Christopher
    family: Tosh
  - given: Sanjoy
    family: Dasgupta
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 2993-3035
  id: tosh19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 2993
  lastpage: 3035
  published: 2019-06-25 00:00:00 +0000
- title: 'The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint'
  abstract: 'The effectiveness of model-based versus model-free methods is a long-standing question in reinforcement learning (RL).  Motivated by recent empirical success of RL on continuous control tasks, we study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR).  We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension.  For policy optimization, we study a simple family of problem instances and show that nominal (certainty equivalence principle) control also requires several factors of state and input dimension fewer samples than the policy gradient method to reach the same level of control performance on these instances.  Furthermore, the gap persists even when employing baselines commonly used in practice.  To the best of our knowledge, this is the first theoretical result which demonstrates a separation in the sample complexity between model-based and model-free methods on a continuous control task.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/tu19a.html
  PDF: http://proceedings.mlr.press/v99/tu19a/tu19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-tu19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Stephen
    family: Tu
  - given: Benjamin
    family: Recht
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3036-3083
  id: tu19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3036
  lastpage: 3083
  published: 2019-06-25 00:00:00 +0000
- title: 'Theoretical guarantees for sampling and inference in generative models with latent diffusions'
  abstract: 'We introduce and study a class of probabilistic generative models, where the latent object is a finite-dimensional diffusion process on a finite time interval and the observed variable is drawn conditionally on the terminal point of the diffusion. We make the following contributions: We provide a unified viewpoint on both sampling and variational inference in such generative models through the lens of stochastic control. We quantify the expressiveness of diffusion-based generative models. Specifically, we show that one can efficiently sample from a wide class of terminal target distributions by choosing the drift of the latent diffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measured by the Kullback–Leibler divergence to the target distribution. Finally, we present and analyze a scheme for unbiased simulation of generative models with latent diffusions and provide bounds on the variance of the resulting estimators. This scheme can be implemented as a deep generative model with a random number of layers. '
  volume: 99
  URL: https://proceedings.mlr.press/v99/tzen19a.html
  PDF: http://proceedings.mlr.press/v99/tzen19a/tzen19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-tzen19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Belinda
    family: Tzen
  - given: Maxim
    family: Raginsky
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3084-3114
  id: tzen19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3084
  lastpage: 3114
  published: 2019-06-25 00:00:00 +0000
- title: 'Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds'
  abstract: 'We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs.  %by training a neural network with a single hidden layer of nonlinear gates. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in $2$-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, for any $k$, the size of the network and number of iterations needed are both bounded by $n^{O(k)}\log(1/\epsilon)$. The core of our analysis is the following existence theorem, which is of independent interest:  for any $\epsilon > 0$, any bounded function that has a degree $k$ polynomial approximation with error $\epsilon_0$ (in $2$-norm), can be approximated to within error $\epsilon_0 + \epsilon$ as a linear combination of $n^{O(k)}\cdot \mbox{poly}(1/\epsilon)$ {\em randomly chosen} gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to $k$. In particular, this applies to training networks  of unbiased sigmoids and ReLUs.  We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components.  We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality $n$ must use $n^{\Omega(k)}$ queries even when the target functions are restricted to a set of $n^{O(k)}$ degree-$k$ polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only $\Theta(k \log n)$.  Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/vempala19a.html
  PDF: http://proceedings.mlr.press/v99/vempala19a/vempala19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-vempala19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Santosh
    family: Vempala
  - given: John
    family: Wilmes
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3115-3117
  id: vempala19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3115
  lastpage: 3117
  published: 2019-06-25 00:00:00 +0000
- title: 'Estimation of smooth densities in Wasserstein distance'
  abstract: 'The Wasserstein distances are a set of metrics on probability distributions supported on $\mathbb{R}^d$ with applications throughout statistics and machine learning. Often, such distances are used in the context of variational problems, in which the statistician employs in place of an unknown measure a proxy constructed on the basis of independent samples. This raises the basic question of how well measures can be approximated in Wasserstein distance. While it is known that an empirical measure comprising i.i.d. samples is rate-optimal for general measures, no improved results were known for measures possessing smooth densities. We prove the first minimax rates for estimation of smooth densities for general Wasserstein distances, thereby showing how the curse of dimensionality can be alleviated for sufficiently regular measures. We also show how to construct discretely supported measures, suitable for computational purposes, which enjoy improved rates. Our approach is based on novel bounds between the Wasserstein distances and suitable Besov norms, which may be of independent interest.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/weed19a.html
  PDF: http://proceedings.mlr.press/v99/weed19a/weed19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-weed19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jonathan
    family: Weed
  - given: Quentin
    family: Berthet
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3118-3119
  id: weed19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3118
  lastpage: 3119
  published: 2019-06-25 00:00:00 +0000
- title: 'Estimating the Mixing Time of Ergodic Markov Chains'
  abstract: 'We address the problem of estimating the mixing time $t_{\mathsf{mix}}$ of an arbitrary ergodic finite Markov chain from a single trajectory of length $m$. The reversible case was addressed by Hsu et al. [2018+], who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl’s inequality allows for a dimension-free perturbation analysis of the empirical eigenvalues. As Hsu et al. point out, in the absence of reversibility (and hence, the non-symmetry of the pair probabilities matrix), the existing perturbation analysis has a worst-case exponential dependence on the number of states $d$. Furthermore, even if an eigenvalue perturbation analysis with better dependence on $d$ were available, in the non-reversible case the connection between the spectral gap and the mixing time is not nearly as straightforward as in the reversible case. Our key insight is to estimate the pseudo-spectral gap instead, which allows us to overcome the loss of self-adjointness and to achieve a polynomial dependence on $d$ and the minimal stationary probability $\pi_\star$. Additionally, in the reversible case, we obtain simultaneous nearly (up to logarithmic factors) minimax rates in $t_{\mathsf{mix}}$ and precision $\varepsilon$, closing a gap in Hsu et al., who treated $\varepsilon$ as constant in the lower bounds. Finally, we construct fully empirical confidence intervals for the pseudo-spectral gap, which shrink to zero at a rate of roughly $1/\sqrt m$, and improve the state of the art in even the reversible case.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/wolfer19a.html
  PDF: http://proceedings.mlr.press/v99/wolfer19a/wolfer19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-wolfer19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Geoffrey
    family: Wolfer
  - given: Aryeh
    family: Kontorovich
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3120-3159
  id: wolfer19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3120
  lastpage: 3159
  published: 2019-06-25 00:00:00 +0000
- title: 'Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the $O(1/T)$ Convergence Rate'
  abstract: 'Stochastic approximation (SA) is a classical approach for stochastic convex optimization. Previous studies have demonstrated that the convergence rate of SA can be improved by introducing either smoothness or strong convexity condition. In this paper, we make use of smoothness and strong convexity simultaneously to boost the convergence rate. Let $\lambda$ be the modulus of strong convexity,  $\kappa$ be the condition number, $F_*$ be the minimal risk, and $\alpha>1$ be some small constant. First, we demonstrate that, in expectation, an $O(1/[\lambda T^\alpha] + \kappa F_*/T)$ risk bound is attainable when $T = \Omega(\kappa^\alpha)$. Thus, when $F_*$ is small, the convergence rate could be faster than $O(1/[\lambda T])$ and approaches $O(1/[\lambda T^\alpha])$ in the ideal case. Second, to further benefit from small risk, we show that, in expectation, an $O(1/2^{T/\kappa}+F_*)$ risk bound is achievable. Thus, the excess risk reduces exponentially until reaching $O(F_*)$, and if $F_*=0$, we obtain a global linear convergence. Finally, we emphasize that our proof is constructive and each risk bound is equipped with an efficient stochastic algorithm attaining that bound.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/zhang19a.html
  PDF: http://proceedings.mlr.press/v99/zhang19a/zhang19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-zhang19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Lijun
    family: Zhang
  - given: Zhi-Hua
    family: Zhou
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3160-3179
  id: zhang19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3160
  lastpage: 3179
  published: 2019-06-25 00:00:00 +0000
- title: 'Open Problem: Is Margin Sufficient for Non-Interactive Private Distributed Learning?'
  abstract: 'We ask whether every class of Boolean functions that has polynomial margin complexity can be PAC learned efficiently by a non-interactive locally differentially private algorithm.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/daniely19a.html
  PDF: http://proceedings.mlr.press/v99/daniely19a/daniely19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-daniely19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Amit
    family: Daniely
  - given: Vitaly
    family: Feldman
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3180-3184
  id: daniely19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3180
  lastpage: 3184
  published: 2019-06-25 00:00:00 +0000
- title: 'Open Problem: How fast can a multiclass test set be overfit?'
  abstract: 'We ask how many measurements of the accuracy on a multiclass benchmark are needed to achieve a given amount of overfitting.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/feldman19b.html
  PDF: http://proceedings.mlr.press/v99/feldman19b/feldman19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-feldman19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vitaly
    family: Feldman
  - given: Roy
    family: Frostig
  - given: Moritz
    family: Hardt
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3185-3189
  id: feldman19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3185
  lastpage: 3189
  published: 2019-06-25 00:00:00 +0000
- title: 'Open Problem: Do Good Algorithms Necessarily Query Bad Points?'
  abstract: 'Folklore results in the theory of Stochastic Approximation indicates the (minimax) optimality of Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951) with polynomially decaying stepsizes and iterate averaging (Ruppert, 1988; Polyak and Juditsky, 1992) for classes of stochastic convex optimization. Basing of these folkore results and some recent developments, this manuscript considers a more subtle question: does any algorithm necessarily (information theoretically) have to query iterates that are sub-optimal infinitely often?'
  volume: 99
  URL: https://proceedings.mlr.press/v99/ge19b.html
  PDF: http://proceedings.mlr.press/v99/ge19b/ge19b.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-ge19b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Rong
    family: Ge
  - given: Prateek
    family: Jain
  - given: Sham M.
    family: Kakade
  - given: Rahul
    family: Kidambi
  - given: Dheeraj M.
    family: Nagaraj
  - given: Praneeth
    family: Netrapalli
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3190-3193
  id: ge19b
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3190
  lastpage: 3193
  published: 2019-06-25 00:00:00 +0000
- title: 'Open Problem: Risk of Ruin in Multiarmed Bandits'
  abstract: 'We formalize a particular class of problems called \textit{survival multiarmed bandits} (S-MAB), which constitutes a modified version of \textit{budgeted multiarmed bandits} (B-MAB) where a true \textit{risk of ruin} must be considered, bringing it closer to \textit{risk-averse multiarmed bandits} (RA-MAB). In a S-MAB, pulling an arm can result in both positive and negative rewards. The agent has an initial budget that evolves in time with the received rewards. The goal is finding a good \textit{exploration-exploitation-safety} trade-off, maximizing rewards while minimizing the probability of getting ruined (i.e. hitting a negative budget). Such simple and until now neglected modification in the MAB statement changes the way to approach the problem, asking for adapted algorithms and specific analytical tools, and also make it more likely related to some important real-world applications. We are interested in the following open problems which stem from such new MAB definition: (a) how can the regret be meaningfully defined in formal terms for a S-MAB given its multiobjective optimization nature? (b) can a S-MAB be reduced to a RA-MAB or a B-MAB, transferring their theoretical guarantees? (c) what kind of method or strategy must an agent follow to optimally solve a S-MAB?'
  volume: 99
  URL: https://proceedings.mlr.press/v99/perotto19a.html
  PDF: http://proceedings.mlr.press/v99/perotto19a/perotto19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-perotto19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Filipo S.
    family: Perotto
  - given: Mathieu
    family: Bourgais
  - given: Bruno C.
    family: Silva
  - given: Laurent
    family: Vercouter
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3194-3197
  id: perotto19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3194
  lastpage: 3197
  published: 2019-06-25 00:00:00 +0000
- title: 'Open Problem: Monotonicity of Learning'
  abstract: 'We pose the question to what extent a learning algorithm behaves monotonically in the following sense: does it perform better, in expectation, when adding one instance to the training set? We focus on empirical risk minimization and illustrate this property with several examples, two where it does hold and two where it does not. We also relate it to the notion of PAC-learnability.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/viering19a.html
  PDF: http://proceedings.mlr.press/v99/viering19a/viering19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-viering19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Tom
    family: Viering
  - given: Alexander
    family: Mey
  - given: Marco
    family: Loog
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3198-3201
  id: viering19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3198
  lastpage: 3201
  published: 2019-06-25 00:00:00 +0000
- title: 'Open Problem: The Oracle Complexity of Convex Optimization with Limited Memory'
  abstract: 'We note that known methods achieving the optimal oracle complexity for first order convex optimization require quadratic memory, and ask whether this is necessary, and more broadly seek to characterize the minimax number of first order queries required to optimize a convex Lipschitz function subject to a memory constraint.'
  volume: 99
  URL: https://proceedings.mlr.press/v99/woodworth19a.html
  PDF: http://proceedings.mlr.press/v99/woodworth19a/woodworth19a.pdf
  edit: https://github.com/mlresearch//v99/edit/gh-pages/_posts/2019-06-25-woodworth19a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Thirty-Second Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Blake
    family: Woodworth
  - given: Nathan
    family: Srebro
  editor: 
  - given: Alina
    family: Beygelzimer
  - given: Daniel
    family: Hsu
  page: 3202-3210
  id: woodworth19a
  issued:
    date-parts: 
      - 2019
      - 6
      - 25
  firstpage: 3202
  lastpage: 3210
  published: 2019-06-25 00:00:00 +0000