- title: 'Conference on Learning Theory 2018: Preface'
abstract: 'Preface to the proceedings of the 31st Conference On Learning Theory.'
volume: 75
URL: https://proceedings.mlr.press/v75/bubeck18a.html
PDF: http://proceedings.mlr.press/v75/bubeck18a/bubeck18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-bubeck18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Sébastien
family: Bubeck
- given: Philippe
family: Rigollet
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1-1
id: bubeck18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1
lastpage: 1
published: 2018-07-03 00:00:00 +0000
- title: 'Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations'
abstract: 'We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given $\tilde{O}(dr^{2})$ random linear measurements of a rank $r$ positive semidefinite matrix $X^{\star}$, we can recover $X^{\star}$ by parameterizing it by $UU^\top$ with $U\in \mathbb R^{d\times d}$ and minimizing the squared loss, even if $r \ll d$. We prove that starting from a small initialization, gradient descent recovers $X^{\star}$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results solve the conjecture of Gunasekar et al.’17 under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.'
volume: 75
URL: https://proceedings.mlr.press/v75/li18a.html
PDF: http://proceedings.mlr.press/v75/li18a/li18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-li18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yuanzhi
family: Li
- given: Tengyu
family: Ma
- given: Hongyang
family: Zhang
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 2-47
id: li18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 2
lastpage: 47
published: 2018-07-03 00:00:00 +0000
- title: 'Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure'
abstract: 'Recently, research in unsupervised learning has gravitated towards exploring statistical-computational gaps induced by sparsity. A line of work initiated in Berthet and Rigollet (2013) has aimed to explain these gaps through reductions to conjecturally hard problems from complexity theory. However, the delicate nature of average-case reductions has limited the development of techniques and often led to weaker hardness results that only apply to algorithms robust to different noise distributions or that do not need to know the parameters of the problem. We introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. Our new lower bounds include: -
**Planted Independent Set:** We show tight lower bounds for detecting a planted independent set of size $k$ in a sparse Erdős-Rényi graph of size $n$ with edge density $\tilde{\Theta}(n^{-\alpha})$. -
**Planted Dense Subgraph:** If $p > q$ are the edge densities inside and outside of the community, we show the first lower bounds for the general regime $q = \tilde{\Theta}(n^{-\alpha})$ and $p - q = \tilde{\Theta}(n^{-\gamma})$ where $\gamma \ge \alpha$, matching the lower bounds predicted in Chen and Xu (2016). Our lower bounds apply to a deterministic community size $k$, resolving a question raised in Hajek et al. (2015). -
**Biclustering:** We show strong lower bounds for Gaussian biclustering as a simple hypothesis testing problem to detect a uniformly at random planted flat $k \times k$ submatrix. -
**Sparse Rank-1 Submatrix:** We show that detection in the sparse spiked Wigner model is often harder than biclustering, and are able to obtain two different tight lower bounds for these problems with different reductions from planted clique. -
**Sparse PCA:** We give a reduction between rank-1 submatrix and sparse PCA to obtain tight lower bounds in the less sparse regime $k \gg \sqrt{n}$, when the spectral algorithm is optimal over the SDP. We give an alternate reduction recovering the lower bounds of Berthet and Rigollet (2013) and Gao et al. (2017) in the simple hypothesis testing variant of sparse PCA. We also observe a subtlety in the complexity of sparse PCA that arises when the planted vector is biased. -
**Subgraph Stochastic Block Model:** We introduce a model where two small communities are planted in an Erdős-Rényi graph of the same average edge density and give tight lower bounds yielding different hard regimes than planted dense subgraph.

Our results demonstrate that, despite the delicate nature of average-case reductions, using natural problems as intermediates can often be beneficial, as is the case in worst-case complexity. Our main technical contribution is to introduce a set of techniques for average-case reductions that: (1) maintain the level of signal in an instance of a problem; (2) alter its planted structure; and (3) map two initial high-dimensional distributions simultaneously to two target distributions approximately under total variation. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we consider.'
volume: 75
URL: https://proceedings.mlr.press/v75/brennan18a.html
PDF: http://proceedings.mlr.press/v75/brennan18a/brennan18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-brennan18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Matthew
family: Brennan
- given: Guy
family: Bresler
- given: Wasim
family: Huleihel
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 48-166
id: brennan18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 48
lastpage: 166
published: 2018-07-03 00:00:00 +0000
- title: 'Logistic Regression: The Importance of Being Improper'
abstract: 'Learning linear predictors with the logistic loss—both in stochastic and online settings—is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing “fast rates” for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter (2012) when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We also show that the improved dependence on predictor norm is near-optimal. Leveraging this improved dependency on the predictor norm yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $\tilde{O}(\sqrt{n})$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parametric and even nonparametric settings.'
volume: 75
URL: https://proceedings.mlr.press/v75/foster18a.html
PDF: http://proceedings.mlr.press/v75/foster18a/foster18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-foster18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Dylan J.
family: Foster
- given: Satyen
family: Kale
- given: Haipeng
family: Luo
- given: Mehryar
family: Mohri
- given: Karthik
family: Sridharan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 167-208
id: foster18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 167
lastpage: 208
published: 2018-07-03 00:00:00 +0000
- title: 'Actively Avoiding Nonsense in Generative Models'
abstract: 'A generative model may generate utter nonsense when it is fit to maximize the likelihood of observed data. This happens due to “model error,” i.e., when the true data generating distribution does not fit within the class of generative models being learned. To address this, we propose a model of active distribution learning using a binary invalidity oracle that identifies some examples as clearly invalid, together with random positive examples sampled from the true distribution. The goal is to maximize the likelihood of the positive examples subject to the constraint of (almost) never generating examples labeled invalid by the oracle. Guarantees are agnostic compared to a class of probability distributions. We first show that proper learning may require exponentially many queries to the invalidity oracle. We then give an improper distribution learning algorithm that uses only polynomially many queries. '
volume: 75
URL: https://proceedings.mlr.press/v75/hanneke18a.html
PDF: http://proceedings.mlr.press/v75/hanneke18a/hanneke18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-hanneke18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Steve
family: Hanneke
- given: Adam Tauman
family: Kalai
- given: Gautam
family: Kamath
- given: Christos
family: Tzamos
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 209-227
id: hanneke18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 209
lastpage: 227
published: 2018-07-03 00:00:00 +0000
- title: 'A Faster Approximation Algorithm for the Gibbs Partition Function'
abstract: ' We consider the problem of estimating the partition function $Z(\beta)=\sum_x \exp(-\beta H(x))$ of a Gibbs distribution with a Hamilton $H(\cdot)$, or more precisely the logarithm of the ratio $q=\ln Z(0)/Z(\beta)$. It has been recently shown how to approximate $q$ with high probability assuming the existence of an oracle that produces samples from the Gibbs distribution for a given parameter value in $[0,\beta]$. The current best known approach due to Huber (2015) uses $O(q\ln n\cdot[\ln q + \ln \ln n+\varepsilon^{-2}])$ oracle calls on average where $\varepsilon$ is the desired accuracy of approximation and $H(\cdot)$ is assumed to lie in $\{0\}\cup[1,n]$. We improve the complexity to $O(q\ln n\cdot\varepsilon^{-2})$ oracle calls. We also show that the same complexity can be achieved if exact oracles are replaced with approximate sampling oracles that are within $O(\frac{\varepsilon^2}{q\ln n})$ variation distance from exact oracles. Finally, we prove a lower bound of $\Omega(q\cdot \varepsilon^{-2})$ oracle calls under a natural model of computation. '
volume: 75
URL: https://proceedings.mlr.press/v75/kolmogorov18a.html
PDF: http://proceedings.mlr.press/v75/kolmogorov18a/kolmogorov18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-kolmogorov18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Vladimir
family: Kolmogorov
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 228-249
id: kolmogorov18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 228
lastpage: 249
published: 2018-07-03 00:00:00 +0000
- title: 'Exponential Convergence of Testing Error for Stochastic Gradient Methods'
abstract: 'We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing \emph{loss} (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing \emph{error} (classification error) converges exponentially fast if low-noise conditions are assumed. To achieve these rates of convergence we show sharper high-probability bounds with respect to the number of observations for stochastic gradient descent.'
volume: 75
URL: https://proceedings.mlr.press/v75/pillaud-vivien18a.html
PDF: http://proceedings.mlr.press/v75/pillaud-vivien18a/pillaud-vivien18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-pillaud-vivien18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Loucas
family: Pillaud-Vivien
- given: Alessandro
family: Rudi
- given: Francis
family: Bach
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 250-296
id: pillaud-vivien18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 250
lastpage: 296
published: 2018-07-03 00:00:00 +0000
- title: 'Size-Independent Sample Complexity of Neural Networks'
abstract: 'We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.'
volume: 75
URL: https://proceedings.mlr.press/v75/golowich18a.html
PDF: http://proceedings.mlr.press/v75/golowich18a/golowich18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-golowich18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Noah
family: Golowich
- given: Alexander
family: Rakhlin
- given: Ohad
family: Shamir
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 297-299
id: golowich18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 297
lastpage: 299
published: 2018-07-03 00:00:00 +0000
- title: 'Underdamped Langevin MCMC: A non-asymptotic analysis'
abstract: 'We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $\varepsilon$ error (in 2-Wasserstein distance) in $\mathcal{O}(\sqrt{d}/\varepsilon)$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is $\mathcal{O}(d/\varepsilon^2)$ steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.'
volume: 75
URL: https://proceedings.mlr.press/v75/cheng18a.html
PDF: http://proceedings.mlr.press/v75/cheng18a/cheng18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-cheng18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Xiang
family: Cheng
- given: Niladri S.
family: Chatterji
- given: Peter L.
family: Bartlett
- given: Michael I.
family: Jordan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 300-323
id: cheng18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 300
lastpage: 323
published: 2018-07-03 00:00:00 +0000
- title: 'Online Variance Reduction for Stochastic Optimization'
abstract: 'Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform \emph{importance sampling} techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling data points, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.'
volume: 75
URL: https://proceedings.mlr.press/v75/borsos18a.html
PDF: http://proceedings.mlr.press/v75/borsos18a/borsos18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-borsos18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Zalan
family: Borsos
- given: Andreas
family: Krause
- given: Kfir Y.
family: Levy
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 324-357
id: borsos18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 324
lastpage: 357
published: 2018-07-03 00:00:00 +0000
- title: 'Information Directed Sampling and Bandits with Heteroscedastic Noise'
abstract: 'In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and to satisfy a tail bound uniformly on the domain; a restrictive assumption for many applications. In this work, we consider bandits with heteroscedastic noise, where we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret analysis framework, that is similar to the Bayesian framework of Russo and Van Roy (2014), and we prove a new high-probability regret bound for general, possibly randomized policies, which depends on a quantity we refer to as regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize the regret-information ratio over all possible action sampling distributions. This further relies on concentration inequalities for online least squares regression in separable Hilbert spaces, which we generalize to the case of heteroscedastic noise. We then formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding novel algorithms for Bayesian optimization. We also prove frequentist regret bounds, which in the homoscedastic case recover known bounds for UCB, but can be much better when the noise is heteroscedastic. Empirically, we demonstrate in a linear setting with heteroscedastic noise, that some of our methods can outperform UCB and Thompson Sampling, while staying competitive when the noise is homoscedastic.'
volume: 75
URL: https://proceedings.mlr.press/v75/kirschner18a.html
PDF: http://proceedings.mlr.press/v75/kirschner18a/kirschner18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-kirschner18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Johannes
family: Kirschner
- given: Andreas
family: Krause
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 358-384
id: kirschner18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 358
lastpage: 384
published: 2018-07-03 00:00:00 +0000
- title: 'Testing Symmetric Markov Chains From a Single Trajectory'
abstract: 'The paper’s abstract in valid LaTeX, without non-standard macros or \cite commands. Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a {\em single trajectory of a Markov Chain.} In particular, we observe a single trajectory $X_0,\ldots,X_t,\ldots$ of an unknown, symmetric, and finite state Markov Chain $\cal M$. We do not control the starting state $X_0$, and we cannot restart the chain. Given our single trajectory, the goal is to test whether $\cal M$ is identical to a model Markov Chain ${\cal M}’$, or far from it under an appropriate notion of difference. We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain ${\cal M}’$ is $\tilde{O}(n)$ in the size of the state space $n$. '
volume: 75
URL: https://proceedings.mlr.press/v75/daskalakis18a.html
PDF: http://proceedings.mlr.press/v75/daskalakis18a/daskalakis18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-daskalakis18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Constantinos
family: Daskalakis
- given: Nishanth
family: Dikkala
- given: Nick
family: Gravin
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 385-409
id: daskalakis18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 385
lastpage: 409
published: 2018-07-03 00:00:00 +0000
- title: 'Detection limits in the high-dimensional spiked rectangular model'
abstract: 'We study the problem of detecting the presence of a single unknown spike in a rectangular data matrix, in a high-dimensional regime where the spike has fixed strength and the aspect ratio of the matrix converges to a finite limit. This setup includes Johnstone’s spiked covariance model. We analyze the likelihood ratio of the spiked model against an “all noise" null model of reference, and show it has asymptotically Gaussian fluctuations in a region below—but in general not up to—the so-called BBP threshold from random matrix theory. Our result parallels earlier findings of Onatski et al. (2013) and Johnstone-Onatski (2015) for spherical spikes. We present a probabilistic approach capable of treating generic product priors. In particular, sparsity in the spike is allowed. Our approach operates through the principle of the cavity method from spin-glass theory. The question of the maximal parameter region where asymptotic normality is expected to hold is left open. This region, not necessarily given by BBP, is shaped by the prior in a non-trivial way. We conjecture that this is the entire paramagnetic phase of an associated spin-glass model, and is defined by the vanishing of the replica-symmetric solution of Lesieur et al. (2015). '
volume: 75
URL: https://proceedings.mlr.press/v75/el-alaoui18a.html
PDF: http://proceedings.mlr.press/v75/el-alaoui18a/el-alaoui18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-el-alaoui18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Ahmed
family: El Alaoui
- given: Michael I.
family: Jordan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 410-438
id: el-alaoui18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 410
lastpage: 438
published: 2018-07-03 00:00:00 +0000
- title: 'Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification'
abstract: 'We prove that the ordinary least-squares (OLS) estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory. Our upper bound relies on a generalization of Mendelson’s small-ball method to dependent data, eschewing the use of standard mixing-time arguments. Our lower bounds reveal that these upper bounds match up to logarithmic factors. In particular, we capture the correct signal-to-noise behavior of the problem, showing that \emph{more unstable} linear systems are \emph{easier} to estimate. This behavior is qualitatively different from arguments which rely on mixing-time calculations that suggest that unstable systems are more difficult to estimate. We generalize our technique to provide bounds for a more general class of linear response time-series. '
volume: 75
URL: https://proceedings.mlr.press/v75/simchowitz18a.html
PDF: http://proceedings.mlr.press/v75/simchowitz18a/simchowitz18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-simchowitz18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Max
family: Simchowitz
- given: Horia
family: Mania
- given: Stephen
family: Tu
- given: Michael I.
family: Jordan
- given: Benjamin
family: Recht
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 439-473
id: simchowitz18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 439
lastpage: 473
published: 2018-07-03 00:00:00 +0000
- title: 'Active Tolerant Testing'
abstract: 'In this work, we show that for a nontrivial hypothesis class $\mathcal C$, we can estimate the distance of a target function $f$ to $\mathcal C$ (estimate the error rate of the best $h\in \mathcal C$) using substantially fewer labeled examples than would be needed to actually {\em learn} a good $h \in \mathcal C$. Specifically, we show that for the class $\mathcal C$ of unions of $d$ intervals on the line, in the active learning setting in which we have access to a pool of unlabeled examples drawn from an arbitrary underlying distribution $\mathcal D$, we can estimate the error rate of the best $h \in \mathcal C$ to an additive error $\epsilon$ with a number of label requests that is {\em independent of $d$} and depends only on $\epsilon$. In particular, we make $O(\frac{1}{\epsilon^6}\log \frac{1}{\epsilon})$ label queries to an unlabeled pool of size $O(\frac{d}{\epsilon^2}\log \frac{1}{\epsilon})$. This task of estimating the distance of an unknown $f$ to a given class $\mathcal C$ is called {\em tolerant testing} or {\em distance estimation} in the testing literature, usually studied in a membership query model and with respect to the uniform distribution. Our work extends that of Balcan et al. (2012) who solved the {\em non}-tolerant testing problem for this class (distinguishing the zero-error case from the case that the best hypothesis in the class has error greater than $\epsilon$). We also consider the related problem of estimating the performance of a given learning algorithm $\mathcal A$ in this setting. That is, given a large pool of unlabeled examples drawn from distribution $\mathcal D$, can we, from only a few label queries, estimate how well $\mathcal A$ would perform if the entire dataset were labeled and given as training data to $\mathcal A$? We focus on $k$-Nearest Neighbor style algorithms, and also show how our results can be applied to the problem of hyperparameter tuning (selecting the best value of $k$ for the given learning problem).'
volume: 75
URL: https://proceedings.mlr.press/v75/blum18a.html
PDF: http://proceedings.mlr.press/v75/blum18a/blum18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-blum18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Avrim
family: Blum
- given: Lunjia
family: Hu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 474-497
id: blum18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 474
lastpage: 497
published: 2018-07-03 00:00:00 +0000
- title: 'Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods'
abstract: 'The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $E$ in $\mathbb{R}^n$ so that data points projected onto $E$ follow a non-Gaussian distribution. Vempala and Xiao (2011) proposed a local search algorithm, and showed that it was able to estimate $E$ accurately with polynomial time and sample complexity, if the dimension of $E$ is treated as a constant and with the assumption that all one-dimensional marginals of the non-Gaussian distribution over $E$ have non-Gaussian moments. In this paper, we propose a simple spectral algorithm called \textsc{Reweighted PCA}, and prove that it possesses the same guarantee. The principle that underlies this approach is a new characterization of multivariate Gaussian distributions.'
volume: 75
URL: https://proceedings.mlr.press/v75/tan18a.html
PDF: http://proceedings.mlr.press/v75/tan18a/tan18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-tan18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yan Shuo
family: Tan
- given: Roman
family: Vershynin
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 498-534
id: tan18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 498
lastpage: 534
published: 2018-07-03 00:00:00 +0000
- title: 'Calibrating Noise to Variance in Adaptive Data Analysis'
abstract: ' Datasets are often used multiple times and each successive analysis may depend on the outcome of previous analyses. Standard techniques for ensuring generalization and statistical validity do not account for this adaptive dependence. A recent line of work studies the challenges that arise from such adaptive data reuse by considering the problem of answering a sequence of “queries” about the data distribution where each query may depend arbitrarily on answers to previous queries. The strongest results obtained for this problem rely on differential privacy – a strong notion of algorithmic stability with the important property that it “composes” well when data is reused. However the notion is rather strict, as it requires stability under replacement of an arbitrary data element. The simplest algorithm is to add Gaussian (or Laplace) noise to distort the empirical answers. However, analysing this technique using differential privacy yields suboptimal accuracy guarantees when the queries have low variance. Here we propose a relaxed notion of stability based on KL divergence that also composes adaptively. We show that our notion of stability implies a bound on the mutual information between the dataset and the output of the algorithm and then derive new generalization guarantees implied by bounded mutual information. We demonstrate that a simple and natural algorithm based on adding noise scaled to the standard deviation of the query provides our notion of stability. This implies an algorithm that can answer statistical queries about the dataset with substantially improved accuracy guarantees for low-variance queries. The only previous approach that provides such accuracy guarantees is based on a more involved differentially private median-of-means algorithm and its analysis exploits stronger “group” stability of the algorithm. '
volume: 75
URL: https://proceedings.mlr.press/v75/feldman18a.html
PDF: http://proceedings.mlr.press/v75/feldman18a/feldman18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-feldman18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Vitaly
family: Feldman
- given: Thomas
family: Steinke
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 535-544
id: feldman18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 535
lastpage: 544
published: 2018-07-03 00:00:00 +0000
- title: 'Accelerating Stochastic Gradient Descent for Least Squares Regression'
abstract: 'There is widespread sentiment that fast gradient methods (e.g. Nesterov’s acceleration, conjugate gradient, heavy ball) are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes this conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.'
volume: 75
URL: https://proceedings.mlr.press/v75/jain18a.html
PDF: http://proceedings.mlr.press/v75/jain18a/jain18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-jain18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Prateek
family: Jain
- given: Sham M.
family: Kakade
- given: Rahul
family: Kidambi
- given: Praneeth
family: Netrapalli
- given: Aaron
family: Sidford
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 545-604
id: jain18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 545
lastpage: 604
published: 2018-07-03 00:00:00 +0000
- title: 'Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints'
abstract: 'We study the generalization errors of \emph{non-convex} regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_N}\right)$, where $L$ is Lipschitz parameter, $\beta$ is inverse temperature, and $T_N$ is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of the parameter. The bounds characterize how the noises in the algorithm itself controls the statistical learning behavior in non-convex problems, without uniform convergence in the hypothesis space, which sheds light on the effect of training algorithms on the generalization error for deep neural networks.'
volume: 75
URL: https://proceedings.mlr.press/v75/mou18a.html
PDF: http://proceedings.mlr.press/v75/mou18a/mou18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-mou18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Wenlong
family: Mou
- given: Liwei
family: Wang
- given: Xiyu
family: Zhai
- given: Kai
family: Zheng
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 605-638
id: mou18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 605
lastpage: 638
published: 2018-07-03 00:00:00 +0000
- title: 'Optimal approximation of continuous functions by very deep ReLU networks'
abstract: 'We consider approximations of general continuous functions on finite-dimensional cubes by general deep ReLU neural networks and study the approximation rates with respect to the modulus of continuity of the function and the total number of weights $W$ in the network. We establish the complete phase diagram of feasible approximation rates and show that it includes two distinct phases. One phase corresponds to slower approximations that can be achieved with constant-depth networks and continuous weight assignments. The other phase provides faster approximations at the cost of depths necessarily growing as a power law $L\sim W^{\alpha}, 0<\alpha\le 1,$ and with necessarily discontinuous weight assignments. In particular, we prove that constant-width fully-connected networks of depth $L\sim W$ provide the fastest possible approximation rate $\|f-\widetilde f\|_\infty = O(\omega_f(O(W^{-2/\nu})))$ that cannot be achieved with less deep networks. '
volume: 75
URL: https://proceedings.mlr.press/v75/yarotsky18a.html
PDF: http://proceedings.mlr.press/v75/yarotsky18a/yarotsky18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-yarotsky18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Dmitry
family: Yarotsky
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 639-649
id: yarotsky18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 639
lastpage: 649
published: 2018-07-03 00:00:00 +0000
- title: 'Averaging Stochastic Gradient Descent on Riemannian Manifolds'
abstract: 'We consider the minimization of a function defined on a Riemannian manifold $\mathcal{M}$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $\mathcal{M}$ to an averaged iterate sequence with a robust and fast $O(1/n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.'
volume: 75
URL: https://proceedings.mlr.press/v75/tripuraneni18a.html
PDF: http://proceedings.mlr.press/v75/tripuraneni18a/tripuraneni18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-tripuraneni18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Nilesh
family: Tripuraneni
- given: Nicolas
family: Flammarion
- given: Francis
family: Bach
- given: Michael I.
family: Jordan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 650-687
id: tripuraneni18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 650
lastpage: 687
published: 2018-07-03 00:00:00 +0000
- title: 'Fitting a Putative Manifold to Noisy Data'
abstract: 'In the present work, we give a solution to the following question from manifold learning. Suppose data belonging to a high dimensional Euclidean space is drawn independently, identically distributed from a measure supported on a low dimensional twice differentiable embedded manifold $M$, and corrupted by a small amount of gaussian noise. How can we produce a manifold $M’$ whose Hausdorff distance to $M$ is small and whose reach is not much smaller than the reach of $M$?'
volume: 75
URL: https://proceedings.mlr.press/v75/fefferman18a.html
PDF: http://proceedings.mlr.press/v75/fefferman18a/fefferman18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-fefferman18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Charles
family: Fefferman
- given: Sergei
family: Ivanov
- given: Yaroslav
family: Kurylev
- given: Matti
family: Lassas
- given: Hariharan
family: Narayanan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 688-720
id: fefferman18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 688
lastpage: 720
published: 2018-07-03 00:00:00 +0000
- title: 'Private Sequential Learning'
abstract: 'We formulate a private learning model to study an intrinsic tradeoff between privacy and query complexity in sequential learning. Our model involves a learner who aims to determine a scalar value, $v^*$, by sequentially querying an external database and receiving binary responses. In the meantime, an adversary observes the learner’s queries, though not the responses, and tries to infer from them the value of $v^*$. The objective of the learner is to obtain an accurate estimate of $v^*$ using only a small number of queries, while simultaneously protecting her privacy by making $v^*$ provably difficult to learn for the adversary. Our main results provide tight upper and lower bounds on the learner’s query complexity as a function of desired levels of privacy and estimation accuracy. We also construct explicit query strategies whose complexity is optimal up to an additive constant.'
volume: 75
URL: https://proceedings.mlr.press/v75/tsitsiklis18a.html
PDF: http://proceedings.mlr.press/v75/tsitsiklis18a/tsitsiklis18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-tsitsiklis18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: John
family: Tsitsiklis
- given: Kuang
family: Xu
- given: Zhi
family: Xu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 721-727
id: tsitsiklis18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 721
lastpage: 727
published: 2018-07-03 00:00:00 +0000
- title: 'Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models'
abstract: 'Generalized linear models (GLMs) arise in high-dimensional machine learning, statistics, communications and signal processing. % In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes or benchmarks models in neural networks. % We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal inference and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and dimensions are large and their ratio is fixed. % Non-rigorous predictions for the optimal inference and generalization errors existed for special cases of GLMs, e.g. for the perceptron in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. % Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance, and locate the associated sharp phase transitions separating learnable and non-learnable regions.'
volume: 75
URL: https://proceedings.mlr.press/v75/barbier18a.html
PDF: http://proceedings.mlr.press/v75/barbier18a/barbier18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-barbier18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Jean
family: Barbier
- given: Florent
family: Krzakala
- given: Nicolas
family: Macris
- given: Léo
family: Miolane
- given: Lenka
family: Zdeborová
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 728-731
id: barbier18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 728
lastpage: 731
published: 2018-07-03 00:00:00 +0000
- title: 'Exact and Robust Conformal Inference Methods for Predictive Machine Learning with Dependent Data'
abstract: 'We extend conformal inference to general settings that allow for time series data. Our proposal is developed as a randomization method and accounts for potential serial dependence by including block structures in the permutation scheme. As a result, the proposed method retains the exact, model-free validity when the data are i.i.d. or more generally exchangeable, similar to usual conformal inference methods. When exchangeability fails, as is the case for common time series data, the proposed approach is approximately valid under weak assumptions on the conformity score.'
volume: 75
URL: https://proceedings.mlr.press/v75/chernozhukov18a.html
PDF: http://proceedings.mlr.press/v75/chernozhukov18a/chernozhukov18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-chernozhukov18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Victor
family: Chernozhukov
- given: Kaspar
family: Wüthrich
- given: Zhu
family: Yinchu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 732-749
id: chernozhukov18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 732
lastpage: 749
published: 2018-07-03 00:00:00 +0000
- title: 'Nonstochastic Bandits with Composite Anonymous Feedback'
abstract: 'We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.'
volume: 75
URL: https://proceedings.mlr.press/v75/cesa-bianchi18a.html
PDF: http://proceedings.mlr.press/v75/cesa-bianchi18a/cesa-bianchi18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-cesa-bianchi18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Nicolò
family: Cesa-Bianchi
- given: Claudio
family: Gentile
- given: Yishay
family: Mansour
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 750-773
id: cesa-bianchi18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 750
lastpage: 773
published: 2018-07-03 00:00:00 +0000
- title: 'Lower Bounds for Higher-Order Convex Optimization'
abstract: 'State-of-the-art methods in mathematical optimization employ higher-order derivative information. We explore the limitations of higher-order optimization and prove that even for convex optimization, a polynomial dependence on the approximation guarantee and higher-order smoothness parameters is necessary. This refutes the hope that higher-order smoothness and higher-order derivatives can lead to dimension free polynomial time algorithms for convex optimization. As a special case, we show Nesterov’s accelerated cubic regularization method and higher-order methods to be nearly tight.'
volume: 75
URL: https://proceedings.mlr.press/v75/agarwal18a.html
PDF: http://proceedings.mlr.press/v75/agarwal18a/agarwal18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-agarwal18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Naman
family: Agarwal
- given: Elad
family: Hazan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 774-792
id: agarwal18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 774
lastpage: 792
published: 2018-07-03 00:00:00 +0000
- title: 'Log-concave sampling: Metropolis-Hastings algorithms are fast!'
abstract: 'We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by running a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step to ensure the correct stationary distribution. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds reveal that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $\delta$ for a density with condition number $\kappa$, we show that MALA requires $\mathcal{O} \big(\kappa d \log(1/\delta) \big)$ steps, as compared to the $\mathcal{O} \big(\kappa^2 d/\delta^2 \big)$ steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for a zeroth-order method Metropolized random walk (MRW) and show that it mixes $\mathcal{O}(\kappa d)$ slower than MALA.'
volume: 75
URL: https://proceedings.mlr.press/v75/dwivedi18a.html
PDF: http://proceedings.mlr.press/v75/dwivedi18a/dwivedi18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-dwivedi18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Raaz
family: Dwivedi
- given: Yuansi
family: Chen
- given: Martin J
family: Wainwright
- given: Bin
family: Yu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 793-797
id: dwivedi18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 793
lastpage: 797
published: 2018-07-03 00:00:00 +0000
- title: 'Incentivizing Exploration by Heterogeneous Users'
abstract: 'We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, $N$ bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent noise.Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms’ attribute vectors derived from observations of other agents’ past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents’ preferences are drawn, but not the specific draws, can offer arm-specific incentive payments to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment. We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $p > 0$ of agents, we show that this algorithm achieves expected cumulative regret of $O (N \e^{2/p} + N \log^3(T))$, using expected cumulative payments of $O(N^2 \e^{2/p})$. If $p$ is known or the distribution over agent preferences is discrete, the exponential term $\e^{2/p}$ can be replaced with suitable polynomials in $N$ and $1/p$. For discrete preferences, the regret’s dependence on $T$ can be eliminated entirely, giving constant (depending only polynomially on $N$ and $1/p$) expected regret and payments. This constant regret stands in contrast to the $\Theta(\log(T))$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences causes exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration. '
volume: 75
URL: https://proceedings.mlr.press/v75/chen18a.html
PDF: http://proceedings.mlr.press/v75/chen18a/chen18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-chen18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Bangrui
family: Chen
- given: Peter
family: Frazier
- given: David
family: Kempe
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 798-818
id: chen18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 798
lastpage: 818
published: 2018-07-03 00:00:00 +0000
- title: 'Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms'
abstract: 'We study the problem of robustly learning multi-dimensional histograms. A $d$-dimensional function $h: D \to \R$ is called a $k$-histogram if there exists a partition of the domain $D \subseteq \R^d$ into $k$ axis-aligned rectangles such that $h$ is constant within each such rectangle. Let $f: D \to \R$ be a $d$-dimensional probability density function and suppose that $f$ is $\mathrm{OPT}$-close, in $L_1$-distance, to an unknown $k$-histogram (with unknown partition). Our goal is to output a hypothesis that is $O(\mathrm{OPT}) + \epsilon$ close to $f$, in $L_1$-distance. We give an algorithm for this learning problem that uses $n = \tilde{O}_d(k/\eps^2)$ samples and runs in time $\tilde{O}_d(n)$. For any fixed dimension, our algorithm has optimal sample complexity, up to logarithmic factors, and runs in near-linear time. Prior to our work, the time complexity of the $d=1$ case was well-understood, but significant gaps in our understanding remained even for $d=2$.'
volume: 75
URL: https://proceedings.mlr.press/v75/diakonikolas18a.html
PDF: http://proceedings.mlr.press/v75/diakonikolas18a/diakonikolas18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-diakonikolas18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Ilias
family: Diakonikolas
- given: Jerry
family: Li
- given: Ludwig
family: Schmidt
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 819-842
id: diakonikolas18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 819
lastpage: 842
published: 2018-07-03 00:00:00 +0000
- title: 'Time-Space Tradeoffs for Learning Finite Functions from Random Evaluations, with Applications to Polynomials'
abstract: 'We develop an extension of recent analytic methods for obtaining time-space tradeoff lower bounds for problems of learning from uniformly random labelled examples. With our methods we can obtain bounds for learning concept classes of finite functions from random evaluations even when the sample space of random inputs can be significantly smaller than the concept class of functions and the function values can be from an arbitrary finite set. At the core of our results, we reduce the time-space complexity of learning from random evaluations to the question of how much the corresponding evaluation matrix amplifies the 2-norms of “almost uniform” probability distributions. To analyze the latter, we formulate it as a semidefinite program, and we analyze its dual. In order to handle function values from arbitrary finite sets, we apply this norm amplification analysis to complex matrices. As applications that follow from our new techniques, we show that any algorithm that learns $n$-variate polynomial functions of degree at most $d$ over $\mathbb{F}_2$ with success at least $2^{-O(n)}$ from evaluations on randomly chosen inputs either requires space $\Omega(nm/d)$ or $2^{\Omega(n/d)}$ time where $m=(n/d)^{\Theta(d)}$ is the dimension of the space of such polynomials. These bounds are asymptotically optimal for polynomials of arbitrary constant degree since they match the tradeoffs achieved by natural learning algorithms for the problems. We extend these results to learning polynomials of degree at most $d$ over any odd prime field $\mathbb{F}_p$ where we show that $\Omega((mn/d)\log p)$ space or time $p^{\Omega(n/d)}$ is required. To derive our bounds for learning polynomials over finite fields, we show that an analysis of the dual of the corresponding semidefinite program follows from an understanding of the distribution of the bias of all degree $d$ polynomials with respect to uniformly random inputs.'
volume: 75
URL: https://proceedings.mlr.press/v75/beame18a.html
PDF: http://proceedings.mlr.press/v75/beame18a/beame18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-beame18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Paul
family: Beame
- given: Shayan
family: Oveis Gharan
- given: Xin
family: Yang
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 843-856
id: beame18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 843
lastpage: 856
published: 2018-07-03 00:00:00 +0000
- title: 'Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability'
abstract: 'We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003). For a particular local optimum of the empirical risk, with an \textit{arbitrary initialization}, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short \textit{recurrence time}; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long \textit{escape time}. We call this phenomenon \textit{empirical metastability}. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality. '
volume: 75
URL: https://proceedings.mlr.press/v75/tzen18a.html
PDF: http://proceedings.mlr.press/v75/tzen18a/tzen18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-tzen18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Belinda
family: Tzen
- given: Tengyuan
family: Liang
- given: Maxim
family: Raginsky
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 857-875
id: tzen18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 857
lastpage: 875
published: 2018-07-03 00:00:00 +0000
- title: 'Hardness of Learning Noisy Halfspaces using Polynomial Thresholds'
abstract: 'We prove the hardness of weakly learning halfspaces in the presence of adversarial noise using polynomial threshold functions (PTFs). In particular, we prove that for any constants $d \in \mathbb{Z}^+$ and $\eps > 0$, it is NP-hard to decide: given a set of $\{-1,1\}$-labeled points in $\mathbb{R}^n$ whether (YES Case) there exists a halfspace that classifies $(1-\eps)$-fraction of the points correctly, or (NO Case) any degree-$d$ PTF classifies at most $(1/2 + \eps)$-fraction of the points correctly. This strengthens to all constant degrees the previous NP-hardness of learning using degree-$2$ PTFs shown by Diakonikolas et al. (2011). The latter result had remained the only progress over the works of Feldman et al. (2006) and Guruswami et al. (2006) ruling out weakly proper learning adversarially noisy halfspaces.'
volume: 75
URL: https://proceedings.mlr.press/v75/bhattacharyya18a.html
PDF: http://proceedings.mlr.press/v75/bhattacharyya18a/bhattacharyya18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-bhattacharyya18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Arnab
family: Bhattacharyya
- given: Suprovat
family: Ghoshal
- given: Rishi
family: Saket
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 876-917
id: bhattacharyya18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 876
lastpage: 917
published: 2018-07-03 00:00:00 +0000
- title: 'Best of both worlds: Stochastic & adversarial best-arm identification'
abstract: 'We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: $\backslash$emph{\{}Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards?{\}} First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.'
volume: 75
URL: https://proceedings.mlr.press/v75/abbasi-yadkori18a.html
PDF: http://proceedings.mlr.press/v75/abbasi-yadkori18a/abbasi-yadkori18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-abbasi-yadkori18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yasin
family: Abbasi-Yadkori
- given: Peter
family: Bartlett
- given: Victor
family: Gabillon
- given: Alan
family: Malek
- given: Michal
family: Valko
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 918-949
id: abbasi-yadkori18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 918
lastpage: 949
published: 2018-07-03 00:00:00 +0000
- title: 'Learning Patterns for Detection with Multiscale Scan Statistics'
abstract: 'This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern and the pattern itself is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis—that the data is only noise—is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an $\epsilon$-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds.'
volume: 75
URL: https://proceedings.mlr.press/v75/sharpnack18a.html
PDF: http://proceedings.mlr.press/v75/sharpnack18a/sharpnack18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-sharpnack18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: James
family: Sharpnack
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 950-969
id: sharpnack18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 950
lastpage: 969
published: 2018-07-03 00:00:00 +0000
- title: 'Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk'
abstract: 'We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another in which the task is to invert a generative neural network given only compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. Hence, there are no local minima, saddle points, or other stationary points outside these neighborhoods. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and they bridge the gap between the empirical success of enforcing deep generative priors and a rigorous understanding of non-linear inverse problems.'
volume: 75
URL: https://proceedings.mlr.press/v75/hand18a.html
PDF: http://proceedings.mlr.press/v75/hand18a/hand18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-hand18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Paul
family: Hand
- given: Vladislav
family: Voroninski
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 970-978
id: hand18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 970
lastpage: 978
published: 2018-07-03 00:00:00 +0000
- title: 'Small-loss bounds for online learning with partial information'
abstract: 'We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called “small-loss” $o(\alpha L^{\star})$ regret bounds with high probability, where $\alpha$ is the independence number of the graph, and $L^{\star}$ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e. utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications such as semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators. In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $\tilde{O}(\sqrt{dL^{\star}})$ for actual regret, where $d$ is the number of actions, answering open questions of Neu. Previous bounds for bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal $\tilde{O}(\sqrt{\kappa L^{\star}})$ regret guarantee for fixed feedback graphs with clique-partition number at most $\kappa$.'
volume: 75
URL: https://proceedings.mlr.press/v75/lykouris18a.html
PDF: http://proceedings.mlr.press/v75/lykouris18a/lykouris18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-lykouris18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Thodoris
family: Lykouris
- given: Karthik
family: Sridharan
- given: Éva
family: Tardos
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 979-986
id: lykouris18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 979
lastpage: 986
published: 2018-07-03 00:00:00 +0000
- title: 'Empirical bounds for functions with weak interactions'
abstract: 'We provide sharp empirical estimates of expectation, variance and normal approximation for a class of statistics whose variation in any argument does not change too much when another argument is modified. Examples of such weak interactions are furnished by U- and V-statistics, Lipschitz L-statistics and various error functionals of $\ell_2$-regularized algorithms and Gibbs algorithms.'
volume: 75
URL: https://proceedings.mlr.press/v75/maurer18a.html
PDF: http://proceedings.mlr.press/v75/maurer18a/maurer18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-maurer18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Andreas
family: Maurer
- given: Massimiliano
family: Pontil
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 987-1010
id: maurer18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 987
lastpage: 1010
published: 2018-07-03 00:00:00 +0000
- title: 'Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression'
abstract: 'High-dimensional settings, where the data dimension ($d$) far exceeds the number of observations ($n$), are common in many statistical and machine learning applications. Methods based on $\ell_1$-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is hence natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix $X \in \mathbb{R}^{n \times d}$ satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution $X \Phi^\top \Phi$, where $\Phi \in \mathbb{R}^{m \times d}$ is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily {\em verifiable} condition into the design process, and allows for generation of {\em compressed} design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix $X$.'
volume: 75
URL: https://proceedings.mlr.press/v75/kasiviswanathan18a.html
PDF: http://proceedings.mlr.press/v75/kasiviswanathan18a/kasiviswanathan18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-kasiviswanathan18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Shiva Prasad
family: Kasiviswanathan
- given: Mark
family: Rudelson
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1011-1041
id: kasiviswanathan18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1011
lastpage: 1041
published: 2018-07-03 00:00:00 +0000
- title: 'Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent'
abstract: 'Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern \emph{nonconvex} applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov’s AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, matching the best known convergence rate, which is faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting—all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called \emph{improve or localize}, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.'
volume: 75
URL: https://proceedings.mlr.press/v75/jin18a.html
PDF: http://proceedings.mlr.press/v75/jin18a/jin18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-jin18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Chi
family: Jin
- given: Praneeth
family: Netrapalli
- given: Michael I.
family: Jordan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1042-1085
id: jin18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1042
lastpage: 1085
published: 2018-07-03 00:00:00 +0000
- title: 'Convex Optimization with Unbounded Nonconvex Oracles using Simulated Annealing'
abstract: 'We consider the problem of minimizing a convex objective function $F$ when one can only evaluate its noisy approximation $\hat{F}$. Unless one assumes some structure on the noise, $\hat{F}$ may be an arbitrary nonconvex function, making the task of minimizing $F$ intractable. To overcome this, prior work has often focused on the case when $F(x)-\hat{F}(x)$ is uniformly-bounded. In this paper we study the more general case when the noise has magnitude $\alpha F(x) + \beta$ for some $\alpha, \beta > 0$, and present a polynomial time algorithm that finds an approximate minimizer of $F$ for this noise model. Previously, Markov chains, such as the stochastic gradient Langevin dynamics, have been used to arrive at approximate solutions to these optimization problems. However, for the noise model considered in this paper, no single temperature allows such a Markov chain to both mix quickly and concentrate near the global minimizer. We bypass this by combining “simulated annealing" with the stochastic gradient Langevin dynamics, and gradually decreasing the temperature of the chain in order to approach the global minimizer. As a corollary one can approximately minimize a nonconvex function that is close to a convex function; however, the closeness can deteriorate as one moves away from the optimum.'
volume: 75
URL: https://proceedings.mlr.press/v75/mangoubi18a.html
PDF: http://proceedings.mlr.press/v75/mangoubi18a/mangoubi18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-mangoubi18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Oren
family: Mangoubi
- given: Nisheeth K.
family: Vishnoi
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1086-1124
id: mangoubi18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1086
lastpage: 1124
published: 2018-07-03 00:00:00 +0000
- title: 'Learning Mixtures of Linear Regressions with Nearly Optimal Complexity'
abstract: 'Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.'
volume: 75
URL: https://proceedings.mlr.press/v75/li18b.html
PDF: http://proceedings.mlr.press/v75/li18b/li18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-li18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yuanzhi
family: Li
- given: Yingyu
family: Liang
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1125-1144
id: li18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1125
lastpage: 1144
published: 2018-07-03 00:00:00 +0000
- title: 'Detecting Correlations with Little Memory and Communication'
abstract: 'We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise correlations with optimal sample complexity, the number of required memory/communication bits is at least quadratic in the dimension. Our results substantially improve those of Shamir (2014), which studied a similar question in a much more restricted setting. To the best of our knowledge, these are the first provable sample/memory/communication trade-offs for a practical estimation problem, using standard distributions, and in the natural regime where the memory/communication budget is larger than the size of a single data point. To derive our theorems, we prove a new information-theoretic result, which may be relevant for studying other information-constrained learning problems.'
volume: 75
URL: https://proceedings.mlr.press/v75/dagan18a.html
PDF: http://proceedings.mlr.press/v75/dagan18a/dagan18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-dagan18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yuval
family: Dagan
- given: Ohad
family: Shamir
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1145-1198
id: dagan18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1145
lastpage: 1198
published: 2018-07-03 00:00:00 +0000
- title: 'Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning'
abstract: 'Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as “lock-in probability”. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.'
volume: 75
URL: https://proceedings.mlr.press/v75/dalal18a.html
PDF: http://proceedings.mlr.press/v75/dalal18a/dalal18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-dalal18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Gal
family: Dalal
- given: Gugan
family: Thoppe
- given: Balázs
family: Szörényi
- given: Shie
family: Mannor
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1199-1233
id: dalal18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1199
lastpage: 1233
published: 2018-07-03 00:00:00 +0000
- title: 'Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities'
abstract: 'We study the problem of learning multivariate log-concave densities with respect to a global loss function. We obtain the first upper bound on the sample complexity of the maximum likelihood estimator (MLE) for a log-concave density on $\mathbb{R}^d$, for all $d \geq 4$. Prior to this work, no finite sample upper bound was known for this estimator in more than $3$ dimensions. In more detail, we prove that for any $d \geq 1$ and $\epsilon>0$, given $\tilde{O}_d((1/\epsilon)^{(d+3)/2})$ samples drawn from an unknown log-concave density $f_0$ on $\mathbb{R}^d$, the MLE outputs a hypothesis $h$ that with high probability is $\epsilon$-close to $f_0$, in squared Hellinger loss. A sample complexity lower bound of $\Omega_d((1/\epsilon)^{(d+1)/2})$ was previously known for any learning algorithm that achieves this guarantee. We thus establish that the sample complexity of the log-concave MLE is near-optimal, up to an $\tilde{O}(1/\epsilon)$ factor.'
volume: 75
URL: https://proceedings.mlr.press/v75/carpenter18a.html
PDF: http://proceedings.mlr.press/v75/carpenter18a/carpenter18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-carpenter18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Timothy
family: Carpenter
- given: Ilias
family: Diakonikolas
- given: Anastasios
family: Sidiropoulos
- given: Alistair
family: Stewart
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1234-1262
id: carpenter18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1234
lastpage: 1262
published: 2018-07-03 00:00:00 +0000
- title: 'More Adaptive Algorithms for Adversarial Bandits'
abstract: 'We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.'
volume: 75
URL: https://proceedings.mlr.press/v75/wei18a.html
PDF: http://proceedings.mlr.press/v75/wei18a/wei18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-wei18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Chen-Yu
family: Wei
- given: Haipeng
family: Luo
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1263-1291
id: wei18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1263
lastpage: 1291
published: 2018-07-03 00:00:00 +0000
- title: 'Efficient Convex Optimization with Membership Oracles'
abstract: ' We consider the problem of minimizing a convex function over a convex set given access only to an evaluation oracle for the function and a membership oracle for the set. We give a simple algorithm which solves this problem with $\tilde{O}(n^{2})$ oracle calls and $\tilde{O}(n^{3})$ additional arithmetic operations. Using this result, we obtain more efficient reductions among the five basic oracles for convex sets and functions defined by Gr{ö}tschel, Lov{á}sz, and Schrijver (1988). '
volume: 75
URL: https://proceedings.mlr.press/v75/lee18a.html
PDF: http://proceedings.mlr.press/v75/lee18a/lee18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-lee18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yin Tat
family: Lee
- given: Aaron
family: Sidford
- given: Santosh S.
family: Vempala
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1292-1294
id: lee18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1292
lastpage: 1294
published: 2018-07-03 00:00:00 +0000
- title: 'A General Approach to Multi-Armed Bandits Under Risk Criteria'
abstract: 'Different risk-related criteria have received recent interest in learning problems, where typically each case is treated in a customized manner. In this paper we provide a more systematic approach to analyzing such risk criteria within a stochastic multi-armed bandit (MAB) formulation. We identify a set of general conditions that yield a simple characterization of the oracle rule (which serves as the regret benchmark), and facilitate the design of upper confidence bound (UCB) learning policies. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.'
volume: 75
URL: https://proceedings.mlr.press/v75/cassel18a.html
PDF: http://proceedings.mlr.press/v75/cassel18a/cassel18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-cassel18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Asaf
family: Cassel
- given: Shie
family: Mannor
- given: Assaf
family: Zeevi
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1295-1306
id: cassel18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1295
lastpage: 1306
published: 2018-07-03 00:00:00 +0000
- title: 'An Optimal Learning Algorithm for Online Unconstrained Submodular Maximization'
abstract: ' We consider a basic problem at the interface of two fundamental fields: {\em submodular optimization} and {\em online learning}. In the {\em online unconstrained submodular maximization (online USM) problem}, there is a universe $[n]=\{1,2,\ldots,n\}$ and a sequence of $T$ nonnegative (not necessarily monotone) submodular functions arrive over time. The goal is to design a computationally efficient online algorithm, which chooses a subset of $[n]$ at each time step as a function only of the past, such that the accumulated value of the chosen subsets is as close as possible to the maximum total value of a fixed subset in hindsight. Our main result is a polynomial-time no-$\frac12$-regret algorithm for this problem, meaning that for every sequence of nonnegative submodular functions, the algorithm’s expected total value is at least $\frac12$ times that of the best subset in hindsight, up to an error term sublinear in $T$. The factor of $\tfrac 12$ cannot be improved upon by any polynomial-time online algorithm when the submodular functions are presented as value oracles. Previous work on the offline problem implies that picking a subset uniformly at random in each time step achieves zero $\frac14$-regret. A byproduct of our techniques is an explicit subroutine for the two-experts problem that has an unusually strong regret guarantee: the total value of its choices is comparable to twice the total value of either expert on rounds it did not pick that expert. This subroutine may be of independent interest. '
volume: 75
URL: https://proceedings.mlr.press/v75/roughgarden18a.html
PDF: http://proceedings.mlr.press/v75/roughgarden18a/roughgarden18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-roughgarden18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Tim
family: Roughgarden
- given: Joshua R.
family: Wang
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1307-1325
id: roughgarden18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1307
lastpage: 1325
published: 2018-07-03 00:00:00 +0000
- title: 'The Mean-Field Approximation: Information Inequalities, Algorithms, and Complexity'
abstract: 'The mean field approximation to the Ising model is a canonical variational tool that is used for analysis and inference in Ising models. We provide a simple and optimal bound for the KL error of the mean field approximation for Ising models on general graphs, and extend it to higher order Markov random fields. Our bound improves on previous bounds obtained in work in the graph limit literature by Borgs, Chayes, Lov{á}sz, S{ó}s, and Vesztergombi and recent works by Basak and Mukherjee, and Eldan. Our bound is tight up to lower order terms. Building on the methods used to prove the bound, along with techniques from combinatorics and optimization, we study the algorithmic problem of estimating the (variational) free energy for Ising models and general Markov random fields. For a graph $G$ on $n$ vertices and interaction matrix $J$ with Frobenius norm $\|{J} \|_F$, we provide algorithms that approximate the free energy within an additive error of $\epsilon n \|J\|_F$ in time $\exp(poly(1/\epsilon))$. We also show that approximation within $(n \|J\|_F)^{1-\delta}$ is NP-hard for every $\delta > 0$. Finally, we provide more efficient approximation algorithms, which find the optimal mean field approximation, for ferromagnetic Ising models and for Ising models satisfying Dobrushin’s condition. '
volume: 75
URL: https://proceedings.mlr.press/v75/jain18b.html
PDF: http://proceedings.mlr.press/v75/jain18b/jain18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-jain18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Vishesh
family: Jain
- given: Frederic
family: Koehler
- given: Elchanan
family: Mossel
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1326-1347
id: jain18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1326
lastpage: 1347
published: 2018-07-03 00:00:00 +0000
- title: 'Approximation beats concentration? An approximation view on inference with smooth radial kernels'
abstract: 'Positive definite kernels and their associated Reproducing Kernel Hilbert Spaces provide a mathematically compelling and practically competitive framework for learning from data. In this paper we take the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties. We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and “Fourier” coefficients of functions in the kernel space restricted to a discrete set of data points. We also investigate the fitting capacity of kernels, giving explicit bounds on the fat shattering dimension of the balls in Reproducing Kernel Hilbert spaces. Interestingly, the same properties that make kernels very effective approximators for functions in their “native” kernel space, also limit their capacity to represent arbitrary functions. We discuss various implications, including those for gradient descent type methods. It is important to note that most of our bounds are measure independent. Moreover, at least in moderate dimension, the bounds for eigenvalues are much tighter than the bounds which can be obtained from the usual matrix concentration results. For example, we see that eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain. We call this “approximation beats concentration” phenomenon as even when the data are sampled from a probability distribution, some of their aspects are better understood in terms of approximation theory. '
volume: 75
URL: https://proceedings.mlr.press/v75/belkin18a.html
PDF: http://proceedings.mlr.press/v75/belkin18a/belkin18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-belkin18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Mikhail
family: Belkin
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1348-1361
id: belkin18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1348
lastpage: 1361
published: 2018-07-03 00:00:00 +0000
- title: 'Non-Convex Matrix Completion Against a Semi-Random Adversary'
abstract: ' Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies crucially on the assumption that each entry of the matrix is observed with exactly the same probability $p$, which is not realistic in practice. In this paper, we investigate a more realistic semi-random model, where the probability of observing each entry is {\em at least} $p$. Even with this mild semi-random perturbation, we can construct counter-examples where existing non-convex algorithms get stuck in bad local optima. In light of the negative results, we propose a pre-processing step that tries to re-weight the semi-random input, so that it becomes “similar” to a random input. We give a nearly-linear time algorithm for this problem, and show that after our pre-processing, all the local minima of the non-convex objective can be used to approximately recover the underlying ground-truth matrix. '
volume: 75
URL: https://proceedings.mlr.press/v75/cheng18b.html
PDF: http://proceedings.mlr.press/v75/cheng18b/cheng18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-cheng18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yu
family: Cheng
- given: Rong
family: Ge
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1362-1394
id: cheng18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1362
lastpage: 1394
published: 2018-07-03 00:00:00 +0000
- title: 'The Vertex Sample Complexity of Free Energy is Polynomial'
abstract: 'The free energy is a key quantity which is associated to Markov random fields. Classical results in statistical physics show how, given an analytic formula of the free energy, it is possible to compute many key quantities associated with Markov random fields including quantities such as magnetization and the location of various phase transitions. Given a massive Markov random field on $n$ nodes, can a small sample from it provide a rough approximation to the free energy $\mathcal{F}_n = \log{Z_n}$? Results in the graph limit literature by Borgs, Chayes, Lov{á}sz, S{ó}s, and Vesztergombi show that for Ising models on $n$ nodes and interactions of strength $\Theta(1/n)$, an $\epsilon$ approximation to $\log Z_n / n$ can be achieved by sampling a randomly induced model on $2^{O(1/\epsilon^2)}$ nodes. We show that the sampling complexity of this problem is {\em polynomial in }$1/\epsilon$. We further show a polynomial dependence on $\epsilon$ cannot be avoided. Our results are very general as they apply to higher order Markov random fields. For Markov random fields of order $r$, we obtain an algorithm that achieves $\epsilon$ approximation using a number of samples polynomial in $r$ and $1/\epsilon$ and running time that is $2^{O(1/\epsilon^2)}$ up to polynomial factors in $r$ and $\epsilon$. For ferromagnetic Ising models, the running time is polynomial in $1/\epsilon$. Our results are intimately connected to recent research on the regularity lemma and property testing, where the interest is in finding which properties can tested within $\epsilon$ error in time polynomial in $1/\epsilon$. In particular, our proofs build on results of Alon, de la Vega, Kannan and Karpinski, who also introduced the notion of polynomial vertex sample complexity. Another critical ingredient of the proof is an effective bound by the authors of this paper relating the variational free energy and the free energy. '
volume: 75
URL: https://proceedings.mlr.press/v75/jain18c.html
PDF: http://proceedings.mlr.press/v75/jain18c/jain18c.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-jain18c.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Vishesh
family: Jain
- given: Frederic
family: Koehler
- given: Elchanan
family: Mossel
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1395-1419
id: jain18c
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1395
lastpage: 1419
published: 2018-07-03 00:00:00 +0000
- title: 'Efficient Algorithms for Outlier-Robust Regression'
abstract: ' We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution ${\mathcal{D}}$ and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to ${\mathcal{D}}$, assuming that the marginal distribution of $\mathcal{D}$ over the input space is \emph{certifiably hypercontractive}. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets. '
volume: 75
URL: https://proceedings.mlr.press/v75/klivans18a.html
PDF: http://proceedings.mlr.press/v75/klivans18a/klivans18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-klivans18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Adam
family: Klivans
- given: Pravesh K.
family: Kothari
- given: Raghu
family: Meka
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1420-1430
id: klivans18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1420
lastpage: 1430
published: 2018-07-03 00:00:00 +0000
- title: 'Action-Constrained Markov Decision Processes With Kullback-Leibler Cost'
abstract: 'This paper concerns computation of optimal policies in which the one-step reward function contains a cost term that models Kullback-Leibler divergence with respect to nominal dynamics. This technique was introduced by Todorov in 2007, where it was shown under general conditions that the solution to the average-reward optimality equations reduce to a simple eigenvector problem. Since then many authors have sought to apply this technique to control problems and models of bounded rationality in economics. A crucial assumption is that the input process is essentially unconstrained. For example, if the nominal dynamics include randomness from nature (e.g., the impact of wind on a moving vehicle), then the optimal control solution does not respect the exogenous nature of this disturbance. This paper introduces a technique to solve a more general class of action-constrained MDPs. The main idea is to solve an entire parameterized family of MDPs, in which the parameter is a scalar weighting the one-step reward function. The approach is new and practical even in the original unconstrained formulation.'
volume: 75
URL: https://proceedings.mlr.press/v75/busic18a.html
PDF: http://proceedings.mlr.press/v75/busic18a/busic18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-busic18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Ana
family: Bušić
- given: Sean
family: Meyn
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1431-1444
id: busic18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1431
lastpage: 1444
published: 2018-07-03 00:00:00 +0000
- title: 'Fundamental Limits of Weak Recovery with Applications to Phase Retrieval'
abstract: 'In phase retrieval we want to recover an unknown signal $\boldsymbol x\in\mathbb C^d$ from $n$ quadratic measurements of the form $y_i = |⟨\boldsymbol a_i,\boldsymbol x⟩|^2+w_i$ where $\boldsymbol a_i\in \mathbb C^d$ are known sensing vectors and $w_i$ is measurement noise. We ask the following \emph{weak recovery} question: what is the minimum number of measurements $n$ needed to produce an estimator $\hat{\boldsymbol x}(\boldsymbol y)$ that is positively correlated with the signal $\boldsymbol x$? We consider the case of Gaussian vectors $\boldsymbol a_i$. We prove that – in the high-dimensional limit – a sharp phase transition takes place, and we locate the threshold in the regime of vanishingly small noise. For $n\le d-o(d)$ no estimator can do significantly better than random and achieve a strictly positive correlation. For $n\ge d+o(d)$ a simple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with the same spectral estimator demonstrate promising performance with realistic sensing matrices. Spectral methods are used to initialize non-convex optimization algorithms in phase retrieval, and our approach can boost the performance in this setting as well. Our impossibility result is based on classical information-theory arguments. The spectral algorithm computes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp characterization of the spectral properties of this random matrix using tools from free probability and generalizing a recent result by Lu and Li. Both the upper and lower bound generalize beyond phase retrieval to measurements $y_i$ produced according to a generalized linear model. As a byproduct of our analysis, we compare the threshold of the proposed spectral method with that of a message passing algorithm. '
volume: 75
URL: https://proceedings.mlr.press/v75/mondelli18a.html
PDF: http://proceedings.mlr.press/v75/mondelli18a/mondelli18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-mondelli18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Marco
family: Mondelli
- given: Andrea
family: Montanari
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1445-1450
id: mondelli18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1445
lastpage: 1450
published: 2018-07-03 00:00:00 +0000
- title: 'Cutting plane methods can be extended into nonconvex optimization'
abstract: 'We show that it is possible to obtain an $O(\epsilon^{-4/3})$ runtime — including computational cost — for finding $\epsilon$-stationary points of nonconvex functions using cutting plane methods. This improves on the best known epsilon dependence achieved by cubic regularized Newton of $O(\epsilon^{-3/2})$ as proved by Nesterov and Polyak (2006). Our techniques utilize the convex until proven guilty principle proposed by Carmon, Duchi, Hinder, and Sidford (2017). '
volume: 75
URL: https://proceedings.mlr.press/v75/hinder18a.html
PDF: http://proceedings.mlr.press/v75/hinder18a/hinder18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-hinder18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Oliver
family: Hinder
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1451-1454
id: hinder18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1451
lastpage: 1454
published: 2018-07-03 00:00:00 +0000
- title: 'An Analysis of the t-SNE Algorithm for Data Visualization'
abstract: 'A first line of attack in exploratory data analysis is \emph{data visualization}, i.e., generating a 2-dimensional representation of data that makes \emph{clusters} of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The \emph{t-SNE} heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the \emph{de facto} standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the “ground-truth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in \emph{partially} recovering cluster structure even when the above deterministic condition is not met.'
volume: 75
URL: https://proceedings.mlr.press/v75/arora18a.html
PDF: http://proceedings.mlr.press/v75/arora18a/arora18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-arora18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Sanjeev
family: Arora
- given: Wei
family: Hu
- given: Pravesh K.
family: Kothari
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1455-1462
id: arora18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1455
lastpage: 1462
published: 2018-07-03 00:00:00 +0000
- title: 'Adaptivity to Smoothness in X-armed bandits'
abstract: 'We study the stochastic continuum-armed bandit problem from the angle of adaptivity to \emph{unknown regularity} of the reward function $f$. We prove that there exists no strategy for the cumulative regret that adapts optimally to the \emph{smoothness} of $f$. We show however that such minimax optimal adaptive strategies exist if the learner is given \emph{extra-information} about $f$. Finally, we complement our positive results with matching lower bounds.'
volume: 75
URL: https://proceedings.mlr.press/v75/locatelli18a.html
PDF: http://proceedings.mlr.press/v75/locatelli18a/locatelli18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-locatelli18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Andrea
family: Locatelli
- given: Alexandra
family: Carpentier
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1463-1492
id: locatelli18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1463
lastpage: 1492
published: 2018-07-03 00:00:00 +0000
- title: 'Black-Box Reductions for Parameter-free Online Learning in Banach Spaces'
abstract: 'We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce optimization over a constrained domain to unconstrained optimization. All of our reductions run as fast as online gradient descent. We use our new techniques to improve upon the previously best regret bounds for parameter-free learning, and do so for arbitrary norms.'
volume: 75
URL: https://proceedings.mlr.press/v75/cutkosky18a.html
PDF: http://proceedings.mlr.press/v75/cutkosky18a/cutkosky18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-cutkosky18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Ashok
family: Cutkosky
- given: Francesco
family: Orabona
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1493-1529
id: cutkosky18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1493
lastpage: 1529
published: 2018-07-03 00:00:00 +0000
- title: 'A Data Prism: Semi-verified learning in the small-alpha regime'
abstract: 'We consider a simple model of unreliable or crowdsourced data where there is an underlying set of $n$ binary variables, each “evaluator” contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $r$ of the variables, and the learner is given the true value of a \emph{constant} number of variables. We show that, provided an $\alpha$-fraction of the evaluators are “good” (either correct, or with independent noise rate $p < 1/2$), then the true values of a $(1-\eps)$ fraction of the $n$ underlying variables can be deduced as long as $r > \log_{2-2p}(1/\alpha)$. For example, if the fraction of “good” evaluators is larger than $1/16$ and there is no noise in their responses, then accurate recovery is possible provided each worker evaluates a random set of $4$ items. This result is optimal in that if $r \leq \log_{2-2p}(1/\alpha)$ the large dataset can contain no information. This setting can be viewed as an instance of the \emph{semi-verified} learning model introduced by Charikar, Steinhardt, and Valiant, which explores the tradeoff between the number of items evaluated by each worker and the fraction of “good” evaluators. In the standard adversarial setting, our algorithm requires $\tilde{O}\left(n^{\log_{2-2p}(1/\alpha)}\right)$ evaluators. However, the algorithm runs in near linear time, $\tilde{O}_{r,\eps}(n)$, and hence would require only a near-linear number of evaluations in the weaker model in which the adversary’s responses to each $r$-tuple of items are independent of the set of evaluations collected. These settings and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This extreme parameter regime, where the fraction of reliable data is small (inverse exponential in the amount of data provided by each source), is relevant to a number of practical settings. For example, the setting where you collect a dataset on customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a \emph{constant} (polynomial in $1/\alpha$ but independent of $n$), number of randomly selected items, to recover an accurate estimate of the entire set of preferences, even if the fraction of the original dataset contributed by the demographic of interest is inverse exponential in the number of preferences supplied by each customer. In this sense, our results can be viewed as a “data prism” allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.'
volume: 75
URL: https://proceedings.mlr.press/v75/meister18a.html
PDF: http://proceedings.mlr.press/v75/meister18a/meister18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-meister18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Michela
family: Meister
- given: Gregory
family: Valiant
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1530-1546
id: meister18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1530
lastpage: 1546
published: 2018-07-03 00:00:00 +0000
- title: 'A Direct Sum Result for the Information Complexity of Learning'
abstract: ' How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al., where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain $\mathcal{X}$ with information complexity at least $\Omega \left(d\log \log \frac{|\mathcal{X}|}{d}\right)$ bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case $d=1$. The above result is in fact a special case of a more general phenomenon we explore. We define the notion of {\em information complexity} of a given class of functions $\cH$. Intuitively, it is the minimum amount of information that an algorithm for $\mathcal{X}$ must retain about its input to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes. '
volume: 75
URL: https://proceedings.mlr.press/v75/nachum18a.html
PDF: http://proceedings.mlr.press/v75/nachum18a/nachum18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-nachum18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Ido
family: Nachum
- given: Jonathan
family: Shafer
- given: Amir
family: Yehudayoff
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1547-1568
id: nachum18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1547
lastpage: 1568
published: 2018-07-03 00:00:00 +0000
- title: 'Online learning over a finite action set with limited switching'
abstract: 'We study the value of switching actions in the Prediction From Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First, we revisit the well-studied and practically motivated setting of PFE with switching costs. Many algorithms are known to achieve the minimax optimal order of $O(\sqrt{T \log n})$ in \textit{expectation} for both regret and number of switches, where $T$ is the number of iterations and $n$ the number of actions. However, no \textit{high probability} guarantees are known. Our main technical contribution is the first algorithms which with high probability achieve this optimal order for both regret and number of switches. This settles an open problem of [Devroye et al., 2015], directly implies the first high probability guarantees for several problems of interest, and is efficiently adaptable to the related problem of online combinatorial optimization with limited switching. \par Next, to investigate the value of switching actions at a more granular level, we introduce the setting of \textit{switching budgets}, in which the algorithm is limited to $S \leq T$ switches between actions. This entails a limited number of free switches, in contrast to the unlimited number of expensive switches allowed in the switching cost setting. Using the above result and several reductions, we unify previous work and completely characterize the complexity of this switching budget setting up to small polylogarithmic factors: for both the PFE and MAB problems, for all switching budgets $S \leq T$, and for both expectation and high probability guarantees. For PFE, we show that the optimal rate is of order $\tilde{\Theta}(\sqrt{T\log n})$ for $S = \Omega(\sqrt{T\log n})$, and $\min(\tilde{\Theta}(\tfrac{T\log n}{S}), T)$ for $S = O(\sqrt{T \log n})$. Interestingly, the bandit setting does not exhibit such a phase transition; instead we show the minimax rate decays steadily as $\min(\tilde{\Theta}(\tfrac{T\sqrt{n}}{\sqrt{S}}), T)$ for all ranges of $S \leq T$. These results recover and generalize the known minimax rates for the (arbitrary) switching cost setting.'
volume: 75
URL: https://proceedings.mlr.press/v75/altschuler18a.html
PDF: http://proceedings.mlr.press/v75/altschuler18a/altschuler18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-altschuler18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Jason
family: Altschuler
- given: Kunal
family: Talwar
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1569-1573
id: altschuler18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1569
lastpage: 1573
published: 2018-07-03 00:00:00 +0000
- title: 'Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent'
abstract: 'We study \emph{smoothed online convex optimization}, a version of online convex optimization where the learner incurs a penalty for changing her actions between rounds. Given a $\Omega(\sqrt{d})$ lower bound on the competitive ratio of any online algorithm, where $d$ is the dimension of the action space, we ask under what conditions this bound can be beaten. We introduce a novel algorithmic framework for this problem, Online Balanced Descent (OBD), which works by iteratively projecting the previous point onto a carefully chosen level set of the current cost function so as to balance the switching costs and hitting costs. We demonstrate the generality of the OBD framework by showing how, with different choices of “balance,” OBD can improve upon state-of-the-art performance guarantees for both competitive ratio and regret; in particular, OBD is the first algorithm to achieve a dimension-free competitive ratio, $3 + O(1/\alpha)$, for locally polyhedral costs, where $\alpha$ measures the “steepness” of the costs. We also prove bounds on the dynamic regret of OBD when the balance is performed in the dual space that are dimension-free and imply that OBD has sublinear static regret.'
volume: 75
URL: https://proceedings.mlr.press/v75/chen18b.html
PDF: http://proceedings.mlr.press/v75/chen18b/chen18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-chen18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Niangjun
family: Chen
- given: Gautam
family: Goel
- given: Adam
family: Wierman
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1574-1594
id: chen18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1574
lastpage: 1594
published: 2018-07-03 00:00:00 +0000
- title: 'Faster Rates for Convex-Concave Games'
abstract: 'We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $O(T^{-1/2})$, recent work \citep{RS13,SALS15} has established $O(1/T)$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a particular class of games one achieves a $O(1/T^2)$ rate, and we show how this applies to the Frank-Wolfe method and recovers a similar bound \citep{D15}. We also show that such no-regret techniques can even achieve a linear rate, $O(\exp(-T))$, for equilibrium computation under additional curvature assumptions. '
volume: 75
URL: https://proceedings.mlr.press/v75/abernethy18a.html
PDF: http://proceedings.mlr.press/v75/abernethy18a/abernethy18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-abernethy18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Jacob
family: Abernethy
- given: Kevin A.
family: Lai
- given: Kfir Y.
family: Levy
- given: Jun-Kun
family: Wang
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1595-1625
id: abernethy18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1595
lastpage: 1625
published: 2018-07-03 00:00:00 +0000
- title: '$\ell_1$ Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent'
abstract: 'We present preconditioned stochastic gradient descent (SGD) algorithms for the $\ell_1$ minimization problem $\min_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{A}} \boldsymbol{\mathit{x}} - \boldsymbol{\mathit{b}}\|_1$ in the overdetermined case, where there are far more constraints than variables. Specifically, we have $\boldsymbol{\mathit{A}} \in \mathbb{R}^{n \times d}$ for $n \gg d$. Commonly known as the Least Absolute Deviations problem, $\ell_1$ regression can be used to solve many important combinatorial problems, such as minimum cut and shortest path. SGD-based algorithms are appealing for their simplicity and practical efficiency. Our primary insight is that careful preprocessing can yield preconditioned matrices $\tilde{\boldsymbol{\mathit{A}}}$ with strong properties (besides good condition number and low-dimension) that allow for faster convergence of gradient descent. In particular, we precondition using Lewis weights to obtain an isotropic matrix with fewer rows and strong upper bounds on all row norms. We leverage these conditions to find a good initialization, which we use along with recent smoothing reductions and accelerated stochastic gradient descent algorithms to achieve $\epsilon$ relative error in $\widetilde{O}(nnz(\boldsymbol{\mathit{A}}) + d^{2.5} \epsilon^{-2})$ time with high probability, where $nnz(\boldsymbol{\mathit{A}})$ is the number of non-zeros in $\boldsymbol{\mathit{A}}$. This improves over the previous best result using gradient descent for $\ell_1$ regression. We also match the best known running times for interior point methods in several settings. Finally, we also show that if our original matrix $\boldsymbol{\mathit{A}}$ is approximately isotropic and the row norms are approximately equal, we can give an algorithm that avoids using fast matrix multiplication and obtains a running time of $\widetilde{O}(nnz(\boldsymbol{\mathit{A}}) + s d^{1.5}\epsilon^{-2} + d^2\epsilon^{-2})$, where $s$ is the maximum number of non-zeros in a row of $\boldsymbol{\mathit{A}}$. In this setting, we beat the best interior point methods for certain parameter regimes.'
volume: 75
URL: https://proceedings.mlr.press/v75/durfee18a.html
PDF: http://proceedings.mlr.press/v75/durfee18a/durfee18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-durfee18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: David
family: Durfee
- given: Kevin A.
family: Lai
- given: Saurabh
family: Sawlani
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1626-1656
id: durfee18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1626
lastpage: 1656
published: 2018-07-03 00:00:00 +0000
- title: 'Optimal Single Sample Tests for Structured versus Unstructured Network Data'
abstract: ' We study the problem of testing, using only a single sample, between mean field distribu- tions (like Curie-Weiss, Erdős-Rényi) and structured Gibbs distributions (like Ising model on sparse graphs and Exponential Random Graphs). Our goal is to test without know- ing the parameter values of the underlying models: only the structure of dependencies is known. We develop a new approach that applies to both the Ising and Exponential Random Graph settings based on a general and natural statistical test. The test can dis- tinguish the hypotheses with high probability above a certain threshold in the (inverse) temperature parameter, and is optimal in that below the threshold no test can distinguish the hypotheses. The thresholds do not correspond to the presence of long-range order in the models. By aggregating information at a global scale, our test works even at very high temperatures. The proofs are based on distributional approximation and sharp concentration of quadratic forms, when restricted to Hamming spheres. The restriction to Hamming spheres is necessary, since otherwise any scalar statistic is useless without explicit knowledge of the temperature parameter. At the same time, this restriction changes the behavior of the functions under consideration, making it hard to directly apply standard methods (i.e., Stein’s method) for concentration of weakly dependent variables. Instead, we carry out an additional tensorization argument using a Markov chain that respects the symmetry of the Hamming sphere. '
volume: 75
URL: https://proceedings.mlr.press/v75/bresler18a.html
PDF: http://proceedings.mlr.press/v75/bresler18a/bresler18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-bresler18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Guy
family: Bresler
- given: Dheeraj
family: Nagaraj
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1657-1690
id: bresler18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1657
lastpage: 1690
published: 2018-07-03 00:00:00 +0000
- title: 'A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation'
abstract: 'Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a \emph{simple and explicit finite time analysis} of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. A final section of the paper shows that all of our main results extend to the study of a variant of Q-learning applied to optimal stopping problems.'
volume: 75
URL: https://proceedings.mlr.press/v75/bhandari18a.html
PDF: http://proceedings.mlr.press/v75/bhandari18a/bhandari18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-bhandari18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Jalaj
family: Bhandari
- given: Daniel
family: Russo
- given: Raghav
family: Singal
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1691-1692
id: bhandari18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1691
lastpage: 1692
published: 2018-07-03 00:00:00 +0000
- title: 'Privacy-preserving Prediction'
abstract: 'Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression. We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all differentially private prediction algorithms.'
volume: 75
URL: https://proceedings.mlr.press/v75/dwork18a.html
PDF: http://proceedings.mlr.press/v75/dwork18a/dwork18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-dwork18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Cynthia
family: Dwork
- given: Vitaly
family: Feldman
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1693-1702
id: dwork18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1693
lastpage: 1702
published: 2018-07-03 00:00:00 +0000
- title: 'An Estimate Sequence for Geodesically Convex Optimization'
abstract: 'We propose a Riemannian version of Nesterov’s Accelerated Gradient algorithm (\textsc{Ragd}), and show that for \emph{geodesically} smooth and strongly convex problems, within a neighborhood of the minimizer whose radius depends on the condition number as well as the sectional curvature of the manifold, \textsc{Ragd} converges to the minimizer with acceleration. Unlike the algorithm in (Liu et al., 2017) that requires the exact solution to a nonlinear equation which in turn may be intractable, our algorithm is constructive and computationally tractable. Our proof exploits a new estimate sequence and a novel bound on the nonlinear metric distortion, both ideas may be of independent interest.'
volume: 75
URL: https://proceedings.mlr.press/v75/zhang18a.html
PDF: http://proceedings.mlr.press/v75/zhang18a/zhang18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-zhang18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Hongyi
family: Zhang
- given: Suvrit
family: Sra
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1703-1723
id: zhang18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1703
lastpage: 1723
published: 2018-07-03 00:00:00 +0000
- title: 'The Externalities of Exploration and How Data Diversity Helps Exploitation'
abstract: 'Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain individuals or groups. Motivated by these concerns, we initiate the study of the externalities of exploration—the undesirable side effects that the presence of one party may impose on another—under the linear contextual bandits model. We introduce the notion of a group externality, measuring the extent to which the presence of one population of users (the majority) impacts the rewards of another (the minority). We show that this impact can, in some cases, be negative, and that, in a certain sense, no algorithm can avoid it. We then move on to study externalities at the individual level, interpreting the act of exploration as an externality imposed on the current user of a system by future users. This drives us to ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm that always chooses the action that currently looks optimal. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $\tilde{O}(T^{1/3})$. Returning to group-level effects, we show that under the same conditions, negative group externalities essentially vanish if one runs the greedy algorithm. Together, our results uncover a sharp contrast between the high externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.'
volume: 75
URL: https://proceedings.mlr.press/v75/raghavan18a.html
PDF: http://proceedings.mlr.press/v75/raghavan18a/raghavan18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-raghavan18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Manish
family: Raghavan
- given: Aleksandrs
family: Slivkins
- given: Jennifer Vaughan
family: Wortman
- given: Zhiwei Steven
family: Wu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1724-1738
id: raghavan18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1724
lastpage: 1738
published: 2018-07-03 00:00:00 +0000
- title: 'Efficient Contextual Bandits in Non-stationary Worlds'
abstract: 'Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret $\mathcal{O}(\sqrt{ST})$ if there are $T$ rounds with $S$ stationary periods, or more generally $\mathcal{O}(\Delta^{1/3}T^{2/3})$ where $\Delta$ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity $S$ or $\Delta$, we also develop a parameter free algorithm achieving regret $\min\{S^{1/4}T^{3/4}, \Delta^{1/5}T^{4/5}\}$. This improves and generalizes the best existing result $\Delta^{0.18}T^{0.82}$ by Karnin and Anava (2016) which only holds for the two-armed bandit problem.'
volume: 75
URL: https://proceedings.mlr.press/v75/luo18a.html
PDF: http://proceedings.mlr.press/v75/luo18a/luo18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-luo18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Haipeng
family: Luo
- given: Chen-Yu
family: Wei
- given: Alekh
family: Agarwal
- given: John
family: Langford
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1739-1776
id: luo18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1739
lastpage: 1776
published: 2018-07-03 00:00:00 +0000
- title: 'Langevin Monte Carlo and JKO splitting'
abstract: 'Algorithms based on discretizing Langevin diffusion are popular tools for sampling from high-dimensional distributions. We develop novel connections between such Monte Carlo algorithms, the theory of Wasserstein gradient flow, and the operator splitting approach to solving PDEs. In particular, we show that a proximal version of the Unadjusted Langevin Algorithm corresponds to a scheme that alternates between solving the gradient flows of two specific functionals on the space of probability measures. Using this perspective, we derive some new non-asymptotic results on the convergence properties of this algorithm.'
volume: 75
URL: https://proceedings.mlr.press/v75/bernton18a.html
PDF: http://proceedings.mlr.press/v75/bernton18a/bernton18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-bernton18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Espen
family: Bernton
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1777-1798
id: bernton18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1777
lastpage: 1798
published: 2018-07-03 00:00:00 +0000
- title: 'Subpolynomial trace reconstruction for random strings \{and arbitrary deletion probability'
abstract: 'The deletion-insertion channel takes as input a bit string ${\bf x}\in \{0,1\}^{n}$, and outputs a string where bits have been deleted and inserted independently at random. The trace reconstruction problem is to recover $\bf x$ from many independent outputs (called “traces”) of the deletion-insertion channel applied to $\bf x$. We show that if $\bf x$ is chosen uniformly at random, then $\exp(O(\log^{1/3} n))$ traces suffice to reconstruct $\bf x$ with high probability. For the deletion channel with deletion probability $q<1/2$ the earlier upper bound was $\exp(O(\log^{1/2} n))$. The case of $q\geq 1/2$ or the case where insertions are allowed has not been previously analysed, and therefore the earlier upper bound was as for worst-case strings, i.e., $\exp(O( n^{1/3}))$. A key ingredient in our proof is a delicate two-step alignment procedure where we estimate the location in each trace corresponding to a given bit of $\bf x$. The alignment is done by viewing the strings as random walks, and comparing the increments in the walk associated with the input string and the trace, respectively.'
volume: 75
URL: https://proceedings.mlr.press/v75/holden18a.html
PDF: http://proceedings.mlr.press/v75/holden18a/holden18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-holden18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Nina
family: Holden
- given: Robin
family: Pemantle
- given: Yuval
family: Peres
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1799-1840
id: holden18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1799
lastpage: 1840
published: 2018-07-03 00:00:00 +0000
- title: 'An explicit analysis of the entropic penalty in linear programming'
abstract: 'Solving linear programs by using entropic penalization has recently attracted new interest in the optimization community, since this strategy forms the basis for the fastest-known algorithms for the optimal transport problem, with many applications in modern large-scale machine learning. Crucial to these applications has been an analysis of how quickly solutions to the penalized program approach true optima to the original linear program. More than 20 years ago, Cominetti and San Martín showed that this convergence is exponentially fast; however, their proof is asymptotic and does not give any indication of how accurately the entropic program approximates the original program for any particular choice of the penalization parameter. We close this long-standing gap in the literature regarding entropic penalization by giving a new proof of the exponential convergence, valid for any linear program. Our proof is non-asymptotic, yields explicit constants, and has the virtue of being extremely simple. We provide matching lower bounds and show that the entropic approach does not lead to a near-linear time approximation scheme for the linear assignment problem.'
volume: 75
URL: https://proceedings.mlr.press/v75/weed18a.html
PDF: http://proceedings.mlr.press/v75/weed18a/weed18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-weed18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Jonathan
family: Weed
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1841-1855
id: weed18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1841
lastpage: 1855
published: 2018-07-03 00:00:00 +0000
- title: 'Efficient active learning of sparse halfspaces'
abstract: 'We study the problem of efficient PAC active learning of homogeneous linear classifiers (halfspaces) in $\mathbb{R}^d$, where the goal is to learn a halfspace with low error using as few label queries as possible. Under the extra assumption that there is a $t$-sparse halfspace that performs well on the data ($t \ll d$), we would like our active learning algorithm to be {\em attribute efficient}, i.e. to have label requirements sublinear in $d$. In this paper, we provide a computationally efficient algorithm that achieves this goal. Under certain distributional assumptions on the data, our algorithm achieves a label complexity of $O(t \cdot \mathrm{polylog}(d, \frac 1 \epsilon))$. In contrast, existing algorithms in this setting are either computationally inefficient, or subject to label requirements polynomial in $d$ or $\frac 1 \epsilon$.'
volume: 75
URL: https://proceedings.mlr.press/v75/zhang18b.html
PDF: http://proceedings.mlr.press/v75/zhang18b/zhang18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-zhang18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Chicheng
family: Zhang
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1856-1880
id: zhang18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1856
lastpage: 1880
published: 2018-07-03 00:00:00 +0000
- title: 'Marginal Singularity, and the Benefits of Labels in Covariate-Shift'
abstract: 'We present new minimax results that concisely capture the relative benefits of source and target labeled data, under {covariate-shift}. Namely, we show that, in general classification settings, the benefits of target labels are controlled by a \emph{transfer-exponent} $\gamma$ that encodes how \emph{singular} $Q$ is locally w.r.t. $P$, and interestingly allows situations where transfer did not seem possible under previous insights. In fact, our new minimax analysis – in terms of $\gamma$ – reveals a \emph{continuum of regimes} ranging from situations where target labels have little benefit, to regimes where target labels dramatically improve classification. We then show that a recently proposed semi-supervised procedure can be extended to adapt to unknown $\gamma$, and therefore requests target labels only when beneficial, while achieving nearly minimax transfer rates. '
volume: 75
URL: https://proceedings.mlr.press/v75/kpotufe18a.html
PDF: http://proceedings.mlr.press/v75/kpotufe18a/kpotufe18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-kpotufe18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Samory
family: Kpotufe
- given: Guillaume
family: Martinet
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1882-1886
id: kpotufe18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1882
lastpage: 1886
published: 2018-07-03 00:00:00 +0000
- title: 'Learning Single-Index Models in Gaussian Space'
abstract: 'We consider regression problems where the response is a smooth but non-linear function of a $k$-dimensional projection of $p$ normally-distributed covariates, contaminated with additive Gaussian noise. The goal is to recover the range of the $k$-dimensional projection, i.e., the index space. This model is called the multi-index model, and the $k=1$ case is called the single-index model. For the single-index model, we characterize the population landscape of a natural semi-parametric maximum likelihood objective in terms of the link function and prove that it has no spurious local minima. We also propose and analyze an efficient iterative procedure that recovers the index space up to error $\epsilon$ using a sample size $\tilde{O}(p^{O(R^2/\mu)} + p/\epsilon^2)$, where $R$ and $\mu$, respectively, parameterize the smoothness of the link function and the signal strength. When a multi-index model is incorrectly specified as a single-index model, we prove that essentially the same procedure, with sample size $\tilde{O}(p^{O(kR^2/\mu)} + p/\epsilon^2)$, returns a vector that is $\epsilon$-close to being completely in the index space.'
volume: 75
URL: https://proceedings.mlr.press/v75/dudeja18a.html
PDF: http://proceedings.mlr.press/v75/dudeja18a/dudeja18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-dudeja18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Rishabh
family: Dudeja
- given: Daniel
family: Hsu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1887-1930
id: dudeja18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1887
lastpage: 1930
published: 2018-07-03 00:00:00 +0000
- title: 'Hidden Integrality of SDP Relaxations for Sub-Gaussian Mixture Models'
abstract: 'We consider the problem of finding discrete clustering structures under Sub-Gaussian Mixture Models. We establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded by the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models. A special case of this result shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our result establishes sufficient conditions for the SDP to correctly recover the cluster memberships of at least $(1-\delta)$ fraction of the points for any $\delta\in(0,1)$. Error bounds for estimating cluster centers can also be derived directly from our results.'
volume: 75
URL: https://proceedings.mlr.press/v75/fei18a.html
PDF: http://proceedings.mlr.press/v75/fei18a/fei18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-fei18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yingjie
family: Fei
- given: Yudong
family: Chen
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1931-1965
id: fei18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1931
lastpage: 1965
published: 2018-07-03 00:00:00 +0000
- title: 'Counting Motifs with Graph Sampling'
abstract: 'Applied researchers often construct a network from data that has been collected from a random sample of nodes, with the goal to infer properties of the parent network from the sampled version. Two of the most widely used sampling schemes are *subgraph sampling*, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and *neighborhood sampling*, where we additionally observe the edges between the sampled vertices and their neighbors. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for parent graph $G$ with maximal degree $d$, for any connected motif $h$ on $k$ vertices, to estimate the number of copies of $h$ in $G$, denoted by $s=\mathsf{s}(h,G)$, with a multiplicative error of $\epsilon$, - For subgraph sampling, the optimal sampling ratio $p$ is $\Theta_{k}(\max\{ (s\epsilon^2)^{-\frac{1}{k}}, \; \frac{d^{k-1}}{s\epsilon^{2}} \})$, which only depends on the size of the motif but
*not* its actual topology. Furthermore, we show that Horvitz-Thompson type estimators are universally optimal for any connected motifs. - For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieving the sampling ratio $O_{k}(\min\{ (\frac{d}{s\epsilon^2})^{\frac{1}{k-1}}, \; \sqrt{\frac{d^{k-2}}{s\epsilon^2}}\})$, which again only depends on the size of $h$. This is shown to be optimal for all motifs with at most $4$ vertices and cliques of all sizes.

The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These results allow us to quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.'
volume: 75
URL: https://proceedings.mlr.press/v75/klusowski18a.html
PDF: http://proceedings.mlr.press/v75/klusowski18a/klusowski18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-klusowski18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Jason M.
family: Klusowski
- given: Yihong
family: Wu
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 1966-2011
id: klusowski18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 1966
lastpage: 2011
published: 2018-07-03 00:00:00 +0000
- title: 'Approximate Nearest Neighbors in Limited Space'
abstract: 'We consider the $(1+\epsilon)$-approximate nearest neighbor search problem: given a set $X$ of $n$ points in a $d$-dimensional space, build a data structure that, given any query point $y$, finds a point $x \in X$ whose distance to $y$ is at most $(1+\epsilon) \min_{x \in X} \|x-y\|$ for an accuracy parameter $\epsilon \in (0,1)$. Our main result is a data structure that occupies only $O(\epsilon^{-2} n \log(n) \log(1/\epsilon))$ bits of space, assuming all point coordinates are integers in the range $\{-n^{O(1)} \ldots n^{O(1)}\}$, i.e., the coordinates have $O(\log n)$ bits of precision. This improves over the best previously known space bound of $O(\epsilon^{-2} n \log(n)^2)$, obtained via the randomized dimensionality reduction method of Johnson and Lindenstrauss (1984). We also consider the more general problem of estimating all distances from a collection of query points to all data points $X$, and provide almost tight upper and lower bounds for the space complexity of this problem. '
volume: 75
URL: https://proceedings.mlr.press/v75/indyk18a.html
PDF: http://proceedings.mlr.press/v75/indyk18a/indyk18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-indyk18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Piotr
family: Indyk
- given: Tal
family: Wagner
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 2012-2036
id: indyk18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 2012
lastpage: 2036
published: 2018-07-03 00:00:00 +0000
- title: 'Breaking the $1/\sqrt{n}$ Barrier: Faster Rates for Permutation-based Models in Polynomial Time'
abstract: 'Many applications, including rank aggregation and crowd-labeling, can be modeled in terms of a bivariate isotonic matrix with unknown permutations acting on its rows and columns. We consider the problem of estimating such a matrix based on noisy observations of a subset of its entries, and design and analyze a polynomial-time algorithm that improves upon the state of the art. In particular, our results imply that any such $n \times n$ matrix can be estimated efficiently in the normalized Frobenius norm at rate $\widetilde{\mathcal O}(n^{-3/4})$, thus narrowing the gap between $\widetilde{\mathcal O}(n^{-1})$ and $\widetilde{\mathcal O}(n^{-1/2})$, which were hitherto the rates of the most statistically and computationally efficient methods, respectively.'
volume: 75
URL: https://proceedings.mlr.press/v75/mao18a.html
PDF: http://proceedings.mlr.press/v75/mao18a/mao18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-mao18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Cheng
family: Mao
- given: Ashwin
family: Pananjady
- given: Martin J.
family: Wainwright
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 2037-2042
id: mao18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 2037
lastpage: 2042
published: 2018-07-03 00:00:00 +0000
- title: 'Unleashing Linear Optimizers for Group-Fair Learning and Optimization'
abstract: 'Most systems and learning algorithms optimize average performance or average loss – one reason being computational complexity. However, many objectives of practical interest are more complex than simply average loss. This arises, for example, when balancing performance or loss with fairness across people. We prove that, from a computational perspective, optimizing arbitrary objectives that take into account performance over a small number of groups is not significantly harder to optimize than average performance. Our main result is a polynomial-time reduction that uses a linear optimizer to optimize an arbitrary (Lipschitz continuous) function of performance over a (constant) number of possibly-overlapping groups. This includes fairness objectives over small numbers of groups, and we further point out that other existing notions of fairness such as individual fairness can be cast as convex optimization and hence more standard convex techniques can be used. Beyond learning, our approach applies to multi-objective optimization, more generally.'
volume: 75
URL: https://proceedings.mlr.press/v75/alabi18a.html
PDF: http://proceedings.mlr.press/v75/alabi18a/alabi18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-alabi18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Daniel
family: Alabi
- given: Nicole
family: Immorlica
- given: Adam
family: Kalai
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 2043-2066
id: alabi18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 2043
lastpage: 2066
published: 2018-07-03 00:00:00 +0000
- title: 'The Many Faces of Exponential Weights in Online Learning'
abstract: 'A standard introduction to online learning might place Online Gradient Descent at its center and then proceed to develop generalizations and extensions like Online Mirror Descent and second-order methods. Here we explore the alternative approach of putting Exponential Weights (EW) first. We show that many standard methods and their regret bounds then follow as a special case by plugging in suitable surrogate losses and playing the EW posterior mean. For instance, we easily recover Online Gradient Descent by using EW with a Gaussian prior on linearized losses, and, more generally, all instances of Online Mirror Descent based on regular Bregman divergences also correspond to EW with a prior that depends on the mirror map. Furthermore, appropriate quadratic surrogate losses naturally give rise to Online Gradient Descent for strongly convex losses and to Online Newton Step. We further interpret several recent adaptive methods (iProd, Squint, and a variation of Coin Betting for experts) as a series of closely related reductions to exp-concave surrogate losses that are then handled by Exponential Weights. Finally, a benefit of our EW interpretation is that it opens up the possibility of sampling from the EW posterior distribution instead of playing the mean. As already observed by Bubeck and Eldan, this recovers the best-known rate in Online Bandit Linear Optimization.'
volume: 75
URL: https://proceedings.mlr.press/v75/hoeven18a.html
PDF: http://proceedings.mlr.press/v75/hoeven18a/hoeven18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-hoeven18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Dirk
family: Hoeven
- given: Tim
family: Erven
- given: Wojciech
family: Kotłowski
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 2067-2092
id: hoeven18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 2067
lastpage: 2092
published: 2018-07-03 00:00:00 +0000
- title: 'Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem'
abstract: 'We study sampling as optimization in the space of measures. We focus on gradient flow-based optimization with the Langevin dynamics as a case study. We investigate the source of the bias of the unadjusted Langevin algorithm (ULA) in discrete time, and consider how to remove or reduce the bias. We point out the difficulty is that the heat flow is exactly solvable, but neither its forward nor backward method is implementable in general, except for Gaussian data. We propose the symmetrized Langevin algorithm (SLA), which should have a smaller bias than ULA, at the price of implementing a proximal gradient step in space. We show SLA is in fact consistent for Gaussian target measure, whereas ULA is not. We also illustrate various algorithms explicitly for Gaussian target measure with Gaussian data, including gradient descent, proximal gradient, and Forward-Backward, and show they are all consistent.'
volume: 75
URL: https://proceedings.mlr.press/v75/wibisono18a.html
PDF: http://proceedings.mlr.press/v75/wibisono18a/wibisono18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-wibisono18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Andre
family: Wibisono
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 2093-3027
id: wibisono18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 2093
lastpage: 3027
published: 2018-07-03 00:00:00 +0000
- title: 'Online Learning: Sufficient Statistics and the Burkholder Method'
abstract: 'We uncover a fairly general principle in online learning: If a regret inequality can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that 1) can be used algorithmically to achieve the regret bound and 2) only depends on these sufficient statistics, not the entire data sequence, so that the online strategy is only required to keep the sufficient statistics in memory. This characterization is achieved by bringing the full power of the Burkholder Method—originally developed for certifying probabilistic martingale inequalities—to bear on the online learning setting. To demonstrate the scope and effectiveness of the Burkholder method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We also present a linear-time/space prediction strategy for parameter-free supervised learning with linear classes and general smooth norms.'
volume: 75
URL: https://proceedings.mlr.press/v75/foster18b.html
PDF: http://proceedings.mlr.press/v75/foster18b/foster18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-foster18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Dylan J.
family: Foster
- given: Alexander
family: Rakhlin
- given: Karthik
family: Sridharan
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3028-3064
id: foster18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3028
lastpage: 3064
published: 2018-07-03 00:00:00 +0000
- title: 'Minimax Bounds on Stochastic Batched Convex Optimization'
abstract: ' We study the stochastic batched convex optimization problem, in which we use many \emph{parallel} observations to optimize a convex function given limited rounds of interaction. In each of $M$ rounds, an algorithm may query for information at $n$ points, and after issuing all $n$ queries, it receives unbiased noisy function and/or (sub)gradient evaluations at the $n$ points. After $M$ such rounds, the algorithm must output an estimator. We provide lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for Lipschitz convex and smooth strongly convex functions. Our rates of convergence (nearly) achieve the fully sequential rate once $M = O(d \log \log n)$, where $d$ is the problem dimension, but the rates may exponentially degrade as the dimension $d$ increases, in distinction from fully sequential settings.'
volume: 75
URL: https://proceedings.mlr.press/v75/duchi18a.html
PDF: http://proceedings.mlr.press/v75/duchi18a/duchi18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-duchi18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: John
family: Duchi
- given: Feng
family: Ruan
- given: Chulhee
family: Yun
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3065-3162
id: duchi18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3065
lastpage: 3162
published: 2018-07-03 00:00:00 +0000
- title: 'Geometric Lower Bounds for Distributed Parameter Estimation under Communication Constraints'
abstract: 'We consider parameter estimation in distributed networks, where each sensor in the network observes an independent sample from an underlying distribution and has $k$ bits to communicate its sample to a centralized processor which computes an estimate of a desired parameter. We develop lower bounds for the minimax risk of estimating the underlying parameter under squared $\ell_2$ loss for a large class of distributions. Our results show that under mild regularity conditions, the communication constraint reduces the effective sample size by a factor of $d$ when $k$ is small, where $d$ is the dimension of the estimated parameter. Furthermore, this penalty reduces at most exponentially with increasing $k$, which is the case for some models, e.g., estimating high-dimensional distributions. For other models however, we show that the sample size reduction is re-mediated only linearly with increasing $k$, e.g. when some sub-Gaussian structure is available. We apply our results to the distributed setting with product Bernoulli model, multinomial model, and dense/sparse Gaussian location models which recover or strengthen existing results. Our approach significantly deviates from existing approaches for developing information-theoretic lower bounds for communication-efficient estimation. We circumvent the need for strong data processing inequalities used in prior work and develop a geometric approach which builds on a new representation of the communication constraint. This approach allows us to strengthen and generalize existing results with simpler and more transparent proofs.'
volume: 75
URL: https://proceedings.mlr.press/v75/han18a.html
PDF: http://proceedings.mlr.press/v75/han18a/han18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-han18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yanjun
family: Han
- given: Ayfer
family: Özgür
- given: Tsachy
family: Weissman
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3163-3188
id: han18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3163
lastpage: 3188
published: 2018-07-03 00:00:00 +0000
- title: 'Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance'
abstract: 'We present \emph{Local Moment Matching (LMM)}, a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. We construct an efficiently computable estimator that achieves the minimax rates in estimating the distribution up to permutation, and show that the plug-in approach of our unlabeled distribution estimator is “universal" in estimating symmetric functionals of discrete distributions. Instead of doing best polynomial approximation explicitly as in existing literature of functional estimation, the plug-in approach conducts polynomial approximation implicitly and attains the optimal sample complexity for the entropy, power sum and support size functionals.'
volume: 75
URL: https://proceedings.mlr.press/v75/han18b.html
PDF: http://proceedings.mlr.press/v75/han18b/han18b.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-han18b.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Yanjun
family: Han
- given: Jiantao
family: Jiao
- given: Tsachy
family: Weissman
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3189-3221
id: han18b
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3189
lastpage: 3221
published: 2018-07-03 00:00:00 +0000
- title: 'Iterate Averaging as Regularization for Stochastic Gradient Descent'
abstract: 'We propose and analyze a variant of the classic Polyak–Ruppert averaging scheme, broadly used in stochastic gradient methods. Rather than a uniform average of the iterates, we consider a weighted average, with weights decaying in a geometric fashion. In the context of linear least-squares regression, we show that this averaging scheme has the same regularizing effect, and indeed is asymptotically equivalent, to ridge regression. In particular, we derive finite-sample bounds for the proposed approach that match the best known results for regularized stochastic gradient methods.'
volume: 75
URL: https://proceedings.mlr.press/v75/neu18a.html
PDF: http://proceedings.mlr.press/v75/neu18a/neu18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-neu18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Gergely
family: Neu
- given: Lorenzo
family: Rosasco
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3222-3242
id: neu18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3222
lastpage: 3242
published: 2018-07-03 00:00:00 +0000
- title: 'Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form'
abstract: 'Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer–Monteiro factorization approach for solving SDPs. For a large class of SDPs, upon random perturbation of the cost matrix, with high probability, we show that all approximate second-order stationary points are approximate global optima for the penalty formulation of appropriately rank-constrained SDPs, as long as the number of constraints scales sub-quadratically with the desired rank. Our result is based on a simple penalty function formulation of the rank-constrained SDP along with a smoothed analysis to avoid worst-case cost matrices. We particularize our results to two applications, namely, Max-Cut and matrix completion.'
volume: 75
URL: https://proceedings.mlr.press/v75/bhojanapalli18a.html
PDF: http://proceedings.mlr.press/v75/bhojanapalli18a/bhojanapalli18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-bhojanapalli18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Srinadh
family: Bhojanapalli
- given: Nicolas
family: Boumal
- given: Prateek
family: Jain
- given: Praneeth
family: Netrapalli
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3243-3270
id: bhojanapalli18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3243
lastpage: 3270
published: 2018-07-03 00:00:00 +0000
- title: 'Certified Computation from Unreliable Datasets'
abstract: 'A wide range of learning tasks require human input in labeling massive data. The collected data though are usually low quality and contain inaccuracies and errors. As a result, modern science and business face the problem of learning from unreliable data sets. In this work, we provide a generic approach that is based on \textit{verification} of only few records of the data set to guarantee high quality learning outcomes for various optimization objectives. Our method, identifies small sets of critical records and verifies their validity. We show that many problems only need $\text{poly}(1/\varepsilon)$ verifications, to ensure that the output of the computation is at most a factor of $(1 \pm \varepsilon)$ away from the truth. For any given instance, we provide an \textit{instance optimal} solution that verifies the minimum possible number of records to approximately certify correctness. Then using this instance optimal formulation of the problem we prove our main result: “every function that satisfies some Lipschitz continuity condition can be certified with a small number of verifications”. We show that the required Lipschitz continuity condition is satisfied even by some NP-complete problems, which illustrates the generality and importance of this theorem. In case this certification step fails, an invalid record will be identified. Removing these records and repeating until success, guarantees that the result will be accurate and will depend only on the verified records. Surprisingly, as we show, for several computation tasks more efficient methods are possible. These methods always guarantee that the produced result is not affected by the invalid records, since any invalid record that affects the output will be detected and verified.'
volume: 75
URL: https://proceedings.mlr.press/v75/gouleakis18a.html
PDF: http://proceedings.mlr.press/v75/gouleakis18a/gouleakis18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-gouleakis18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Themis
family: Gouleakis
- given: Christos
family: Tzamos
- given: Manolis
family: Zampetakis
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3271-3294
id: gouleakis18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3271
lastpage: 3294
published: 2018-07-03 00:00:00 +0000
- title: 'Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon'
abstract: 'In reinforcement learning (RL), problems with long planning horizons are perceived as very challenging. The recent advances in PAC RL, however, show that the sample complexity of RL does not depend on planning horizon except at a superficial level. How can we explain such a difference? Noting that the technical assumptions in these upper bounds might have hidden away the challenges of long horizons, we ask the question: \emph{can we prove a lower bound with a horizon dependence when such assumptions are removed?} We also provide a few observations on the desired characteristics of the lower bound construction.'
volume: 75
URL: https://proceedings.mlr.press/v75/jiang18a.html
PDF: http://proceedings.mlr.press/v75/jiang18a/jiang18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-jiang18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Nan
family: Jiang
- given: Alekh
family: Agarwal
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3395-3398
id: jiang18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3395
lastpage: 3398
published: 2018-07-03 00:00:00 +0000
- title: 'Open problem: Improper learning of mixtures of Gaussians'
abstract: 'We ask whether there exists an efficient unsupervised learning algorithm for mixture of Gaussians in the over-complete case (number of mixtures is larger than the dimension). The notion of learning is taken to be worst-case compression-based, to allow for improper learning.'
volume: 75
URL: https://proceedings.mlr.press/v75/hazan18a.html
PDF: http://proceedings.mlr.press/v75/hazan18a/hazan18a.pdf
edit: https://github.com/mlresearch//v75/edit/gh-pages/_posts/2018-07-03-hazan18a.md
series: 'Proceedings of Machine Learning Research'
container-title: 'Proceedings of the 31st Conference On Learning Theory'
publisher: 'PMLR'
author:
- given: Elad
family: Hazan
- given: Livni
family: Roi
editor:
- given: Sébastien
family: Bubeck
- given: Vianney
family: Perchet
- given: Philippe
family: Rigollet
page: 3399-3402
id: hazan18a
issued:
date-parts:
- 2018
- 7
- 3
firstpage: 3399
lastpage: 3402
published: 2018-07-03 00:00:00 +0000