Proceedings of Machine Learning ResearchProceedings of Thirty Third Conference on Learning Theory on 09-12 July 2020
Published as Volume 125 by the Proceedings of Machine Learning Research on 15 July 2020.
Volume Edited by:
Jacob Abernethy
Shivani Agarwal
Series Editors:
Neil D. Lawrence
Mark Reid
http://proceedings.mlr.press/v125/
Sat, 29 Aug 2020 12:53:05 +0000Sat, 29 Aug 2020 12:53:05 +0000Jekyll v3.9.0Wasserstein Control of Mirror Langevin Monte Carlo Discretized Langevin diffusions are efficient Monte Carlo methods for sampling from high dimensional target densities that are log-Lipschitz-smooth and (strongly) log-concave. In particular, the Euclidean Langevin Monte Carlo sampling algorithm has received much attention lately, leading to a detailed understanding of its non-asymptotic convergence properties and of the role that smoothness and log-concavity play in the convergence rate. Distributions that do not possess these regularity properties can be addressed by considering a Riemannian Langevin diffusion with a metric capturing the local geometry of the log-density. However, the Monte Carlo algorithms derived from discretizations of such Riemannian Langevin diffusions are notoriously difficult to analyze. In this paper, we consider Langevin diffusions on a Hessian-type manifold and study a discretization that is closely related to the mirror-descent scheme. We establish for the first time a non-asymptotic upper-bound on the sampling error of the resulting Hessian Riemannian Langevin Monte Carlo algorithm. This bound is measured according to a Wasserstein distance induced by a Riemannian metric ground cost capturing the squared Hessian structure and closely related to a self-concordance-like condition. The upper-bound implies, for instance, that the iterates contract toward a Wasserstein ball around the target density whose radius is made explicit. Our theory recovers existing Euclidean results and can cope with a wide variety of Hessian metrics related to highly non-flat geometries. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/zhang20a.html
http://proceedings.mlr.press/v125/zhang20a.htmlNearly Non-Expansive Bounds for Mahalanobis Hard Thresholding Given a vector $w \in \mathbb{R}^p$ and a positive semi-definite matrix $A \in \mathbb{R}^{p\times p}$, we study the expansion ratio bound for the following defined Mahalanobis hard thresholding operator of $w$: \[ \mathcal{H}_{A,k}(w):=\argmin_{\|\theta\|_0\le k} \frac{1}{2}\|\theta - w\|^2_A, \]{where} $k\le p$ is the desired sparsity level. The core contribution of this paper is to prove that for any $\bar k$-sparse vector $\bar w$ with $\bar k < k$, the estimation error $\|\mathcal{H}_{A,k}(w) - \bar w\|_A$ satisfies \[ \|\mathcal{H}_{A,k}(w) - \bar w\|^2_A \le \left(1+ \mathcal{O}\left(\kappa(A,2k) \sqrt{\frac{\bar k }{k - \bar k}}\right)\right) \|{w} - \bar w\|^2_A, \]{where} $\kappa(A,2k)$ is the restricted strong condition number of $A$ over $(2k)$-sparse subspace. This estimation error bound is nearly non-expansive when $k$ is sufficiently larger than $\bar k$. Specially when $A$ is the identity matrix such that $\kappa(A,2k)\equiv1$, our bound recovers the previously known nearly non-expansive bounds for Euclidean hard thresholding operator. We further show that such a bound extends to an approximate version of $\mathcal{H}_{A,k}(w)$ estimated by Hard Thresholding Pursuit (HTP) algorithm. We demonstrate the applicability of these bounds to the mean squared error analysis of HTP and its novel extension based on preconditioning method. Numerical evidence is provided to support our theory and demonstrate the superiority of the proposed preconditioning HTP algorithm. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/yuan20a.html
http://proceedings.mlr.press/v125/yuan20a.htmlLearning a Single Neuron with Gradient Methods We consider the fundamental problem of learning a single neuron $\mathbf{x}\mapsto \sigma(\mathbf{w}^\top\mathbf{x})$ in a realizable setting, using standard gradient methods with random initialization, and under general families of input distributions and activations. On the one hand, we show that some assumptions on both the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go significantly beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/yehudai20a.html
http://proceedings.mlr.press/v125/yehudai20a.htmlNon-asymptotic Analysis for Nonparametric Testing We develop a non-asymptotic framework for hypothesis testing in nonparametric regression where the true regression function belongs to a Sobolev space. Our statistical guarantees are exact in the sense that Type I and II errors are controlled for any finite sample size. Meanwhile, one proposed test is shown to achieve minimax rate optimality in the asymptotic sense. An important consequence of this non-asymptotic theory is a new and practically useful formula for selecting the optimal smoothing parameter in the testing statistic. Extensions of our results to general reproducing kernel Hilbert spaces and non-Gaussian error regression are also discussed.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/yang20a.html
http://proceedings.mlr.press/v125/yang20a.htmlTree-projected gradient descent for estimating gradient-sparse parameters on graphs We study estimation of a gradient-sparse parameter vector $\boldsymbol{\theta}^* \in \mathbb{R}^p$, having strong gradient-sparsity $s^*:=\|\nabla_G \boldsymbol{\theta}^*\|_0$ on an underlying graph $G$. Given observations $Z_1,\ldots,Z_n$ and a smooth, convex loss function $\mathcal{L}$ for which $\boldsymbol{\theta}^*$ minimizes the population risk $\mathbb{E}[\mathcal{L}(\boldsymbol{\theta};Z_1,\ldots,Z_n)]$, we propose to estimate $\boldsymbol{\theta}^*$ by a projected gradient descent algorithm that iteratively and approximately projects gradient steps onto spaces of vectors having small gradient-sparsity over low-degree spanning trees of $G$. We show that, under suitable restricted strong convexity and smoothness assumptions for the loss, the resulting estimator achieves the squared-error risk $\frac{s^*}{n} \log (1+\frac{p}{s^*})$ up to a multiplicative constant that is independent of $G$. In contrast, previous polynomial-time algorithms have only been shown to achieve this guarantee in more specialized settings, or under additional assumptions for $G$ and/or the sparsity pattern of $\nabla_G \boldsymbol{\theta}^*$. As applications of our general framework, we apply our results to the examples of linear models and generalized linear models with random design.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/xu20a.html
http://proceedings.mlr.press/v125/xu20a.htmlLearning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium In this work, we develop provably efficient reinforcement learning algorithms for two-player zero-sum Markov games with simultaneous moves. We consider a family of Markov games where the reward function and transition kernel possess a linear structure. Two settings are studied: In the offline setting, we control both players and the goal is to find the Nash Equilibrium efficiently by minimizing the worst-case duality gap. In the online setting, we control a single player and play against an arbitrary opponent; the goal is to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an $\tilde O(\sqrt{d^3 H^3 T} )$ upper bound on the duality gap and regret, without requiring additional assumptions on the sampling model. We highlight that our setting requires overcoming several new challenges that are absent in MDPs or turn-based Markov games. In particular, to achieve optimism under the simultaneous-move games, we construct both upper and lower confidence bounds of the value function, and then derive the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash Equilibrium of this general-sum game is computationally hard, our algorithm instead solves for a Coarse Correlated Equilibrium (CCE), which can be obtained efficiently via linear programming. To our best knowledge, such a CCE-based mechanism for implementing optimism has not appeared in the literature and might be of interest in its own right.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/xie20a.html
http://proceedings.mlr.press/v125/xie20a.htmlKernel and Rich Regimes in Overparametrized Models A recent line of work studies overparametrized neural networks in the “kernel regime,” i.e. when during training the network behaves as a kernelized linear predictor, and thus, training with gradient descent has the effect of finding the corresponding minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized networks can induce rich implicit biases that are not RKHS norms. Building on an observation by \citet{chizat2018note}, we show how the \textbf{\textit{scale of the initialization}} controls the transition between the “kernel” (aka lazy) and “rich” (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a family of simple depth-$D$ linear networks that exhibit an interesting and meaningful transition between the kernel and rich regimes, and highlight an interesting role for the \emph{width} of the models. We further demonstrate this transition empirically for matrix factorization and multilayer non-linear networks.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/woodworth20a.html
http://proceedings.mlr.press/v125/woodworth20a.htmlTaking a hint: How to leverage loss predictors in contextual bandits? We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $\mathcal{O}(\sqrt{T})$ for learning over $T$ rounds, when the total error of the predicted losses relative to the realized losses, denoted as $\mathcal{E} \leq T$, is relatively small. We provide a complete answer to this question, with upper and lower bounds for various settings: adversarial and stochastic environments, known and unknown $\mathcal{E}$, and single and multiple predictors. We show several surprising results, such as 1) the optimal regret is $\mathcal{O}(\min\{\sqrt{T}, \sqrt{\mathcal{E}}T^\frac{1}{4}\})$ when $\mathcal{E}$ is known, in contrast to the standard and better bound $\mathcal{O}(\sqrt{\mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $\mathcal{E}$ is unknown, but as a remedy, $\mathcal{O}(\sqrt{\mathcal{E}}T^\frac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even though logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key \emph{action remapping} technique for optimal regret with known $\mathcal{E}$, 2) computationally efficient implementation of Catoni’s robust mean estimator via an ERM oracle in the stochastic setting with optimal regret, 3) an underestimator for $\mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $\mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/wei20a.html
http://proceedings.mlr.press/v125/wei20a.htmlActive Learning for Identification of Linear Dynamical Systems We propose an algorithm to actively estimate the parameters of a linear dynamical system. Given complete control over the system’s input, our algorithm adaptively chooses the inputs to accelerate estimation. We show a finite time bound quantifying the estimation rate our algorithm attains and prove matching upper and lower bounds which guarantee its asymptotic optimality, up to constants. In addition, we show that this optimal rate is unattainable when using Gaussian noise to excite the system, even with optimally tuned covariance, and analyze several examples where our algorithm provably improves over rates obtained by playing noise. Our analysis critically relies on a novel result quantifying the error in estimating the parameters of a dynamical system when arbitrary periodic inputs are being played. We conclude with numerical examples that illustrate the effectiveness of our algorithm in practice. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/wagenmaker20a.html
http://proceedings.mlr.press/v125/wagenmaker20a.htmlOpen Problem: Fast and Optimal Online Portfolio SelectionOnline portfolio selection has received much attention in the COLT community since its introduction by Cover, but all state-of-the-art methods fall short in at least one of the following ways: they are either i) computationally infeasible; or ii) they do not guarantee optimal regret; or iii) they assume the gradients are bounded, which is unnecessary and cannot be guaranteed. We are interested in a natural follow-the-regularized-leader (FTRL) approach based on the log barrier regularizer, which is computationally feasible. The open problem we put before the community is to formally prove whether this approach achieves the optimal regret. Resolving this question will likely lead to new techniques to analyse FTRL algorithms. There are also interesting technical connections to self-concordance, which has previously been used in the context of bandit convex optimization.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/van-erven20a.html
http://proceedings.mlr.press/v125/van-erven20a.htmlBalancing Gaussian vectors in high dimension Motivated by problems in controlled experiments, we study the discrepancy of random matrices with continuous entries where the number of columns $n$ is much larger than the number of rows $m$. Our first result shows that if $\omega(1) = m = o(n)$, a matrix with i.i.d. standard Gaussian entries has discrepancy $\Theta(\sqrt{n} \, 2^{-n/m})$ with high probability. This provides sharp guarantees for Gaussian discrepancy in a regime that had not been considered before in the existing literature. Our results also apply to a more general family of random matrices with continuous i.i.d. entries, assuming that $m = O(n/\log{n})$. The proof is non-constructive and is an application of the second moment method. Our second result is algorithmic and applies to random matrices whose entries are i.i.d. and have a Lipschitz density. We present a randomized polynomial-time algorithm that achieves discrepancy $e^{-\Omega(\log^2(n)/m)}$ with high probability, provided that $m = O(\sqrt{\log{n}})$. In the one-dimensional case, this matches the best known algorithmic guarantees due to Karmarkar–Karp. For higher dimensions $2 \leq m = O(\sqrt{\log{n}})$, this establishes the first efficient algorithm achieving discrepancy smaller than $O( \sqrt{m} )$. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/turner20a.html
http://proceedings.mlr.press/v125/turner20a.htmlEstimation and Inference with Trees and Forests in High Dimensions Regression Trees [Breiman et al. 1984] and Random Forests [Breiman 2001], are one of the most widely used estimation methods by machine learning practitioners. Despite their widespread use, their theoretical underpinnings are far from being fully understood. Recent breakthrough advances have shown that such greedily built trees are asymptotically consistent [Biau et al. 2010, Denil et al. 2014, Scornet et al. 2015] in the low dimensional regime, where the number of features is a constant, independent of the sample size. Also, the works of [Mentch et al. 2016, Wager et al. 2018] provide asymptotic normality results for honest versions of Random Forests. In this work, we analyze the performance of regression trees and forests with binary features in the high-dimensional regime, where the number of features can grow exponentially with the number of samples. We show that trees and forests built greedily based on the celebrated CART criterion, provably adapt to sparsity: when only a subset $R$, of size $r$, of the features are relevant, then the mean squared error of appropriately shallow trees, or fully grown honest forests, scales exponentially only with the number of relevant features and depends only logarithmically on the overall number of features. More precisely, we identify three regimes, each providing different dependence on the number of relevant features. When the relevant variables are “weakly” relevant (in the sense that there is not strong separation between the relevant and irrelevant variables in terms of their ability to reduce variance), then shallow trees achieve “slow rates” on the mean squared error of the order of $2^r/\sqrt{n}$, when variables are independent, and $1/n^{1/(r+2)}$, when variables are dependent. When the relevant variables are “strongly” relevant, in that there is a separation in their ability to reduce variance as compared to the irrelevant ones, by a constant $\beta_{\min}$, then we show that greedily built shallow trees and fully grown honest forests can achieve fast parametric mean squared error rates of the order of $2^r/(\beta_{\min}\,{n})$. When variables are strongly relevant, we also show that the predictions of sub-sampled honest forests have an asymptotically normal distribution centered around their true values and whose variance scales at most as $O(2^r \log(n)/(\beta_{\min}\,{n}))$. Thus, sub-sampled honest forests are provably a data-adaptive method for non-parametric inference, that adapts to the latent sparsity dimension of the data generating distribution, as opposed to classical non-parametric regression approaches. Our results show that, at least for the case of binary features, forest based algorithms can offer immense improvement on the statistical power of non-parametric hypothesis tests in high-dimensional regimes. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/syrgkanis20a.html
http://proceedings.mlr.press/v125/syrgkanis20a.htmlOpen Problem: Information Complexity of VC LearningUniform convergence approaches learning by studying the complexity of hypothesis classes. In particular, hypothesis classes with bounded Vapnik-Chervonenkis dimension exhibit strong uniform convergence, that is, any hypothesis in the class has low generalization error. On the other hand, a long line of work studies the information complexity of a learning algorithm, as it is connected to several desired properties, including generalization. We ask whether all VC classes admit a learner with low information complexity which achieves the generalization bounds guaranteed by uniform convergence. Specifically, since we know that this is not possible if we consider proper and consistent learners and measure information complexity in terms of the mutual information (Bassily et al., 2018), we are interested in learners with low information complexity measured in terms of the recently introduced notion of CMI (Steinke and Zakynthinou, 2020). Can we obtain tight bounds on the information complexity of a learning algorithm for a VC class (via CMI), thus exactly retrieving the known generalization bounds implied for this class by uniform convergence?Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/steinke20b.html
http://proceedings.mlr.press/v125/steinke20b.htmlReasoning About Generalization via Conditional Mutual Information We provide an information-theoretic framework for studying the generalization properties of machine learning algorithms. Our framework ties together existing approaches, including uniform convergence bounds and recent methods for adaptive data analysis. Specifically, we use Conditional Mutual Information (CMI) to quantify how well the input (i.e., the training data) can be recognized given the output (i.e., the trained model) of the learning algorithm. We show that bounds on CMI can be obtained from VC dimension, compression schemes, differential privacy, and other methods. We then show that bounded CMI implies various forms of generalization.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/steinke20a.html
http://proceedings.mlr.press/v125/steinke20a.htmlImproper Learning for Non-Stochastic Control We consider the problem of controlling a possibly unknown linear dynamical system with adversarial perturbations, adversarially chosen convex loss functions, and partially observed states, known as non-stochastic control. We introduce a controller parametrization based on the denoised observations, and prove that applying online gradient descent to this parametrization yields a new controller which attains sublinear regret vs. a large class of closed-loop policies. In the fully-adversarial setting, our controller attains an optimal regret bound of $\sqrt{T}$-when the system is known, and, when combined with an initial stage of least-squares estimation, $T^{2/3}$ when the system is unknown; both yield the first sublinear regret for the partially observed setting. Our bounds are the first in the non-stochastic control setting that compete with \emph{all} stabilizing linear dynamical controllers, not just state feedback. Moreover, in the presence of semi-adversarial noise containing both stochastic and adversarial components, our controller attains the optimal regret bounds of $\mathrm{poly}(\log T)$ when the system is known, and $\sqrt{T}$ when unknown. To our knowledge, this gives the first end-to-end $\sqrt{T}$ regret for online Linear Quadratic Gaussian controller, and applies in a more general setting with adversarial losses and semi-adversarial noise.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/simchowitz20a.html
http://proceedings.mlr.press/v125/simchowitz20a.htmlLogistic Regression Regret: What’s the Catch? We address the problem of the achievable regret rates with online logistic regression. We derive lower bounds with logarithmic regret under $L_1$, $L_2$, and $L_\infty$ constraints on the parameter values. The bounds are dominated by $d/2 \log T$, where $T$ is the horizon and $d$ is the dimensionality of the parameter space. We show their achievability for $d=o(T^{1/3})$ in all these cases with Bayesian methods, that achieve them up to a $d/2 \log d$ term. Interesting different behaviors are shown for larger dimensionality. Specifically, on the negative side, if $d = \Omega(\sqrt{T})$, any algorithm is guaranteed regret of $\Omega(d \log T)$ (greater than $\Theta(\sqrt{T})$) under $L_\infty$ constraints on the parameters (and the example features). On the positive side, under $L_1$ constraints on the parameters, there exist Bayesian algorithms that can achieve regret that is sub-linear in $d$ for the asymptotically larger values of $d$. For $L_2$ constraints, it is shown that for large enough $d$, the regret remains linear in $d$ but no longer logarithmic in $T$. Adapting the \emph{redundancy-capacity\/} theorem from information theory, we demonstrate a principled methodology based on grids of parameters to derive lower bounds. Grids are also utilized to derive some upper bounds. Our results strengthen results by Kakade and Ng (2005) and Foster et al. (2018) for upper bounds for this problem, introduce novel lower bounds, and adapt a methodology that can be used to obtain such bounds for other related problems. They also give a novel characterization of the asymptotic behavior when the dimension of the parameter space is allowed to grow with $T$. They additionally strengthen connections to the information theory literature, demonstrating that the actual regret for logistic regression depends on the richness of the parameter class, where even within this problem, richer classes lead to greater regret.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/shamir20a.html
http://proceedings.mlr.press/v125/shamir20a.htmlA Nearly Optimal Variant of the Perceptron Algorithm for the Uniform Distribution on the Unit Sphere We show a simple perceptron-like algorithm to learn origin-centered halfspaces in $\mathbb{R}^n$ with accuracy $1-\epsilon$ and confidence $1-\delta$ in time $\mathcal{O}\left(\frac{n^2}{\epsilon}\left(\log \frac{1}{\epsilon}+\log \frac{1}{\delta}\right)\right)$ using $\mathcal{O}\left(\frac{n}{\epsilon}\left(\log \frac{1}{\epsilon}+\log \frac{1}{\delta}\right)\right)$ labeled examples drawn uniformly from the unit $n$-sphere. This improves upon algorithms given in Baum(1990), Long(1994) and Servedio(1999). The time and sample complexity of our algorithm match the lower bounds given in Long(1995) up to logarithmic factors. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/schmalhofer20a.html
http://proceedings.mlr.press/v125/schmalhofer20a.htmlHow Good is SGD with Random Shuffling? We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting has been investigated in several recent works, but the optimal error rates remain unclear. In this paper, we provide lower bounds on the expected optimization error with these heuristics (using SGD with any constant step size), which elucidate their advantages and disadvantages. In particular, we prove that after $k$ passes over $n$ individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega\left(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds. Moreover, if the functions are only shuffled once, then the lower bound increases to $\Omega(1/nk^2)$. Since there are strictly smaller upper bounds for repeated reshuffling, this proves an inherent performance gap between SGD with single shuffling and repeated shuffling. As a more minor contribution, we also provide a non-asymptotic $\Omega(1/k^2)$ lower bound (independent of $n$) for the incremental gradient method, when no random shuffling takes place. Finally, we provide an indication that our lower bounds are tight, by proving matching upper bounds for univariate quadratic functions.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/safran20a.html
http://proceedings.mlr.press/v125/safran20a.htmlTsallis-INF for Decoupled Exploration and Exploitation in Multi-armed Bandits We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal $O(\sqrt{KT})$ regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/rouyer20a.html
http://proceedings.mlr.press/v125/rouyer20a.htmlList Decodable Subspace Recovery Learning from data in the presence of outliers is a fundamental problem in statistics. In this work, we study robust statistics in the presence of overwhelming outliers for the fundamental problem of subspace recovery. Given a dataset where an $\alpha$ fraction (less than half) of the data is distributed uniformly in an unknown $k$ dimensional subspace in $d$ dimensions, and with no additional assumptions on the remaining data, the goal is to recover a succinct list of $O(\frac{1}{\alpha})$ subspaces one of which is nontrivially correlated with the planted subspace. We provide the first polynomial time algorithm for the ’list decodable subspace recovery’ problem, and subsume it under a more general framework of list decoding over distributions that are "certifiably resilient" capturing state of the art results for list decodable mean estimation and regression.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/raghavendra20a.html
http://proceedings.mlr.press/v125/raghavendra20a.htmlFinite-Time Analysis of Asynchronous Stochastic Approximation and $Q$-Learning We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronous $Q$-learning. The resulting bound matches the sharpest available bound for synchronous $Q$-learning, and improves over previous known bounds for asynchronous $Q$-learning.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/qu20a.html
http://proceedings.mlr.press/v125/qu20a.htmlCovariance-adapting algorithm for semi-bandits with application to sparse outcomes We investigate \emph{stochastic combinatorial semi-bandits}, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed \emph{sub-Gaussian} family. We alleviate this issue by instead considering a new general family of \emph{sub-exponential} distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the regret on this family, that is parameterized by the \emph{unknown} covariance matrix, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/perrault20a.html
http://proceedings.mlr.press/v125/perrault20a.htmlAdaptive Submodular Maximization under Stochastic Item Costs Constrained maximization of non-decreasing utility functions with submodularity-like properties is at the core of several AI and ML applications including viral marketing, pool-based active learning, adaptive feature selection, and sensor deployment. In this work, we develop adaptive policies for maximizing such functions when both the utility function and item costs may be stochastic. First, we study maximization of an adaptive weak submodular function which is submodular in an approximate and probabilistic sense, subject to a stochastic fractional knapsack constraint, which requires total expected item cost to be within a given capacity. We present the $\beta$-GREEDY policy for this problem; our approximation guarantee for it recovers many known greedy maximization guarantees as special cases. Next, we study maximization of an adaptive submodular function, which is submodular in a probabilistic sense, subject to a stochastic knapsack constraint, which requires the total item cost to be within a given capacity with probability one. We present the MIX policy for this problem; our approximation guarantee for it is the first known approximation guarantee for maximizing a non-linear utility function subject to a stochastic knapsack constraint. Using alternative parameterizations of MIX, we also derive the first known approximation guarantees for maximizing an adaptive submodular function subject to a deterministic knapsack constraint. Our guarantees are powered by an innovative differential analysis technique which models the $\beta$-GREEDY policy using a continuous-time stochastic reward process of a particle whose reward is related to the optimal utility through a differential inequality. The solution to this inequality yields our $\beta$-GREEDY guarantee. We combine differential analysis with a variety of other ideas to derive our MIX guarantees.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/parthasarathy20a.html
http://proceedings.mlr.press/v125/parthasarathy20a.htmlCostly Zero Order Oracles We study optimization with an approximate zero order oracle where there is a cost $c(\epsilon)$ associated with querying the oracle with $\epsilon$ accuracy. We consider the task of reconstructing a linear function: given a linear function $f : X \subseteq \R^d \rightarrow [-1,1]$ defined on a not-necessarily-convex set $X$, the goal is to reconstruct $\hat f$ such that $\vert{f(x) - \hat f(x)}\vert \leq \epsilon$ for all $x \in X$. We show that this can be done with cost $O(d \cdot c(\epsilon/d))$. The algorithm is based on a (poly-time computable) John-like theorem for simplices instead of ellipsoids, which may be of independent interest. This question is motivated by optimization with threshold queries, which are common in economic applications such as pricing. With threshold queries, approximating a number up to precision $\epsilon$ requires $c(\epsilon) = \log(1/\epsilon)$. For this, our algorithm has cost $O(d \log(d/\epsilon))$ which matches the $\Omega(d \log(1/\epsilon))$ lower bound up to log factors. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/paes-leme20a.html
http://proceedings.mlr.press/v125/paes-leme20a.htmlEfficient and robust algorithms for adversarial linear contextual bandits We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d. at random from a known distribution, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best known lower bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/neu20b.html
http://proceedings.mlr.press/v125/neu20b.htmlFast Rates for Online Prediction with Abstention In the setting of sequential prediction of individual $(0, 1)$-sequences with expert advice, we show that by allowing the learner to abstain from the prediction by paying a cost marginally smaller than $0.5$ (say, $0.49$), it is possible to achieve expected regret bounds that are independent of the time horizon T. We exactly characterize the dependence on the abstention cost $c$ and the number of experts N by providing matching upper and lower bounds of order $\frac{log N} {1−2c}$, which is to be contrasted with the best possible rate of $\sqrt{T\log N}$ that is available without the option to abstain. We also discuss various extensions of our model, including a setting where the sequence of abstention costs can change arbitrarily over time, where we show regret bounds interpolating between the slow and the fast rates mentioned above, under some natural assumptions on the sequence of abstention costsWed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/neu20a.html
http://proceedings.mlr.press/v125/neu20a.htmlExtending Learnability to Auxiliary-Input Cryptographic Primitives and Meta-PAC Learning We investigate the meaning of efficient learnability from several different perspectives. The purpose is to give new insights into central problems in computational learning theory (CoLT). Specifically, we discuss the following two questions related to efficient PAC learnability. First, we investigate the gap between PAC learnability for polynomial-size circuits and weak cryptographic primitives taking auxiliary-input. Applebaum et al. observed that such a weak primitive is enough to show the hardness of PAC learning. However, the opposite direction is still unknown. In this paper, we introduce the following two notions: (1) a variant model of PAC learning whose hardness corresponds to auxiliary-input one-way functions; (2) a variant of a hitting set generator corresponding to the hardness of PAC learning. The equivalence gives a clearer insight into the gap between the hardness of learning and weak cryptographic primitives. Second, we discuss why proving efficient learnability is difficult. This question is natural because few classes are known to be polynomially learnable at present. In this paper, we formulate a task of determining efficient learnability as a meta-PAC learning problem and show that our meta-PAC learning is exactly as hard as PAC learning. Our result insists on one possibility: a hard-to-learn instance itself yields the hardness of proving efficient learnability. Our technical contribution is to give (1) a general framework for translating the hardness of PAC learning into auxiliary-input primitives, and (2) a new formulation to discuss the hardness of determining efficient learnability. Our work yields new important frontiers related to CoLT, including investigation of the learning hierarchy. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/nanashima20a.html
http://proceedings.mlr.press/v125/nanashima20a.htmlOn Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration We undertake a precise study of the asymptotic and non-asymptotic properties of stochastic approximation procedures with Polyak-Ruppert averaging for solving a linear system $\bar{A} \theta = \bar{b}$. When the matrix $\bar{A}$ is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity. The CLT characterizes the exact asymptotic covariance matrix, which is the sum of the classical Polyak-Ruppert covariance and a correction term that scales with the step size. Under assumptions on the tail of the noise distribution, we prove a non-asymptotic concentration inequality whose main term matches the covariance in CLT in any direction, up to universal constants. When the matrix $\bar{A}$ is not Hurwitz but only has non-negative real parts in its eigenvalues, we prove that the averaged LSA procedure actually achieves an $O(1/T)$ rate in mean-squared error. Our results provide a more refined understanding of linear stochastic approximation in both the asymptotic and non-asymptotic settings. We also show various applications of the main results, including the study of momentum-based stochastic gradient methods as well as temporal difference algorithms in reinforcement learning.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/mou20a.html
http://proceedings.mlr.press/v125/mou20a.htmlParallels Between Phase Transitions and Circuit Complexity? In many natural average-case problems, there are or there are believed to be critical values in the parameter space where the structure of the space of solutions changes in a fundamental way. These phase transitions are often believed to coincide with drastic changes in the computational complexity of the associated problem. In this work, we study the circuit complexity of inference in the broadcast tree model, which has important applications in phylogenetic reconstruction and close connections to community detection. We establish a number of qualitative connections between phase transitions and circuit complexity in this model. Specifically we show that there is a $\mathbf{TC}^0$ circuit that competes with the Bayes optimal predictor in some range of parameters above the Kesten-Stigum bound. We also show that there is a $16$ label broadcast tree model beneath the Kesten-Stigum bound in which it is possible to accurately guess the label of the root, but beating random guessing is $\mathbf{NC}^1$-hard on average. The key to locating phase transitions is often to study some intrinsic notions of complexity associated with belief propagation \— e.g. where do linear statistics fail, or when is the posterior sensitive to noise? Ours is the first work to study the complexity of belief propagation in a way that is grounded in circuit complexity.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/moitra20a.html
http://proceedings.mlr.press/v125/moitra20a.htmlInformation Theoretic Optimal Learning of Gaussian Graphical Models What is the optimal number of independent observations from which a sparse Gaussian Graphical Model can be correctly recovered? Information-theoretic arguments provide a lower bound on the minimum number of samples necessary to perfectly identify the support of any multivariate normal distribution as a function of model parameters. For a model defined on a sparse graph with $p$ nodes, a maximum degree $d$ and minimum normalized edge strength $\kappa$, this necessary number of samples scales at least as $d \log p/\kappa^2$. The sample complexity requirements of existing methods for perfect graph reconstruction exhibit dependency on additional parameters that do not enter in the lower bound. The question of whether the lower bound is tight and achievable by a polynomial time algorithm remains open. In this paper, we constructively answer this question and propose an algorithm, termed DICE, whose sample complexity matches the information-theoretic lower bound up to a universal constant factor. We also propose a related algorithm SLICE that has a slightly higher sample complexity, but can be implemented as a mixed integer quadratic program which makes it attractive in practice. Importantly, SLICE retains a critical advantage of DICE in that its sample complexity only depends on quantities present in the information theoretic lower bound. We anticipate that this result will stimulate future search of computationally efficient sample-optimal algorithms.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/misra20a.html
http://proceedings.mlr.press/v125/misra20a.htmlLipschitz and Comparator-Norm Adaptivity in Online Learning We study Online Convex Optimization in the unbounded setting where neither predictions nor gradient are constrained. The goal is to simultaneously adapt to both the sequence of gradients and the comparator. We first develop parameter-free and scale-free algorithms for a simplified setting with hints. We present two versions: the first adapts to the squared norms of both comparator and gradients separately using $O(d)$ time per round, the second adapts to their squared inner products (which measure variance only in the comparator direction) in time $O(d^3)$ per round. We then generalize two prior reductions to the unbounded setting; one to not need hints, and a second to deal with the range ratio problem (which already arises in prior work). We discuss their optimality in light of prior and new lower bounds. We apply our methods to obtain sharper regret bounds for scale-invariant online prediction with linear models. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/mhammedi20a.html
http://proceedings.mlr.press/v125/mhammedi20a.htmlTight Lower Bounds for Combinatorial Multi-Armed Bandits The Combinatorial Multi-Armed Bandit problem is a sequential decision-making problem in which an agent selects a set of arms on each round, observes feedback for each of these arms and aims to maximize a known reward function of the arms it chose. While previous work proved regret upper bounds in this setting for general reward functions, only a few works provided matching lower bounds, all for specific reward functions. In this work, we prove regret lower bounds for combinatorial bandits that hold under mild assumptions for all smooth reward functions. We derive both problem-dependent and problem-independent bounds and show that the recently proposed Gini-weighted smoothness parameter (Merlis and Mannor, 2019) also determines the lower bounds for monotone reward functions. Notably, this implies that our lower bounds are tight up to log-factors.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/merlis20a.html
http://proceedings.mlr.press/v125/merlis20a.htmlOpen Problem: Average-Case Hardness of Hypergraphic Planted Clique DetectionWe note the significance of hypergraphic planted clique (HPC) detection in the investigation of computational hardness for a range of tensor problems. We ask if more evidence for the computational hardness of HPC detection can be developed. In particular, we conjecture if it is possible to establish the equivalence of the computational hardness between HPC and PC detection.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/luo20a.html
http://proceedings.mlr.press/v125/luo20a.htmlBetter Algorithms for Estimating Non-Parametric Models in Crowd-Sourcing and Rank Aggregation Motivated by applications in crowd-sourcing and rank aggregation, a recent line of work has studied the problem of estimating an $n \times n$ bivariate isotonic matrix with an unknown permutation acting on its rows (and possibly another unknown permutation acting on its columns) from partial and noisy observations. There are wide and persistent computational vs. statistical gaps for this problem. It is known that the minimax optimal rate is $\widetilde{O}(n^{-1})$ when error is measured in average squared Frobenius norm. However the best known polynomial time computable estimator due to \cite{coltpaper} achieves the rate $\widetilde{O}(n^{-\frac{3}{4}})$, and this is the natural barrier to approaches based on using local statistics to figure out the relative order of pairs of rows without using information from the rest of the matrix. Here we introduce a framework for exploiting global information in shape-constrained estimation problems. In the case when only the rows are permuted, we give an algorithm that achieves error rate $O(n^{-1 + o(1)})$, which essentially closes the computational vs. statistical gap for this problem. When both the rows and columns are permuted, we give an improved algorithm that achieves error rate $O(n^{-\frac{5}{6} + o(1)})$. Additionally, all of our algorithms run in nearly linear time. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/liu20a.html
http://proceedings.mlr.press/v125/liu20a.htmlNear-Optimal Algorithms for Minimax Optimization This paper resolves a longstanding open question pertaining to the design of near-optimal first-order algorithms for smooth and strongly-convex-strongly-concave minimax problems. Current state-of-the-art first-order algorithms find an approximate Nash equilibrium using $\tilde{O}(\kappa_x+\kappa_y)$ or $\tilde{O}(\text{min}\{\kappa_x\sqrt{\kappa_y}, \sqrt{\kappa_x}\kappa_y\})$ gradient evaluations, where $\kappa_x$ and $\kappa_y$ are the condition numbers for the strong-convexity and strong-concavity assumptions. A gap still remains between these results and the best existing lower bound $\tilde{\Omega}(\sqrt{\kappa_x\kappa_y})$. This paper presents the first algorithm with $\tilde{O}(\sqrt{\kappa_x\kappa_y})$ gradient complexity, matching the lower bound up to logarithmic factors. Our algorithm is designed based on an accelerated proximal point method and an accelerated solver for minimax proximal steps. It can be easily extended to the settings of strongly-convex-concave, convex-concave, nonconvex-strongly-concave, and nonconvex-concave functions. This paper also presents algorithms that match or outperform all existing methods in these settings in terms of gradient complexity, up to logarithmic factors.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/lin20a.html
http://proceedings.mlr.press/v125/lin20a.htmlLearning Entangled Single-Sample Gaussians in the Subset-of-Signals Model In the setting of entangled single-sample distributions, the goal is to estimate some common parameter shared by a family of $n$ distributions, given one single sample from each distribution. This paper studies mean estimation for entangled single-sample Gaussians that have a common mean but different unknown variances. We propose the subset-of-signals model where an unknown subset of $m$ variances are bounded by 1 while there are no assumptions on the other variances. In this model, we analyze a simple and natural method based on iteratively averaging the truncated samples, and show that the method achieves error $O \left(\frac{\sqrt{n\ln n}}{m}\right)$ with high probability when $m=\Omega(\sqrt{n\ln n})$, slightly improving existing bounds for this range of $m$. We further prove lower bounds, showing that the error is $\Omega\left(\left(\frac{n}{m^4}\right)^{1/2}\right)$ when $m$ is between $\Omega(\ln n)$ and $O(n^{1/4})$, and the error is $\Omega\left(\left(\frac{n}{m^4}\right)^{1/6}\right)$ when $m$ is between $\Omega(n^{1/4})$ and $O(n^{1 - \epsilon})$ for an arbitrarily small $\epsilon>0$, improving existing lower bounds and extending to a wider range of $m$.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/liang20b.html
http://proceedings.mlr.press/v125/liang20b.htmlOn the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^{\alpha}$, $\alpha\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/liang20a.html
http://proceedings.mlr.press/v125/liang20a.htmlLearning Over-Parametrized Two-Layer Neural Networks beyond NTK We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/li20a.html
http://proceedings.mlr.press/v125/li20a.htmlA Fast Spectral Algorithm for Mean Estimation with Sub-Gaussian Rates We study the algorithmic problem of estimating the mean of a heavy-tailed random vector in R^d, given n i.i.d. samples. The goal is to design an efficient estimator that attains the optimal sub-gaussian error bound, only assuming that the random vector has bounded mean and covariance. Polynomial-time solutions to this problem are known but have high runtime due to their use of semi-definite programming (SDP). Moreover, conceptually, it remains open whether convex relaxation is truly necessary for this problem. In this work, we show that it is possible to go beyond SDP and achieve better computational efficiency. In particular, we provide a spectral algorithm that achieves the optimal statistical performance and runs in time O ( n^2 d ), improving upon the previous fastest runtime O( n^{3.5}+ n^2 d ) by Cherapanamjeri et.al. (COLT ’19). Our algorithm is spectral in that it only requires (approximate) eigenvector computations, which can be implemented very efficiently by, for example, power iteration or the Lanczos method. At the core of our algorithm is a novel connection between the furthest hyperplane problem introduced by Karnin et. al. (COLT ’12) and a structural lemma on heavy-tailed distributions by Lugosi and Mendelson (Ann. Stat. ’19). This allows us to iteratively reduce the estimation error at a geometric rate using only the information derived from the top singular vector of the data matrix, leading to a significantly faster running time.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/lei20a.html
http://proceedings.mlr.press/v125/lei20a.htmlAn $\widetilde\mathcal{O}(m/\varepsilon^3.5)$-Cost Algorithm for Semidefinite Programs with Diagonal ConstraintsWe provide a first-order algorithm for semidefinite programs (SDPs) with diagonal constraints on the matrix variable. Our algorithm outputs an $\varepsilon$-optimal solution with a run time of $\widetilde{\mathcal{O}}(m/\varepsilon^{3.5})$, where $m$ is the number of non-zero entries in the cost matrix. This improves upon the previous best run time of $\widetilde{\mathcal{O}}(m/\varepsilon^{4.5})$ by Arora and Kale. As a corollary of our result, given an instance of the Max-Cut problem with $n$ vertices and $m \gg n$ edges, our algorithm returns a $(1 - \varepsilon)\alpha_{GW}$ cut in the faster time of $\widetilde{\mathcal{O}}(m/\varepsilon^{3.5})$, where $\alpha_{GW} \approx 0.878567$ is the approximation ratio by Goemans and Williamson. Our key technical contribution is to combine an approximate variant of the Arora-Kale framework of mirror descent for SDPs with the idea of trading off exact computations in every iteration for variance-reduced estimations in most iterations, only periodically resetting the accumulated error with exact computations. This idea, along with the constructed estimator, are of possible independent interest for other problems that use the mirror descent framework. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/lee20c.html
http://proceedings.mlr.press/v125/lee20c.htmlLogsmooth Gradient Concentration and Tighter Runtimes for Metropolized Hamiltonian Monte Carlo We show that the gradient norm $\norm{\nabla f(x)}$ for $x \sim \exp(-f(x))$, where $f$ is strongly convex and smooth, concentrates tightly around its mean. This removes a barrier in the prior state-of-the-art analysis for the well-studied Metropolized Hamiltonian Monte Carlo (HMC) algorithm for sampling from a strongly logconcave distribution. We correspondingly demonstrate that Metropolized HMC mixes in $\tilde{O}(\kappa d)$ iterations, improving upon the $\tilde{O}(\kappa^{1.5}\sqrt{d} + \kappa d)$ runtime of prior work by a factor $(\kappa/d)^{1/2}$ when the condition number $\kappa$ is large. Our mixing time analysis introduces several techniques which to our knowledge have not appeared in the literature and may be of independent interest, including restrictions to a nonconvex set with good conductance behavior, and a new reduction technique for boosting a constant-accuracy total variation guarantee under weak warmness assumptions. This is the first high-accuracy mixing time result for logconcave distributions using only first-order function information which achieves linear dependence on $\kappa$; we also give evidence that this dependence is likely to be necessary for standard Metropolized first-order methods.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/lee20b.html
http://proceedings.mlr.press/v125/lee20b.htmlA Closer Look at Small-loss Bounds for Bandits with Graph Feedback We study {\it small-loss} bounds for adversarial multi-armed bandits with graph feedback, that is, adaptive regret bounds that depend on the loss of the best arm or related quantities, instead of the total number of rounds. We derive the first small-loss bound for general strongly observable graphs, resolving an open problem of Lykouris et al. (2018). Specifically, we develop an algorithm with regret $\mathcal{\tilde{O}}(\sqrt{\kappa L_*})$ where $\kappa$ is the clique partition number and $L_*$ is the loss of the best arm, and for the special case of self-aware graphs where every arm has a self-loop, we improve the regret to $\mathcal{\tilde{O}}(\min\{\sqrt{\alpha T}, \sqrt{\kappa L_*}\})$ where $\alpha \leq \kappa$ is the independence number. Our results significantly improve and extend those by Lykouris et al. (2018) who only consider self-aware undirected graphs. Furthermore, we also take the first attempt at deriving small-loss bounds for weakly observable graphs. We first prove that no typical small-loss bounds are achievable in this case, and then propose algorithms with alternative small-loss bounds in terms of the loss of some specific subset of arms. A surprising side result is that $\mathcal{\tilde{O}}(\sqrt{T})$ regret is achievable even for weakly observable graphs as long as the best arm has a self-loop. Our algorithms are based on the Online Mirror Descent framework but require a suite of novel techniques that might be of independent interest. Moreover, all our algorithms can be made parameter-free without the knowledge of the environment.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/lee20a.html
http://proceedings.mlr.press/v125/lee20a.htmlExploration by Optimisation in Partial Monitoring We provide a novel algorithm for adversarial k-action d-outcome partial monitoring that is adaptive, intuitive and efficient. The highlight is that for the non-degenerate locally observable games, the n-round minimax regret is bounded by 6m k^(3/2) sqrt(n log(k)), where m is the number of signals. This matches the best known information-theoretic upper bound derived via Bayesian minimax duality. The same algorithm also achieves near-optimal regret for full information, bandit and globally observable games. High probability bounds and simple experiments are also provided.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/lattimore20a.html
http://proceedings.mlr.press/v125/lattimore20a.htmlThe EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians We consider the problem of spherical Gaussian Mixture models with $k \geq 3$ components when the components are well separated. A fundamental previous result established that separation of $\Omega(\sqrt{\log k})$ is necessary and sufficient for identifiability of the parameters with \textit{polynomial} sample complexity (Regev and Vijayaraghavan, 2017). In the same context, we show that $\tilde{O} (kd/\epsilon^2)$ samples suffice for any $\epsilon \lesssim 1/k$, closing the gap from polynomial to linear, and thus giving the first optimal sample upper bound for the parameter estimation of well-separated Gaussian mixtures. We accomplish this by proving a new result for the Expectation-Maximization (EM) algorithm: we show that EM converges locally, under separation $\Omega(\sqrt{\log k})$. The previous best-known guarantee required $\Omega(\sqrt{k})$ separation (Yan, et al., 2017). Unlike prior work, our results do not assume or use prior knowledge of the (potentially different) mixing weights or variances of the Gaussian components. Furthermore, our results show that the finite-sample error of EM does not depend on non-universal quantities such as pairwise distances between means of Gaussian components.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kwon20a.html
http://proceedings.mlr.press/v125/kwon20a.htmlOn Suboptimality of Least Squares with Application to Estimation of Convex Bodies We develop a technique for establishing lower bounds on the sample complexity of Least Squares (or, Empirical Risk Minimization) for large classes of functions. As an application, we settle an open problem regarding optimality of Least Squares in estimating a convex set from noisy support function measurements in dimension $d\geq 6$. Specifically, we establish that Least Squares is mimimax sub-optimal, and achieves a rate of $\tilde{\Theta}_d(n^{-2/(d-1)})$ whereas the minimax rate is $\Theta_d(n^{-4/(d+3)})$.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kur20a.html
http://proceedings.mlr.press/v125/kur20a.htmlOpen Problem: Tight Convergence of SGD in Constant DimensionStochastic Gradient Descent (SGD) is one of the most popular optimization methods in machine learning and has been studied extensively since the early 50’s. However, our understanding of this fundamental algorithm is still lacking in certain aspects. We point out to a gap that remains between the known upper and lower bounds for the expected suboptimality of the last SGD point whenever the dimension is a constant independent of the number of SGD iterations $T$, and in particular, that the gap is still unaddressed even in the one dimensional case. For the latter, we provide evidence that the correct rate is $\Theta(1/\sqrt{T})$ and conjecture that the same applies in any (constant) dimension.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/koren20a.html
http://proceedings.mlr.press/v125/koren20a.htmlNew Potential-Based Bounds for Prediction with Expert Advice This work addresses the classic machine learning problem of online prediction with expert advice. We consider the finite-horizon version of this zero-sum, two-person game. Using verification arguments from optimal control theory, we view the task of finding better lower and upper bounds on the value of the game (regret) as the problem of finding better sub- and supersolutions of certain partial differential equations (PDEs). These sub- and supersolutions serve as the potentials for player and adversary strategies, which lead to the corresponding bounds. To get explicit bounds, we use closed-form solutions of specific PDEs. Our bounds hold for any given number of experts and horizon; in certain regimes (which we identify) they improve upon the previous state of the art. For two and three experts, our bounds provide the optimal leading order term. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kobzar20a.html
http://proceedings.mlr.press/v125/kobzar20a.htmlInformation Directed Sampling for Linear Partial Monitoring Partial monitoring is a rich framework for sequential decision making under uncertainty that generalizes many well known bandit models, including linear, combinatorial and dueling bandits. We introduce {\em information directed sampling} (IDS) for stochastic partial monitoring with a linear reward and observation structure. IDS achieves adaptive worst-case regret rates that depend on precise observability conditions of the game. Moreover, we prove lower bounds that classify the minimax regret of all finite games into four possible regimes. IDS achieves the optimal rate in all cases up to logarithmic factors, without tuning any hyper-parameters. We further extend our results to the contextual and the kernelized setting, which significantly increases the range of possible applications.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kirschner20a.html
http://proceedings.mlr.press/v125/kirschner20a.htmlUniversal Approximation with Deep Narrow Networks The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth. Here we consider the natural ‘dual’ scenario for networks of bounded width and arbitrary depth. Precisely, let $n$ be the number of inputs neurons, $m$ be the number of output neurons, and let $\rho$ be any nonaffine continuous function, with a continuous nonzero derivative at some point. Then we show that the class of neural networks of arbitrary depth, width $n + m + 2$, and activation function $\rho$, is dense in $C(K; \mathbb{R}^m)$ for $K \subseteq \mathbb{R}^n$ with $K$ compact. This covers every activation function possible to use in practice, and also includes polynomial activation functions, which is unlike the classical version of the theorem, and provides a qualitative difference between deep narrow networks and shallow wide networks. We then consider several extensions of this result. In particular we consider nowhere differentiable activation functions, density in noncompact domains with respect to the $L^p$-norm, and how the width may be reduced to just $n + m + 1$ for ‘most’ activation functions.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kidger20a.html
http://proceedings.mlr.press/v125/kidger20a.htmlOnline Learning with Vector Costs and Bandits with Knapsacks We introduce online learning with vector costs ($OLVC_p$) where in each time step $t \in \{1,\ldots, T\}$, we need to play an action $i \in \{1,\ldots,n\}$ that incurs an unknown vector cost in $[0,1]^d$. The goal of the online algorithm is to minimize the $\ell_p$ norm of the sum of its cost vectors. This captures the classical online learning setting for $d=1$, and is interesting for general $d$ because of applications like online scheduling where we want to balance the load between different machines (dimensions). We study $OLVC_p$ in both stochastic and adversarial arrival settings, and give a general procedure to reduce the problem from $d$ dimensions to a single dimension. This allows us to use classical online learning algorithms in both full and bandit feedback models to obtain (near) optimal results. In particular, we obtain a single algorithm (up to the choice of learning rate) that gives sublinear regret for stochastic arrivals and a tight $O(\min\{p, \log d\})$ competitive ratio for adversarial arrivals. The $OLVC_p$ problem also occurs as a natural subproblem when trying to solve the popular Bandits with Knapsacks (BWK) problem. This connection allows us to use our $OLVC_p$ techniques to obtain (near) optimal results for BWK in both stochastic and adversarial settings. In particular, we obtain a tight $O(\log d \cdot \log T)$ competitive ratio algorithm for adversarial BWK, which improves over the $O(d \cdot \log T)$ competitive ratio algorithm of Immorlica et al. (2019).Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kesselheim20a.html
http://proceedings.mlr.press/v125/kesselheim20a.htmlPrivately Learning Thresholds: Closing the Exponential Gap We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of one individual and we would like to come up with a generalizing hypothesis $h$ while guaranteeing differential privacy for the individuals. Intuitively, this means that any single labeled example in the training data should not have a significant effect on the choice of the hypothesis. This problem has received much attention recently; unlike the non-private case, where the sample complexity is independent of the domain size and just depends on the desired accuracy and confidence, for private learning the sample complexity must depend on the domain size $X$ (even for approximate differential privacy). Alon et al. (STOC 2019) showed a lower bound of $\Omega(\log^*|X|)$ on the sample complexity and Bun et al. (FOCS 2015) presented an approximate-private learner with sample complexity $\tilde{O}\left(2^{\log^*|X|}\right)$. In this work we reduce this gap significantly, almost settling the sample complexity. We first present a new upper bound (algorithm) of $\tilde{O}\left(\left(\log^*|X|\right)^2\right)$ on the sample complexity and then present an improved version with sample complexity $\tilde{O}\left(\left(\log^*|X|\right)^{1.5}\right)$. Our algorithm is constructed for the related interior point problem, where the goal is to find a point between the largest and smallest input elements. It is based on selecting an input-dependent hash function and using it to embed the database into a domain whose size is reduced logarithmically; this results in a new database, an interior point of which can be used to generate an interior point of the original database in a differentially private manner.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kaplan20a.html
http://proceedings.mlr.press/v125/kaplan20a.htmlApproximate is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity We present and study approximate notions of dimensional and margin complexity, which correspond to the minimal dimension or norm of an embedding required to {\em approximate}, rather then exactly represent, a given hypothesis class. We show that such notions are not only sufficient for learning using linear predictors or a kernel, but unlike the exact variants, are also necessary. Thus they are better suited for discussing limitations of linear or kernel methods.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kamath20b.html
http://proceedings.mlr.press/v125/kamath20b.htmlPrivate Mean Estimation of Heavy-Tailed Distributions We give new upper and lower bounds on the minimax sample complexity of differentially private mean estimation of distributions with bounded $k$-th moments. Roughly speaking, in the univariate case, we show that $$n = \Theta\left(\frac{1}{\alpha^2} + \frac{1}{\alpha^{\frac{k}{k-1}}\varepsilon}\right)$$ samples are necessary and sufficient to estimate the mean to $\alpha$-accuracy under $\varepsilon$-differential privacy, or any of its common relaxations. This result demonstrates a qualitatively different behavior compared to estimation absent privacy constraints, for which the sample complexity is identical for all $k \geq 2$. We also give algorithms for the multivariate setting whose sample complexity is a factor of $O(d)$ larger than the univariate case.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kamath20a.html
http://proceedings.mlr.press/v125/kamath20a.htmlFinite Time Analysis of Linear Two-timescale Stochastic Approximation with Markovian Noise Linear two-timescale stochastic approximation (SA) scheme is an important class of algorithms which has become popular in reinforcement learning (RL), particularly for the policy evaluation problem. Recently, a number of works have been devoted to establishing the finite time analysis of the scheme, especially under the Markovian (non-i.i.d.) noise settings that are ubiquitous in practice. In this paper, we provide a finite-time analysis for linear two timescale SA. Our bounds show that there is no discrepancy in the convergence rate between Markovian and martingale noise, only the constants are affected by the mixing time of the Markov chain. With an appropriate step size schedule, the transient term in the expected error bound is $o(1/k^c)$ and the steady-state term is ${\cal O}(1/k)$, where $c>1$ and $k$ is the iteration number. Furthermore, we present an asymptotic expansion of the expected error with a matching lower bound of $\Omega(1/k)$. A simple numerical experiment is presented to support our theory.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/kaledin20a.html
http://proceedings.mlr.press/v125/kaledin20a.htmlProvably efficient reinforcement learning with linear function approximation Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where \emph{function approximation} must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a “simulator” or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)—a classical algorithm frequently studied in the linear setting—achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/jin20a.html
http://proceedings.mlr.press/v125/jin20a.htmlGradient descent follows the regularization path for general losses Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an \emph{implicit bias}. This bias is typically towards a certain regularized solution, and relies upon the details of the learning process, for instance the use of the cross-entropy loss. In this work, we show that for empirical risk minimization over linear predictors with \emph{arbitrary} convex, strictly decreasing losses, if the risk does not attain its infimum, then the gradient-descent path and the \emph{algorithm-independent} regularization path converge to the same direction (whenever either converges to a direction). Using this result, we provide a justification for the widely-used exponentially-tailed losses (such as the exponential loss or the logistic loss): while this convergence to a direction for exponentially-tailed losses is necessarily to the maximum-margin direction, other losses such as polynomially-tailed losses may induce convergence to a direction with a poor margin. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/ji20a.html
http://proceedings.mlr.press/v125/ji20a.htmlEfficient improper learning for online logistic regression We consider the setting of online logistic regression and consider the regret with respect to the $\ell_2$-ball of radius $B$. It is known (see Hazan et al. (2014)) that any proper algorithm which has logarithmic regret in the number of samples (denoted $n$) necessarily suffers an exponential multiplicative constant in $B$. In this work, we design an efficient improper algorithm that avoids this exponential constant while preserving a logarithmic regret. Indeed, Foster et al. (2018) showed that the lower bound does not apply to improper algorithms and proposed a strategy based on exponential weights with prohibitive computational complexity. Our new algorithm based on regularized empirical risk minimization with surrogate losses satisfies a regret scaling as $O(B\log(Bn))$ with a per-round time-complexity of order $O(d^2 + \log(n))$.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/jezequel20a.html
http://proceedings.mlr.press/v125/jezequel20a.htmlRobust causal inference under covariate shift via worst-case subpopulation treatment effects We propose a notion of worst-case treatment effect (WTE) across all subpopulations of a given size, a conservative notion of topline treatment effect. Compared to the average treatment effect (ATE) that solely relies on the covariate distribution of collected data, WTE is robust to unanticipated covariate shifts, and ensures reliable inference uniformly over underrepresented minority groups. We develop a semiparametrically efficient estimator for the WTE, leveraging machine learning-based estimates of heterogenous treatment effects and propensity scores. By virtue of satisfying a key (Neyman) orthogonality property, our estimator enjoys central limit behavior—oracle rates with true nuisance parameters—even when estimates of nuisance parameters converge at slower-than-parameteric rates. In particular, this allows using black-box machine learning methods to construct asymptotically exact confidence intervals for the WTE. For both observational and randomized studies, we prove that our estimator achieves the \emph{optimal} asymptotic variance, by establishing a semiparametric efficiency lower bound. On real datasets, we illustrate the non-robustness of ATE under even small amounts distributional shift, and demonstrate that WTE allows us to guard against brittle findings that are invalidated by unanticipated covariate shifts. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/jeong20a.html
http://proceedings.mlr.press/v125/jeong20a.htmlPrecise Tradeoffs in Adversarial Training for Linear Regression Despite breakthrough performance, modern learning models are known to be highly vulnerable to small adversarial perturbations in their inputs. While a wide variety of recent \emph{adversarial training} methods have been effective at improving robustness to perturbed inputs (robust accuracy), often this benefit is accompanied by a decrease in accuracy on benign inputs (standard accuracy), leading to a tradeoff between often competing objectives. Complicating matters further, recent empirical evidence suggest that a variety of other factors (size and quality of training data, model size, etc.) affect this tradeoff in somewhat surprising ways. In this paper we provide a precise and comprehensive understanding of the role of adversarial training in the context of linear regression with Gaussian features. In particular, we characterize the fundamental tradeoff between the accuracies achievable by any algorithm regardless of computational power or size of the training data. Furthermore, we precisely characterize the standard/robust accuracy and the corresponding tradeoff achieved by a contemporary mini-max adversarial training approach in a high-dimensional regime where the number of data points and the parameters of the model grow in proportion to each other. Our theory for adversarial training algorithms also facilitates the rigorous study of how a variety of factors (size and quality of training data, model overparametrization etc.) affect the tradeoff between these two competing accuracies. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/javanmard20a.html
http://proceedings.mlr.press/v125/javanmard20a.htmlExtrapolating the profile of a finite population We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =\omega(k/\log k)$, it is possible to consistently estimate in total variation the \emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $\Theta(1/\log k)$. Our estimator is based on Wolfowitz’s minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/jana20a.html
http://proceedings.mlr.press/v125/jana20a.htmlSmooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes We study a nonparametric contextual bandit problem where the expected reward functions belong to a Hölder class with smoothness parameter $\beta$. We show how this interpolates between two extremes that were previously studied in isolation: non-differentiable bandits ($\beta\leq1$), where rate-optimal regret is achieved by running separate non-contextual bandits in different context regions, and parametric-response bandits (satisfying $\beta=\infty$), where rate-optimal regret can be achieved with minimal or no exploration due to infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and non-differentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/hu20a.html
http://proceedings.mlr.press/v125/hu20a.htmlNoise-tolerant, Reliable Active Classification with Comparison Queries With the explosion of massive, widely available unlabeled data in the past years, finding label and time efficient, robust learning algorithms has become ever more important in theory and in practice. We study the paradigm of active learning, in which algorithms with access to large pools of data may adaptively choose what samples to label in the hope of exponentially increasing efficiency. By introducing comparisons, an additional type of query comparing two points, we provide the first time and query efficient algorithms for learning non-homogeneous linear separators robust to bounded (Massart) noise. We further provide algorithms for a generalization of the popular Tsybakov low noise condition, and show how comparisons provide a strong reliability guarantee that is often impractical or impossible with only labels - returning a classifier that makes no errors with high probability.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/hopkins20a.html
http://proceedings.mlr.press/v125/hopkins20a.htmlA Greedy Anytime Algorithm for Sparse PCA The taxing computational effort that is involved in solving some high-dimensional statistical problems, in particular problems involving non-convex optimization, has popularized the development and analysis of algorithms that run efficiently (polynomial-time) but with no general guarantee on statistical consistency. In light of the ever-increasing compute power and decreasing costs, a more useful characterization of algorithms is by their ability to calibrate the invested computational effort with various characteristics of the input at hand and with the available computational resources. We propose a new greedy algorithm for the $\ell_0$-sparse PCA problem which supports the calibration principle. We provide both a rigorous analysis of our algorithm in the spiked covariance model, as well as simulation results and comparison with other existing methods. Our findings show that our algorithm recovers the spike in SNR regimes where all polynomial-time algorithms fail while running in a reasonable parallel-time on a cluster.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/holtzman20a.html
http://proceedings.mlr.press/v125/holtzman20a.htmlNear-Optimal Methods for Minimizing Star-Convex Functions and Beyond In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $$\gamma \in (0,1]$$: $$\gamma = 1$$ encompasses the classes of smooth convex and star-convex functions, and smaller values of $$\gamma$$ indicate that the function can be "more nonconvex." We develop a variant of accelerated gradient descent that computes an $$\epsilon$$-approximate minimizer of a smooth $$\gamma$$-quasar-convex function with at most $$O(\gamma^{-1} \epsilon^{-1/2} \log(\gamma^{-1} \epsilon^{-1}))$$ total function and gradient evaluations. We also derive a lower bound of $$\Omega(\gamma^{-1} \epsilon^{-1/2})$$ on the worst-case number of gradient evaluations required by any deterministic first-order method, showing that, up to a logarithmic factor, no deterministic first-order method can improve upon ours.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/hinder20a.html
http://proceedings.mlr.press/v125/hinder20a.htmlFaster Projection-free Online Learning In many online learning problems the computational bottleneck for gradient-based methods is the projection operation. For this reason, in many problems the most efficient algorithms are based on the Frank-Wolfe method, which replaces projections by linear optimization. In the general case, however, online projection-free methods require more iterations than projection-based methods: the best known regret bound scales as $T^{3/4}$. Despite significant work on various variants of the Frank-Wolfe method, this bound has remained unchanged for a decade. In this paper we give an efficient projection-free algorithm that guarantees $T^{2/3}$ regret for general online convex optimization with smooth cost functions and one linear optimization computation per iteration. As opposed to previous Frank-Wolfe approaches, our algorithm is derived using the Follow-the-Perturbed-Leader method and is analyzed using an online primal-dual framework.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/hazan20a.html
http://proceedings.mlr.press/v125/hazan20a.htmlBessel Smoothing and Multi-Distribution Property Estimation We consider a basic problem in statistical learning: estimating properties of multiple discrete distributions. Denoting by $\Delta_k$ the standard simplex over $[k]:=\{0,1,\ldots, k\}$, a property of $d$ distributions is a mapping from $\Delta_k^d$ to $\mathbb R$. These properties include well-known distribution characteristics such as Shannon entropy and support size ($d=1$), and many important divergence measures between distributions ($d=2$). The primary problem being considered is to learn the property value of an $\emph{unknown}$ $d$-tuple of distributions from its sample. The study of such problems dates back to the works of Efron and Thisted (1976b); Thisted and Efron (1987); Good (1953b); Carlton (1969), and has been pushed forward steadily during the past decades. Surprisingly, before our work, the general landscape of this fundamental learning problem was insufficiently understood, and nearly all the existing results are for the special case $d\le 2$. Our first main result provides a near-linear-time computable algorithm that, given independent samples from any collection of distributions and for a broad class of multi-distribution properties, learns the property as well as the empirical plug-in estimator that uses samples with logarithmic-factor larger sizes. As a corollary of this, for any $\varepsilon>0$ and fixed $d\in \mathbb Z^+$, a $d$-distribution property over $[k]$ that is Lipschitz and additively separable can be learned to an accuracy of $\varepsilon$ using a sample of size $\mathcal{O}(k/(\varepsilon^3\sqrt{\log k}))$, with high probability. Our second result addresses a closely related problem– tolerant independence testing: One receives samples from the unknown joint and marginal distributions, and attempts to infer the $\ell_1$ distance between the joint distribution and the product distribution of the marginals. We show that this testing problem also admits a sample complexity sub-linear in the alphabet sizes, demonstrating the broad applicability of our approach.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/hao20a.html
http://proceedings.mlr.press/v125/hao20a.htmlLocally Private Hypothesis Selection We initiate the study of hypothesis selection under local differential privacy. Given samples from an unknown probability distribution $p$ and a set of $k$ probability distributions $\mathcal{Q}$, we aim to output, under the constraints of $\varepsilon$-differential privacy, a distribution from $\mathcal{Q}$ whose total variation distance to $p$ is comparable to the best such distribution. This is a generalization of the classic problem of $k$-wise simple hypothesis testing, which corresponds to when $p \in \mathcal{Q}$, and we wish to identify $p$. Absent privacy constraints, this problem requires $O(\log k)$ samples from $p$, and it was recently shown that the same complexity is achievable under (central) differential privacy. However, the naive approach to this problem under local differential privacy would require $\tilde O(k^2)$ samples. We first show that the constraint of local differential privacy incurs an exponential increase in cost: any algorithm for this problem requires at least $\Omega(k)$ samples. Second, for the special case of $k$-wise simple hypothesis testing, we provide a non-interactive algorithm which nearly matches this bound, requiring $\tilde O(k)$ samples. Finally, we provide sequentially interactive algorithms for the general case, requiring $\tilde O(k)$ samples and only $O(\log \log k)$ rounds of interactivity. Our algorithms are achieved through a reduction to maximum selection with adversarial comparators, a problem of independent interest for which we initiate study in the parallel setting. For this problem, we provide a family of algorithms for each number of allowed rounds of interaction $t$, as well as lower bounds showing that they are near-optimal for every $t$. Notably, our algorithms result in exponential improvements on the round complexity of previous methods.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/gopi20a.html
http://proceedings.mlr.press/v125/gopi20a.htmlLast Iterate is Slower than Averaged Iterate in Smooth Convex-Concave Saddle Point Problems In this paper we study the smooth convex-concave saddle point problem. Specifically, we analyze the last iterate convergence properties of the Extragradient (EG) algorithm. It is well known that the ergodic (averaged) iterates of EG converge at a rate of $O(1/T)$ (Nemirovski, 2004). In this paper, we show that the last iterate of EG converges at a rate of $O(1/\sqrt{T})$. To the best of our knowledge, this is the first paper to provide a convergence rate guarantee for the last iterate of EG for the smooth convex-concave saddle point problem. Moreover, we show that this rate is tight by proving a lower bound of $\Omega(1/\sqrt{T})$ for the last iterate. This lower bound therefore shows a quadratic separation of the convergence rates of ergodic and last iterates in smooth convex-concave saddle point problems.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/golowich20a.html
http://proceedings.mlr.press/v125/golowich20a.htmlNo-Regret Prediction in Marginally Stable Systems We consider the problem of online prediction in a marginally stable linear dynamical system subject to bounded adversarial or (non-isotropic) stochastic perturbations. This poses two challenges. Firstly, the system is in general unidentifiable, so recent and classical results on parameter recovery do not apply. Secondly, because we allow the system to be marginally stable, the state can grow polynomially with time; this causes standard regret bounds in online convex optimization to be vacuous. In spite of these challenges, we show that the online least-squares algorithm achieves sublinear regret (improvable to polylogarithmic in the stochastic setting), with polynomial dependence on the system’s parameters. This requires a refined regret analysis, including a structural lemma showing the current state of the system to be a small linear combination of past states, even if the state grows polynomially. By applying our techniques to learning an autoregressive filter, we also achieve logarithmic regret in the partially observed setting under Gaussian noise, with polynomial dependence on the memory of the associated Kalman filter.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/ghai20a.html
http://proceedings.mlr.press/v125/ghai20a.htmlAsymptotic Errors for High-Dimensional Convex Penalized Linear Regression beyond Gaussian Matrices We consider the problem of learning a coefficient vector $\bf x_0 \in \mathbb R^N$ from noisy linear observations $\mathbf{y} = \mathbf{F}{\mathbf{x}_{0}}+\mathbf{w} \in \mathbb R^M$ in high dimensional limit $M,N \to \infty$ with $\alpha \equiv M/N$ fixed. We provide a rigorous derivation of an explicit formula —first conjectured using heuristics method from statistical physics— for the asymptotic mean squared error obtained by penalized convex estimators such as the LASSO or the elastic net, for a sequence of very generic random matrix $\mathbf{F}$ corresponding to rotationally invariant data matrices of arbitrary spectrum. The proof is based on a convergence analysis of an oracle version of vector approximate message-passing (oracle-VAMP) and on the properties of its state evolution equations. Our method leverages on and highlights the link between vector approximate message-passing, Douglas-Rachford splitting and proximal descent algorithms, extending previous results obtained with i.i.d. matrices for a large class of problems. We illustrate our results on some concrete examples and show that even though they are asymptotic, our predictions agree remarkably well with numerics even for very moderate sizes.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/gerbelot20a.html
http://proceedings.mlr.press/v125/gerbelot20a.htmlOn the Convergence of Stochastic Gradient Descent with Low-Rank Projections for Convex Low-Rank Matrix Problems We revisit the use of Stochastic Gradient Descent (SGD) for solving convex optimization problems that serve as highly popular convex relaxations for many important low-rank matrix recovery problems such as matrix completion, phase retrieval, and more. The computational limitation of applying SGD to solving these relaxations in large-scale is the need to compute a potentially high-rank singular value decomposition (SVD) on each iteration in order to enforce the low-rank-promoting constraint. We begin by considering a simple and natural sufficient condition so that these relaxations indeed admit low-rank solutions. This condition is also necessary for a certain notion of low-rank-robustness to hold. Our main result shows that under this condition which involves the eigenvalues of the gradient vector at optimal points, SGD with mini-batches, when initialized with a “warm-start" point, produces iterates that are low-rank with high probability, and hence only a low-rank SVD computation is required on each iteration. This suggests that SGD may indeed be practically applicable to solving large-scale convex relaxations of low-rank matrix recovery problems. Our theoretical results are accompanied with supporting preliminary empirical evidence. As a side benefit, our analysis is quite simple and short.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/garber20a.html
http://proceedings.mlr.press/v125/garber20a.htmlFrom tree matching to sparse graph alignment In this paper we consider alignment of sparse graphs, for which we introduce the Neighborhood Tree Matching Algorithm (NTMA). For correlated Erdős-R{é}nyi random graphs, we prove that the algorithm returns – in polynomial time – a positive fraction of correctly matched vertices, and a vanishing fraction of mismatches. This result holds with average degree of the graphs in $O(1)$ and correlation parameter $s$ that can be bounded away from $1$, conditions under which random graph alignment is particularly challenging. As a byproduct of the analysis we introduce a matching metric between trees and characterize it for several models of correlated random trees. These results may be of independent interest, yielding for instance efficient tests for determining whether two random trees are correlated or independent.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/ganassali20a.html
http://proceedings.mlr.press/v125/ganassali20a.htmlRigorous Guarantees for Tyler’s M-Estimator via Quantum Expansion Estimating the shape of an elliptical distribution is a fundamental problem in statistics. One estimator for the shape matrix, Tyler’s M-estimator, has been shown to have many appealing asymptotic properties. It performs well in numerical experiments and can be quickly computed in practice by a simple iterative procedure. Despite the many years the estimator has been studied in the statistics community, there was neither a non-asymptotic bound on the rate of the estimator nor a proof that the iterative procedure converges in polynomially many steps. Here we observe a surprising connection between Tyler’s M-estimator and operator scaling, which has been intensively studied in recent years in part because of its connections to the Brascamp-Lieb inequality in analysis. We use this connection, together with novel results on quantum expanders, to show that Tyler’s M-estimator has the optimal rate up to factors logarithmic in the dimension, and that in the generative model the iterative procedure has a linear convergence rate even without regularization.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/franks20a.html
http://proceedings.mlr.press/v125/franks20a.htmlEfficient Parameter Estimation of Truncated Boolean Product Distributions We study the problem of estimating the parameters of a Boolean product distribution in $d$ dimensions, when the samples are truncated by a set $S \subset \{0, 1\}^d$ accessible through a membership oracle. This is the first time that the computational and statistical complexity of learning from truncated samples is considered in a discrete setting. We introduce a natural notion of \emph{fatness} of the truncation set $S$, under which truncated samples reveal enough information about the true distribution. We show that if the truncation set is sufficiently fat, samples from the true distribution can be generated from truncated samples. A stunning consequence is that virtually any statistical task (e.g., learning in total variation distance, parameter estimation, uniformity or identity testing) that can be performed efficiently for Boolean product distributions, can also be performed from truncated samples, with a small increase in sample complexity. We generalize our approach to ranking distributions over $d$ alternatives, where we show how fatness implies efficient parameter estimation of Mallows models from truncated samples. Exploring the limits of learning discrete models from truncated samples, we identify three natural conditions that are necessary for efficient identifiability: (i) the truncation set $S$ should be rich enough; (ii) $S$ should be accessible through membership queries; and (iii) the truncation by $S$ should leave enough randomness in all directions. By carefully adapting the Stochastic Gradient Descent approach of (Daskalakis et al., FOCS 2018), we show that these conditions are also sufficient for efficient learning of truncated Boolean product distributions.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/fotakis20a.html
http://proceedings.mlr.press/v125/fotakis20a.htmlOpen Problem: Model Selection for Contextual BanditsIn statistical learning, algorithms for model selection allow the learner to adapt to the complexity of the best hypothesis class in a sequence. We ask whether similar guarantees are possible for contextual bandit learning.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/foster20a.html
http://proceedings.mlr.press/v125/foster20a.htmlEmbedding Dimension of Polyhedral Losses A common technique in supervised learning with discrete losses, such as 0-1 loss, is to optimize a convex surrogate loss over Rd, calibrated with respect to the original loss. In particular, recent work has investigated embedding the original predictions (e.g. labels) as points in Rd, showing an equivalence to using polyhedral surrogates. In this work, we study the notion of the embedding dimension of a given discrete loss: the minimum dimension d such that an embedding exists. We characterize d-embeddability for all d, with a particularly tight characterization for d=1 (embedding into the real line), and useful necessary conditions for d>1 in the form of a quadratic feasibility program. We illustrate our results with novel lower bounds for abstain loss.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/finocchiaro20a.html
http://proceedings.mlr.press/v125/finocchiaro20a.htmlRoot-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first root-n-regret result for reinforcement learning in stochastic MDPs with general value function approximation.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/dong20a.html
http://proceedings.mlr.press/v125/dong20a.htmlConsistent recovery threshold of hidden nearest neighbor graphs Motivated by applications such as discovering strong ties in social networks and assembling genome subsequences in biology, we study the problem of recovering a hidden $2k$-nearest neighbor (NN) graph in an $n$-vertex complete graph, whose edge weights are independent and distributed according to $P_n$ for edges in the hidden $2k$-NN graph and $Q_n$ otherwise. The special case of Bernoulli distributions corresponds to a variant of the Watts-Strogatz small-world graph. We focus on two types of asymptotic recovery guarantees as $n\to \infty$: (1) exact recovery: all edges are classified correctly with probability tending to one; (2) almost exact recovery: the expected number of misclassified edges is $o(nk)$. We show that the maximum likelihood estimator achieves (1) exact recovery for $2 \le k \le n^{o(1)}$ if $ \liminf \frac{2\alpha_n}{\log n}>1$; (2) almost exact recovery for $ 1 \le k \le o\left( \frac{\log n}{\log \log n} \right)$ if $ \liminf \frac{kD(P_n||Q_n)}{\log n}>1, $ where $\alpha_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$ is the Rényi divergence of order $\frac{1}{2}$ and $D(P_n||Q_n)$ is the Kullback-Leibler divergence. Under mild distributional assumptions, these conditions are shown to be information-theoretically necessary for any algorithm to succeed. A key challenge in the analysis is the enumeration of $2k$-NN graphs that differ from the hidden one by a given number of edges. We also analyze several computationally efficient algorithms and provide sufficient conditions under which they achieve exact/almost exact recovery. In particular, we develop a polynomial-time algorithm that attains the threshold for exact recovery under the small-world model.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/ding20a.html
http://proceedings.mlr.press/v125/ding20a.htmlAlgorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks We study the problem of PAC learning one-hidden-layer ReLU networks with $k$ hidden units on $\mathbb{R}^d$ under Gaussian marginals in the presence of additive label noise. For the case of positive coefficients, we give the first polynomial-time algorithm for this learning problem for $k$ up to $\tilde{O}(\sqrt{\log d})$. Previously, no polynomial time algorithm was known, even for $k=3$. This answers an open question posed by Klivans (2017). Importantly, our algorithm does not require any assumptions about the rank of the weight matrix and its complexity is independent of its condition number. On the negative side, for the more general task of PAC learning one-hidden-layer ReLU networks with arbitrary real coefficients, we prove a Statistical Query lower bound of $d^{\Omega(k)}$. Thus, we provide a separation between the two classes in terms of efficient learnability. Our upper and lower bounds are general, extending to broader families of activation functions. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/diakonikolas20d.html
http://proceedings.mlr.press/v125/diakonikolas20d.htmlLearning Halfspaces with Massart Noise Under Structured Distributions We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach is extremely simple: We identify a smooth {\em non-convex} surrogate loss with the property that any approximate stationary point of this loss defines a halfspace that is close to the target halfspace. Given this structural result, we can use SGD to solve the underlying learning problem.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/diakonikolas20c.html
http://proceedings.mlr.press/v125/diakonikolas20c.htmlApproximation Schemes for ReLU Regression We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concentration and anti-concentration conditions (and includes, for example, all log-concave distributions). This solves the main open problem of Goel et al., who proved hardness results for any exact algorithm for ReLU regression (up to an additive $\epsilon$). Using more sophisticated techniques, we can improve our results and obtain a polynomial-time approximation scheme for any subgaussian distribution. Given the aforementioned hardness results, these guarantees can not be substantially improved. Our main insight is a new characterization of {\em surrogate losses} for nonconvex activations. While prior work had established the existence of convex surrogates for monotone activations, we show that properties of the underlying distribution actually induce strong convexity for the loss, allowing us to relate the global minimum to the activation’s {\em Chow parameters}. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/diakonikolas20b.html
http://proceedings.mlr.press/v125/diakonikolas20b.htmlHalpern Iteration for Near-Optimal and Parameter-Free Monotone Inclusion and Strong Solutions to Variational Inequalities We leverage the connections between nonexpansive maps, monotone Lipschitz operators, and proximal mappings to obtain near-optimal (i.e., optimal up to poly-log factors in terms of iteration complexity) and parameter-free methods for solving monotone inclusion problems. These results immediately translate into near-optimal guarantees for approximating strong solutions to variational inequality problems, approximating convex-concave min-max optimization problems, and minimizing the norm of the gradient in min-max optimization problems. Our analysis is based on a novel and simple potential-based proof of convergence of Halpern iteration, a classical iteration for finding fixed points of nonexpansive maps. Additionally, we provide a series of algorithmic reductions that highlight connections between different problem classes and lead to lower bounds that certify near-optimality of the studied methods.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/diakonikolas20a.html
http://proceedings.mlr.press/v125/diakonikolas20a.htmlHigh probability guarantees for stochastic convex optimization Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on “light-tail” noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/davis20a.html
http://proceedings.mlr.press/v125/davis20a.htmlPAC learning with stable and private predictions We study binary classification algorithms for which the prediction on any point is not too sensitive to individual examples in the dataset. Specifically, we consider the notions of uniform stability (Bousquet and Elisseeff, 2001) and prediction privacy (Dwork and Feldman, 2018). Previous work on these notions shows how they can be achieved in the standard PAC model via simple aggregation of models trained on disjoint subsets of data. Unfortunately, this approach leads to a significant overhead in terms of sample complexity. Here we demonstrate several general approaches to stable and private prediction that either eliminate or significantly reduce the overhead. Specifically, we demonstrate that for any class $C$ of VC dimension $d$ there exists a $\gamma$-uniformly stable algorithm for learning $C$ with excess error $\alpha$ using $\tilde O(d/(\alpha\gamma) + d/\alpha^2)$ samples. We also show that this bound is nearly tight. For $\eps$-differentially private prediction we give two new algorithms: one using $\tilde O(d/(\alpha^2\eps))$ samples and another one using $\tilde O(d^2/(\alpha\eps) + d/\alpha^2)$ samples. The best previously known bounds for these problems are $O(d/(\alpha^2\gamma))$ and $O(d/(\alpha^3\eps))$, respectively.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/dagan20a.html
http://proceedings.mlr.press/v125/dagan20a.htmlOptimal Group Testing In the group testing problem, which goes back to the work of Dorfman (1943), we aim to identify a small set of $k\sim n^\theta$ infected individuals out of a population size $n$, $0<\theta<1$.We avail ourselves to a test procedure that can test a group of individuals, with the test returning a positive result iff at least one individual in the group is infected. All tests are conducted in parallel. The aim is to devise a test design with as few tests as possible so that the infected individuals can be identified with high probability. We establish an explicit sharp information-theoretic/algorithmic phase transition $m_{inf}$, showing that with more than $\minf$ tests the infected individuals can be identified in polynomial time, while this is impossible with fewer tests. In addition, we obtain an optimal two-stage adaptive group testing scheme. These results resolve problems prominently posed in [Aldridge et al. 2019, Johnson et al. 2018, Mézard and Toninelli 2011].Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/coja-oghlan20a.html
http://proceedings.mlr.press/v125/coja-oghlan20a.htmlPessimism About Unknown Unknowns Inspires Conservatism If we could define the set of all bad outcomes, we could hard-code an agent which avoids them; however, in sufficiently complex environments, this is infeasible. We do not know of any general-purpose approaches in the literature to avoiding novel failure modes. Motivated by this, we define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models. We call this agent pessimistic, since it optimizes assuming the worst case. A scalar parameter tunes the agent’s pessimism by changing the size of the set of world-models taken into account. Our first main contribution is: given an assumption about the agent’s model class, a sufficiently pessimistic agent does not cause “unprecedented events” with probability $1-\delta$, whether or not designers know how to precisely specify those precedents they are concerned with. Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy we would like to improve. Our other main contribution is that the agent’s policy’s value approaches at least that of the mentor, while the probability of deferring to the mentor goes to 0. In high-stakes environments, we might like advanced artificial agents to pursue goals cautiously, which is a non-trivial problem even if the agent were allowed arbitrary computing power; we present a formal solution.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/cohen20a.html
http://proceedings.mlr.press/v125/cohen20a.htmlODE-Inspired Analysis for the Biological Version of Oja’s Rule in Solving Streaming PCA Oja’s rule [Oja, Journal of mathematical biology 1982] is a well-known biologically-plausible algorithm using a Hebbian-type synaptic update rule to solve streaming principal component analysis (PCA). Computational neuroscientists have known that this biological version of Oja’s rule converges to the top eigenvector of the covariance matrix of the input in the limit. However, prior to this work, it was open to prove any convergence rate guarantee. In this work, we give the first convergence rate analysis for the biological version of Oja’s rule in solving streaming PCA. Moreover, our convergence rate matches the information theoretical lower bound up to logarithmic factors and outperforms the state-of-the-art upper bound for streaming PCA. Furthermore, we develop a novel framework inspired by ordinary differential equations (ODE) to analyze general stochastic dynamics. The framework abandons the traditional \textit{step-by-step} analysis and instead analyzes a stochastic dynamic in \textit{one-shot} by giving a closed-form solution to the entire dynamic. The one-shot framework allows us to apply stopping time and martingale techniques to have a flexible and precise control on the dynamic. We believe that this general framework is powerful and should lead to effective yet simple analysis for a large class of problems with stochastic dynamics.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/chou20a.html
http://proceedings.mlr.press/v125/chou20a.htmlImplicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/chizat20a.html
http://proceedings.mlr.press/v125/chizat20a.htmlGradient descent algorithms for Bures-Wasserstein barycenters We study first order methods to compute the barycenter of a probability distribution $P$ over the space of probability measures with finite second moment. We develop a framework to derive global rates of convergence for both gradient descent and stochastic gradient descent despite the fact that the barycenter functional is not geodesically convex. Our analysis overcomes this technical hurdle by employing a Polyak-Ł{}ojasiewicz (PL) inequality and relies on tools from optimal transport and metric geometry. In turn, we establish a PL inequality when $P$ is supported on the Bures-Wasserstein manifold of Gaussian probability measures. It leads to the first global rates of convergence for first order methods in this context. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/chewi20a.html
http://proceedings.mlr.press/v125/chewi20a.htmlThe Influence of Shape Constraints on the Thresholding Bandit Problem We investigate the stochastic \emph{Thresholding Bandit problem} (\textit{TBP}) under several \emph{shape constraints}. On top of (i) the vanilla, unstructured \textit{TBP}, we consider the case where (ii) the sequence of arm’s means $(\mu_k)_k$ is monotonically increasing \textit{MTBP}, (iii) the case where $(\mu_k)_k$ is unimodal \textit{UTBP} and (iv) the case where $(\mu_k)_k$ is concave \textit{CTBP}. In the \textit{TBP} problem the aim is to output, at the end of the sequential game, the set of arms whose means are above a given threshold. The regret is the highest gap between a misclassified arm and the threshold. In the fixed budget setting, we provide \emph{problem independent} minimax rates for the expected regret in all settings, as well as associated algorithms. We prove that the minimax rates for the regret are (i) $\sqrt{\log(K)K/T}$ for \textit{TBP}, (ii) $\sqrt{\log(K)/T}$ for \textit{MTBP}, (iii) $\sqrt{K/T}$ for \textit{UTBP} and (iv) $\sqrt{\log\log K/T}$ for \textit{CTBP}, where $K$ is the number of arms and $T$ is the budget. These rates demonstrate that \textit{the dependence on $K$} of the minimax regret varies significantly depending on the shape constraint. This highlights the fact that the shape constraints modify fundamentally the nature of the \textit{TBP}.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/cheshire20a.html
http://proceedings.mlr.press/v125/cheshire20a.htmlLearning Polynomials in Few Relevant Dimensions Polynomial regression is a basic primitive in learning and statistics. In its most basic form the goal is to fit a degree $d$ polynomial to a response variable $y$ in terms of an $n$-dimensional input vector $x$. This is extremely well-studied with many applications and has sample and runtime complexity $\Theta(n^d)$. Can one achieve better runtime if the intrinsic dimension of the data is much smaller than the ambient dimension $n$? Concretely, we are given samples $(x,y)$ where $y$ is a degree at most $d$ polynomial in an unknown $r$-dimensional projection (the relevant dimensions) of $x$. This can be seen both as a generalization of phase retrieval and as a special case of learning multi-index models where the link function is an unknown low-degree polynomial. Note that without distributional assumptions, this is at least as hard as junta learning. In this work we consider the important case where the covariates are Gaussian. We give an algorithm that learns the polynomial within accuracy $\epsilon$ with sample complexity that is roughly $N = O_{r,d}(n \log^2(1/\epsilon) (\log n)^d)$ and runtime $O_{r,d}(N n^2)$. Prior to our work, no such results were known even for the case of $r=1$. We introduce a new \emph{filtered PCA} approach to get a warm start for the true subspace and use \emph{geodesic SGD} to boost to arbitrary accuracy; our techniques may be of independent interest, especially for problems dealing with subspace recovery or analyzing SGD on manifolds.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/chen20a.html
http://proceedings.mlr.press/v125/chen20a.htmlBounds in query learning We introduce new combinatorial quantities for concept classes, and prove lower and upper bounds for learning complexity in several models of learning in terms of various combinatorial quantities. In the setting of equivalence plus membership queries, we give an algorithm which learns a class in polynomially many queries whenever any such algorithm exists. Our approach is flexible and powerful enough to give new and very short proofs of the efficient learnability of several prominent examples (e.g. regular languages and regular $\omega$-languages), in some cases also producing new bounds on the number of queries.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/chase20a.html
http://proceedings.mlr.press/v125/chase20a.htmlThe estimation error of general first order methods Modern large-scale statistical models require the estimation of thousands to millions of parameters. This is often accomplished by iterative algorithms such as gradient descent, projected gradient descent or their accelerated versions. What are the fundamental limits of these approaches? This question is well understood from an optimization viewpoint when the underlying objective is convex. Work in this area characterizes the gap to global optimality as a function of the number of iterations. However, these results have only indirect implications on the gap to \emph{statistical} optimality. Here we consider two families of high-dimensional estimation problems: high-dimensional regression and low-rank matrix estimation, and introduce a class of ‘general first order methods’ that aim at efficiently estimating the underlying parameters. This class of algorithms is broad enough to include classical first order optimization (for convex and non-convex objectives), but also other types of algorithms. Under a random design assumption, we derive lower bounds on the estimation error that hold in the high-dimensional asymptotics in which both the number of observations and the number of parameters diverge. These lower bounds are optimal in the sense that there exist algorithms in this class whose estimation error matches the lower bounds up to asymptotically negligible terms. We illustrate our general results through applications to sparse phase retrieval and sparse principal component analysis.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/celentano20a.html
http://proceedings.mlr.press/v125/celentano20a.htmlEfficient, Noise-Tolerant, and Private Learning via Boosting We introduce a simple framework for designing private boosting algorithms. We give natural conditions under which these algorithms are differentially private, efficient, and noise-tolerant PAC learners. To demonstrate our framework, we use it to construct noise-tolerant and private PAC learners for large-margin halfspaces whose sample complexity does not depend on the dimension. We give two sample complexity bounds for our large-margin halfspace learner. One bound is based only on differential privacy, and uses this guarantee as an asset for ensuring generalization. This first bound illustrates a general methodology for obtaining PAC learners from privacy, which may be of independent interest. The second bound uses standard techniques from the theory of large-margin classification (the fat-shattering dimension) to match the best known sample complexity for differentially private learning of large-margin halfspaces, while additionally tolerating random label noise.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bun20a.html
http://proceedings.mlr.press/v125/bun20a.htmlHighly smooth minimization of non-smooth problems We establish improved rates for structured \emph{non-smooth} optimization problems by means of near-optimal higher-order accelerated methods. In particular, given access to a standard oracle model that provides a $p^{th}$ order Taylor expansion of a \emph{smoothed} version of the function, we show how to achieve $\eps$-optimality for the \emph{original} problem in $\tilde{O}_p\pa{\eps^{-\frac{2p+2}{3p+1}}}$ calls to the oracle. Furthermore, when $p=3$, we provide an efficient implementation of the near-optimal accelerated scheme that achieves an $O(\eps^{-4/5})$ iteration complexity, where each iteration requires $\tilde{O}(1)$ calls to a linear system solver. Thus, we go beyond the previous $O(\eps^{-1})$ barrier in terms of $\eps$ dependence, and in the case of $\ell_\infty$ regression and $\ell_1$-SVM, we establish overall improvements for some parameter settings in the moderate-accuracy regime. Our results also lead to improved high-accuracy rates for minimizing a large class of convex quartic polynomials.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bullins20a.html
http://proceedings.mlr.press/v125/bullins20a.htmlNon-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem. The model assumes no communication and no shared randomness at all between the players, and furthermore when two (or more) players select the same action this results in a maximal loss. We prove the first $\sqrt{T}$-type regret guarantee for this problem, assuming only two players, and under the feedback model where collisions are announced to the colliding players. We also prove the first sublinear regret guarantee for the feedback model where collision information is not available, namely $T^{1-\frac{1}{2m}}$ where $m$ is the number of players.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bubeck20c.html
http://proceedings.mlr.press/v125/bubeck20c.htmlHow to Trap a Gradient Flow We consider the problem of finding an $\varepsilon$-approximate stationary point of a smooth function on a compact domain of $\R^d$. In contrast with dimension-free approaches such as gradient descent, we focus here on the case where $d$ is finite, and potentially small. This viewpoint was explored in 1993 by Vavasis, who proposed an algorithm which, for {\em any fixed finite dimension $d$}, improves upon the $O(1/\varepsilon^2)$ oracle complexity of gradient descent. For example for $d=2$, Vavasis’ approach obtains the complexity $O(1/\varepsilon)$. Moreover for $d=2$ he also proved a lower bound of $\Omega(1/\sqrt{\varepsilon})$ for deterministic algorithms (we extend this result to randomized algorithms). Our main contribution is an algorithm, which we call {\em gradient flow trapping} (GFT), and the analysis of its oracle complexity. In dimension $d=2$, GFT closes the gap with Vavasis’ lower bound (up to a logarithmic factor), as we show that it has complexity $O\left(\sqrt{\frac{\log(1/\varepsilon)}{\varepsilon}}\right)$. In dimension $d=3$, we show a complexity of $O\left(\frac{\log(1/\varepsilon)}{\varepsilon}\right)$, improving upon Vavasis’ $O\left(1 / \varepsilon^{1.2} \right)$. In higher dimensions, GFT has the remarkable property of being a {\em logarithmic parallel depth} strategy, in stark contrast with the polynomial depth of gradient descent or Vavasis’ algorithm. In this higher dimensional regime, the total work of GFT improves quadratically upon the only other known polylogarithmic depth strategy for this problem, namely naive grid search.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bubeck20b.html
http://proceedings.mlr.press/v125/bubeck20b.htmlCoordination without communication: optimal regret in two players multi-armed bandits We consider two agents playing simultaneously the same stochastic three-armed bandit problem. The two agents are cooperating but they cannot communicate. Under the assumption that shared randomness is available, we propose a strategy with no collisions at all between the players (with very high probability), and with near-optimal regret $O(\sqrt{T \log(T)})$. We also argue that the extra logarithmic term $\sqrt{\log(T)}$ should be necessary by proving a lower bound for a full information variant of the problem.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bubeck20a.html
http://proceedings.mlr.press/v125/bubeck20a.htmlID3 Learns Juntas for Smoothed Product Distributions In recent years, there are many attempts to understand popular heuristics. An example of such heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai and Teng (2008). That is, we show a learnability result when the observed distribution is a “noisy” variant of the original distribution.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/brutzkus20a.html
http://proceedings.mlr.press/v125/brutzkus20a.htmlA Corrective View of Neural Networks: Representation, Memorization and Learning We develop a \emph{corrective mechanism} for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks: 1. Two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for $n$ arbitrary points in $\mathbb{R}^d$ with $\tilde{O}(\tfrac{n}{\theta^4})$ ReLUs, where $\theta$ is the minimum distance between two different points. This bound can be shown to be optimal in $n$ up to logarithmic factors. 2. Two-layer neural networks with ReLUs and smoothed ReLUs can represent functions with an error of at most $\epsilon$ with $O(C(a,d)\epsilon^{-1/(a+1)})$ units for $a \in \mathbb{N}\cup\{0\}$ when the function has $\Theta(ad)$ bounded derivatives. In certain cases $d$ can be replaced with effective dimension $q \ll d$. Our results indicate that neural networks with only a single nonlinear layer are surprisingly powerful with regards to representation, and show that in contrast to what is suggested in recent work, depth is not needed in order to represent highly smooth functions. 3. Gradient Descent on the recombination weights of a two-layer random features network with ReLUs and smoothed ReLUs can learn low degree polynomials up to squared error $\epsilon$ with $\mathrm{subpoly}(1/\epsilon)$ units. Even though deep networks can approximate these polynomials with $\mathrm{polylog}(1/\epsilon)$ units, existing \emph{learning} bounds for this problem require $\mathrm{poly}(1/\epsilon)$ units. To the best of our knowledge, our results give the first sub-polynomial learning guarantees for this problem. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bresler20a.html
http://proceedings.mlr.press/v125/bresler20a.htmlReducibility and Statistical-Computational Gaps from Secret Leakage Inference problems with conjectured statistical-computational gaps are ubiquitous throughout modern statistics, computer science, statistical physics and discrete probability. While there has been success evidencing these gaps from the failure of restricted classes of algorithms, progress towards a more traditional reduction-based approach to computational complexity in statistical inference has been limited. These average-case problems are each tied to a different natural distribution, high-dimensional structure and conjecturally hard parameter regime, leaving reductions among them technically challenging. Despite a flurry of recent success in developing such techniques, existing reductions have largely been limited to inference problems with similar structure – primarily mapping among problems representable as a sparse submatrix signal plus a noise matrix, which is similar to the common starting hardness assumption of planted clique ($\textsc{pc}$). The insight in this work is that a slight generalization of the planted clique conjecture – secret leakage planted clique ($\textsc{pc}_\rho$), wherein a small amount of information about the hidden clique is revealed – gives rise to a variety of new average-case reduction techniques, yielding a web of reductions relating statistical problems with very different structure. Based on generalizations of the planted clique conjecture to specific forms of $\textsc{pc}_\rho$, we deduce tight statistical-computational tradeoffs for a diverse range of problems including robust sparse mean estimation, mixtures of sparse linear regressions, robust sparse linear regression, tensor PCA, variants of dense $k$-block stochastic block models, negatively correlated sparse PCA, semirandom planted dense subgraph, detection in hidden partition models and a universality principle for learning sparse mixtures. This gives the first reduction-based evidence for a number of conjectured statistical-computational gaps. We introduce a number of new average-case reduction techniques that also reveal novel connections to combinatorial designs based on the incidence geometry of $\mathbb{F}_r^t$ and to random matrix theory. In particular, we show a convergence result between Wishart and inverse Wishart matrices that may be of independent interest. The specific hardness conjectures for $\textsc{pc}_\rho$ implying our statistical-computational gaps all are in correspondence with natural graph problems such as $k$-partite, bipartite and hypergraph variants of $\textsc{pc}$. Hardness in a $k$-partite hypergraph variant of $\textsc{pc}$ is the strongest of these conjectures and sufficient to establish all of our computational lower bounds. We also give evidence for our $\textsc{pc}_\rho$ hardness conjectures from the failure of low-degree polynomials and statistical query algorithms. Our work raises a number of open problems and suggests that previous technical obstacles to average-case reductions may have arisen because planted clique is not the right starting point. An expanded set of hardness assumptions, such as $\textsc{pc}_\rho$, may be a key first step towards a more complete theory of reductions among statistical problems.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/brennan20a.html
http://proceedings.mlr.press/v125/brennan20a.htmlThe Gradient Complexity of Linear Regression We investigate the computational complexity of several basic linear algebra primitives, including largest eigenvector computation and linear regression, in the computational model that allows access to the data via a matrix-vector product oracle. We show that for polynomial accuracy, $\Theta(d)$ calls to the oracle are necessary and sufficient even for a randomized algorithm. Our lower bound is based on a reduction to estimating the least eigenvalue of a random Wishart matrix. This simple distribution enables a concise proof, leveraging a few key properties of the random Wishart ensemble.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/braverman20a.html
http://proceedings.mlr.press/v125/braverman20a.htmlSharper Bounds for Uniformly Stable Algorithms Deriving generalization bounds for stable algorithms is a classical question in learning theory taking its roots in the early works by Vapnik and Chervonenkis (1974) and Rogers and Wagner (1978). In a series of recent breakthrough papers by Feldman and Vondrak (2018, 2019), it was shown that the best known high probability upper bounds for uniformly stable learning algorithms due to Bousquet and Elisseef (2002) are sub-optimal in some natural regimes. To do so, they proved two generalization bounds that significantly outperform the simple generalization bound of Bousquet and Elisseef (2002). Feldman and Vondrak also asked if it is possible to provide sharper bounds and prove corresponding high probability lower bounds. This paper is devoted to these questions: firstly, inspired by the original arguments of Feldman and Vondrak (2019), we provide a short proof of the moment bound that implies the generalization bound stronger than both recent results in Feldman and Vondrak (2018, 2019). Secondly, we prove general lower bounds, showing that our moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. Our main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bousquet20b.html
http://proceedings.mlr.press/v125/bousquet20b.htmlProper Learning, Helly Number, and an Optimal SVM Bound The classical PAC sample complexity bounds are stated for any Empirical Risk Minimizer (ERM) and contain an extra logarithmic factor $\log(1/\epsilon)$ which is known to be necessary for ERM in general. It has been recently shown by Hanneke (2016) that the optimal sample complexity of PAC learning for any VC class C does not include this log factor and is achieved by a particular improper learning algorithm, which outputs a specific majority-vote of hypotheses in C. This leaves the question of when this bound can be achieved by proper learning algorithms, which are restricted to always output a hypothesis from C. In this paper we aim to characterize the classes for which the optimal sample complexity can be achieved by a proper learning algorithm. We identify that these classes can be characterized by the dual Helly number, which is a combinatorial parameter that arises in discrete geometry and abstract convexity. In particular, under general conditions on C, we show that the dual Helly number is bounded if and only if there is a proper learner that obtains the optimal dependence on $\epsilon$. As further implications of our techniques we resolve a long-standing open problem posed by Vapnik and Chervonenkis (1974) on the performance of the Support Vector Machine by proving that the sample complexity of SVM in the realizable case is $\Theta((n/\epsilon)+(1/\epsilon)\log(1/\delta))$, where $n$ is the dimension. This gives the first optimal PAC bound for Halfspaces achieved by a proper learning algorithm, and moreover is computationally efficient.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bousquet20a.html
http://proceedings.mlr.press/v125/bousquet20a.htmlSelfish Robustness and Equilibria in Multi-Player Bandits Motivated by cognitive radios, stochastic multi-player multi-armed bandits gained a lot of interest recently. In this class of problems, several players simultaneously pull arms and encounter a collision – with 0 reward – if some of them pull the same arm at the same time. While the cooperative case where players maximize the collective reward (obediently following some fixed protocol) has been mostly considered, robustness to malicious players is a crucial and challenging concern. Existing approaches consider only the case of adversarial jammers whose objective is to blindly minimize the collective reward. We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare. We provide the first algorithm robust to selfish players (a.k.a. Nash equilibrium) with a logarithmic regret, when the arm performance is observed. When collisions are also observed, Grim Trigger type of strategies enable some implicit communication-based algorithms and we construct robust algorithms in two different settings: the homogeneous (with a regret comparable to the centralized optimal one) and heterogeneous cases (for an adapted and relevant notion of regret). We also provide impossibility results when only the reward is observed or when arm means vary arbitrarily among players.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/boursier20a.html
http://proceedings.mlr.press/v125/boursier20a.htmlHardness of Identity Testing for Restricted Boltzmann Machines and Potts models We study identity testing for restricted Boltzmann machines (RBMs), and more generally for undirected graphical models. Given sample access to the Gibbs distribution corresponding to an unknown or hidden model $M^*$ and given an explicit model $M$, can we distinguish if either $M = M^*$ or if they are (statistically) far apart? Daskalakis et al. (2018) presented a polynomial-time algorithm for identity testing for the ferromagnetic (attractive) Ising model. In contrast, for the antiferromagnetic (repulsive) Ising model, Bezáková et al. (2019) proved that unless $RP=NP$ there is no identity testing algorithm when $\beta d=\omega(\log{n})$, where $d$ is the maximum degree of the visible graph and $\beta$ is the largest edge weight (in absolute value). We prove analogous hardness results for RBMs (i.e., mixed Ising models on bipartite graphs), even when there are no latent variables or an external field. Specifically, we show that if $RP\neq NP$, then when $\beta d=\omega(\log{n})$ there is no polynomial-time algorithm for identity testing for RBMs; when $\beta d =O(\log{n})$ there is an efficient identity testing algorithm that utilizes the structure learning algorithm of Klivans and Meka (2017). In addition, we prove similar lower bounds for purely ferromagnetic RBMs with inconsistent external fields, and for the ferromagnetic Potts model. Previous hardness results for identity testing of Bezáková et al. (2019) utilized the hardness of finding the maximum cuts, which corresponds to the ground states of the antiferromagnetic Ising model. Since RBMs are on bipartite graphs such an approach is not feasible. We instead introduce a novel methodology to reduce from the corresponding approximate counting problem and utilize the phase transition that is exhibited by RBMs and the mean-field Potts model. We believe that our method is general, and that it can be used to establish the hardness of identity testing for other spin systems.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/blanca20a.html
http://proceedings.mlr.press/v125/blanca20a.htmlImplicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards “simple” models. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/blanc20a.html
http://proceedings.mlr.press/v125/blanc20a.htmlFree Energy Wells and Overlap Gap Property in Sparse PCA We study a variant of the sparse PCA (principal component analysis) problem in the “hard” regime, where the inference task is possible yet no polynomial-time algorithm is known to exist. Prior work, based on the low-degree likelihood ratio, has conjectured a precise expression for the best possible (sub-exponential) runtime throughout the hard regime. Following instead a statistical physics inspired point of view, we show bounds on the depth of free energy wells for various Gibbs measures naturally associated to the problem. These free energy wells imply hitting time lower bounds that corroborate the low-degree conjecture: we show that a class of natural MCMC (Markov chain Monte Carlo) methods (with worst-case initialization) cannot solve sparse PCA with less than the conjectured runtime. These lower bounds apply to a wide range of values for two tuning parameters: temperature and sparsity misparametrization. Finally, we prove that the Overlap Gap Property (OGP), a structural property that implies failure of certain local search algorithms, holds in a significant part of the hard regime.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/ben-arous20a.html
http://proceedings.mlr.press/v125/ben-arous20a.htmlComplexity Guarantees for Polyak Steps with Momentum In smooth strongly convex optimization, knowledge of the strong convexity parameter is critical for obtaining simple methods with accelerated rates. In this work, we study a class of methods, based on Polyak steps, where this knowledge is substituted by that of the optimal value, $f_*$. We first show slightly improved convergence bounds than previously known for the classical case of simple gradient descent with Polyak steps, we then derive an accelerated gradient method with Polyak steps and momentum, along with convergence guarantees.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/barre20a.html
http://proceedings.mlr.press/v125/barre20a.htmlCalibrated Surrogate Losses for Adversarially Robust Classification Adversarially robust classification seeks a classifier that is insensitive to adversarial perturbations of test patterns. This problem is often formulated via a minimax objective, where the target loss is the worst-case value of the 0-1 loss subject to a bound on the size of perturbation. Recent work has proposed convex surrogates for the adversarial 0-1 loss, in an effort to make optimization more tractable. In this work, we consider the question of which surrogate losses are \emph{calibrated} with respect to the adversarial 0-1 loss, meaning that minimization of the former implies minimization of the latter. We show that no convex surrogate loss is calibrated with respect to the adversarial 0-1 loss when restricted to the class of linear models. We further introduce a class of nonconvex losses and offer necessary and sufficient conditions for losses in this class to be calibrated. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bao20a.html
http://proceedings.mlr.press/v125/bao20a.htmlFinite Regret and Cycles with Fixed Step-Size via Alternating Gradient Descent-Ascent Gradient descent is arguably one of the most popular online optimization methods with a wide array of applications. However, the standard implementation where agents simultaneously update their strategies yields several undesirable properties; strategies diverge away from equilibrium and regret grows over time. In this paper, we eliminate these negative properties by considering a different implementation to obtain $O\left( \nicefrac{1}{T}\right)$ time-average regret via arbitrary fixed step-size. We obtain this surprising property by having agents take turns when updating their strategies. In this setting, we show that an agent that uses gradient descent with any linear loss function obtains bounded regret – regardless of how their opponent updates their strategies. Furthermore, we show that in adversarial settings that agents’ strategies are bounded and cycle when both are using the alternating gradient descent algorithm.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/bailey20a.html
http://proceedings.mlr.press/v125/bailey20a.htmlActive Local Learning In this work we consider active {\em local learning}: given a query point $x$, and active access to an unlabeled training set $S$, output the prediction $h(x)$ of a near-optimal $h \in H$ using significantly fewer labels than would be needed to actually learn $h$ fully. In particular, the number of label queries should be independent of the complexity of $H$, and the function $h$ should be well-defined, independent of $x$. This immediately also implies an algorithm for {\em distance estimation}: estimating the value $opt(H)$ from many fewer labels than needed to actually learn a near-optimal $h \in H$, by running local learning on a few random query points and computing the average error. For the hypothesis class consisting of functions supported on the interval $[0,1]$ with Lipschitz constant bounded by $L$, we present an algorithm that makes $O(({1 / \epsilon^6}) \log(1/\epsilon))$ label queries from an unlabeled pool of $O(({L / \epsilon^4})\log(1/\epsilon))$ samples. It estimates the distance to the best hypothesis in the class to an additive error of $\epsilon$ for an arbitrary underlying distribution. We further generalize our algorithm to more than one dimensions. We emphasize that the number of labels used is independent of the complexity of the hypothesis class which is linear in $L$ in the one-dimensional case. Furthermore, we give an algorithm to locally estimate the values of a near-optimal function at a few query points of interest with number of labels independent of $L$. We also consider the related problem of approximating the minimum error that can be achieved by the Nadaraya-Watson estimator under a linear diagonal transformation with eigenvalues coming from a small range. For a $d$-dimensional pointset of size $N$, our algorithm achieves an additive approximation of $\epsilon$, makes $\tilde{O}({d}/{\epsilon^2})$ queries and runs in $\tilde{O}({d^2}/{\epsilon^{d+4}}+{dN}/{\epsilon^2})$ time. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/backurs20a.html
http://proceedings.mlr.press/v125/backurs20a.htmlEstimating Principal Components under Adversarial Perturbations Robustness is a key requirement for widespread deployment of machine learning algorithms, and has received much attention in both statistics and computer science. We study a natural model of robustness for high-dimensional statistical estimation problems that we call the {\em adversarial perturbation model}. An adversary can perturb {\em every} sample arbitrarily up to a specified magnitude $\delta$ measured in some $\ell_q$ norm, say $\ell_\infty$. Our model is motivated by emerging paradigms such as {\em low precision machine learning} and {\em adversarial training}. We study the classical problem of estimating the top-$r$ principal subspace of the Gaussian covariance matrix in high dimensions, under the adversarial perturbation model. We design a computationally efficient algorithm that given corrupted data, recovers an estimate of the top-$r$ principal subspace with error that depends on a robustness parameter $\kappa$ that we identify. This parameter corresponds to the $q \to 2$ operator norm of the projector onto the principal subspace, and generalizes well-studied analytic notions of sparsity. Additionally, in the absence of corruptions, our algorithmic guarantees recover existing bounds for problems such as sparse PCA and its higher rank analogs. We also prove that the above dependence on the parameter $\kappa$ is almost optimal asymptotically, not just in a minimax sense, but remarkably for {\em every} instance of the problem. This {\em instance-optimal} guarantee shows that the $q \to 2$ operator norm of the subspace essentially {\em characterizes} the estimation error under adversarial perturbations.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/awasthi20a.html
http://proceedings.mlr.press/v125/awasthi20a.htmlData-driven confidence bands for distributed nonparametric regression Gaussian Process Regression and Kernel Ridge Regression are popular nonparametric regression approaches. Unfortunately, they suffer from high computational complexity rendering them inapplicable to the modern massive datasets. To that end a number of approximations have been suggested, some of them allowing for a distributed implementation. One of them is the divide and conquer approach, splitting the data into a number of partitions, obtaining the local estimates and finally averaging them. In this paper we suggest a novel computationally efficient fully data-driven algorithm, quantifying uncertainty of this method, yielding frequentist $L_2$-confidence bands. We rigorously demonstrate validity of the algorithm. Another contribution of the paper is a minimax-optimal high-probability bound for the averaged estimator, complementing and generalizing the known risk bounds.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/avanesov20a.html
http://proceedings.mlr.press/v125/avanesov20a.htmlSecond-Order Information in Non-Convex Stochastic Optimization: Power and Limitations We design an algorithm which finds an $\epsilon$-approximate stationary point (with $\|\nabla F(x)\|\le \epsilon$) using $O(\epsilon^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and—surprisingly—that it cannot be improved using stochastic $p$th order methods for any $p\ge 2$, even when the first $p$ derivatives of the objective are Lipschitz. Together, these results characterize the complexity of non-convex stochastic optimization with second-order methods and beyond. Expanding our scope to the oracle complexity of finding $(\epsilon,\gamma)$-approximate second-order stationary points, we establish nearly matching upper and lower bounds for stochastic second-order methods. Our lower bounds here are novel even in the noiseless case.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/arjevani20a.html
http://proceedings.mlr.press/v125/arjevani20a.htmlDimension-Free Bounds for Chasing Convex Functions We consider the problem of chasing convex functions, where functions arrive over time. The player takes actions after seeing the function, and the goal is to achieve a small function cost for these actions, as well as a small cost for moving between actions. While the general problem requires a polynomial dependence on the dimension, we show how to get dimension-independent bounds for well-behaved functions. In particular, we consider the case where the convex functions are $\kappa$-well-conditioned, and give an algorithm that achieves an $O(\sqrt \kappa)$-competitiveness. Moreover, when the functions are supported on $k$-dimensional affine subspaces—e.g., when the function are the indicators of some affine subspaces—we get $O(\min(k, \sqrt{k \log T}))$-competitive algorithms for request sequences of length $T$. We also show some lower bounds, that well-conditioned functions require $\Omega(\kappa^{1/3})$-competitiveness, and $k$-dimensional functions require $\Omega(\sqrt{k})$-competitiveness.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/argue20a.html
http://proceedings.mlr.press/v125/argue20a.htmlPan-Private Uniformity Testing A centrally differentially private algorithm maps raw data to differentially private outputs. In contrast, a locally differentially private algorithm may only access data through public interaction with data holders, and this interaction must be a differentially private function of the data. We study the intermediate model of \emph{pan-privacy}. Unlike a locally private algorithm, a pan-private algorithm receives data in the clear. Unlike a centrally private algorithm, the algorithm receives data one element at a time and must maintain a differentially private internal state while processing this stream. First, we show that pan-privacy against multiple intrusions on the internal state is equivalent to sequentially interactive local privacy. Next, we contextualize pan-privacy against a single intrusion by analyzing the sample complexity of uniformity testing over domain $[k]$. Focusing on the dependence on $k$, centrally private uniformity testing has sample complexity $\Theta(\sqrt{k})$, while noninteractive locally private uniformity testing has sample complexity $\Theta(k)$. We show that the sample complexity of pan-private uniformity testing is $\Theta(k^{2/3})$. By a new $\Omega(k)$ lower bound for the sequentially interactive setting, we also separate pan-private from sequentially interactive locally private and multi-intrusion pan-private uniformity testing.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/amin20a.html
http://proceedings.mlr.press/v125/amin20a.htmlWinnowing with Gradient Descent The performance of multiplicative updates is typically logarithmic in the number of features when the targets are sparse. Strikingly, we show that the same property can also be achieved with gradient descent updates. We obtain this result by rewriting the non-negative weights $w_i$ of multiplicative updates by $u_i^2$ and then performing a gradient descent step w.r.t. the new $u_i$ parameters. We apply this method to the Winnow update, the Hedge update, and the unnormalized and normalized exponentiated gradient (EG) updates for linear regression. When the original weights $w_i$ are scaled to sum to one (as done for Hedge and normalized EG), then in the corresponding reparameterized update, the $u_i$ parameters are now divided by $\Vert\mathbf{u}\Vert_2$ after the gradient descent step. We show that these reparameterizations closely track the original multiplicative updates by proving in each case the same online regret bounds (albeit in some cases, with slightly different constants). As a side, our work exhibits a simple two-layer linear neural network that, when trained with gradient descent, can experimentally solve a certain sparse linear problem (known as the Hadamard problem) with exponentially fewer examples than any kernel method.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/amid20a.html
http://proceedings.mlr.press/v125/amid20a.htmlHierarchical Clustering: A 0.585 Revenue Approximation Hierarchical Clustering trees have been widely accepted as a useful form of clustering data, resulting in a prevalence of adopting fields including phylogenetics, image analysis, bioinformatics and more. Recently, Dasgupta (STOC 16’) initiated the analysis of these types of algorithms through the lenses of approximation. Later, the dual problem was considered by Moseley and Wang (NIPS 17’) dubbing it the Revenue goal function. In this problem, given a nonnegative weight $w_{ij}$ for each pair $i,j \in [n]=\{1,2, \ldots ,n\}$, the objective is to find a tree $T$ whose set of leaves is $[n]$ that maximizes the function $\sum_{i<j \in [n]} w_{ij} (n -|T_{ij}|)$, where $|T_{ij}|$ is the number of leaves in the subtree rooted at the least common ancestor of $i$ and $j$. In our work we consider the revenue goal function and prove the following results. First, we prove the existence of a bisection (i.e., a tree of depth $2$ in which the root has two children, each being a parent of $n/2$ leaves) which approximates the general optimal tree solution up to a factor of $\frac{1}{2}$ (which is tight). Second, we apply this result in order to prove a $\frac{2}{3}p$ approximation for the general revenue problem, where $p$ is defined as the approximation ratio of the \textsc{Max-Uncut Bisection} problem. Since $p$ is known to be at least $0.8776$ (Austrin et al., 2016) (Wu et al., 2015), we get a $0.585$ approximation algorithm for the revenue problem. This improves a sequence of earlier results which culminated in an $0.4246$-approximation guarantee (Ahmadian et al., 2019).Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/alon20b.html
http://proceedings.mlr.press/v125/alon20b.htmlClosure Properties for Private Classification and Online Prediction Let H be a class of boolean functions and consider a composed class H’ that is derived from H using some arbitrary aggregation rule (for example, H’ may be the class of all 3-wise majority-votes of functions in H). We upper bound the Littlestone dimension of H’ in terms of that of H. As a corollary, we derive closure properties for online learning and private PAC learning. The derived bounds on the Littlestone dimension exhibit an undesirable exponential dependence. For private learning, we prove close to optimal bounds that circumvents this suboptimal dependency. The improved bounds on the sample complexity of private learning are derived algorithmically via transforming a private learner for the original class H to a private learner for the composed class H’. Using the same ideas we show that any (proper or improper) private algorithm that learns a class of functions H in the realizable case (i.e., when the examples are labeled by some function in the class) can be transformed to a private algorithm that learns the class H in the agnostic case.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/alon20a.html
http://proceedings.mlr.press/v125/alon20a.htmlFrom Nesterov’s Estimate Sequence to Riemannian Acceleration We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our results, we revisit Nesterov’s estimate sequence technique and develop a conceptually simple alternative from first principles. We then extend our analysis to Riemannian acceleration, localizing the key difficulty into “metric distortion.” We control this distortion via a novel geometric inequality, which enables us to formulate and analyze global Riemannian acceleration.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/ahn20a.html
http://proceedings.mlr.press/v125/ahn20a.htmlModel-Based Reinforcement Learning with a Generative Model is Minimax Optimal This work considers the sample and computational complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plug-in” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior model-free results, we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/agarwal20b.html
http://proceedings.mlr.press/v125/agarwal20b.htmlOptimality and Approximation with Policy Gradient Methods in Markov Decision Processes Policy gradient (PG) methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) “tabular” policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. In the \emph{tabular setting}, our main results are: 1) convergence rate to global optimum for direct parameterization and projected gradient ascent 2) an asymptotic convergence to global optimum for softmax policy parameterization and PG; and a convergence rate with additional entropy regularization, and 3) dimension-free convergence to global optimum for softmax policy parameterization and Natural Policy Gradient (NPG) method with exact gradients. In \emph{function approximation}, we further analyze NPG with exact as well as inexact gradients under certain smoothness assumptions on the policy parameterization and establish rates of convergence in terms of the quality of the initial state distribution. One insight of this work is in formalizing how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place PG methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/agarwal20a.html
http://proceedings.mlr.press/v125/agarwal20a.htmlDistributed Signal Detection under Communication Constraints Independent draws from a $d$-dimensional spherical Gaussian distribution are distributed across users, each holding one sample. A central server seeks to distinguish between the two hypotheses: the distribution has zero mean, or the mean has $\ell_2$-norm at least $\varepsilon$, a pre-specified threshold. However, the users can each transmit at most $\ell$ bits to the server. This is the problem of detecting whether an observed signal is simply white noise in a distributed setting. We study this distributed testing problem with and without the availability of a common randomness shared by the users. We design schemes with and without such shared randomness which achieve sample complexities. We then obtain lower bounds for protocols with public randomness, tight when $\ell=O(1)$. We finally conclude with several conjectures and open problems.Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/acharya20b.html
http://proceedings.mlr.press/v125/acharya20b.htmlDomain Compression and its Application to Randomness-Optimal Distributed Goodness-of-Fit We study goodness-of-fit of discrete distributions in the distributed setting, where samples are divided between multiple users who can only release a limited amount of information about their samples due to various information constraints. Recently, a subset of the authors showed that having access to a common random seed (i.e., shared randomness) leads to a significant reduction in the sample complexity of this problem. In this work, we provide a complete understanding of the interplay between the amount of shared randomness available, the stringency of information constraints, and the sample complexity of the testing problem by characterizing a tight trade-off between these three parameters. We provide a general distributed goodness-of-fit protocol that as a function of the amount of shared randomness interpolates smoothly between the private- and public-coin sample complexities. We complement our upper bound with a general framework to prove lower bounds on the sample complexity of this testing problems under limited shared randomness. Finally, we instantiate our bounds for the two archetypal information constraints of communication and local privacy, and show that our sample complexity bounds are optimal as a function of all the parameters of the problem, including the amount of shared randomness. A key component of our upper bounds is a new primitive of \textit{domain compression}, a tool that allows us to map distributions to a much smaller domain size while preserving their pairwise distances, using a limited amount of randomness. Wed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/acharya20a.html
http://proceedings.mlr.press/v125/acharya20a.htmlConference on Learning Theory 2020: PrefacePreface to the proceedings of the 32nd Conference on Learning TheoryWed, 15 Jul 2020 00:00:00 +0000
http://proceedings.mlr.press/v125/abernethy20a.html
http://proceedings.mlr.press/v125/abernethy20a.html