Proceedings of Machine Learning ResearchProceedings of Thirty Fourth Conference on Learning Theory
Held in Boulder, Colorado, USA on 15-19 August 2021
Published as Volume 134 by the Proceedings of Machine Learning Research on 21 July 2021.
Volume Edited by:
Mikhail Belkin
Samory Kpotufe
Series Editors:
Neil D. Lawrence
* Mark Reid
https://proceedings.mlr.press/v134/
Thu, 09 Feb 2023 06:29:10 +0000Thu, 09 Feb 2023 06:29:10 +0000Jekyll v3.9.3Thu, 09 Feb 2023 06:29:10 +0000
https://proceedings.mlr.press/v134/2021-07-21-kuzborskij21a.html
https://proceedings.mlr.press/v134/2021-07-21-kuzborskij21a.htmlConference on Learning Theory 2021: Post-conference PrefaceSat, 09 Oct 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/belkin21a.html
https://proceedings.mlr.press/v134/belkin21a.htmlBenign Overfitting of Constant-Stepsize SGD for Linear RegressionThere is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/zou21a.html
https://proceedings.mlr.press/v134/zou21a.htmlA Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural NetworkWhile over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason—they either work in the Neural Tangent Kernel regime where neurons don’t move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly over-parameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly over-parameterized two-layer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an over-parameterized two-layer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape—we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/zhou21b.html
https://proceedings.mlr.press/v134/zhou21b.htmlNearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision ProcessesWe study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. For the fixed-horizon episodic setting with inhomogeneous transition kernels, we propose a new, computationally efficient algorithm that uses the basis kernels to approximate value functions. We show that the new algorithm, which we call ${\text{UCRL-VTR}^{+}}$, attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the number of basis kernels, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that ${\text{UCRL-VTR}^{+}}$ is minimax optimal up to logarithmic factors. At the core of our results are (1) a weighted least squares estimator for the unknown transitional probability; and (2) a new Bernstein-type concentration inequality for self-normalized vector-valued martingales with bounded increments. Together, these new tools enable tight control of the Bellman error and lead to a nearly minimax regret. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for RL with linear function approximation.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/zhou21a.html
https://proceedings.mlr.press/v134/zhou21a.htmlIs Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of HorizonEpisodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with S states, A actions, planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We propose a new algorithm, Monotonic Value Propagation (MVP), which relies on a new Bernstein-type bonus. The new bonus only requires tweaking the constants to ensure optimism and thus is significantly simpler than existing bonus constructions. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of contextual bandits. Notably, this result 1) exponentially improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019], and Zhang et al. [2020] in terms of the dependency on H, and 2) exponentially improves the running time in [Wang et al. 2020] and significantly improves the dependency on S, A and K in sample complexity.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/zhang21b.html
https://proceedings.mlr.press/v134/zhang21b.htmlImproved Algorithms for Efficient Active Learning Halfspaces with Massart and Tsybakov NoiseWe give a computationally-efficient PAC active learning algorithm for $d$-dimensional homogeneous halfspaces that can tolerate Massart noise (Massart and Nedelec, 2006) and Tsybakov noise (Tsybakov, 2004). Specialized to the $\eta$-Massart noise setting, our algorithm achieves an information-theoretically near-optimal label complexity of $\tilde{O}\left( \frac{d}{(1-2\eta)^2} \mathrm{polylog}(\frac1\epsilon) \right)$ under a wide range of unlabeled data distributions (specifically, the family of ``structured distributions' defined in Diakonikolas et al. (2020)). Under the more challenging Tsybakov noise condition, we identify two subfamilies of noise conditions, under which our efficient algorithm provides label complexity guarantees strictly lower than passive learning algorithms.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/zhang21a.html
https://proceedings.mlr.press/v134/zhang21a.htmlCautiously Optimistic Policy Optimization and Exploration with Linear Function ApproximationPolicy optimization methods are popular reinforcement learning (RL) algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy nature is not amenable to data reuse and the incremental updates result in a large iteration complexity before a near optimal policy is discovered. These empirical findings are mirrored in theory in the recent work of \citet{agarwal2020pc}, which provides a policy optimization method PC-PG that provably finds near optimal policies in the linear MDP model and is robust to model misspecification, but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the poor sample complexity of PC-PG while retaining the robustness to model misspecification. COPOE makes several important algorithmic enhancements of PC-PG, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new RL algorithms. The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PC-PG to $\widetilde{O}(1/\epsilon^3)$ for COPOE, nearly bridging the gap with value-based techniques.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/zanette21a.html
https://proceedings.mlr.press/v134/zanette21a.htmlFine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step BootstrapThis paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$ regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/xu21a.html
https://proceedings.mlr.press/v134/xu21a.htmlThe Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent CommunicationWe resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/woodworth21a.html
https://proceedings.mlr.press/v134/woodworth21a.htmlOn Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value FunctionWe consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a restricted, “local” access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and the features are also only accessible for the states that are encountered in this process. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in $H$ (the horizon of the MDP) or $d$ (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that $\poly(H,d)$ planning is possible with state value function realizability whenever the action set has a constant size. In particular, we present the TensorPlan algorithm which uses $\poly((dH/\delta)^A)$ simulator queries to find a $\delta$-optimal policy relative to any deterministic policy for which the value function is linearly realizable with some bounded parameter (with a known bound). This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an interesting open question. We also extend the upper bound to the near-realizable case and to the infinite-horizon discounted MDP setup. The upper bounds are complemented by a lower bound which states that in the infinite-horizon episodic setting, planners that achieve constant suboptimality need exponentially many queries, either in the dimension or the number of actions.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/weisz21a.html
https://proceedings.mlr.press/v134/weisz21a.htmlNon-stationary Reinforcement Learning without Prior Knowledge: an Optimal Black-box ApproachWe propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity. By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for (generalzed) linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/wei21b.html
https://proceedings.mlr.press/v134/wei21b.htmlLast-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov GamesWe study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent’s best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/wei21a.html
https://proceedings.mlr.press/v134/wei21a.htmlImplicit Regularization in ReLU Networks with the Square LossUnderstanding the implicit regularization (or implicit bias) of gradient descent has recently been a very active research area. However, the implicit regularization in nonlinear neural networks is still poorly understood, especially for regression losses such as the square loss. Perhaps surprisingly, we prove that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters (although on the positive side, we show it can be characterized approximately). For one hidden-layer networks, we prove a similar result, where in general it is impossible to characterize implicit regularization properties in this manner, except for the “balancedness” property identified in Du et al. (2018). Our results suggest that a more general framework than the one considered so far may be needed to understand implicit regularization for nonlinear predictors, and provides some clues on what this framework should be.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/vardi21b.html
https://proceedings.mlr.press/v134/vardi21b.htmlSize and Depth Separation in Approximating Benign Functions with Neural NetworksWhen studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions. However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially-bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these conditions “benign”, and explore the benefits of size and depth for approximation of benign functions with ReLU networks. As we show, this problem is more challenging than the corresponding problem for non-benign functions. We give complexity-theoretic barriers to showing depth-lower-bounds: Proving existence of a benign function that cannot be approximated by polynomial-sized networks of depth $4$ would settle longstanding open problems in computational complexity. It implies that beyond depth $4$ there is a barrier to showing depth-separation for benign functions, even between networks of constant depth and networks of nonconstant depth. We also study size-separation, namely, whether there are benign functions that can be approximated with networks of size $O(s(d))$, but not with networks of size $O(s’(d))$. We show a complexity-theoretic barrier to proving such results beyond size $O(d\log^2(d))$, but also show an explicit benign function, that can be approximated with networks of size $O(d)$ and not with networks of size $o(d/\log d)$. For approximation in the $L_\infty$ sense we achieve such separation already between size $O(d)$ and size $o(d)$. Moreover, we show superpolynomial size lower bounds and barriers to such lower bounds, depending on the assumptions on the function. Our size-separation results rely on an analysis of size lower bounds for Boolean functions, which is of independent interest: We show linear size lower bounds for computing explicit Boolean functions (such as set disjointness) with neural networks and threshold circuits.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/vardi21a.html
https://proceedings.mlr.press/v134/vardi21a.htmlRobust Online Convex Optimization in the Presence of OutliersWe consider online convex optimization when a number k of data points are outliers that may be corrupted. We model this by introducing the notion of robust regret, which measures the regret only on rounds that are not outliers. The aim for the learner is to achieve small robust regret, without knowing where the outliers are. If the outliers are chosen adversarially, we show that a simple filtering strategy on extreme gradients incurs O(k) overhead compared to the usual regret bounds, and that this is unimprovable, which means that k needs to be sublinear in the number of rounds. We further ask which additional assumptions would allow for a linear number of outliers. It turns out that the usual benign cases of independently, identically distributed (i.i.d.) observations or strongly convex losses are not sufficient. However, combining i.i.d. observations with the assumption that outliers are those observations that are in an extreme quantile of the distribution, does lead to sublinear robust regret, even though the expected number of outliers is linear.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/vanerven21a.html
https://proceedings.mlr.press/v134/vanerven21a.htmlA Dimension-free Computational Upper-bound for Smooth Optimal Transport EstimationIt is well-known that plug-in statistical estimation of optimal transport suffers from the curse of dimensionality. Despite recent efforts to improve the rate of estimation with the smoothness of the problem, the computational complexity of these recently proposed methods still degrade exponentially with the dimension. In this paper, thanks to an infinite-dimensional sum-of-squares representation, we derive a statistical estimator of smooth optimal transport which achieves a precision $\varepsilon$ from $\tilde{O}(\varepsilon^{-2})$ independent and identically distributed samples from the distributions, for a computational cost of $\tilde{O}(\varepsilon^{-4})$ when the smoothness increases, hence yielding dimension-free statistical \emph{and} computational rates, with potentially exponentially dimension-dependent constants.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/vacher21a.html
https://proceedings.mlr.press/v134/vacher21a.htmlMachine Unlearning via Algorithmic StabilityWe study the problem of machine unlearning and identify a notion of algorithmic stability, Total Variation (TV) stability, which we argue, is suitable for the goal of exact unlearning. For convex risk minimization problems, we design TV-stable algorithms based on noisy Stochastic Gradient Descent (SGD). Our key contribution is the design of corresponding efficient unlearning algorithms, which are based on constructing a near-maximal coupling of Markov chains for the noisy SGD procedure. To understand the trade-offs between accuracy and unlearning efficiency, we give upper and lower bounds on excess empirical and populations risk of TV stable algorithms for convex risk minimization. Our techniques generalize to arbitrary non-convex functions, and our algorithms are differentially private as well.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/ullah21a.html
https://proceedings.mlr.press/v134/ullah21a.htmlOn Empirical Bayes Variational Autoencoder: An Excess Risk BoundIn this paper, we consider variational autoencoders (VAE) via empirical Bayes estimation, referred to as Empirical Bayes Variational Autoencoders (EBVAE), which is a general framework including popular VAE methods as special cases. Despite the widespread use of VAE, its theoretical aspects are less explored in the literature. Motivated by this, we establish a general theoretical framework for analyzing the excess risk associated with EBVAE under the setting of density estimation, covering both parametric and nonparametric cases, through the lens of M-estimation. As an application, we analyze the excess risk of the commonly-used EBVAE with Gaussian models and highlight the importance of covariance matrices of Gaussian encoders and decoders in obtaining a good statistical guarantee, shedding light on the empirical observations reported in the literature.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/tang21a.html
https://proceedings.mlr.press/v134/tang21a.htmlEfficient Bandit Convex Optimization: Beyond Linear LossesWe study the problem of online learning with bandit feedback, where a learner aims to minimize a sequence of adversarially generated loss functions, while only observing the value of each function at a single point. When the loss functions chosen by the adversary are convex and quadratic, we develop a new algorithm which achieves the optimal regret rate of $O{T^{1/2}}$. Furthermore, our algorithm satisfies three important desiderata: (a) it is practical and can be efficiently implemented for high dimensional problems, (b) the regret bound holds with high probability even against adaptive adversaries whose decisions can depend on the learner’s previous actions, and (c) it is robust to model mis-specification; that is, the regret bound degrades gracefully when the loss functions deviate from convex quadratics. To the best of our knowledge, ours is the first algorithm for bandit convex optimization with quadratic losses which is efficiently implementable and achieves optimal regret guarantees. Existing algorithms for this problem either have sub-optimal regret guarantees or are computationally expensive and do not scale well to high-dimensional problems. Our algorithm is a bandit version of the classical regularized Newton’s method. It involves estimation of gradients and Hessians of the loss functions from single point feedback. A key caveat of working with Hessian estimates however is that they typically have large variance. In this work, we show that it is nonetheless possible to finesse this caveat by relying on the idea of “focus regions”, where we restrict the algorithm to choose actions from a subset of the action space in which the variance of our estimates can be controlled.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/suggala21a.html
https://proceedings.mlr.press/v134/suggala21a.htmlJohnson-Lindenstrauss Transforms with Best ConfidenceThe seminal result of Johnson and Lindenstrauss on random embeddings has been intensively studied in applied and theoretical computer science. Despite that vast body of literature, we still lack of complete understanding of statistical properties of random projections; a particularly intriguing question is: why are the theoretical bounds that far behind the empirically observed performance? Motivated by this question, this work develops Johnson-Lindenstrauss distributions with optimal, data-oblivious, statistical confidence bounds. These bounds are numerically best possible, for any given data dimension, embedding dimension, and distortion tolerance. They improve upon prior works in terms of statistical accuracy, as well as exactly determine the no-go regimes for data-oblivious approaches. Furthermore, the projection matrices are efficiently samplable. The construction relies on orthogonal matrices, and the proof uses certain elegant properties of the unit sphere. In particular, the following techniques introduced in this work are of independent interest: a) a compact expression for the projection distortion in terms of singular eigenvalues of the projection matrix, b) a parametrization linking the unit sphere and the Dirichlet distribution and c) anti-concentration bounds for the Dirichlet distribution. Besides the technical contribution, the paper presents applications and numerical evaluation along with working implementation in Python (shared as a GitHub repository).Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/skorski21a.html
https://proceedings.mlr.press/v134/skorski21a.htmlLazy OCO: Online Convex Optimization on a Switching BudgetWe study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary. In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d. losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings. Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/sherman21a.html
https://proceedings.mlr.press/v134/sherman21a.htmlAlmost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy BallWe study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem. For SGD, in the convex and smooth setting, we provide the first \emph{almost sure} asymptotic convergence \emph{rates} for a weighted average of the iterates . More precisely, we show that the convergence rate of the function values is arbitrarily close to $o(1/\sqrt{k})$, and is exactly $o(1/k)$ in the so-called overparametrized case. We show that these results still hold when using a decreasing step size version of stochastic line search and stochastic Polyak stepsizes, thereby giving the first proof of convergence of these methods in the non-overparametrized regime. Using a substantially different analysis, we show that these rates hold for SHB as well, but at the last iterate. This distinction is important because it is the last iterate of SGD and SHB which is used in practice. We also show that the last iterate of SHB converges to a minimizer \emph{almost surely}. Additionally, we prove that the function values of the deterministic HB converge at a $o(1/k)$ rate, which is faster than the previously known $O(1/k)$. Finally, in the nonconvex setting, we prove similar rates on the lowest gradient norm along the trajectory of SGD.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/sebbouh21a.html
https://proceedings.mlr.press/v134/sebbouh21a.htmlThe Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural NetworksWe study the effects of mild over-parameterization on the optimization landscape of a simple ReLU neural network of the form $\mathbf{x}\mapsto\sum_{i=1}^k\max\{0,\mathbf{w}_i^{\top}\mathbf{x}\}$, in a well-studied teacher-student setting where the target values are generated by the same architecture, and when directly optimizing over the population squared loss with respect to Gaussian inputs. We prove that while the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons, it is not even \emph{locally convex} after any amount of over-parameterization. Moreover, related desirable properties (e.g., one-point strong convexity and the Polyak-{Ł}ojasiewicz condition) also do not hold even locally. On the other hand, we establish that the objective remains one-point strongly convex in \emph{most} directions (suitably defined), and show an optimization guarantee under this property. For the non-global minima, we prove that adding even just a single neuron will turn a non-global minimum into a saddle point. This holds under some technical conditions which we validate empirically. These results provide a possible explanation for why recovering a global minimum becomes significantly easier when we over-parameterize, even if the amount of over-parameterization is very moderate.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/safran21a.html
https://proceedings.mlr.press/v134/safran21a.htmlLearning to Stop with Surprisingly Few SamplesWe consider a discounted infinite horizon optimal stopping problem. If the underlying distribution is known a priori, the solution of this problem is obtained via dynamic programming (DP) and is given by a well known threshold rule. When information on this distribution is lacking, a natural (though naive) approach is “explore-then-exploit," whereby the unknown distribution or its parameters are estimated over an initial exploration phase, and this estimate is then used in the DP to determine actions over the residual exploitation phase. We show: (i) with proper tuning, this approach leads to performance comparable to the full information DP solution; and (ii) despite common wisdom on the sensitivity of such “plug in" approaches in DP due to propagation of estimation errors, a surprisingly “short" (logarithmic in the horizon) exploration horizon suffices to obtain said performance. In cases where the underlying distribution is heavy-tailed, these observations are even more pronounced: a single sample exploration phase suffices.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/russo21a.html
https://proceedings.mlr.press/v134/russo21a.htmlAverage-Case Communication Complexity of Statistical ProblemsWe study statistical problems, such as planted clique, its variants, and sparse principal component analysis in the context of average-case communication complexity. Our motivation is to understand the statistical-computational trade-offs in streaming, sketching, and query-based models. Communication complexity is the main tool for proving lower bounds in these models, yet many prior results do not hold in an average-case setting. We provide a general reduction method that preserves the input distribution for problems involving a random graph or matrix with planted structure. Then, we derive two-party and multi-party communication lower bounds for detecting or finding planted cliques, bipartite cliques, and related problems. As a consequence, we obtain new bounds on the query complexity in the edge-probe, vector-matrix-vector, matrix-vector, linear sketching, and $\mathbb{F}_2$-sketching models. Many of these results are nearly tight, and we use our techniques to provide simple proofs of some known lower bounds for the edge-probe model.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/rashtchian21a.html
https://proceedings.mlr.press/v134/rashtchian21a.htmlExponential Weights Algorithms for Selective LearningWe study the selective learning problem introduced by Qiao and Valiant (2019), in which the learner observes $n$ labeled data points one at a time. At a time of its choosing, the learner selects a window length $w$ and a model $\hat\ell$ from the model class $\mathcal{L}$, and then labels the next $w$ data points using $\hat\ell$. The \emph{excess risk} incurred by the learner is defined as the difference between the average loss of $\hat\ell$ over those $w$ data points and the smallest possible average loss among all models in $\mathcal{L}$ over those $w$ data points. We give an improved algorithm, termed the \emph{hybrid exponential weights} algorithm, that achieves an expected excess risk of $O((\log\log|\mathcal{L}| + \log\log n)/\log n)$. This result gives a doubly exponential improvement in the dependence on $|\mathcal{L}|$ over the best known bound of $O(\sqrt{|\mathcal{L}|/\log n})$. We complement the positive result with an almost matching lower bound, which suggests the worst-case optimality of the algorithm. We also study a more restrictive family of learning algorithms that are \emph{bounded-recall} in the sense that when a prediction window of length $w$ is chosen, the learner’s decision only depends on the most recent $w$ data points. We analyze an exponential weights variant of the ERM algorithm in Qiao and Valiant (2019). This new algorithm achieves an expected excess risk of $O(\sqrt{\log |\mathcal{L}|/\log n})$, which is shown to be nearly optimal among all bounded-recall learners. Our analysis builds on a generalized version of the selective mean prediction problem in Drucker (2013); Qiao and Valiant (2019), which may be of independent interest.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/qiao21a.html
https://proceedings.mlr.press/v134/qiao21a.htmlExponential savings in agnostic active learning through abstentionWe show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss 1/2 of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/puchkin21a.html
https://proceedings.mlr.press/v134/puchkin21a.htmlAdaptive Discretization for Adversarial Lipschitz BanditsLipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces such as the [0,1] interval, where similar actions are guaranteed to have similar rewards. A central theme here is the adaptive discretization of the action space, which gradually "zooms in" on the more promising regions thereof. The goal is to take advantage of "nicer" problem instances, while retaining near-optimal worst-case performance. While the stochastic version of the problem is well-understood, the general version with adversarial rewards is not. We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds. In particular, we recover the worst-case optimal regret bound for the adversarial version, and the instance-dependent regret bound for the stochastic version. A version with full proofs (and additional results) appears at arxiv.org/abs/2006.12367v2.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/podimata21a.html
https://proceedings.mlr.press/v134/podimata21a.htmlLearning from Censored and Dependent Data: The case of Linear DynamicsObservations from dynamical systems often exhibit irregularities, such as censoring, where values are recorded only if they fall within a certain range. Censoring is ubiquitous in practice, due to saturating sensors, limit-of-detection effects, image frame effects, and combined with the temporal dependencies within the data, makes the task of system identification particularly challenging. In light of recent developments on learning linear dynamical systems (LDSs), and on censored statistics with independent data, we revisit the decades-old problem of learning an LDS, from censored observations (Lee and Maddala (1985), Zeger and Brookmeyer (1986)). Here, the learner observes the state x_t \in R^d if and only if x_t belongs to some set S_t _x0012_\in R^d. We develop the first computationally and statistically efficient algorithm for learning the system, assuming only oracle to the sets St. Our algorithm, Stochastic Online Newton with Switching Gradients, is a novel second-order method that builds on the Online Newton Step (ONS) of Hazan et al. (2007). Our Switching-Gradient scheme does not always use (stochastic) gradients of the function we want to optimize, which we call censor-aware function. Instead, in each iteration, it performs a simple test to decide whether to use the censor-aware, or another censor-oblivious function, for getting a stochastic gradient. In our analysis, we consider a “generic” Online Newton method, which uses arbitrary vectors instead of gradients, and we prove an error-bound for it. This can be used to appropriately design these vectors, Leading to our Switching-Gradient scheme. This framework significantly deviates from the recent long line of works on censored statistics (e.g, Daskalakis et al. (2018); Kontonis et al. (2019); Daskalakis et al. (2019), which apply Stochastic Gradient Descent (SGD), and their analysis reduces to establishing conditions for off-the-shelf SGD-bounds. Our approach enables to relax these conditions, and gives rise to phenomena that might appear counterintuitive, given the previous works. Specifically, our method makes progress even when the current “survival probability” is exponentially small. We believe that our analysis framework will have applications in more settings where the data are subject to censoring.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/plevrakis21a.html
https://proceedings.mlr.press/v134/plevrakis21a.htmlTowards a Dimension-Free Understanding of Adaptive Linear ControlWe study the problem of adaptive control of the linear quadratic regulator for systems in very high, or even infinite dimension. We demonstrate that while sublinear regret requires finite dimensional inputs, the ambient state dimension of the system need not be bounded in order to perform online control. We provide the first regret bounds for LQR which hold for infinite dimensional systems, replacing dependence on ambient dimension with more natural notions of problem complexity. Our guarantees arise from a novel perturbation bound for certainty equivalence which scales with the prediction error in estimating the system parameters, without requiring consistent parameter recovery in more stringent measures like the operator norm. When specialized to finite dimensional settings, our bounds recover near optimal dimension and time horizon dependence.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/perdomo21a.html
https://proceedings.mlr.press/v134/perdomo21a.htmlTowards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty OracleMotivated by applications in crowdsourced entity resolution in database, signed edge prediction in social networks and correlation clustering, Mazumdar and Saha [NIPS 2017] proposed an elegant theoretical model for studying clustering with a faulty oracle. In this model, given a set of $n$ items which belong to $k$ unknown groups (or clusters), our goal is to recover the clusters by asking pairwise queries to an oracle. This oracle can answer the query that “do items $u$ and $v$ belong to the same cluster?”. However, the answer to each pairwise query errs with probability $\epsilon$, for some $\epsilon\in(0,\frac12)$. Mazumdar and Saha provided two algorithms under this model: one algorithm is query-optimal while time-inefficient (i.e., running in quasi-polynomial time), the other is time efficient (i.e., in polynomial time) while query-suboptimal. Larsen, Mitzenmacher and Tsourakakis [WWW 2020] then gave a new time-efficient algorithm for the special case of $2$ clusters, which is query-optimal if the bias $\delta:=1-2\epsilon$ of the model is large. It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters and other regimes of $\delta$. In this paper, we make progress on the above question and provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(\log^2 n)$) for all constant $k$ and any $\delta$ in the regime when information-theoretic recovery is possible. Our algorithm is built on a connection to the stochastic block model.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/peng21a.html
https://proceedings.mlr.press/v134/peng21a.htmlProvable Memorization via Deep Neural Networks using Sub-linear ParametersIt is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $O(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width 3) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide empirical results that support our theoretical findings.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/park21a.html
https://proceedings.mlr.press/v134/park21a.htmlSGD in the Large: Average-case Analysis, Asymptotics, and Stepsize CriticalityWe propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and on the least squares problem with random data (finite-sum). Using this new framework, we show that the dynamics of SGD become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which is also verified experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates (i.e., the complexity of an algorithm averaged over all possible inputs). These rates show significant improvement over classical worst-case complexities.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/paquette21a.html
https://proceedings.mlr.press/v134/paquette21a.htmlOpen Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?We propose matrix norm inequalities that extend the Recht and Ré (2012) conjecture on a noncommutative AM-GM inequality, by supplementing it with another inequality that accounts for single-shuffle in stochastic finite-sum minimization. Single-shuffle is a popular without-replacement sampling scheme that shuffles only once in the beginning, but has not been studied in the Recht-Ré conjecture and the follow-up literature. Instead of focusing on general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products satisfy a series of spectral norm inequalities that imply “single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD and GD” in special cases.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/open-problem-yun21a.html
https://proceedings.mlr.press/v134/open-problem-yun21a.htmlOpen Problem: Tight Online Confidence Intervals for RKHS ElementsConfidence intervals are a crucial building block in the analysis of various online learning problems. The analysis of kernel-based bandit and reinforcement learning problems utilize confidence intervals applicable to the elements of a reproducing kernel Hilbert space (RKHS). However, the existing confidence bounds do not appear to be tight, resulting in suboptimal regret bounds. In fact, the existing regret bounds for several kernelized bandit algorithms (e.g., GP-UCB, GP-TS, and their variants) may fail to even be sublinear. It is unclear whether the suboptimal regret bound is a fundamental shortcoming of these algorithms or an artifact of the proof, and the main challenge seems to stem from the online (sequential) nature of the observation points. We formalize the question of online confidence intervals in the RKHS setting and overview the existing results. Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/open-problem-vakili21a.html
https://proceedings.mlr.press/v134/open-problem-vakili21a.htmlOpen Problem: Is There an Online Learning Algorithm That Learns Whenever Online Learning Is Possible?The open problem asks whether there exists an online learning algorithm for binary classification that guarantees, for all target concepts, to make a sublinear number of mistakes, under only the assumption that the (possibly random) sequence of points X allows that such a learning algorithm can exist for that sequence. As a secondary problem, it also asks whether a specific concise condition completely determines whether a given (possibly random) sequence of points X admits the existence of online learning algorithms guaranteeing a sublinear number of mistakes for all target concepts.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/open-problem-hanneke21b.html
https://proceedings.mlr.press/v134/open-problem-hanneke21b.htmlOpen Problem: Are all VC-classes CPAC learnable?A few years ago, it was shown that there exist basic statistical learning problems whose learnability can not be determined within ZFC [Ben-David, Hrubes, Moran, Shpilka, Yehudayoff, 2017]. Such independence, and the implied impossibility of characterizing learnability of a class by any combinatorial parameter, stems from the basic definitions viewing learners as arbitrary functions. That level of generality not only results in unprovability issues but is also problematic from the perspective of modeling practical machine learning, where learners and predictors are computable objects. In light of that, it is natural to consider learnability by algorithms that output computable predictors (both learners and predictors are then representable as finite objects). A recent study [Agarwal, Ananthakrishnan, Ben-David, Lechner and Urner, 2020] initiated a theory of such models of learning. It proposed the notion of CPAC learnability, by adding some basic computability requirements into a PAC learning framework. As a first step towards a characterization of learnability in the CPAC framework, Agarwal et al showed that CPAC learnability of a binary hypothesis class is not implied by the finiteness of its VC-dimension anymore, as far as proper learners are concerned. A major remaining open question is whether a similar result holds also for improper learning. Namely, does there exist a computable concept class consisting of computable classifiers, that has a finite VC-dimension but no computable learner can PAC learn it (even if the learner is not restricted to output a hypothesis that is a member of the class)? Another implied interesting question concerns coming up with combinatorial characterizations of learnability for computable learners.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/open-problem-agarwal21b.html
https://proceedings.mlr.press/v134/open-problem-agarwal21b.htmlIt was “all” for “nothing”: sharp phase transitions for noiseless discrete channelsWe prove a phase transition known as the “all-or-nothing” phenomenon for noiseless discrete channels. This class of models includes the Bernoulli group testing model and the planted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenon for such models was only known in a limited range of parameters. Our work extends the results to all signals with sublinear sparsity. Our main technique is to show that for such models, the “all” half of all-or-nothing implies the “nothing” half, so that a proof of “all” can be turned into a proof of “nothing.” Since the “all” half can often be proven by straightforward means, our equivalence gives a powerful and general approach towards establishing the existence of this phenomenon in other contexts.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/nilesweed21a.html
https://proceedings.mlr.press/v134/nilesweed21a.htmlInformation-Theoretic Generalization Bounds for Stochastic Gradient DescentWe study the generalization properties of the popular stochastic optimization method known as stochastic gradient descent (SGD) for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/neu21a.html
https://proceedings.mlr.press/v134/neu21a.htmlA Theory of Heuristic LearnabilityWhich concepts can we learn efficiently on average? In this paper, we investigate the capability of a natural average-case learning framework, heuristic PAC (heurPAC) learning to answer this and some other related questions. Roughly speaking, we say that a concept class is heurPAC learnable if there exists a learning algorithm that given $n, s\in\mathbb{N}$ and $\epsilon,\delta,\eta\in(0,1]$ as input, learns all but $\eta$ fraction of $n$-input target functions represented as $s$-bit strings in the class from passively collected examples and then outputs an $\epsilon$-close hypothesis with failure probability at most $\delta$ in polynomial-time in $n$, $s$, $\epsilon^{-1},\delta^{-1}$, and $\eta^{-1}$, where each example is generated according to some example distribution. First, we establish a positive learnability result. Specifically, we show that a simple Fourier-based algorithm heurPAC learns $\Omega(\log n)$-junta functions on the uniform distribution, which is a central open question in the original PAC learning model. Our technical contribution is to introduce the notion of elusive functions that captures hard-to-learn cases and to establish a polynomial relation between the running time and the fraction of such elusive functions. Second, we present clear relations between heurPAC learnability and cryptography. Particularly, we show that for any efficiently evaluated class $\mathscr{C}$, (1) if $\mathscr{C}$ is not heurPAC learnable, then an auxiliary-input one-way function (AIOWF) exists; (2) if $\mathscr{C}$ is not heurPAC learnable on the uniform distribution, then an infinitely-often one-way function (io-OWF) exists. As a corollary, we also present new characterizations for AIOWF and io-OWF based on heurPAC learnability, which is conceptually stronger than the previous ones that are based on average-case learnability for fixed parameters. These results show that our framework might yield heuristic learners with theoretical guarantees for broader classes than the usual PAC learning framework, and any efficiently evaluated class has a potential for such a heuristic learner or a secure cryptographic primitive. Through this paper, we suggest further research toward the win-win “learning vs. cryptography” paradigm.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/nanashima21a.html
https://proceedings.mlr.press/v134/nanashima21a.htmlAdversarially Robust Learning with Unknown Perturbation SetsWe study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of successful attacks, in different interaction models, in terms of the VC and Littlestone dimensions of the hypothesis class of predictors, and without any assumptions on the perturbation set.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/montasser21a.html
https://proceedings.mlr.press/v134/montasser21a.htmlLearning to Sample from Censored Markov Random FieldsWe study the problem of learning Censored Markov Random Fields (abbreviated CMRFs), which are Markov Random Fields where some of the nodes are censored (i.e. not observed). We assume the CMRF is high temperature but, crucially, make no assumption about its structure. This makes structure learning impossible. Nevertheless we introduce a new definition, which we call learning to sample, that circumvents this obstacle. We give an algorithm that can learn to sample from a distribution within $\epsilon n$ earthmover distance of the target distribution for any $\epsilon > 0$. We obtain stronger results when we additionally assume high girth, as well as computational lower bounds showing that these are essentially optimal.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/moitra21a.html
https://proceedings.mlr.press/v134/moitra21a.htmlLearning with invariances in random features and kernel modelsA number of machine learning tasks entail a high degree of invariance: the data distribution does not change if we act on the data with a certain group of transformations. For instance, labels of images are invariant under translations of the images. Certain neural network architectures —for instance, convolutional networks—are believed to owe their success to the fact that they exploit such invariance properties. With the objective of quantifying the gain achieved by invariant architectures, we introduce two classes of models: invariant random features and invariant kernel methods. The latter includes, as a special case, the neural tangent kernel for convolutional networks with global average pooling. We consider uniform covariates distributions on the sphere and hypercube and a general invariant target function. We characterize the test error of invariant methods in a high-dimensional regime in which the sample size and number of hidden units scale as polynomials in the dimension, for a class of groups that we call ‘degeneracy $\alpha$’, with $\alpha \leq 1$. We show that exploiting invariance in the architecture saves a $d^\alpha$ factor ($d$ stands for the dimension) in sample size and number of hidden units to achieve the same test error as for unstructured architectures. Finally, we show that output symmetrization of an unstructured kernel estimator does not give a significant statistical improvement; on the other hand, data augmentation with an unstructured kernel estimator is equivalent to an invariant kernel estimator and enjoys the same improvement in statistical efficiency.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/mei21a.html
https://proceedings.mlr.press/v134/mei21a.htmlImproved Analysis of the Tsallis-INF Algorithm in Stochastically Constrained Adversarial Bandits and Stochastic Bandits with Adversarial CorruptionsWe derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(\Delta,C,T)$ self-bounding constraint the algorithm achieves $\mathcal{O}\left(\left(\sum_{i\neq i^*} \frac{1}{\Delta_i}\right)\log_+\left(\frac{(K-1)T}{\left(\sum_{i\neq i^*} \frac{1}{\Delta_i}\right)^2}\right)+\sqrt{C\left(\sum_{i\neq i^*}\frac{1}{\Delta_i}\right)\log_+\left(\frac{(K-1)T}{C\sum_{i\neq i^*}\frac{1}{\Delta_i}}\right)}\right)$ regret bound, where $T$ is the time horizon, $K$ is the number of arms, $\Delta_i$ are the suboptimality gaps, $i^*$ is the best arm, $C$ is the corruption magnitude, and $\log_+(x) = \max\left(1,\log x\right)$. The regime includes stochastic bandits, stochastically constrained adversarial bandits, and stochastic bandits with adversarial corruptions as special cases. Additionally, we provide a general analysis, which allows to achieve the same kind of improvement for generalizations of Tsallis-INF to other settings beyond multiarmed bandits.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/masoudian21a.html
https://proceedings.mlr.press/v134/masoudian21a.htmlRandom Graph Matching with Improved Noise RobustnessGraph matching, also known as network alignment, refers to finding a bijection between the vertex sets of two given graphs so as to maximally align their edges. This fundamental computational problem arises frequently in multiple fields such as computer vision and biology. Recently, there has been a plethora of work studying efficient algorithms for graph matching under probabilistic models. In this work, we propose a new algorithm for graph matching: Our algorithm associates each vertex with a signature vector using a multistage procedure and then matches a pair of vertices from the two graphs if their signature vectors are close to each other. We show that, for two Erdős–Rényi graphs with edge correlation $1-\alpha$, our algorithm recovers the underlying matching exactly with high probability when $\alpha \le 1 / (\log \log n)^C$, where $n$ is the number of vertices in each graph and $C$ denotes a positive universal constant. This improves the condition $\alpha \le 1 / (\log n)^C$ achieved in previous work.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/mao21a.html
https://proceedings.mlr.press/v134/mao21a.htmlThe Connection Between Approximation, Depth Separation and Learnability in Neural NetworksSeveral recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks. In this work we study the intricate connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target. Specifically, we show that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks. We also show that a class of functions can be learned by an efficient statistical query algorithm if and only if it can be approximated in a weak sense by some kernel class. We give several examples of functions which demonstrate depth separation, and conclude that they cannot be efficiently learned, even by a hypothesis class that can efficiently approximate them.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/malach21a.html
https://proceedings.mlr.press/v134/malach21a.htmlApproximation Algorithms for Socially Fair ClusteringWe present an (e^{O(p)} (log \ell) / (log log \ell))-approximation algorithm for socially fair clustering with the l_p-objective. In this problem, we are given a set of points in a metric space. Each point belongs to one (or several) of \ell groups. The goal is to find a k-medians, k-means, or, more generally, l_p-clustering that is simultaneously good for all of the groups. More precisely, we need to find a set of k centers C so as to minimize the maximum over all groups j of \sum_{u in group j} d(u, C)^p. The socially fair clustering problem was independently proposed by Abbasi, Bhaskara, and Venkatasubramanian (2021) and Ghadiri, Samadi, and Vempala (2021). Our algorithm improves and generalizes their O(\ell)-approximation algorithms for the problem. The natural LP relaxation for the problem has an integrality gap of \Omega(\ell). In order to obtain our result, we introduce a strengthened LP relaxation and show that it has an integrality gap of \Theta((log \ell) / (log log \ell)) for a fixed p. Additionally, we present a bicriteria approximation algorithm, which generalizes the bicriteria approximation of Abbasi et al. (2021).Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/makarychev21a.html
https://proceedings.mlr.press/v134/makarychev21a.htmlCorruption-robust exploration in episodic reinforcement learningWe initiate the study of episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of multi-armed bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on “optimism in the face of uncertainty”, by complementing them with principles from “action elimination”. Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear MDP settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit feedback model for episodic reinforcement learning.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/lykouris21a.html
https://proceedings.mlr.press/v134/lykouris21a.htmlA Priori Generalization Analysis of the Deep Ritz Method for Solving High Dimensional Elliptic Partial Differential EquationsThis paper concerns the a priori generalization analysis of the Deep Ritz Method (DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving high dimensional partial differential equations. We derive the generalization error bounds of two-layer neural networks in the framework of the DRM for solving two prototype elliptic PDEs: Poisson equation and static Schrödinger equation on the $d$-dimensional unit hypercube. Specifically, we prove that the convergence rates of generalization errors are independent of the dimension $d$, under the a priori assumption that the exact solutions of the PDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we give sufficient conditions on the forcing term and the potential function which guarantee that the solutions are spectral Barron functions. We achieve this by developing a new solution theory for the PDEs on the spectral Barron space, which can be viewed as an analog of the classical Sobolev regularity theory for PDEs.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/lu21a.html
https://proceedings.mlr.press/v134/lu21a.htmlExponentially Improved Dimensionality Reduction for l1: Subspace Embeddings and Independence TestingDespite many applications, dimensionality reduction in the $\ell_1$-norm is much less understood than in the Euclidean norm. We give two new oblivious dimensionality reduction techniques for the $\ell_1$-norm which improve exponentially over prior ones:
- We design a distribution over random matrices $S \in \mathbb{R}^{r \times n}$, where $r = 2^{\textrm{poly}(d/(\varepsilon \delta))}$, such that given any matrix $A \in \mathbb{R}^{n \times d}$, with probability at least $1-\delta$, simultaneously for all $x$, $\|SAx\|_1 = (1 \pm \varepsilon)\|Ax\|_1$. Note that $S$ is linear, does not depend on $A$, and maps $\ell_1$ into $\ell_1$. Our distribution provides an exponential improvement on the previous best known map of Wang and Woodruff (SODA, 2019), which required $r = 2^{2^{\Omega(d)}}$, even for constant $\varepsilon$ and $\delta$. Our bound is optimal, up to a polynomial factor in the exponent, given a known $2^{\textrm{poly}(d)}$ lower bound for constant $\varepsilon$ and $\delta$.
- We design a distribution over matrices $S \in \mathbb{R}^{k \times n}$, where $k = 2^{O(q^2)}(\varepsilon^{-1} q \log d)^{O(q)}$, such that given any $q$-mode tensor $A \in (\mathbb{R}^{d})^{\otimes q}$, one can estimate the entrywise $\ell_1$-norm $\|A\|_1$ from $S(A)$. Moreover, $S = S^1 \otimes S^2 \otimes \cdots \otimes S^q$ and so given vectors $u_1, \ldots, u_q \in \mathbb{R}^d$, one can compute $S(u_1 \otimes u_2 \otimes \cdots \otimes u_q)$ in time $2^{O(q^2)}(\varepsilon^{-1} q \log d)^{O(q)}$, which is much faster than the $d^q$ time required to form $u_1 \otimes u_2 \otimes \cdots \otimes u_q$. Our linear map gives a streaming algorithm for independence testing using space $2^{O(q^2)}(\varepsilon^{-1} q \log d)^{O(q)}$, improving the previous doubly exponential $(\varepsilon^{-1} \log d)^{q^{O(q)}}$ space bound of Braverman and Ostrovsky (STOC, 2010).
For subspace embeddings, we also study the setting when $A$ is itself drawn from distributions with independent entries, and obtain a polynomial embedding dimension. For independence testing, we also give algorithms for any distance measure with a polylogarithmic-sized sketch and satisfying an approximate triangle inequality.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/li21c.html
https://proceedings.mlr.press/v134/li21c.htmlSoftmax Policy Gradient Methods Can Take Exponential Time to ConvergeThe softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $\eta$ can take \[ \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} \text{iterations} \]{to} converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/li21b.html
https://proceedings.mlr.press/v134/li21b.htmlStochastic Approximation for Online Tensorial Independent Component AnalysisIndependent component analysis (ICA) has been a popular dimension reduction tool in statistical machine learning and signal processing. In this paper, we present a convergence analysis for an online tensorial ICA algorithm, by viewing the problem as a nonconvex stochastic approximation problem. For estimating one component, we provide a dynamics-based analysis to prove that our online tensorial ICA algorithm with a specific choice of stepsize achieves a sharp finite-sample error bound. In particular, under a mild assumption on the data-generating distribution and a scaling condition such that $d^4/T$ is sufficiently small up to a polylogarithmic factor of data dimension $d$ and sample size $T$, a sharp finite-sample error bound of $\tilde{O}(\sqrt{d/T})$ can be obtained.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/li21a.html
https://proceedings.mlr.press/v134/li21a.htmlStructured Logconcave Sampling with a Restricted Gaussian OracleWe give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to generically improve dependences on problem conditioning $\kappa$ from polynomial to linear. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for $g: \mathbb{R}^d \rightarrow \mathbb{R}$, which is a sampler for distributions whose negative log-likelihood sums a quadratic (in a multiple of the identity) and $g$. By combining our reduction framework with our new samplers, we obtain the following bounds for sampling structured distributions to total variation distance $\eps$.
For composite densities $\exp(-f(x) - g(x))$, where $f$ has condition number $\kappa$ and convex (but possibly non-smooth) $g$ admits an RGO, we obtain a mixing time of $O(\kappa d \log^3\tfrac{\kappa d}{\epsilon})$, matching the state-of-the-art non-composite bound Lee et.\ al.\. No composite samplers with better mixing than general-purpose logconcave samplers were previously known.
For logconcave finite sums $\exp(-F(x))$, where $F(x) = \tfrac{1}{n}\sum_{i \in [n]} f_i(x)$ has condition number $\kappa$, we give a sampler querying $\widetilde{O}(n + \kappa\max(d, \sqrt{nd}))$ gradient oracles\footnote{For convenience of exposition, the $\widetilde{O}$ notation hides logarithmic factors in the dimension $d$, problem conditioning $\kappa$, desired accuracy $\epsilon$, and summand count $n$ (when applicable). A first-order (gradient) oracle for $f:\mathbb{R}^d \rightarrow \mathbb{R}$ returns $(f(x), \nabla f(x))$ on input $x$, and a zeroth-order (value) oracle only returns $f(x)$.} to $\{f_i\}_{i \in [n]}$. No high-accuracy samplers with nontrivial gradient query complexity were previously known.
For densities with condition number $\kappa$, we give an algorithm obtaining mixing time $O(\kappa d \log^2\tfrac{\kappa d}{\epsilon})$, improving Lee et.\ al.\ by a logarithmic factor with a significantly simpler analysis. We also show a zeroth-order algorithm attains the same query complexity.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/lee21a.html
https://proceedings.mlr.press/v134/lee21a.htmlMirror Descent and the Information RatioWe establish a connection between the stability of mirror descent and the information ratio by Russo and Van Roy (2014). Our analysis shows that mirror descent with suitable loss estimators and exploratory distributions enjoys the same bound on the adversarial regret as the bounds on the Bayesian regret for information-directed sampling. Along the way, we develop the theory for information-directed sampling and provide an efficient algorithm for adversarial bandits for which the regret upper bound matches exactly the best known information-theoretic upper bound. Keywords: Bandits, partial monitoring, mirror descent, information theory.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/lattimore21b.html
https://proceedings.mlr.press/v134/lattimore21b.htmlImproved Regret for Zeroth-Order Stochastic Convex BanditsWe present an efficient algorithm for stochastic bandit convex optimisation with no assumptions on smoothness or strong convexity and for which the regret is bounded by O(d^(4.5) sqrt(n) polylog(n)), where n is the number of interactions and d is the dimension.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/lattimore21a.html
https://proceedings.mlr.press/v134/lattimore21a.htmlProjected Stochastic Gradient Langevin Algorithms for Constrained Sampling and Non-Convex LearningLangevin algorithms are gradient descent methods with additive noise. They have been used for decades in Markov Chain Monte Carlo (MCMC) sampling, optimization, and learning. Their convergence properties for unconstrained non-convex optimization and learning problems have been studied widely in the last few years. Other work has examined projected Langevin algorithms for sampling from log-concave distributions restricted to convex compact sets. For learning and optimization, log-concave distributions correspond to convex losses. In this paper, we analyze the case of non-convex losses with compact convex constraint sets and IID external data variables. We term the resulting method the projected stochastic gradient Langevin algorithm (PSGLA). We show the algorithm achieves a deviation of $O(T^{-1/4} (log T )^{1/2} )$ from its target distribution in 1-Wasserstein distance. For optimization and learning, we show that the algorithm achieves $\epsilon$-suboptimal solutions, on average, provided that it is run for a time that is polynomial in $\epsilon$ and slightly super-exponential in the problem dimension.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/lamperski21a.html
https://proceedings.mlr.press/v134/lamperski21a.htmlOn the Minimal Error of Empirical Risk MinimizationWe study the minimal squared error of the Empirical Risk Minimization (ERM) procedure in the task of regression, both in random and fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM in Donsker and non-Donsker classes. We also discuss our results through the lens of recent studies on interpolation in overparameterized models.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/kur21a.html
https://proceedings.mlr.press/v134/kur21a.htmlHypothesis testing with low-degree polynomials in the Morris class of exponential familiesAnalysis of low-degree polynomial algorithms is a powerful, newly-popular method for predicting computational thresholds in hypothesis testing problems. One limitation of current techniques for this analysis is their restriction to Bernoulli and Gaussian distributions. We expand this range of possibilities by performing the low-degree analysis of hypothesis testing for the Morris class of natural exponential families with quadratic variance function, giving a unified treatment of Gaussian, Poisson, gamma (including exponential and chi-squared), binomial (including Bernoulli), negative binomial (including geometric), and generalized hyperbolic secant distributions. We then give several algorithmic applications. 1. In models where a random signal is observed through coordinatewise-independent noise applied in an exponential family, the success or failure of low-degree polynomials is governed by the z-score overlap, the inner product of z-score vectors with respect to the null distribution of two independent copies of the signal. 2. In the same models, testing with low-degree polynomials exhibits channel monotonicity: the above distributions admit a total ordering by computational cost of hypothesis testing, according to a scalar parameter describing how the variance depends on the mean in an exponential family. 3. In a spiked matrix model with a particular non-Gaussian noise distribution, the low-degree prediction is incorrect unless polynomials with arbitrarily large degree in individual matrix entries are permitted. This shows that polynomials summing over self-avoiding walks and variants thereof, as proposed recently by Ding, Hopkins, and Steurer (2020) for spiked matrix models with heavy-tailed noise, are strictly suboptimal for this model. Thus low-degree polynomials appear to offer a tradeoff between robustness and strong performance fine-tuned to specific models. Inspired by this, we suggest that a class of problems requiring "exploration before inference," where an algorithm must first examine the input and then use some intermediate computation to choose a suitable inference subroutine, appears especially difficult for low-degree polynomials.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/kunisky21a.html
https://proceedings.mlr.press/v134/kunisky21a.htmlAsymptotically Optimal Information-Directed SamplingWe introduce a simple and efficient algorithm for stochastic linear bandits with finitely many actions that is asymptotically optimal and (nearly) worst-case optimal in finite time. The approach is based on the frequentist information-directed sampling (IDS) framework, with a surrogate for the information gain that is informed by the optimization problem that defines the asymptotic lower bound. Our analysis sheds light on how IDS balances the trade-off between regret and information and uncovers a surprising connection between the recently proposed primal-dual methods and the IDS algorithm. We demonstrate empirically that IDS is competitive with UCB in finite-time, and can be significantly better in the asymptotic regime.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/kirschner21a.html
https://proceedings.mlr.press/v134/kirschner21a.htmlThe Sparse Vector Technique, RevisitedWe revisit one of the most basic and widely applicable techniques in the literature of differential privacy – the sparse vector technique [Dwork et al., STOC 2009]. This simple algorithm privately tests whether the value of a given query on a database is close to what we expect it to be. It allows to ask an unbounded number of queries as long as the answer is close to what we expect, and halts following the first query for which this is not the case. We suggest an alternative, equally simple, algorithm that can continue testing queries as long as any single individual does not contribute to the answer of too many queries whose answer deviates substantially form what we expect. Our analysis is subtle and some of its ingredients may be more widely applicable. In some cases our new algorithm allows to privately extract much more information from the database than the original. We demonstrate this by applying our algorithm to the shifting-heavy-hitters problem: On every time step, each of n users gets a new input, and the task is to privately identify all the current heavy-hitters. That is, on time step i, the goal is to identify all data elements x such that many of the users have x as their current input. We present an algorithm for this problem with improved error guarantees over what can be obtained using existing techniques. Specifically, the error of our algorithm depends on the maximal number of times that a single user holds a heavy-hitter as input, rather than the total number of times in which a heavy-hitter exists.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/kaplan21a.html
https://proceedings.mlr.press/v134/kaplan21a.html(Nearly) Dimension Independent Private ERM with AdaGrad Rates\{via Publicly Estimated SubspacesWe revisit the problem of empirical risk minimziation (ERM) with differential privacy. We show that noisy AdaGrad, given appropriate knowledge and conditions on the subspace from which gradients can be drawn, achieves a regret comparable to traditional AdaGrad plus a well-controlled term due to noise. We show a convergence rate of $O(\tr(G_T)/T)$, where $G_T$ captures the geometry of the gradient subspace. Since $\tr(G_T)=O(\sqrt{T})$ we can obtain faster rates for convex and Lipschitz functions, compared to the $O(1/\sqrt{T})$ rate achieved by known versions of noisy (stochastic) gradient descent with comparable noise variance. In particular, we show that if the gradients lie in a known constant rank subspace, and assuming algorithmic access to an envelope which bounds decaying sensitivity, one can achieve faster convergence to an excess empirical risk of $\tilde O(1/\epsilon n)$, where $\epsilon$ is the privacy budget and $n$ the number of samples. Letting $p$ be the problem dimension, this result implies that, by running noisy Adagrad, we can bypass the DP-SGD bound $\tilde O(\sqrt{p}/\epsilon n)$ in $T=(\epsilon n)^{2/(1+2\alpha)}$ iterations, where $\alpha \geq 0$ is a parameter controlling gradient norm decay, instead of the rate achieved by SGD of $T=\epsilon^2n^2$. Our results operate with general convex functions in both constrained and unconstrained minimization. Along the way, we do a perturbation analysis of noisy AdaGrad, which is of independent interest. Our utility guarantee for the private ERM problem follows as a corollary to the regret guarantee of noisy AdaGrad.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/kairouz21a.html
https://proceedings.mlr.press/v134/kairouz21a.htmlReduced-Rank Regression with Operator Norm ErrorA common data analysis task is the reduced-rank regression problem: $$\min_{\textrm{rank-}k X} \|AX-B\|,$$ where $A \in \mathbb{R}^{n \times c}$ and $B \in \mathbb{R}^{n \times d}$ are given large matrices and $\|\cdot\|$ is some norm. Here the unknown matrix $X \in \mathbb{R}^{c \times d}$ is constrained to be of rank $k$ as it results in a significant parameter reduction of the solution when $c$ and $d$ are large. In the case of Frobenius norm error, there is a standard closed form solution to this problem and a fast algorithm to find a $(1+\varepsilon)$-approximate solution. However, for the important case of operator norm error, no closed form solution is known and the fastest known algorithms take singular value decomposition time. We give the first randomized algorithms for this problem running in time $$(nnz{(A)} + nnz{(B)} + c^2) \cdot k/\varepsilon^{1.5} + (n+d)k^2/\epsilon + c^{\omega},$$ up to a polylogarithmic factor involving condition numbers, matrix dimensions, and dependence on $1/\varepsilon$. Here $nnz{(M)}$ denotes the number of non-zero entries of a matrix $M$, and $\omega$ is the exponent of matrix multiplication. As both (1) spectral low rank approximation ($A = B$) and (2) linear system solving ($n = c$ and $d = 1$) are special cases, our time cannot be improved by more than a $1/\varepsilon$ factor (up to polylogarithmic factors) without a major breakthrough in linear algebra. Interestingly, known techniques for low rank approximation, such as alternating minimization or sketch-and-solve, provably fail for this problem. Instead, our algorithm uses an existential characterization of a solution, together with Krylov methods, low degree polynomial approximation, and sketching-based preconditioning.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/kacham21a.html
https://proceedings.mlr.press/v134/kacham21a.htmlMoment Multicalibration for Uncertainty EstimationWe show how to achieve the notion of "multicalibration" from Hebert-Johnson et al. (2018) not just for means, but also for variances and other higher moments. Informally, this means that we can find regression functions which, given a data point, can make point predictions not just for the expectation of its label, but for higher moments of its label distribution as well—and those predictions match the true distribution quantities when averaged not just over the population as a whole, but also when averaged over an enormous number of finely defined subgroups. It yields a principled way to estimate the uncertainty of predictions on many different subgroups—and to diagnose potential sources of unfairness in the predictive power of features across subgroups. As an application, we show that our moment estimates can be used to derive marginal prediction intervals that are simultaneously valid as averaged over all of the (sufficiently large) subgroups for which moment multicalibration has been obtained.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/jung21a.html
https://proceedings.mlr.press/v134/jung21a.htmlDouble Explore-then-Commit: Asymptotic Optimality and BeyondWe study the multi-armed bandit problem with subGaussian rewards. The explore-then-commit (ETC) strategy, which consists of an exploration phase followed by an exploitation phase, is one of the most widely used algorithms in a variety of online decision applications. Nevertheless, it has been shown in \cite{garivier2016explore} that ETC is suboptimal in the asymptotic sense as the horizon grows, and thus, is worse than fully sequential strategies such as Upper Confidence Bound (UCB). In this paper, we show that a variant of ETC algorithm can actually achieve the asymptotic optimality for multi-armed bandit problems as UCB-type algorithms do and extend it to the batched bandit setting. Specifically, we propose a double explore-then-commit (DETC) algorithm that has two exploration and exploitation phases and proves that DETC achieves the asymptotically optimal regret bound. To our knowledge, DETC is the first non-fully-sequential algorithm that achieves such asymptotic optimality. In addition, we extend DETC to batched bandit problems, where (i) the exploration process is split into a small number of batches and (ii) the round complexity is of central interest. We prove that a batched version of DETC can achieve the asymptotic optimality with only a constant round complexity. This is the first batched bandit algorithm that can attain the optimal asymptotic regret bound and optimal round complexity simultaneously.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/jin21a.html
https://proceedings.mlr.press/v134/jin21a.htmlParameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret BoundsThis paper presents multi-armed bandit (MAB) algorithms that work well in adversarial environments and that offer improved performance by exploiting inherent structures in such environments, as stochastic generative models, as well as small variations in loss vectors. The fundamental aim of this work is to overcome the limitation of worst-case analyses in MAB contexts. There can be found two basic approaches achieving this purpose: best-of-both-worlds algorithms that work well for both stochastic and adversarial settings, and data-dependent regret bounds that work well depending on certain difficulty indicators w.r.g. loss sequences. One remarkable study w.r.t. the best-of-both-worlds approach deals with the Tsallis-INF algorithm \citep{zimmert2019optimal}, which achieves nearly optimal regret bounds up to small constants in both settings, though such bounds have remained unproven for a special case of a stochastic setting with multiple optimal arms. This paper offers two particular contributions: (i) We show that the Tsallis-INF algorithm enjoys a regret bound of a logarithmic order in the number of rounds for stochastic environments, even if the best arm is not unique. (ii) We provide a new algorithm with a new \textit{hybrid} regret bound that implies logarithmic regret in the stochastic regime and multiple data-dependent regret bounds in the adversarial regime, including bounds dependent on cumulative loss, total variation, and loss-sequence path-length. Both our proposed algorithm and the Tsallis-INF algorithm are based on a follow-the-regularized-leader (FTRL) framework with a time-varying regularizer. The analyses in this paper rely on \textit{skewed Bregman divergence}, which provides simple expressions of regret bounds for FTRL with a time-varying regularizer.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/ito21a.html
https://proceedings.mlr.press/v134/ito21a.htmlGroup testing and local search: is there a computational-statistical gap?Group testing is a fundamental problem in statistical inference with many real-world applications, including the need for massive group testing during the ongoing COVID-19 pandemic. In this paper we study the task of approximate recovery, in which we tolerate having a small number of incorrectly classified items. One of the most well-known, optimal, and easy to implement testing procedures is the non-adaptive Bernoulli group testing, where all tests are conducted in parallel, and each item is chosen to be part of any certain test independently with some fixed probability. In this setting, there is an observed gap between the number of tests above which recovery is information theoretically possible, and the number of tests required by the currently best known efficient algorithms to succeed. In this paper we seek to understand whether this computational-statistical gap can be closed. Our main contributions are the following: 1.Often times such gaps are explained by a phase transition in the landscape of the solution space of the problem (an Overlap Gap Property (OGP) phase transition). We provide first moment evidence that, perhaps surprisingly, such a phase transition does not take place throughout the regime for which recovery is information theoretically possible. This fact suggests that the model is in fact amenable to local search algorithms. 2. We prove the complete absence of “bad” local minima for a part of the “hard” regime, a fact which implies an improvement over known theoretical results on the performance of efficient algorithms for approximate recovery without false-negatives. 3. Finally, motivated by the evidence for the abscence for the OGP, we present extensive simulations that strongly suggest that a very simple local algorithm known as Glauber Dynamics does indeed succeed, and can be used to efficiently implement the well-known (theoretically optimal) Smallest Satisfying Set (SSS) estimator. Given that practical algorithms for this task utilize Branch and Bound and Linear Programming relaxation techniques, our finding could potentially be of practical interest.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/iliopoulos21a.html
https://proceedings.mlr.press/v134/iliopoulos21a.htmlStreaming k-PCA: Efficient guarantees for Oja’s algorithm, beyond rank-one updatesWe analyze Oja’s algorithm for streaming $k$-PCA, and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. $d \times d$ symmetric matrices, we show that Oja’s algorithm can obtain an accurate approximation to the subspace of the top $k$ eigenvectors of their expectation using a number of samples that scales polylogarithmically with $d$. Previously, such a result was only known in the case where the updates have rank one. Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm’s execution.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/huang21a.html
https://proceedings.mlr.press/v134/huang21a.htmlFast Rates for the Regret of Offline Reinforcement LearningWe study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterized this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate’s pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/hu21a.html
https://proceedings.mlr.press/v134/hu21a.htmlOn the Approximation Power of Two-Layer Networks of Random ReLUsThis paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upper- and lower-bounds for L2-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/hsu21a.html
https://proceedings.mlr.press/v134/hsu21a.htmlAdaptive Learning in Continuous Games: Optimal Regret Bounds and Convergence to Nash EquilibriumIn game-theoretic learning, several agents are simultaneously following their individual interests, so the environment is non-stationary from each player’s perspective. In this context, the performance of a learning algorithm is often measured by its regret. However, no-regret algorithms are not created equal in terms of game-theoretic guarantees: depending on how they are tuned, some of them may drive the system to an equilibrium, while others could produce cyclic, chaotic, or otherwise divergent trajectories. To account for this, we propose a range of no-regret policies based on optimistic mirror descent, with the following desirable properties: (\emph{i}) they do not require \emph{any} prior tuning or knowledge of the game; (\emph{ii}) they all achieve $\mathcal{O}(\sqrt{T})$ regret against arbitrary, adversarial opponents; and (\emph{iii}) they converge to the best response against convergent opponents. Also, if employed by all players, then (\emph{iv}) they guarantee $\mathcal{O}(1)$ \emph{social} regret; while (\emph{v}) the induced sequence of play converges to Nash equilibirum with $\mathcal{O}(1)$ \emph{individual} regret in all variationally stable games (a class of games that includes all monotone and convex-concave zero-sum games).Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/hsieh21a.html
https://proceedings.mlr.press/v134/hsieh21a.htmlBounded Memory Active Learning through Enriched QueriesThe explosive growth of easily-accessible unlabeled data has lead to growing interest in \emph{active learning}, a paradigm in which data-hungry learning algorithms adaptively select informative examples in order to lower prohibitively expensive labeling costs. Unfortunately, in standard worst-case models of learning, the active setting often provides no improvement over non-adaptive algorithms. To combat this, a series of recent works have considered a model in which the learner may ask \emph{enriched} queries beyond labels. While such models have seen success in drastically lowering label costs, they tend to come at the expense of requiring large amounts of memory. In this work, we study what families of classifiers can be learned in \emph{bounded memory}. To this end, we introduce a novel streaming-variant of enriched-query active learning along with a natural combinatorial parameter called \emph{lossless sample compression} that is sufficient for learning not only with bounded memory, but in a query-optimal and computationally efficient manner as well. Finally, we give three fundamental examples of classifier families with small, easy to compute lossless compression schemes when given access to basic enriched queries: axis-aligned rectangles, decision trees, and halfspaces in two dimensions.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/hopkins21a.html
https://proceedings.mlr.press/v134/hopkins21a.htmlShape Matters: Understanding the Implicit Bias of the Noise CovarianceThe noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/haochen21a.html
https://proceedings.mlr.press/v134/haochen21a.htmlOnline Learning with Simple Predictors and a Combinatorial Characterization of Minimax in 0/1 GamesWhich classes can be learned properly in the online model? — that is, by an algorithm that on each round uses a predictor from the concept class. While there are simple and natural cases where improper learning is useful and even necessary, it is natural to ask how complex must the improper predictors be in such cases. Can one always achieve nearly optimal mistake/regret bounds using "simple" predictors? In this work, we give a complete characterization of when this is possible, thus settling an open problem which has been studied since the pioneering works of Angluin (1987) and Littlestone (1988). More precisely, given any concept class C and any hypothesis class H, we provide nearly tight bounds (up to a log factor) on the optimal mistake bounds for online learning C using predictors from H. Our bound yields an exponential improvement over the previously best known bound by Chase and Freitag (2020). As applications, we give constructive proofs showing that (i) in the realizable setting, a near-optimal mistake bound (up to a constant factor) can be attained by a sparse majority-vote of proper predictors, and (ii) in the agnostic setting, a near-optimal regret bound (up to a log factor) can be attained by a randomized proper algorithm. The latter was proven non-constructively by Rakhlin, Sridharan, and Tewari (2015). It was also achieved by constructive but improper algorithms proposed by Ben-David, Pal, and Shalev-Shwartz (2009) and Rakhlin, Shamir, and Sridharan (2012). A technical ingredient of our proof which may be of independent interest is a generalization of the celebrated Minimax Theorem (von Neumann, 1928) for binary zero-sum games with arbitrary action-sets: a simple game which fails to satisfy Minimax is "Guess the Larger Number". In this game, each player picks a natural number and the player who picked the larger number wins. Equivalently, the payoff matrix of this game is infinite triangular. We show that this is the only obstruction: if the payoff matrix does not contain triangular submatrices of unbounded sizes then the Minimax theorem is satisfied. This generalizes von Neumann’s Minimax Theorem by removing requirements of finiteness (or compactness) of the action-sets, and moreover it captures precisely the types of games of interest in online learning: namely, Littlestone games.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/hanneke21a.html
https://proceedings.mlr.press/v134/hanneke21a.htmlGeneralizing Complex Hypotheses on Product Distributions: Auctions, Prophet Inequalities, and Pandora’s ProblemThis paper explores a theory of generalization for learning problems on product distributions, complementing the existing learning theories in the sense that it does not rely on any complexity measures of the hypothesis classes. The main contributions are two general sample complexity bounds: (1) $\tilde{O} \big( \frac{nk}{\epsilon^2} \big)$ samples are sufficient and necessary for learning an $\epsilon$-optimal hypothesis in \emph{any problem} on an $n$-dimensional product distribution, whose marginals have finite supports of sizes at most $k$; (2) $\tilde{O} \big( \frac{n}{\epsilon^2} \big)$ samples are sufficient and necessary for any problem on $n$-dimensional product distributions if it satisfies a notion of strong monotonicity from the algorithmic game theory literature. As applications of these theories, we match the optimal sample complexity for single-parameter revenue maximization (Guo et al., STOC 2019), improve the state-of-the-art for multi-parameter revenue maximization (Gonczarowski and Weinberg, FOCS 2018) and prophet inequality (Correa et al., EC 2019; Rubinstein et al., ITCS 2020), and provide the first and tight sample complexity bound for Pandora’s problem.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/guo21a.html
https://proceedings.mlr.press/v134/guo21a.htmlPAC-Bayes, MAC-Bayes and Conditional Mutual Information: Fast rate bounds that handle general VC classesWe give a novel, unified derivation of conditional PAC-Bayesian and mutual information (MI) generalization bounds. We derive conditional MI bounds as an instance, with special choice of prior, of conditional MAC-Bayesian (Mean Approximately Correct) bounds, itself derived from conditional PAC-Bayesian bounds, where ‘conditional’ means that one can use priors conditioned on a joint training and ghost sample. This allows us to get nontrivial PAC-Bayes and MI-style bounds for general VC classes, something recently shown to be impossible with standard PAC-Bayesian/MI bounds. Second, it allows us to get fast rates of order $O((\text{KL}/n)^{\gamma}$ for $\gamma > 1/2$ if a Bernstein condition holds and for exp-concave losses (with $\gamma=1$), which is impossible with both standard PAC-Bayes generalization and MI bounds. Our work extends the recent work by Steinke and Zakynthinou (2020) who handle MI with VC but neither PAC-Bayes nor fast rates and Mhammedi et al. (2019) who initiated fast rate PAC-Bayes generalization error bounds but handle neither MI nor general VC classes.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/grunwald21a.html
https://proceedings.mlr.press/v134/grunwald21a.htmlSource Identification for Mixtures of Product DistributionsWe give an algorithm for source identification of a mixture of k product distributions on n bits. This is a fundamental problem in machine learning with many applications. Our algorithm identifies the source parameters of an identifiable mixture, given, as input, approximate values of multilinear moments (derived, for instance, from a sufficiently large sample), using $2^{O(k^2)}n^{O(k)}$ arithmetic operations. Our result is the first explicit bound on the computational complexity of source identification of such mixtures. The running time improves previous results by Feldman, O’Donnell, and Servedio (FOCS 2005) and Chen and Moitra (STOC 2019) that guaranteed only learning the mixture (without parametric identification of the source). Our analysis gives a quantitative version of a qualitative characterization of identifiable sources that is due to Tahmasebi, Motahari, and Maddah-Ali (ISIT 2018).Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/gordon21a.html
https://proceedings.mlr.press/v134/gordon21a.htmlDifferentially Private Nonparametric Regression Under a Growth ConditionGiven a real-valued hypothesis class H, we investigate under what conditions there is a differentially private algorithm which learns an optimal hypothesis from H given i.i.d. data. Inspired by recent results for the related setting of binary classification (Alon et al., 2019; Bun et al., 2020), where it was shown that online learnability of a binary class is necessary and sufficient for its private learnability, Jung et al. (2020) showed that in the setting of regression, online learnability of H is necessary for private learnability. Here online learnability of H is characterized by the finiteness of its eta-sequential fat shattering dimension, sfat_eta(H), for all eta > 0. In terms of sufficient conditions for private learnability, Jung et al. (2020) showed that H is privately learnable if lim_{\eta -> 0} sfat_eta(H) is finite, which is a fairly restrictive condition. We show that under the relaxed condition liminf_{eta -> 0} eta * sfat_eta(H) = 0, H is privately learnable, thus establishing the first nonparametric private learnability guarantee for classes H with sfat_eta(H) diverging as eta -> 0. Our techniques involve a novel filtering procedure to output stable hypotheses for nonparametric function classes.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/golowich21a.html
https://proceedings.mlr.press/v134/golowich21a.htmlSurvival of the strictest: Stable and unstable equilibria under regularized learning with partial informationIn this paper, we examine the Nash equilibrium convergence properties of no-regret learning in general N -player games. For concreteness, we focus on the archetypal “follow the regularized leader” (FTRL) family of algorithms, and we consider the full spectrum of uncertainty that the players may encounter – from noisy, oracle-based feedback, to bandit, payoff-based information. In this general context, we establish a comprehensive equivalence between the stability of a Nash equilibrium and its support: a Nash equilibrium is stable and attracting with arbitrarily high probability if and only if it is strict (i.e., each equilibrium strategy has a unique best response). This equivalence extends existing continuous-time versions of the “folk theorem” of evolutionary game theory to a bona fide algorithmic learning setting, and it provides a clear refinement criterion for the prediction of the day-to-day behavior of no-regret learning in games.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/giannou21a.html
https://proceedings.mlr.press/v134/giannou21a.htmlOn Avoiding the Union Bound When Answering Multiple Differentially Private QueriesIn this work, we study the problem of answering $k$ queries with $(\epsilon, \delta)$-differential privacy, where each query has sensitivity one. We give an algorithm for this task that achieves an expected $\ell_\infty$ error bound of $O(\frac{1}{\epsilon}\sqrt{k \log \frac{1}{\delta}})$, which is known to be tight (Steinke and Ullman, 2016). A very recent work by Dagan and Kur (2020) provides a similar result, albeit via a completely different approach. One difference between our work and theirs is that our guarantee holds even when $\delta < 2^{-\Omega(k/(\log k)^8)}$ whereas theirs does not apply in this case. On the other hand, the algorithm of Dagan and Kur (2020) has a remarkable advantage that the $\ell_{\infty}$ error bound of $O(\frac{1}{\epsilon}\sqrt{k \log \frac{1}{\delta}})$ holds not only in expectation but always (i.e., with probability one) while we can only get a high probability (or expected) guarantee on the error.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/ghazi21a.html
https://proceedings.mlr.press/v134/ghazi21a.htmlFrank-Wolfe with a Nearest Extreme Point OracleWe consider variants of the classical Frank-Wolfe algorithm for constrained smooth convex minimization, that instead of access to the standard oracle for minimizing a linear function over the feasible set, have access to an oracle that can find an extreme point of the feasible set that is closest in Euclidean distance to a given vector. We first show that for many feasible sets of interest, such an oracle can be implemented with the same complexity as the standard linear optimization oracle. We then show that with such an oracle we can design new Frank-Wolfe variants which enjoy significantly improved complexity bounds in case the set of optimal solutions lies in the convex hull of a subset of extreme points with small diameter (e.g., a low-dimensional face of a polytope). In particular, for many $0\text{–}1$ polytopes, under quadratic growth and strict complementarity conditions, we obtain the first linearly convergent variant with rate that depends only on the dimension of the optimal face and not on the ambient dimension.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/garber21a.html
https://proceedings.mlr.press/v134/garber21a.htmlImpossibility of Partial Recovery in the Graph Alignment ProblemRandom graph alignment refers to recovering the underlying vertex correspondence between two random graphs with correlated edges. This can be viewed as an average-case and noisy version of the well-known graph isomorphism problem. For the correlated Erdös-Rényi model, we prove the first impossibility result for partial recovery in the sparse regime (with constant average degree). Our bound is tight in the noiseless case (the graph isomorphism problem) and we conjecture that it is still tight with noise. Our proof technique relies on a careful application of the probabilistic method to build automorphisms between tree components of a subcritical Erdös-Rényi graph.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/ganassali21a.html
https://proceedings.mlr.press/v134/ganassali21a.htmlEfficient Algorithms for Learning from Coarse LabelsFor many learning problems one may not have access to fine grained label information; e.g., an image can be labeled as husky, dog, or even animal depending on the expertise of the annotator. In this work, we formalize these settings and study the problem of learning from such coarse data. Instead of observing the actual labels from a set $\mathcal{Z}$, we observe coarse labels corresponding to a partition of $\mathcal{Z}$ (or a mixture of partitions). Our main algorithmic result is that essentially any problem learnable from fine grained labels can also be learned efficiently when the coarse data are sufficiently informative. We obtain our result through a generic reduction for answering Statistical Queries (SQ) over fine grained labels given only coarse labels. The number of coarse labels required depends polynomially on the information distortion due to coarsening and the number of fine labels $|\mathcal{Z}|$. We also investigate the case of (infinitely many) real valued labels focusing on a central problem in censored and truncated statistics: Gaussian mean estimation from coarse data. We provide an efficient algorithm when the sets in the partition are convex and establish that the problem is NP-hard even for very simple non-convex sets.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/fotakis21a.html
https://proceedings.mlr.press/v134/fotakis21a.htmlInstance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based PerspectiveIn the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems. Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/foster21a.html
https://proceedings.mlr.press/v134/foster21a.htmlConvergence rates and approximation results for SGD and its continuous-time counterpartThis paper proposes a thorough theoretical analysis of Stochastic Gradient Descent (SGD) with non-increasing step sizes. First, we show that the recursion defining SGD can be provably approximated by solutions of a time inhomogeneous Stochastic Differential Equation (SDE) using an appropriate coupling. In the specific case of a batch noise we refine our results using recent advances in Stein’s method. Then, motivated by recent analyses of deterministic and stochastic optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and establish non-asymptotic bounds. To that purpose, we develop new comparison techniques which are of independent interest. Adapting these techniques to the discrete setting, we show that the same results hold for the corresponding SGD sequences. In our analysis, we notably improve non-asymptotic bounds in the convex setting for SGD under weaker assumptions than the ones considered in previous works. Finally, we also establish finite-time convergence results under various conditions, including relaxations of the famous Ł{ojasciewicz} inequality, which can be applied to a class of non-convex functions.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/fontaine21a.html
https://proceedings.mlr.press/v134/fontaine21a.htmlSequential prediction under log-loss and misspecificationWe consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption that the data-generating distribution belongs to the hypothesis class as well. Here we present results in the more general misspecified case. Due to special properties of the log-loss, the same problem arises in the context of competitive-optimality in density estimation, and model selection. For the $d$-dimensional Gaussian location hypothesis class, we show that cumulative regrets in the well-specified and misspecified cases asymptotically coincide. In other words, we provide an $o(1)$ characterization of the distribution-free (or PAC) regret in this case – the first such result as far as we know. We recall that the worst-case (or individual-sequence) regret in this case is larger by an additive constant ${d\over 2} + o(1)$. Surprisingly, neither the traditional Bayesian estimators, nor the Shtarkov’s normalized maximum likelihood achieve the PAC regret and our estimator requires special “robustification” against heavy-tailed data. In addition, we show two general results for misspecified regret: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecified regret between well-specified regrets with (asymptotically) close hypotheses classes.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/feder21a.html
https://proceedings.mlr.press/v134/feder21a.htmlModeling from Features: a Mean-field Framework for Over-parameterized Deep Neural NetworksThis paper proposes a new mean-field framework for over-parameterized deep neural networks (DNNs), which can be used to analyze neural network training. In this framework, a DNN is represented by probability measures and functions over its features (that is, the function values of the hidden units over the training data) in the continuous limit, instead of the neural network parameters as most existing studies have done. This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, leading to a simpler representation of DNNs. Moreover, we construct a non-linear dynamics called neural feature flow, which captures the evolution of an over-parameterized DNN trained by Gradient Descent. We illustrate the framework via the Residual Network (Res-Net) architecture. It is shown that when the neural feature flow process converges, it reaches a global minimal solution under suitable conditions.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/fang21a.html
https://proceedings.mlr.press/v134/fang21a.htmlConcentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk MinimizationDimension is an inherent bottleneck to some modern learning tasks, where optimization methods suffer from the size of the data. In this paper, we study non-isotropic distributions of data and develop tools that aim at reducing these dimensional costs by a dependency on an effective dimension rather than the ambient one. Based on non-asymptotic estimates of the metric entropy of ellipsoids -that prove to generalize to infinite dimensions- and on a chaining argument, our uniform concentration bounds involve an effective dimension instead of the global dimension, improving over existing results. We show the importance of taking advantage of non-isotropic properties in learning problems with the following applications: i) we improve state-of-the-art results in statistical preconditioning for communication-efficient distributed optimization, ii) we introduce a non-isotropic randomized smoothing for non-smooth optimization. Both applications cover a class of functions that encompasses empirical risk minization (ERM) for linear models.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/even21a.html
https://proceedings.mlr.press/v134/even21a.htmlAdaptivity in Adaptive SubmodularityAdaptive sequential decision making is one of the central challenges in machine learning and artificial intelligence. In such problems, the goal is to design an interactive policy that plans for an action to take, from a finite set of $n$ actions, given some partial observations. It has been shown that in many applications such as active learning, robotics, sequential experimental design, and active detection, the utility function satisfies adaptive submodularity, a notion that generalizes the notion of diminishing returns to policies. In this paper, we revisit the power of adaptivity in maximizing an adaptive monotone submodular function. We propose an efficient semi adaptive policy that with $O(\log n \times\log k)$ adaptive rounds of observations can achieve an almost tight $1-1/e-\epsilon$ approximation guarantee with respect to an optimal policy that carries out $k$ actions in a fully sequential manner. To complement our results, we also show that it is impossible to achieve a constant factor approximation with $o(\log n)$ adaptive rounds. We also extend our result to the case of adaptive stochastic minimum cost coverage where the goal is to reach a desired utility $Q$ with the cheapest policy. We first prove the long-standing conjecture by Golovin and Krause and show that the greedy policy achieves the asymptotically tight logarithmic approximation guarantee. We then propose a semi adaptive policy that provides the same guarantee in polylogarithmic adaptive rounds through a similar information-parallelism scheme. Our results shrink the adaptivity gap in adaptive submodular maximization by an exponential factor.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/esfandiari21a.html
https://proceedings.mlr.press/v134/esfandiari21a.htmlOn the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and SmoothnessWe study sampling from a target distribution $\nu_* = e^{-f}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm. For any potential function $f$ whose tails behave like $\|x\|^\alpha$ for ${\alpha \in [1,2]}$, and has $\beta$-Hölder continuous gradient, we prove that $\widetilde{\mathcal{O}} \Big(d^{\frac{1}{\beta}+\frac{1+\beta}{\beta}(\frac{2}{\alpha}-{1}_{\{\alpha \neq 1\}})} \epsilon^{-\frac{1}{\beta}}\Big)$ steps are sufficient to reach the $\epsilon$-neighborhood of a $d$-dimensional target distribution $\nu_*$ in KL-divergence. This bound, in terms of $\epsilon$ dependency, is not directly influenced by the tail growth rate $\alpha$ of the potential function as long as its growth is at least linear, and it only relies on the order of smoothness $\beta$. One notable consequence of this result is that for potentials with Lipschitz gradient, i.e. $\beta=1$, the above rate recovers the best known rate $\widetilde{\mathcal{O}} (d\epsilon^{-1})$ which was established for strongly convex potentials in terms of $\epsilon$ dependency, but we show that the same rate is achievable for a wider class of potentials that are degenerately convex at infinity. The growth rate $\alpha$ affects the rate estimate in high dimensions where $d$ is large; furthermore, it recovers the best-known dimension dependency when the tail growth of the potential is quadratic, i.e. $\alpha = 2$, in the current setup.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/erdogdu21a.html
https://proceedings.mlr.press/v134/erdogdu21a.htmlNon-asymptotic approximations of neural networks by Gaussian processesWe study the extent to which wide neural networks may be approximated by Gaussian processes, when initialized with random weights. It is a well-established fact that as the width of a network goes to infinity, its law converges to that of a Gaussian process. We make this quantitative by establishing explicit convergence rates for the central limit theorem in an infinite-dimensional functional space, metrized with a natural transportation distance. We identify two regimes of interest; when the activation function is polynomial, its degree determines the rate of convergence, while for non-polynomial activations, the rate is governed by the smoothness of the function.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/eldan21a.html
https://proceedings.mlr.press/v134/eldan21a.htmlKernel ThinningWe introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Matérn, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/dwivedi21a.html
https://proceedings.mlr.press/v134/dwivedi21a.htmlOn the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD LearningThis paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online-learning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrix-valued functions and uniform ergodicity of the Markov chains. Our main contribution is an exponential stability result for the p-th moment of random matrix product, provided that (i) the underlying Markov chain satisfies a super-Lyapunov drift condition, (ii) the growth of the matrix-valued functions is controlled by an appropriately defined function (related to the drift condition). Using this result, we give finite-time p-th moment bounds for constant and decreasing stepsize linear stochastic approximation schemes with Markovian noise on general state space. We illustrate these findings for linear value-function estimation in reinforcement learning. We provide finite-time p-th moment bound for various members of temporal difference (TD) family of algorithms.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/durmus21a.html
https://proceedings.mlr.press/v134/durmus21a.htmlRandom Coordinate Langevin Monte CarloLangevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the partial derivative along this direction plus noise, while all other coordinates remain untouched. We investigate the total complexity of RC-LMC and compare it with the classical LMC for log-concave probability distributions. We show that when the gradient of the log-density is Lipschitz, RC-LMC is less expensive than the classical LMC if the log-density is highly skewed for high dimensional problems. Further, when both the gradient and the Hessian of the log-density are Lipschitz, RC-LMC is always cheaper than the classical LMC, by a factor proportional to the square root of the problem dimension. In the latter case, we use an example to demonstrate that our estimate of complexity is sharp with respect to the dimension.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/ding21a.html
https://proceedings.mlr.press/v134/ding21a.htmlOutlier-Robust Learning of Ising Models Under Dobrushin’s ConditionWe study the problem of learning Ising models satisfying Dobrushin’s condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-2 polynomials of Ising models that may be of independent interest.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/diakonikolas21e.html
https://proceedings.mlr.press/v134/diakonikolas21e.htmlBoosting in the Presence of Massart NoiseWe study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $\eta(x) \leq \eta$, where $\eta<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $\eta$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/diakonikolas21d.html
https://proceedings.mlr.press/v134/diakonikolas21d.htmlThe Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals in the SQ ModelWe study the problem of agnostic learning under the Gaussian distribution in the Statistical Query (SQ) model. We develop a method for finding hard families of examples for a wide range of concept classes by using LP duality. For Boolean-valued concept classes, we show that the $L^1$-polynomial regression algorithm is essentially best possible among SQ algorithms, and therefore that the SQ complexity of agnostic learning is closely related to the polynomial degree required to approximate any function from the concept class in $L^1$-norm. Using this characterization along with additional analytic tools, we obtain explicit optimal SQ lower bounds for agnostically learning linear threshold functions and the first non-trivial explicit SQ lower bounds for polynomial threshold functions and intersections of halfspaces. We also develop an analogous theory for agnostically learning real-valued functions, and as an application prove near-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/diakonikolas21c.html
https://proceedings.mlr.press/v134/diakonikolas21c.htmlAgnostic Proper Learning of Halfspaces under Gaussian MarginalsWe study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose running time qualitatively matches that of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other activation functions, yielding the first proper agnostic algorithm for ReLU regression.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/diakonikolas21b.html
https://proceedings.mlr.press/v134/diakonikolas21b.htmlThe Sample Complexity of Robust Covariance TestingWe study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber’s contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-\epsilon) X + \epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, \Sigma)$, $B$ is a fixed but unknown noise distribution, and $\epsilon>0$ is an arbitrarily small constant representing the proportion of contamination. We want to distinguish between the cases that $\Sigma$ is the identity matrix versus $\gamma$-far from the identity in Frobenius norm. In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. Moreover, this sample upper bound was shown to be best possible, within constant factors. Our main result is that the sample complexity of covariance testing dramatically increases in the contaminated setting. In particular, we prove a sample complexity lower bound of $\Omega(d^2)$ for $\epsilon$ an arbitrarily small constant and $\gamma = 1/2$. This lower bound is best possible, as $O(d^2)$ samples suffice to even robustly {\em learn} the covariance. The conceptual implication of our result is that, for the natural setting we consider, robust hypothesis testing is at least as hard as robust estimation.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/diakonikolas21a.html
https://proceedings.mlr.press/v134/diakonikolas21a.htmlSparse sketches with small inversion biasFor a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/derezinski21a.html
https://proceedings.mlr.press/v134/derezinski21a.htmlLearning sparse mixtures of permutations from noisy informationWe study the problem of learning an unknown mixture of k permutations over n elements, given access to noisy samples drawn from the unknown mixture. We consider a range of different noise models, including natural variants of the “heat kernel” noise framework and the Mallows model. We give an algorithm which, for each of these noise models, learns the unknown mixture to high accuracy under mild assumptions and runs in $n^{O(log k)}$ time. Our approach is based on a new procedure that recovers an unknown mixture of permutations from noisy higher-order marginals.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/de21b.html
https://proceedings.mlr.press/v134/de21b.htmlWeak learning convex sets under normal distributionsThis paper addresses the following natural question: can efficient algorithms weakly learn convex sets under normal distributions? Strong learnability of convex sets under normal distributions is well understood, with near-matching upper and lower bounds given by Klivans et al (2008), but prior to the current work nothing seems to have been known about weak learning. We essentially answer this question, giving near-matching algorithms and lower bounds. For our positive result, we give a poly(n)-time algorithm that can weakly learn the class of convex sets to advantage $\Omega(1/\sqrt{n})$ using only random examples drawn from the background Gaussian distribution. Our algorithm and analysis are based on a new “density increment” result for convex sets, which we prove using tools from isoperimetry. We also give an information-theoretic lower bound showing that $O(\log(n)/\sqrt{n})$ advantage is best possible even for algorithms that are allowed to make poly(n) many membership queries.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/de21a.html
https://proceedings.mlr.press/v134/de21a.htmlA Statistical Taylor Theorem and Extrapolation of Truncated DensitiesWe show a statistical version of Taylor’s theorem and apply this result to non-parametric density estimation from truncated samples, which is a classical challenge in Statistics [Woodroofe 1985, Stute 1993]. The single-dimensional version of our theorem has the following implication: "For any distribution P on [0, 1] with a smooth log-density function, given samples from the conditional distribution of P on [a, a + \varepsilon] \subset [0, 1], we can efficiently identify an approximation to P over the whole interval [0, 1], with quality of approximation that improves with the smoothness of P". To the best of knowledge, our result is the first in the area of non-parametric density estimation from truncated samples, which works under the hard truncation model, where the samples outside some survival set S are never observed, and applies to multiple dimensions. In contrast, previous works assume single dimensional data where each sample has a different survival set $S$ so that samples from the whole support will ultimately be collected.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/daskalakis21a.html
https://proceedings.mlr.press/v134/daskalakis21a.htmlFrom Local Pseudorandom Generators to Hardness of LearningWe prove hardness-of-learning results under a well-studied assumption on the existence of local pseudorandom generators. As we show, this assumption allows us to surpass the current state of the art, and prove hardness of various basic problems, with no hardness results to date. Our results include: hardness of learning shallow ReLU neural networks under the Gaussian distribution and other distributions; hardness of learning intersections of $\omega(1)$ halfspaces, DNF formulas with $\omega(1)$ terms, and ReLU networks with $\omega(1)$ hidden neurons; hardness of weakly learning deterministic finite automata under the uniform distribution; hardness of weakly learning depth-$3$ Boolean circuits under the uniform distribution, as well as distribution-specific hardness results for learning DNF formulas and intersections of halfspaces. We also establish lower bounds on the complexity of learning intersections of a constant number of halfspaces, and ReLU networks with a constant number of hidden neurons. Moreover, our results imply the hardness of virtually all improper PAC-learning problems (both distribution-free and distribution-specific) that were previously shown hard under other assumptions.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/daniely21a.html
https://proceedings.mlr.press/v134/daniely21a.htmlQuantifying Variational Approximation for Log-Partition FunctionVariational methods, such as mean-field (MF) and tree-reweighted (TRW), provide computationally efficient approximations of the log-partition function for generic graphical models but their approximation ratio is generally not quantified. As the primary contribution of this work, we provide an approach to quantify their approximation ratio for any discrete pairwise graphical model with non-negative potentials through a property of the underlying graph structure $G$. Specifically, we argue that (a variant of) TRW produces an estimate within factor $1/\sqrt{\kappa(G)}$ where $\kappa(G) \in (0,1]$ captures how far $G$ is from tree structure. As a consequence, the approximation ratio is $1$ for trees, $\sqrt{(d+1)/2}$ for graphs with maximum average degree $d$ and $1+1/(2\beta)+o_{\beta\to \infty}(1/\beta)$ for graphs with girth at least $\beta \log N$. The quantity $\kappa(G)$ is the solution of a min-max problem associated with the spanning tree polytope of $G$ that can be evaluated in polynomial time for any graph. We provide a near linear-time variant that achieves an approximation ratio depending on the minimal (across edges) effective resistance of the graph. We connect our results to the graph partition approximation method and thus provide a unified perspective.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/cosson21a.html
https://proceedings.mlr.press/v134/cosson21a.htmlOnline Markov Decision Processes with Aggregate Bandit FeedbackWe study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with the cumulative loss suffered, rather than the individual losses encountered along the trajectory. Our main result is a computationally efficient algorithm with $O(\sqrt{K})$ regret for this setting, where K is the number of episodes. We establish this result via an efficient reduction to a novel bandit learning setting we call Distorted Linear Bandits (DLB), which is a variant of bandit linear optimization where actions chosen by the learner are adversarially distorted before they are committed. We then develop a computationally-efficient online algorithm for DLB for which we prove an $O(\sqrt{T})$ regret bound, where T is the number of time steps. Our algorithm is based on online mirror descent with a self-concordant barrier regularization that employs a novel increasing learning rate schedule.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/cohen21a.html
https://proceedings.mlr.press/v134/cohen21a.htmlOptimal dimension dependence of the Metropolis-Adjusted Langevin AlgorithmConventional wisdom in the sampling literature, backed by a popular diffusion scaling limit, suggests that the mixing time of the Metropolis-Adjusted Langevin Algorithm (MALA) scales as O(d^{1/3}), where d is the dimension. However, the diffusion scaling limit requires stringent assumptions on the target distribution and is asymptotic in nature. In contrast, the best known non-asymptotic mixing time bound for MALA on the class of log-smooth and strongly log-concave distributions is O(d). In this work, we establish that the mixing time of MALA on this class of target distributions is \tilde\Theta(d^{1/2}) under a warm start. Our upper bound proof introduces a new technique based on a projection characterization of the Metropolis adjustment which reduces the study of MALA to the well-studied discretization analysis of the Langevin SDE and bypasses direct computation of the acceptance probability.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chewi21a.html
https://proceedings.mlr.press/v134/chewi21a.htmlImpossible Tuning Made Possible: A New Expert Algorithm and Its ApplicationsWe resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t,i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t,i}$ is the loss for expert $i$ in round $t$. Our algorithm is based on the Mirror Descent framework with a correction term and a weighted entropy regularizer. While natural, the algorithm has not been studied before and requires a careful analysis. We also generalize the bound to $O\left(\sqrt{(\ln d)\sum_t (\ell_{t,i}-m_{t,i})^2}\right)$ for any prediction vector $m_t$ that the learner receives, and recover or improve many existing results by choosing different $m_t$. Furthermore, we use the same framework to create a master algorithm that combines a set of base algorithms and learns the best one with little overhead. The new guarantee of our master allows us to derive many new results for both the expert problem and more generally Online Linear Optimization.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chen21f.html
https://proceedings.mlr.press/v134/chen21f.htmlMinimax Regret for Stochastic Shortest Path with Adversarial Costs and Known TransitionWe study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $O(\sqrt{DT_\star K})$ and $O(\sqrt{DT_\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the recent work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chen21e.html
https://proceedings.mlr.press/v134/chen21e.htmlQuery complexity of least absolute deviation regression via robust uniform convergenceConsider a regression problem where the learner is given a large collection of $d$-dimensional data points, but can only query a small subset of the real-valued labels. How many queries are needed to obtain a $1+\epsilon$ relative error approximation of the optimum? While this problem has been extensively studied for least squares regression, little is known for other losses. An important example is least absolute deviation regression ($\ell_1$ regression) which enjoys superior robustness to outliers compared to least squares. We develop a new framework for analyzing importance sampling methods in regression problems, which enables us to show that the query complexity of least absolute deviation regression is $\Theta(d/\epsilon^2)$ up to logarithmic factors. We further extend our techniques to show the first bounds on the query complexity for any $\ell_p$ loss with $p\in(1,2)$. As a key novelty in our analysis, we introduce the notion of robust uniform convergence, which is a new approximation guarantee for the empirical loss. While it is inspired by uniform convergence in statistical learning, our approach additionally incorporates a correction term to avoid unnecessary variance due to outliers. This can be viewed as a new connection between statistical learning theory and variance reduction techniques in stochastic optimization, which should be of independent interest.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chen21d.html
https://proceedings.mlr.press/v134/chen21d.htmlBlack-Box Control for Linear Dynamical SystemsWe consider the problem of black-box control: the task of controlling an unknown linear time-invariant dynamical system from a single trajectory without a stabilizing controller. Under the assumption that the system is controllable, we give the first {\it efficient} algorithm that is capable of attaining sublinear regret under the setting of online nonstochastic control. This resolves an open problem since the work of Abbasi-Yadkori and Szepesvari(2011) on the stochastic LQR problem, and in a more challenging setting that allows for adversarial perturbations and adversarially chosen changing convex loss functions. We give finite-time regret bounds for our algorithm on the order of $2^{poly(d)} + \tilde{O}(poly(d) T^{2/3})$ for general nonstochastic control, and $2^{poly(d)} + \tilde{O}(poly(d) \sqrt{T})$ for black-box LQR. To complete the picture, we investigate the complexity of the online black-box control problem and give a matching regret lower bound of $2^{\Omega(d)}$, showing that the exponential cost is inevitable. This lower bound holds even in the noiseless setting, and applies to any, randomized or deterministic, black-box control method.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chen21c.html
https://proceedings.mlr.press/v134/chen21c.htmlLearning and testing junta distributions with sub cube conditioningWe study the problems of learning and testing junta distributions on $\{-1,1\}^n$ with respect to the uniform distribution, where a distribution $p$ is a $k$-junta if its probability mass function $p(x)$ depends on a subset of at most $k$ variables. The main contribution is an algorithm for finding relevant coordinates in a $k$-junta distribution with subcube conditioning (Bhattacharyya et al 2018., Canonne et al. 2019). We give two applications: An algorithm for learning $k$-junta distributions with $\tilde{O}(k/\epsilon^2) \log n + O(2^k/\epsilon^2)$ subcube conditioning queries, and an algorithm for testing $k$-junta distributions with $\tilde{O}((k + \sqrt{n})/\epsilon^2)$ subcube conditioning queries. All our algorithms are optimal up to poly-logarithmic factors.
Our results show that subcube conditioning, as a natural model for accessing high-dimensional distributions, enables significant savings in learning and testing junta distributions compared to the standard sampling model. This addresses an open question posed by Aliakbarpour et al. 2016.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chen21b.html
https://proceedings.mlr.press/v134/chen21b.htmlBreaking The Dimension Dependence in Sparse Distribution Estimation under Communication ConstraintsWe consider the problem of estimating a $d$-dimensional $s$-sparse discrete distribution from its samples observed under a $b$-bit communication constraint. The best-known previous result on $\ell_2$ estimation error for this problem is $O\left( \frac{s\log\left( {d}/{s}\right)}{n2^b}\right)$. Surprisingly, we show that when sample size $n$ exceeds a minimum threshold $n^*(s, d, b)$, we can achieve an $\ell_2$ estimation error of $O\left( \frac{s}{n2^b}\right)$. This implies that when $n>n^*(s, d, b)$ the convergence rate does not depend on the ambient dimension $d$ and is the same as knowing the support of the distribution beforehand.
We next ask the question: ``what is the minimum $n^*(s, d, b)$ that allows dimension-free convergence?'. To upper bound $n^*(s, d, b)$, we develop novel localization schemes to accurately and efficiently localize the unknown support. For the non-interactive setting, we show that $n^*(s, d, b) = O\left( \min \left( {d^2\log^2 d}/{2^b}, {s^4\log^2 d}/{2^b}\right) \right)$. Moreover, we connect the problem with non-adaptive group testing and obtain a polynomial-time estimation scheme when $n = \tilde{\Omega}\left({s^4\log^4 d}/{2^b}\right)$. This group testing based scheme is adaptive to the sparsity parameter $s$, and hence can be applied without knowing it. For the interactive setting, we propose a novel tree-based estimation scheme and show that the minimum sample-size needed to achieve dimension-free convergence can be further reduced to $n^*(s, d, b) = \tilde{O}\left( {s^2\log^2 d}/{2^b} \right)$.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chen21a.html
https://proceedings.mlr.press/v134/chen21a.htmlWhen does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/chatterji21a.html
https://proceedings.mlr.press/v134/chatterji21a.htmlOptimizing Optimizers: Regret-optimal gradient descent algorithmsThis paper treats the task of designing optimization algorithms as an optimal control problem. Using regret as a metric for an algorithm’s performance, we study the existence, uniqueness and consistency of regret-optimal algorithms. By providing first-order optimality conditions for the control problem, we show that regret-optimal algorithms must satisfy a specific structure in their dynamics which we show is equivalent to performing \emph{dual-preconditioned gradient descent} on the value function generated by its regret. Using these optimal dynamics, we provide bounds on their rates of convergence to solutions of convex optimization problems. Though closed-form optimal dynamics cannot be obtained in general, we present fast numerical methods for approximating them, generating optimization algorithms which directly optimize their long-term regret. These are benchmarked against commonly used optimization algorithms to demonstrate their effectiveness.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/casgrain21a.html
https://proceedings.mlr.press/v134/casgrain21a.htmlThinking Inside the Ball: Near-Optimal Minimization of the Maximal LossWe characterize the complexity of minimizing $\max_{i\in[N]} f_i(x)$ for convex, Lipschitz functions $f_1,\ldots, f_N$. For non-smooth functions, existing methods require $O(N\epsilon^{-2})$ queries to a first-order oracle to compute an $\epsilon$-suboptimal point and $\widetilde{O}(N\epsilon^{-1})$ queries if the $f_i$ are $O(1/\epsilon)$-smooth. We develop methods with improved complexity bounds of $\widetilde{O}(N\epsilon^{-2/3} + \epsilon^{-8/3})$ in the non-smooth case and $\widetilde{O}(N\epsilon^{-2/3} + \sqrt{N}\epsilon^{-1})$ in the $O(1/\epsilon)$-smooth case. Our methods consist of a recently proposed ball optimization oracle acceleration algorithm (which we refine) and a careful implementation of said oracle for the softmax function. We also prove an oracle complexity lower bound scaling as $\Omega(N\epsilon^{-2/3})$, showing that our dependence on $N$ is optimal up to polylogarithmic factors.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/carmon21a.html
https://proceedings.mlr.press/v134/carmon21a.htmlFast Rates for Structured PredictionDiscrete supervised learning problems such as classification are often tackled by introducing a continuous surrogate problem akin to regression. Bounding the original error, between estimate and solution, by the surrogate error endows discrete problems with convergence rates already shown for continuous instances. Yet, current approaches do not leverage the fact that discrete problems are essentially predicting a discrete output when continuous problems are predicting a continuous value. In this paper, we tackle this issue for general structured prediction problems, opening the way to “super fast” rates, that is, convergence rates for the excess risk faster than $n^{-1}$, where $n$ is the number of observations, with even exponential rates with the strongest assumptions. We first illustrate it for predictors based on nearest neighbors, generalizing rates known for binary classification to any discrete problem within the framework of structured prediction. We then consider kernel ridge regression where we improve known rates in $n^{-1/4}$ to arbitrarily fast rates, depending on a parameter characterizing the hardness of the problem, thus allowing, under smoothness assumptions, to bypass the curse of dimensionality.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/cabannes21a.html
https://proceedings.mlr.press/v134/cabannes21a.htmlCooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor CollisionsWe consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no collisions at all) are achievable for any number of players and arms. At a high level, the previous strategy heavily relied on a 2-dimensional geometric intuition that was difficult to generalize in higher dimensions, while here we take a more combinatorial route to build the new strategy.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bubeck21b.html
https://proceedings.mlr.press/v134/bubeck21b.htmlA Law of Robustness for Two-Layers Neural NetworksWe initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with $k$ neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than $\sqrt{n/k}$ where $n$ is the number of datapoints. In particular, this conjecture implies that overparametrization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a $O(1)$-Lipschitz network, while mere data fitting of $d$-dimensional data requires only one neuron per $d$ datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the high-dimensional regime $n \approx d$ (which we also refer to as the undercomplete case, since only $k \leq d$ is relevant here). Finally we prove the conjecture for polynomial activation functions of degree $p$ when $n \approx d^p$. We complement these findings with experimental evidence supporting the conjecture.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bubeck21a.html
https://proceedings.mlr.press/v134/bubeck21a.htmlExact Recovery of Clusters in Finite Metric Spaces Using Oracle QueriesWe investigate the problem of exact cluster recovery using oracle queries. Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study this problem in the more challenging non-convex setting. We introduce a structural characterization of clusters, called $(\beta,\gamma)$-convexity, that can be applied to any finite set of points equipped with a metric (or even a semimetric, as the triangle inequality is not needed). Using $(\beta,\gamma)$-convexity, we can translate natural density properties of clusters (which include, for instance, clusters that are strongly non-convex in $R^d$) into a graph-theoretic notion of convexity. By exploiting this convexity notion, we design a deterministic algorithm that recovers $(\beta,\gamma)$-convex clusters using $O(k^2 \log n + k^2 (\frac{6}{\beta\gamma})^{dens(X)})$ same-cluster queries, where $k$ is the number of clusters and $dens(X)$ is the density dimension of the semimetric. We show that an exponential dependence on the density dimension is necessary, and we also show that, if we are allowed to make $O(k^2 + k \log n)$ additional queries to a "cluster separation" oracle, then we can recover clusters that have different and arbitrary scales, even when the scale of each cluster is unknown.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bressan21a.html
https://proceedings.mlr.press/v134/bressan21a.htmlStatistical Query Algorithms and Low Degree Tests Are Almost EquivalentResearchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approach is to characterize the limits of restricted models of computation, which on the one hand yields strong computational lower bounds for powerful classes of algorithms and on the other hand helps guide the development of efficient algorithms. In this paper, we study two of the most popular restricted computational models, the statistical query framework and low-degree polynomials, in the context of high-dimensional hypothesis testing. Our main result is that under mild conditions on the testing problem, the two classes of algorithms are essentially equivalent in power. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensor PCA and several variants of the planted clique problem.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/brennan21a.html
https://proceedings.mlr.press/v134/brennan21a.htmlNear-Optimal Entrywise Sampling of Numerically Sparse MatricesMany real-world data sets are sparse or almost sparse. One method to measure this for a matrix $A\in \mathbb{R}^{n\times n}$ is the \emph{numerical sparsity}, denoted $\mathsf{ns}(A)$, defined as the minimum $k\geq 1$ such that $\|a\|_1/\|a\|_2 \leq \sqrt{k}$ for every row and every column $a$ of $A$. This measure of $a$ is smooth and is clearly only smaller than the number of non-zeros in the row/column $a$. The seminal work of Achlioptas and McSherry (2007) has put forward the question of approximating an input matrix $A$ by entrywise sampling. More precisely, the goal is to quickly compute a sparse matrix $\tilde{A}$ satisfying $\|A - \tilde{A}\|_2 \leq \epsilon \|A\|_2$ (i.e., additive spectral approximation) given an error parameter $\epsilon>0$. The known schemes sample and rescale a small fraction of entries from $A$. We propose a scheme that sparsifies an almost-sparse matrix $A$ — it produces a matrix $\tilde{A}$ with $O(\epsilon^{-2}\mathsf{ns}(A) \cdot n\ln n)$ non-zero entries with high probability. We also prove that this upper bound on $\mathsf{nnz}(\tilde{A})$ is \emph{tight} up to logarithmic factors. Moreover, our upper bound improves when the spectrum of $A$ decays quickly (roughly replacing $n$ with the stable rank of $A$). Our scheme can be implemented in time $O(\mathsf{nnz}(A))$ when $\|A\|_2$ is given. Previously, a similar upper bound was obtained by Achlioptas et al. (2013), but only for a restricted class of inputs that does not even include symmetric or covariance matrices. Finally, we demonstrate two applications of these sampling techniques, to faster approximate matrix multiplication, and to ridge regression by using sparse preconditioners.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/braverman21b.html
https://proceedings.mlr.press/v134/braverman21b.htmlNear Optimal Distributed Learning of Halfspaces with Two Parties<i>Distributed learning</i> protocols are designed to train on distributed data without gathering it all on a single centralized machine, thus contributing to the efficiency of the system and enhancing its privacy. We study a central problem in distributed learning, called {\it distributed learning of halfspaces}: let $U \subseteq \mathbb{R}^d$ be a known domain of size $n$ and let $h:\mathbb{R}^d\to \mathbb{R}$ be an unknown target affine function.\footnote{In practice, the domain $U$ is defined implicitly by the representation of $d$-dimensional vectors which is used in the protocol.} A set of <i>examples</i> $\{(u,b)\}$ is distributed between several parties, where~$u \in U$ is a point and $b = \mathsf{sign}(h(u)) \in \{\pm 1\}$ is its label. The parties goal is to agree on a classifier~$f: U\to\{\pm 1\}$ such that~$f(u)=b$ for every input example~$(u,b)$.
We design a protocol for the distributed halfspace learning problem in the two-party setting, communicating only $\tilde O(d\log n)$ bits. To this end, we introduce a new tool called <i>halfspace containers</i>, that is closely related to <i>bracketing numbers</i> in statistics and to <i>hyperplane cuttings</i> in discrete geometry, and allows for a compressed approximate representation of every halfspace. We complement our upper bound result by an almost matching $\tilde \Omega(d\log n)$ lower bound on the communication complexity of any such protocol
Since the distributed halfspace learning problem is closely related to the <i>convex set disjointness</i> problem in communication complexity and the problem of <i>distributed linear programming</i> in distributed optimization, we also derive upper and lower bounds of $\tilde O(d^2\log n)$ and~$\tilde{\Omega}(d\log n)$ on the communication complexity of both of these basic problems.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/braverman21a.html
https://proceedings.mlr.press/v134/braverman21a.htmlMultiplayer Bandit Learning, from Competition to CooperationThe stochastic multi-armed bandit model captures the tradeoff between exploration and exploitation. We study the effects of competition and cooperation on this tradeoff. Suppose there are two arms, one predictable and one risky, and two players, Alice and Bob. In every round, each player pulls an arm, receives the resulting reward, and observes the choice of the other player but not their reward. Alice’s utility is $\Gamma_A + \lambda \Gamma_B$ (and similarly for Bob), where $\Gamma_A$ is Alice’s total reward and $\lambda \in [-1, 1]$ is a cooperation parameter. At $\lambda = -1$ the players are competing in a zero-sum game, at $\lambda = 1$, their interests are aligned, and at $\lambda = 0$, they are neutral: each player’s utility is their own reward. The model is related to the economics literature on strategic experimentation, where usually players observe each other’s rewards. Suppose the predictable arm has success probability $p$ and the risky arm has prior $\mu$. If the discount factor is $\beta$, then the value of $p$ where a single player is indifferent between the arms is the Gittins index $g = g(\mu,\beta) > m$, where $m$ is the mean of the risky arm. Our first result answers, in this setting, a fundamental question posed by Rothschild \cite{rotschild}. We show that competing and neutral players eventually settle on the same arm (even though it may not be the best arm) in every Nash equilibrium, while this can fail for players with aligned interests. Moreover, we show that \emph{competing players} explore \emph{less} than a single player: there is $p^* \in (m, g)$ so that for all $p > p^*$, the players stay at the predictable arm. However, the players are not myopic: they still explore for some $p > m$. On the other hand, \emph{cooperating players} (with $\lambda =1$) explore \emph{more} than a single player. We also show that \emph{neutral players} learn from each other, receiving strictly higher total rewards than they would playing alone, for all $ p\in (p^*, g)$, where $p^*$ is the threshold above which competing players do not explore.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/branzei21a.html
https://proceedings.mlr.press/v134/branzei21a.htmlRank-one matrix estimation: analytic time evolution of gradient descent dynamicsWe consider a rank-one symmetric matrix corrupted by additive noise. The rank-one matrix is formed by an n-component unknown vector on the sphere of radius $\sqrt{n}$, and we consider the problem of estimating this vector from the corrupted matrix in the high dimensional limit of $n$ large, by gradient descent for a quadratic cost function on the sphere. Explicit formulas for the whole time evolution of the overlap between the estimator and unknown vector, as well as the cost, are rigorously derived. In the long time limit we recover the well known spectral phase transition, as a function of the signal-to-noise ratio. The explicit formulas also allow to point out interesting transient features of the time evolution. Our analysis technique is based on recent progress in random matrix theory and uses local versions of the semi-circle law.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bodin21a.html
https://proceedings.mlr.press/v134/bodin21a.htmlRobust learning under clean-label attackWe study the problem of robust learning under clean-label data-poisoning attacks, where the attacker injects (an arbitrary set of) \emph{correctly-labeled} examples to the training set to fool the algorithm into making mistakes on \emph{specific} test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than optimal PAC learning. As we show, any robust algorithm with diminishing attackable rate can achieve the optimal dependence on $\epsilon$ in its PAC sample complexity, i.e., $O(1/\epsilon)$. On the other hand, the attackable rate might be large even for some optimal PAC learners, e.g., SVM for linear classifiers. Furthermore, we show that the class of linear hypotheses is not robustly learnable when the data distribution has zero margin and is robustly learnable in the case of positive margin but requires sample complexity exponential in the dimension. For a general hypothesis class with bounded VC dimension, if the attacker is limited to add at most $t=O(1/\epsilon)$ poison examples, the optimal robust learning sample complexity grows linearly with $t$.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/blum21a.html
https://proceedings.mlr.press/v134/blum21a.htmlMajorizing Measures, Sequential Complexities, and Online LearningWe introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequential Rademacher complexity in terms of the integral of sequential scale-sensitive dimension. Finally, we establish a tight contraction inequality for worst-case sequential Rademacher complexity. The above constitutes the resolution of a number of outstanding open problems in extending the classical theory of empirical processes to the sequential case, and, in turn, establishes sharp results for online learning.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/block21a.html
https://proceedings.mlr.press/v134/block21a.htmlOnline Learning from Optimal ActionsWe study the problem of online contextual optimization where, at each period, instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. At each period, the decision-maker has access to a new set of feasible actions to select from and to a new contextual function that affects that period’s loss function. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. We obtain the first regret bound for this problem that is logarithmic in the time horizon. Our results are derived through the development and analysis of a novel algorithmic structure that leverages the underlying geometry of the problem.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/besbes21a.html
https://proceedings.mlr.press/v134/besbes21a.htmlDeterministic Finite-Memory Bias EstimationIn this paper we consider the problem of estimating a Bernoulli parameter using finite memory. Let $X_1,X_2,\ldots$ be a sequence of independent identically distributed Bernoulli random variables with expectation $\theta$, where $\theta \in [0,1]$. Consider a finite-memory deterministic machine with $S$ states, that updates its state $M_n \in \{1,2,\ldots,S\}$ at each time according to the rule $M_n = f(M_{n-1},X_n)$, where $f$ is a deterministic time-invariant function. Assume that the machine outputs an estimate at each time point according to some fixed mapping from the state space to the unit interval. The quality of the estimation procedure is measured by the asymptotic risk, which is the long-term average of the instantaneous quadratic risk. The main contribution of this paper is an upper bound on the smallest worst-case asymptotic risk any such machine can attain. This bound coincides with a lower bound derived by Leighton and Rivest, to imply that $\Theta(1/S)$ is the minimax asymptotic risk for deterministic $S$-state machines. In particular, our result disproves a longstanding $\Theta(\log S/S)$ conjecture for this quantity, also posed by Leighton and Rivest.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/berg21a.html
https://proceedings.mlr.press/v134/berg21a.htmlReconstructing weighted voting schemes from partial information about their power indicesA number of recent works [Goldberg 2006; O’Donnell and Servedio 2011; De, Diakonikolas, and Servedio 2017; De, Diakonikolas, Feldman, and Servedio 2014] have considered the problem of approximately reconstructing an unknown weighted voting scheme given information about various sorts of “power indices” that characterize the level of control that individual voters have over the final outcome. In the language of theoretical computer science, this is the problem of approximating an unknown linear threshold function (LTF) over ${-1,1}^n$ given some numerical measure (such as the function’s n “Chow parameters,” a.k.a. its degree-1 Fourier coefficients, or the vector of its n Shapley indices) of how much each of the n individual input variables affects the outcome of the function. In this paper we consider the problem of reconstructing an LTF given only partial information about its Chow parameters or Shapley indices; i.e. we are given only the Chow parameters or the Shapley indices corresponding to a subset $S\subseteq [n]$ of the n input variables. A natural goal in this partial information setting is to find an LTF whose Chow parameters or Shapley indices corresponding to indices in S accurately match the given Chow parameters or Shapley indices of the unknown LTF. We refer to this as the Partial Inverse Power Index Problem. Our main results are a polynomial time algorithm for the ($\epsilon$-approximate) Chow Parameters Partial Inverse Power Index Problem and a quasi-polynomial time algorithm for the ($\epsilon$-approximate) Shapley Indices Partial Inverse Power Index Problem.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bennett21a.html
https://proceedings.mlr.press/v134/bennett21a.htmlNon-Euclidean Differentially Private Stochastic Convex OptimizationDifferentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t. the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting. In this work, we conduct a systematic study of DP-SCO for $\ell_p$-setups. For $p=1$, under a standard smoothness assumption, we give a new algorithm with nearly optimal excess risk. This result also extends to general polyhedral norms and feasible sets. For $p\in(1, 2)$, we give two new algorithms, whose central building block is a novel privacy mechanism, which generalizes the Gaussian mechanism. Moreover, we establish a lower bound on the excess risk for this range of $p$, showing a necessary dependence on $\sqrt{d}$, where $d$ is the dimension of the space. Our lower bound implies a sudden transition of the excess risk at $p=1$, where the dependence on $d$ changes from logarithmic to polynomial, resolving an open question in prior work \citep{TTZ15a}. For $p\in (2, \infty)$, noisy SGD attains optimal excess risk in the low-dimensional regime; in particular, this proves the optimality of noisy SGD for $p=\infty$. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bassily21a.html
https://proceedings.mlr.press/v134/bassily21a.htmlSpectral Planting and the Hardness of Refuting Cuts, Colorability, and Communities in Random GraphsWe study the problem of efficiently refuting the k-colorability of a graph, or equivalently, certifying a lower bound on its chromatic number. We give formal evidence of average-case computational hardness for this problem in sparse random regular graphs, suggesting that there is no polynomial-time algorithm that improves upon a classical spectral algorithm. Our evidence takes the form of a "computationally-quiet planting": we construct a distribution of d-regular graphs that has significantly smaller chromatic number than a typical regular graph drawn uniformly at random, while providing evidence that these two distributions are indistinguishable by a large class of algorithms. We generalize our results to the more general problem of certifying an upper bound on the maximum k-cut. This quiet planting is achieved by minimizing the effect of the planted structure (e.g. colorings or cuts) on the graph spectrum. Specifically, the planted structure corresponds exactly to eigenvectors of the adjacency matrix. This avoids the pushout effect of random matrix theory, and delays the point at which the planting becomes visible in the spectrum or local statistics. To illustrate this further, we give similar results for a Gaussian analogue of this problem: a quiet version of the spiked model, where we plant an eigenspace rather than adding a generic low-rank perturbation. Our evidence for computational hardness of distinguishing two distributions is based on three different heuristics: stability of belief propagation, the local statistics hierarchy, and the low-degree likelihood ratio. Of independent interest, our results include general-purpose bounds on the low-degree likelihood ratio for multi-spiked matrix models, and an improved low-degree analysis of the stochastic block model.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/bandeira21a.html
https://proceedings.mlr.press/v134/bandeira21a.htmlOptimal Dynamic Regret in Exp-Concave Online LearningWe consider the problem of the Zinkevich (2003)-style dynamic regret minimization in online learning with \emph{exp-concave} losses. We show that whenever improper learning is allowed, a Strongly Adaptive online learner achieves the dynamic regret of $\tilde O^*(n^{1/3}C_n^{2/3} \vee 1)$ where $C_n$ is the \emph{total variation} (a.k.a. \emph{path length}) of the an arbitrary sequence of comparators that may not be known to the learner ahead of time. Achieving this rate was highly nontrivial even for square losses in 1D where the best known upper bound was $O(\sqrt{nC_n} \vee \log n)$ (Yuan and Lamperski, 2019). Our new proof techniques make elegant use of the intricate structures of the primal and dual variables imposed by the KKT conditions and could be of independent interest. Finally, we apply our results to the classical statistical problem of \emph{locally adaptive non-parametric regression} (Mammen, 1991; Donoho and Johnstone, 1998) and obtain a stronger and more flexible algorithm that do not require any statistical assumptions or any hyperparameter tuning.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/baby21a.html
https://proceedings.mlr.press/v134/baby21a.htmlThe Last-Iterate Convergence Rate of Optimistic Mirror Descent in Stochastic Variational InequalitiesIn this paper, we analyze the local convergence rate of optimistic mirror descent methods in stochastic variational inequalities, a class of optimization problems with important applications to learning theory and machine learning. Our analysis reveals an intricate relation between the algorithm’s rate of convergence and the local geometry induced by the method’s underlying Bregman function. We quantify this relation by means of the Legendre exponent, a notion that we introduce to measure the growth rate of the Bregman divergence relative to the ambient norm near a solution. We show that this exponent determines both the optimal step-size policy of the algorithm and the optimal rates attained, explaining in this way the differences observed for some popular Bregman functions (Euclidean projection, negative entropy, fractional power, etc.).Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/azizian21a.html
https://proceedings.mlr.press/v134/azizian21a.htmlAdversarially Robust Low Dimensional RepresentationsMany machine learning systems are vulnerable to small perturbations made to inputs either at test time or at training time. This has received much recent interest on the empirical front due to applications where reliability and security are critical. However, theoretical understanding of algorithms that are robust to adversarial perturbations is limited. In this work we focus on Principal Component Analysis (PCA), a ubiquitous algorithmic primitive in machine learning. We formulate a natural robust variant of PCA where the goal is to find a low dimensional subspace to represent the given data with minimum projection error, that is in addition robust to small perturbations measured in $\ell_q$ norm (say $q=\infty$). Unlike PCA which is solvable in polynomial time, our formulation is computationally intractable to optimize as it captures a variant of the well-studied sparse PCA objective as a special case. We show the following results: 1. Polynomial time algorithm that is constant factor competitive in the worst-case with respect to the best subspace, in terms of the projection error and the robustness criterion. 2. We show that our algorithmic techniques can also be made robust to adversarial training-time perturbations, in addition to yielding representations that are robust to adversarial perturbations at test time. Specifically, we design algorithms for a strong notion of training-time perturbations, where every point is adversarially perturbed up to a specified amount. 3. We illustrate the broad applicability of our algorithmic techniques in addressing robustness to adversarial perturbations, both at training time and test time. In particular, our adversarially robust PCA primitive leads to computationally efficient and robust algorithms for both unsupervised and supervised learning problems such as clustering and learning adversarially robust classifiers.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/awasthi21a.html
https://proceedings.mlr.press/v134/awasthi21a.htmlFunctions with average smoothness: structure, algorithms, and learningWe initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds — assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. In addition to the usual average, we also examine a “weak” average that is more forgiving and yields a much wider function class. Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the {\em empirical} covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/ashlagi21a.html
https://proceedings.mlr.press/v134/ashlagi21a.htmlLearning in Matrix Games can be Arbitrarily ComplexMany multi-agent systems with strategic interactions have their desired functionality encoded as the Nash equilibrium of a game, e.g. machine learning architectures such as Generative Adversarial Networks. Directly computing a Nash equilibrium of these games is often impractical or impossible in practice, which has led to the development of numerous learning algorithms with the goal of iteratively converging on a Nash equilibrium. Unfortunately, the dynamics generated by the learning process can be very intricate and instances failing to converge become hard to interpret. In this paper we show that, in a strong sense, this dynamic complexity is inherent to games. Specifically, we prove that replicator dynamics, the continuous-time analogue of Multiplicative Weights Update, even when applied in a very restricted class of games–known as finite matrix games–is rich enough to be able to approximate arbitrary dynamical systems. In the context of machine learning, our results are positive in the sense that they show the nearly boundless dynamic modelling capabilities of current machine learning practices, but also negative in implying that these capabilities may come at the cost of interpretability. As a concrete example, we show how replicator dynamics can effectively reproduce the well-known strange attractor of Lonrenz dynamics (the “butterfly effect") while achieving no regret.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/andrade21a.html
https://proceedings.mlr.press/v134/andrade21a.htmlThe Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum LikelihoodIn this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d. samples from a discrete distribution, achieve an approximation factor of $\exp\left(-O(\sqrt{n} \log n) \right)$, improving upon the previous best-known bound achievable in polynomial time of $\exp(-O(n^{2/3} \log n))$ (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter. To obtain our results on PML we establish new connections between PML and the well-studied Bethe and Sinkhorn approximations to the permanent (Vontobel, 2012 and 2014). It is known that the PML objective is proportional to the permanent of a certain Vandermonde matrix (Vontobel, 2012) with $\sqrt{n}$ distinct columns, i.e. with non-negative rank at most $\sqrt{n}$. This allows us to show that the convex approximation to computing PML distributions studied in (Charikar, Shiragur and Sidford, 2019) is governed, in part, by the quality of Sinkhorn approximations to the permanent. We show that both Bethe and Sinkhorn permanents are $\exp(O(k \log(N/k)))$ approximations to the permanent of $N \times N$ matrices with non-negative rank at most $k$. This improves upon the previous known bounds of $\exp(O(N))$ and combining these insights with careful rounding of the convex relaxation yields our results.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/anari21a.html
https://proceedings.mlr.press/v134/anari21a.htmlSGD Generalizes Better Than GD (And Regularization Doesn’t Help)We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that $O(1/\epsilon^2)$ iterations suffice for obtaining a solution with $\epsilon$ excess expected risk, we show that with the same number of steps GD may overfit and emit a solution with $\Omega(1)$ generalization error. Moreover, we show that in fact $\Omega(1/\epsilon^4)$ iterations are necessary for GD to match the generalization performance of SGD, which is also tight due to recent work by Bassily et al. (2020). We further discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/amir21a.html
https://proceedings.mlr.press/v134/amir21a.htmlRegret Minimization in Heavy-Tailed BanditsWe revisit the classic regret-minimization problem in the stochastic multi-armed bandit setting when the arm-distributions are allowed to be heavy-tailed. Regret minimization has been well studied in simpler settings of either bounded support reward distributions or distributions that belong to a single parameter exponential family. We work under the much weaker assumption that the moments of order \((1+\epsilon)\){are} uniformly bounded by a known constant \(B\), for some given \( \epsilon > 0\). We propose an optimal algorithm that matches the lower bound exactly in the first-order term. We also give a finite-time bound on its regret. We show that our index concentrates faster than the well-known truncated or trimmed empirical mean estimators for the mean of heavy-tailed distributions. Computing our index can be computationally demanding. To address this, we develop a batch-based algorithm that is optimal up to a multiplicative constant depending on the batch size. We hence provide a controlled trade-off between statistical optimality and computational cost.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/agrawal21a.html
https://proceedings.mlr.press/v134/agrawal21a.htmlStochastic block model entropy and broadcasting on trees with surveyThe limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [Coja-Oghlan et al. (2017)] and for the classical case of assortative communities but in the dense regime [Deshpande et al. (2016)]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window. The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [Kanade et al. (2016)], [Mossel et al. (2016)] and [Mossel and Xu (2015)]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved.Wed, 21 Jul 2021 00:00:00 +0000
https://proceedings.mlr.press/v134/abbe21a.html
https://proceedings.mlr.press/v134/abbe21a.html