Proceedings of Machine Learning ResearchProceedings of Thirty Seventh Conference on Learning Theory
Held in Edmonton, Canada on 30 June to 03 July 2023
Published as Volume 247 by the Proceedings of Machine Learning Research on 30 June 2024.
Volume Edited by:
Shipra Agrawal
Aaron Roth
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/v247/
Wed, 18 Sep 2024 07:46:13 +0000Wed, 18 Sep 2024 07:46:13 +0000Jekyll v3.10.0Gap-Free Clustering: Sensitivity and Robustness of SDPWe study graph clustering in the Stochastic Block Model (SBM) in the presence of both large clusters and small, unrecoverable clusters. Previous convex relaxation approaches achieving exact recovery do not allow any small clusters of size $o(\sqrt{n})$, or require a size gap between the smallest recovered cluster and the largest non-recovered cluster. We provide an algorithm based on semidefinite programming (SDP) which removes these requirements and provably recovers large clusters regardless of the remaining cluster sizes. Mid-sized clusters pose unique challenges to the analysis, since their proximity to the recovery threshold makes them highly sensitive to small noise perturbations and precludes a closed-form candidate solution. We develop novel techniques, including a leave-one-out-style argument which controls the correlation between SDP solutions and noise vectors even when the removal of one row of noise can drastically change the SDP solution. We also develop improved eigenvalue perturbation bounds of potential independent interest. Our results are robust to certain semirandom settings that are challenging for alternative algorithms. Using our gap-free clustering procedure, we obtain efficient algorithms for the problem of clustering with a faulty oracle with superior query complexities, notably achieving $o(n^2)$ sample complexity even in the presence of a large number of small clusters. Our gap-free clustering procedure also leads to improved algorithms for recursive clustering.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/zurek24a.html
https://proceedings.mlr.press/v247/zurek24a.htmlSpectral Estimators for Structured Generalized Linear Models via Approximate Message Passing (Extended Abstract)We consider the problem of parameter estimation in a high-dimensional generalized linear model. Spectral methods obtained via the principal eigenvector of a suitable data-dependent matrix provide a simple yet surprisingly effective solution. However, despite their wide use, a rigorous performance characterization, as well as a principled way to preprocess the data, are available only for unstructured (i.i.d. Gaussian and Haar orthogonal) designs. In contrast, real-world data matrices are highly structured and exhibit non-trivial correlations. To address the problem, we consider correlated Gaussian designs capturing the anisotropic nature of the features via a covariance matrix $\Sigma$. Our main result is a precise asymptotic characterization of the performance of spectral estimators. This allows us to identify the optimal preprocessing that minimizes the number of samples needed for parameter estimation. Surprisingly, such preprocessing is universal across a broad set of statistical models, which partly addresses a conjecture on optimal spectral estimators for rotationally invariant designs. Our principled approach vastly improves upon previous heuristic methods, including for designs common in computational imaging and genetics. The proposed methodology, based on approximate message passing, is broadly applicable and opens the way to the precise characterization of spiked matrices and of the corresponding spectral methods in a variety of settings. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/zhang24c.html
https://proceedings.mlr.press/v247/zhang24c.htmlOptimal Multi-Distribution LearningMulti-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $\varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $\frac{d+k}{\varepsilon^2}$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Problems 1, 3 and 4 of Awasthi et al. 2023). Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/zhang24b.html
https://proceedings.mlr.press/v247/zhang24b.htmlSettling the sample complexity of online reinforcement learningA central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of MVP (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al., achieves a regret on the order of $$\min\big\{ \sqrt{SAH^3K}, \,HK \big\},$$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon length, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full epsilon-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm to decouple complicated statistical dependency — a long-standing challenge facing the analysis of online RL in the sample-hungry regime. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/zhang24a.html
https://proceedings.mlr.press/v247/zhang24a.htmlFast two-time-scale stochastic gradient method with applications in reinforcement learningTwo-time-scale optimization is a framework introduced in Zeng et al. (2024) that abstracts a range of policy evaluation and policy optimization problems in reinforcement learning (RL). Akin to bi-level optimization under a particular type of stochastic oracle, the two-time-scale optimization framework has an upper level objective whose gradient evaluation depends on the solution of a lower level problem, which is to find the root of a strongly monotone operator. In this work, we propose a new method for solving two-time-scale optimization that achieves significantly faster convergence than the prior arts. The key idea of our approach is to leverage an averaging step to improve the estimates of the operators in both lower and upper levels before using them to update the decision variables. These additional averaging steps eliminate the direct coupling between the main variables, enabling the accelerated performance of our algorithm. We characterize the finite-time convergence rates of the proposed algorithm under various conditions of the underlying objective function, including strong convexity, convexity, Polyak-Lojasiewicz condition, and general non-convexity. These rates significantly improve over the best-known complexity of the standard two-time-scale stochastic approximation algorithm. When applied to RL, we show how the proposed algorithm specializes to novel online sample-based methods that surpass or match the performance of the existing state of the art. Finally, we support our theoretical results with numerical simulations in RL.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/zeng24a.html
https://proceedings.mlr.press/v247/zeng24a.htmlCounting Stars is Constant-Degree Optimal For Detecting Any Planted Subgraph: Extended AbstractWe prove that whenever $p=\Omega(1)$ and for any graph $H$, counting $O(1)$-stars is optimal among all constant degree polynomial tests in terms of strongly separating an instance of $G(n,p),$ from the union of a random copy of $H$ with an instance of $G(n,p).$ Our work generalizes and extends multiple previous results on the inference abilities of $O(1)$-degree polynomials in the literature.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/yu24a.html
https://proceedings.mlr.press/v247/yu24a.htmlTop-$K$ ranking with a monotone adversaryIn this paper, we address the top-$K$ ranking problem with a monotone adversary. We consider the scenario where a comparison graph is randomly generated and the adversary is allowed to add arbitrary edges. The statistician’s goal is then to accurately identify the top-$K$ preferred items based on pairwise comparisons derived from this semi-random comparison graph. The main contribution of this paper is to develop a weighted maximum likelihood estimator (MLE) that achieves near-optimal sample complexity, up to a $\log^2(n)$ factor, where $n$ denotes the number of items under comparison. This is made possible through a combination of analytical and algorithmic innovations. On the analytical front, we provide a refined $\ell_\infty$ error analysis of the weighted MLE that is more explicit and tighter than existing analyses. It relates the $\ell_\infty$ error with the spectral properties of the weighted comparison graph. Motivated by this, our algorithmic innovation involves the development of an SDP-based approach to reweight the semi-random graph and meet specified spectral properties. Additionally, we propose a first-order method based on the Matrix Multiplicative Weight Update (MMWU) framework to solve the resulting SDP efficiently in nearly-linear time in the size of the semi-random comparison graph.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/yang24b.html
https://proceedings.mlr.press/v247/yang24b.htmlMultiple-output composite quantile regression through an optimal transport lensComposite quantile regression has been used to obtain robust estimators of regression coefficients in linear models with good statistical efficiency. By revealing an intrinsic link between the composite quantile regression loss function and the Wasserstein distance from the residuals to the set of quantiles, we establish a generalization of the composite quantile regression to the multiple-output settings. Theoretical convergence rates of the proposed estimator are derived both under the setting where the additive error possesses only a finite $\ell$-th moment (for $\ell > 2$) and where it exhibits a sub-Weibull tail. In doing so, we develop novel techniques for analyzing the M-estimation problem that involves Wasserstein-distance in the loss. Numerical studies confirm the practical effectiveness of our proposed procedure.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/yang24a.html
https://proceedings.mlr.press/v247/yang24a.htmlBridging the Gap: Rademacher Complexity in Robust and Standard GeneralizationTraining Deep Neural Networks (DNNs) with adversarial examples often results in poor generalization to test-time adversarial data. This paper investigates this issue, known as adversarially robust generalization, through the lens of Rademacher complexity. Building upon the studies by Khim and Loh (2018); Yin et al. (2019), numerous works have been dedicated to this problem, yet achieving a satisfactory bound remains an elusive goal. Existing works on DNNs either apply to a surrogate loss instead of the robust loss or yield bounds that are notably looser compared to their standard counterparts. In the latter case, the bounds have a higher dependency on the width $m$ of the DNNs or the dimension $d$ of the data, with an extra factor of at least $\mathcal{O}(\sqrt{m})$ or $\mathcal{O}(\sqrt{d})$. This paper presents upper bounds for adversarial Rademacher complexity of DNNs that match the best-known upper bounds in standard settings, as established in the work of Bartlett et al. (2017), with the dependency on width and dimension being $\mathcal{O}(\ln(dm))$. The central challenge addressed is calculating the covering number of adversarial function classes. We aim to construct a new cover that possesses two properties: 1) compatibility with adversarial examples, and 2) precision comparable to covers used in standard settings. To this end, we introduce a new variant of covering number called the \emph{uniform covering number}, specifically designed and proven to reconcile these two properties. Consequently, our method effectively bridges the gap between Rademacher complexity in robust and standard generalization.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/xiao24a.html
https://proceedings.mlr.press/v247/xiao24a.htmlLarge Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization EfficiencyWe consider \emph{gradient descent} (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly — in $O(\eta)$ steps, and subsequently achieves an $\tilde{O}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an \emph{accelerated} loss of $\tilde{O}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{O}(1/T^2)$ acceleration), nonlinear predictors in the \emph{neural tangent kernel} regime, and online \emph{stochastic gradient descent} (SGD) with a large stepsize, under suitable separability conditions.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wu24b.html
https://proceedings.mlr.press/v247/wu24b.htmlOracle-Efficient Hybrid Online Learning with Unknown DistributionWe study the problem of oracle-efficient hybrid online learning when the features are generated by an unknown i.i.d. process and the labels are generated adversarially. Assuming access to an (offline) ERM oracle, we show that there exists a computationally efficient online predictor that achieves a regret upper bounded by $\tilde{O}(T^{\frac{3}{4}})$ for a finite-VC class, and upper bounded by $\tilde{O}(T^{\frac{p+1}{p+2}})$ for a class with $\alpha$ fat-shattering dimension $\alpha^{-p}$. This provides the first known oracle-efficient sublinear regret bounds for hybrid online learning with an unknown feature generation process. In particular, it confirms a conjecture of Lazaric and Munos (2012). We then extend our result to the scenario of shifting distributions with $K$ changes, yielding a regret of order $\tilde{O}(T^{\frac{4}{5}}K^{\frac{1}{5}})$. Finally, we establish a regret of $\tilde{O}((K^{\frac{2}{3}}(\log|\mathcal{H}|)^{\frac{1}{3}}+K)\cdot T^{\frac{4}{5}})$ for the contextual $K$-armed bandits with a finite policy set $\mathcal{H}$, i.i.d. generated contexts from an unknown distribution, and adversarially generated costs.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wu24a.html
https://proceedings.mlr.press/v247/wu24a.htmlOptimal score estimation via empirical Bayes smoothingWe study the problem of estimating the score function of an unknown probability distribution $\rho^*$ from $n$ independent and identically distributed observations in $d$ dimensions. Assuming that $\rho^*$ is subgaussian and has a Lipschitz-continuous score function $s^*$, we establish the optimal rate of $\tilde \Theta(n^{-\frac{2}{d+4}})$ for this estimation problem under the loss function $\|\hat s - s^*\|^2_{L^2(\rho^*)}$ that is commonly used in the score matching literature, highlighting the curse of dimensionality where sample complexity for accurate score estimation grows exponentially with the dimension $d$. Leveraging key insights in empirical Bayes theory as well as a new convergence rate of smoothed empirical distribution in Hellinger distance, we show that a regularized score estimator based on a Gaussian kernel attains this rate, shown optimal by a matching minimax lower bound. We also discuss extensions to estimating $\beta$-Hölder continuous scores with $\beta \leq 1$, as well as the implication of our theory on the sample complexity of score-based generative models.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wibisono24a.html
https://proceedings.mlr.press/v247/wibisono24a.htmlOpen problem: Convergence of single-timescale mean-field Langevin descent-ascent for two-player zero-sum gamesLet a smooth function $f: T^d \times T^d \to \mathbb{R}$ over the $d$-torus and $\beta>0$. Consider the min-max objective functional $F_\beta(\mu, \nu) = \iint f d\mu d\nu + \beta^{-1} H(\mu) - \beta^{-1} H(\nu)$ over $\mathcal{P}(T^d) \times \mathcal{P}(T^d)$, where $H$ denotes the negative differential entropy. Its unique saddle point defines the entropy-regularized mixed Nash equilibrium of a two-player zero-sum game, and its Wasserstein gradient descent-ascent flow $(\mu_t, \nu_t)$ corresponds to the mean-field limit of a Langevin descent-ascent dynamics. Do $\mu_t$ and $\nu_t$ converge (weakly, say) as $t \to \infty$, for any $f$ and $\beta$? This rather natural qualitative question is still open, and it is not clear whether it can be addressed using the tools currently available for the analysis of dynamics in Wasserstein space. Even though the simple trick of using a different timescale for the ascent versus the descent is known to guarantee convergence, we propose this question as a toy setting to further our understanding of the Wasserstein geometry for optimization.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wang24c.html
https://proceedings.mlr.press/v247/wang24c.htmlNonlinear spiked covariance matrices and signal propagation in deep neural networksMany recent works have studied the eigenvalue spectrum of the Conjugate Kernel (CK) defined by the nonlinear feature map of a feedforward neural network. However, existing results only establish weak convergence of the empirical eigenvalue distribution, and fall short of providing precise quantitative characterizations of the “spike” eigenvalues and eigenvectors that often capture the low-dimensional signal structure of the learning problem. In this work, we characterize these signal eigenvalues and eigenvectors for a nonlinear version of the spiked covariance model, including the CK as a special case. Using this general result, we give a quantitative description of how spiked eigenstructure in the input data propagates through the hidden layers of a neural network with random weights. As a second application, we study a simple regime of representation learning where the weight matrix develops a rank-one signal component over training and characterize the alignment of the target function with the spike eigenvector of the CK on test data.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wang24b.html
https://proceedings.mlr.press/v247/wang24b.htmlEfficient Algorithms for Attributed Graph Alignment with Vanishing Edge Correlation Extended AbstractGraph alignment refers to the task of finding the vertex correspondence between two correlated graphs of $n$ vertices. Extensive study has been done on polynomial-time algorithms for the graph alignment problem under the Erdős–Rényi graph pair model, where the two graphs are Erdős–Rényi graphs with edge probability $q_\mathrm{u}$, correlated under certain vertex correspondence. To achieve exact recovery of the correspondence, all existing algorithms at least require the edge correlation coefficient $\rho_\mathrm{u}$ between the two graphs to be \emph{non-vanishing} as $n\rightarrow\infty$. Moreover, it is conjectured that no polynomial-time algorithm can achieve exact recovery under vanishing edge correlation $\rho_\mathrm{u}<1/\mathrm{polylog}(n)$. In this paper, we show that with a vanishing amount of additional \emph{attribute information}, exact recovery is polynomial-time feasible under \emph{vanishing} edge correlation $\rho_\mathrm{u} \ge n^{-\Theta(1)}$. We identify a \emph{local} tree structure, which incorporates one layer of user information and one layer of attribute information, and apply the subgraph counting technique to such structures. A polynomial-time algorithm is proposed that recovers the vertex correspondence for most of the vertices, and then refines the output to achieve exact recovery. The consideration of attribute information is motivated by real-world applications like LinkedIn and Twitter, where user attributes like birthplace and education background can aid alignment.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wang24a.html
https://proceedings.mlr.press/v247/wang24a.htmlNearly Optimal Regret for Decentralized Online Convex OptimizationWe investigate decentralized online convex optimization (D-OCO), in which a set of local learners are required to minimize a sequence of global loss functions using only local computations and communications. Previous studies have established $O(n^{5/4}\rho^{-1/2}\sqrt{T})$ and ${O}(n^{3/2}\rho^{-1}\log T)$ regret bounds for convex and strongly convex functions respectively, where $n$ is the number of local learners, $\rho<1$ is the spectral gap of the communication matrix, and $T$ is the time horizon. However, there exist large gaps from the existing lower bounds, i.e., $\Omega(n\sqrt{T})$ for convex functions and $\Omega(n)$ for strongly convex functions. To fill these gaps, in this paper, we first develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions to $\tilde{O}(n\rho^{-1/4}\sqrt{T})$ and $\tilde{O}(n\rho^{-1/2}\log T)$. The primary technique is to design an online accelerated gossip strategy that enjoys a faster average consensus among local learners. Furthermore, by carefully exploiting the spectral properties of a specific network topology, we enhance the lower bounds for convex and strongly convex functions to $\Omega(n\rho^{-1/4}\sqrt{T})$ and $\Omega(n\rho^{-1/2})$, respectively. These lower bounds suggest that our algorithms are nearly optimal in terms of $T$, $n$, and $\rho$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/wan24a.html
https://proceedings.mlr.press/v247/wan24a.htmlPruning is Optimal for Learning Sparse Features in High-DimensionsWhile it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We consider learning both single-index and multi-index models of the form $y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$, where $\sigma^*$ is a degree-$p$ polynomial, and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks. Furthermore, we establish Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound. In the same scenario, however, our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/vural24a.html
https://proceedings.mlr.press/v247/vural24a.htmlActive Learning with Simple QuestionsWe consider an active learning setting where a learner is presented with a pool $S$ of $n$ unlabeled examples belonging to a domain $\mathcal X$ and asks queries to find the underlying labeling that agrees with a target concept $h^\ast \in \mathcal H$. In contrast to traditional active learning that queries a single example for its label, we study more general \emph{region queries} that allow the learner to pick a subset of the domain $T \subset \mathcal X$ and a target label $y$ and ask a labeler whether $h^\ast(x) = y $ for every example in the set $T \cap S$. Such more powerful queries allow us to bypass the limitations of traditional active learning and use significantly fewer rounds of interactions to learn but can potentially lead to a significantly more complex query language. Our main contribution is quantifying the trade-off between the number of queries and the complexity of the query language used by the learner. We measure the complexity of the region queries via the VC dimension of the family of regions. We show that given any hypothesis class $\H$ with VC dimension $d$, one can design a region query family $Q$ with VC dimension $6d$ such that for every set of $n$ examples $S \subset \X$ and every $h^* \in \H$, a learner can submit $O(d\log n)$ queries from $Q$ to a labeler and perfectly label $S$. We show a matching lower bound by designing a hypothesis class $\H$ with VC dimension $d$ and a dataset $S \subset \X$ of size $n$ such that any learning algorithm using any query class with VC dimension $(d-2)/3$ must make $\poly(n)$ queries to label $S$ perfectly. Finally, we focus on well-studied hypothesis classes including unions of intervals, high-dimensional boxes, and $d$-dimensional halfspaces, and obtain stronger results. In particular, we design learning algorithms that (i) are computationally efficient and (ii) work even when the queries are not answered based on the learner’s pool of examples $S$ but on some unknown superset $L$ of $S$. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/vasilis24a.html
https://proceedings.mlr.press/v247/vasilis24a.htmlOpen Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement LearningReinforcement Learning (RL) has shown great empirical success in various application domains. The theoretical aspects of the problem have been extensively studied over past decades, particularly under tabular and linear Markov Decision Process structures. Recently, non-linear function approximation using kernel-based prediction has gained traction. This approach is particularly interesting as it naturally extends the linear structure, and helps explain the behavior of neural-network-based models at their infinite width limit. The analytical results however do not adequately address the performance guarantees for this case. We will highlight this open problem, overview existing partial results, and discuss related challenges.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/vakili24a.html
https://proceedings.mlr.press/v247/vakili24a.htmlImproved Hardness Results for Learning Intersections of HalfspacesWe show strong (and surprisingly simple) lower bounds for weakly learning intersections of halfspaces in the improper setting. Strikingly little is known about this problem. For instance, it is not even known if there is a polynomial-time algorithm for learning the intersection of only two halfspaces. On the other hand, lower bounds based on well-established assumptions (such as approximating worst-case lattice problems or variants of Feige’s 3SAT hypothesis) are only known (or are implied by existing results) for the intersection of super-logarithmically many halfspaces (KS06, KS09, DS16). With intersections of fewer halfspaces being only ruled out under less standard assumptions (DV21) (such as the existence of local pseudo-random generators with large stretch). We significantly narrow this gap by showing that even learning $\omega(\log \log N)$ halfspaces in dimension $N$ takes super-polynomial time under standard assumptions on worst-case lattice problems (namely that SVP and SIVP are hard to approximate within polynomial factors). Further, we give unconditional hardness results in the statistical query framework. Specifically, we show that for any $k$ (even constant), learning $k$ halfspaces in dimension $N$ requires accuracy $N^{-\Omega(k)}$, or exponentially many queries – in particular ruling out SQ algorithms with polynomial accuracy for $\omega(1)$ halfspaces. To the best of our knowledge this is the first unconditional hardness result for learning a super-constant number of halfspaces. Our lower bounds are obtained in a unified way via a novel connection we make between intersections of halfspaces and the so-called parallel pancakes distribution (DKS17, PLBR19, BRST21) that has been at the heart of many lower bound constructions in (robust) high-dimensional statistics in the past few years.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/tiegel24a.html
https://proceedings.mlr.press/v247/tiegel24a.htmlSecond Order Methods for Bandit Optimization and ControlBandit convex optimization (BCO) is a general framework for online decision making under uncertainty. While tight regret bounds for general convex losses have been established, existing algorithms achieving these bounds have prohibitive computational costs for high dimensional data. In this paper, we propose a simple and practical BCO algorithm inspired by the online Newton step algorithm. We show that our algorithm achieves optimal (in terms of horizon) regret bounds for a large class of convex functions that satisfy a condition we call $\kappa$-convexity. This class contains a wide range of practically relevant loss functions including linear losses, quadratic losses, and generalized linear models. In addition to optimal regret, this method is the most efficient known algorithm for several well-studied applications including bandit logistic regression. Furthermore, we investigate the adaptation of our second-order bandit algorithm to online convex optimization with memory. We show that for loss functions with a certain affine structure, the extended algorithm attains optimal regret. This leads to an algorithm with optimal regret for bandit LQ problem under a fully adversarial noise model, thereby resolving an open question posed in Grade et. al. 2020 and Sun et. al. 2023. Finally, we show that the more general problem of BCO with (non-affine) memory is harder. We derive a $\tilde{\Omega}(T^{2/3})$ regret lower bound, even under the assumption of smooth and quadratic losses.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/suggala24a.html
https://proceedings.mlr.press/v247/suggala24a.htmlA non-backtracking method for long matrix and tensor completion We consider the problem of low-rank rectangular matrix completion in the regime where the matrix $M$ of size $n\times m$ is “long", i.e., the aspect ratio $m/n$ diverges to infinity. Such matrices are of particular interest in the study of tensor completion, where they arise from the unfolding of a low-rank tensor. In the case where the sampling probability is $\frac{d}{\sqrt{mn}}$, we propose a new spectral algorithm for recovering the singular values and left singular vectors of the original matrix $M$ based on a variant of the standard non-backtracking operator of a suitably defined bipartite weighted random graph, which we call a \textit{non-backtracking wedge operator}. When $d$ is above a Kesten-Stigum-type sampling threshold, our algorithm recovers a correlated version of the singular value decomposition of $M$ with quantifiable error bounds. This is the first result in the regime of bounded $d$ for weak recovery and the first for weak consistency when $d\to\infty$ arbitrarily slowly without any polylog factors. As an application, for low-CP-rank orthogonal $k$-tensor completion, we efficiently achieve weak recovery with sample size $O(n^{k/2})$ and weak consistency with sample size $\omega(n^{k/2})$. A similar result is obtained for low-multilinear-rank tensor completion with $O(n^{k/2})$ many samples.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/stephan24a.html
https://proceedings.mlr.press/v247/stephan24a.htmlFast sampling from constrained spaces using the Metropolis-adjusted Mirror Langevin algorithmWe propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the Mirror Langevin algorithm (Zhang et al, 2020), which is a basic discretisation of the Mirror Langevin dynamics. Due to the inclusion of this filter, our method is unbiased relative to the target, while known discretisations of the Mirror Langevin dynamics including the Mirror Langevin algorithm have an asymptotic bias. For this algorithm, we also give upper bounds for the number of iterations taken to mix to a constrained distribution whose potential is relatively smooth, convex, and Lipschitz continuous with respect to a self-concordant mirror function. As a consequence of the reversibility of the Markov chain induced by the inclusion of the Metropolis-Hastings filter, we obtain an exponentially better dependence on the error tolerance for approximate constrained sampling.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/srinivasan24a.html
https://proceedings.mlr.press/v247/srinivasan24a.htmlA Non-Adaptive Algorithm for the Quantitative Group Testing ProblemConsider an $n$-dimensional binary feature vector with $k$ non-zero entries. The vector can be interpreted as the incident vector corresponding to $n$ items out of which $k$ items are \emph{defective}. The \emph{quantitative group testing} (QGT) problem aims at learning this binary feature vector by queries on subsets of the items that return the total number of defective items. We consider this problem under the \emph{non-adaptive} scenario where the queries on subsets are designed collectively and can be executed in parallel. Most of the existing efficient non-adaptive algorithms for the sublinear regime where $k = n^\alpha$ with $0 < \alpha < 1$ fall short of the information-theoretic lower bound, with a multiplicative gap of $\log k$. Recently, a near-optimal non-adaptive algorithm with a decoding complexity of $O(n^3)$ closed this gap. In this work, we present a concatenated construction method yielding a non-adaptive algorithm with a decoding complexity of $O(n^{2\alpha} + n \log^2 n)$. The probability of decoding failure is analyzed by establishing a connection between the QGT problem and the so-called \emph{balls into bins} problem. Our algorithm reduces the gap between the information-theoretic and computational bound for the number of required queries/tests from $\log k$ to $\log \log k$. This narrows the gap in the number of tests for non-adaptive algorithms within the class of algorithms with $o(n^2)$ decoding complexity. Moreover, although our algorithm exhibits a $\log \log k$ gap in terms of the number of tests, it is surpassed by the existing asymptotically optimal construction only in scenarios where $k$ is exceptionally large for moderate values of $\alpha$, such as $k > 10^{27}$ for $\alpha = 0.7$, thereby highlighting the practical applicability of our proposed concatenated construction.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/soleymani24a.html
https://proceedings.mlr.press/v247/soleymani24a.htmlTraining Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality (extended abstract)We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting “task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases — a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/siyu24a.html
https://proceedings.mlr.press/v247/siyu24a.htmlImproved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential StabilityIn this paper we consider the problem of obtaining sharp bounds for the performance of temporal difference (TD) methods with linear function approximation for policy evaluation in discounted Markov decision processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/samsonov24a.html
https://proceedings.mlr.press/v247/samsonov24a.htmlProvable Advantage in Quantum PAC LearningWe revisit the problem of characterising the complexity of Quantum PAC learning, as introduced by Bshouty and Jackson [SIAM J. Comput. 1998, 28, 1136–1153]. Several quantum advantages have been demonstrated in this setting, however, none are generic: they apply to particular concept classes and typically only work when the distribution that generates the data is known. In the general case, it was recently shown by Arunachalam and de Wolf [JMLR, 19 (2018) 1-36] that quantum PAC learners can only achieve constant factor advantages over classical PAC learners. We show that with a natural extension of the definition of quantum PAC learning used by Arunachalam and de Wolf, we can achieve a generic advantage in quantum learning. To be precise, for any concept class $\mathcal{C}$ of VC dimension $d$, we show there is an $(\epsilon, \delta)$-quantum PAC learner with sample complexity \[{O}\left(\frac{1}{\sqrt{\epsilon}}\left[d+ \log(\frac{1}{\delta})\right]\log^9(1/\epsilon)\right). \]{Up} to polylogarithmic factors, this is a square root improvement over the classical learning sample complexity. We show the tightness of our result by proving an $\Omega(d/\sqrt{\epsilon})$ lower bound that matches our upper bound up to polylogarithmic factors.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/salmon24a.html
https://proceedings.mlr.press/v247/salmon24a.htmlOnline Structured Prediction with Fenchel–Young Losses and Improved Surrogate Regret for Online Multiclass Classification with Logistic LossThis paper studies online structured prediction with full-information feedback. For online multiclass classification, Van der Hoeven (2020) established \emph{finite} surrogate regret bounds, which are independent of the time horizon, by introducing an elegant \emph{exploit-the-surrogate-gap} framework. However, this framework has been limited to multiclass classification primarily because it relies on a classification-specific procedure for converting estimated scores to outputs. We extend the exploit-the-surrogate-gap framework to online structured prediction with \emph{Fenchel–Young losses}, a large family of surrogate losses that includes the logistic loss for multiclass classification as a special case, obtaining finite surrogate regret bounds in various structured prediction problems. To this end, we propose and analyze \emph{randomized decoding}, which converts estimated scores to general structured outputs. Moreover, by applying our decoding to online multiclass classification with the logistic loss, we obtain a surrogate regret bound of $O(\| \bm{U} \|_\mathrm{F}^2)$, where $\bm{U}$ is the best offline linear estimator and $\| \cdot \|_\mathrm{F}$ denotes the Frobenius norm. This bound is tight up to logarithmic factors and improves the previous bound of $O(d\| \bm{U} \|_\mathrm{F}^2)$ due to Van der Hoeven (2020) by a factor of $d$, the number of classes.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/sakaue24a.html
https://proceedings.mlr.press/v247/sakaue24a.htmlOnline Learning with Set-valued FeedbackWe study a variant of online multiclass classification where the learner predicts a single label but receives a \textit{set of labels} as feedback. In this model, the learner is penalized for not outputting a label contained in the revealed set. We show that unlike online multiclass learning with single-label feedback, deterministic and randomized online learnability are \textit{not equivalent} even in the realizable setting with set-valued feedback. Accordingly, we give two new combinatorial dimensions, named the Set Littlestone and Measure Shattering dimension, that tightly characterize deterministic and randomized online learnability respectively in the realizable setting. In addition, we show that the Measure Shattering dimension characterizes online learnability in the agnostic setting and tightly quantifies the minimax regret. Finally, we use our results to establish bounds on the minimax regret for three practical learning settings: online multilabel ranking, online multilabel classification, and real-valued prediction with interval-valued response.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/raman24b.html
https://proceedings.mlr.press/v247/raman24b.htmlApple Tasting: Combinatorial Dimensions and Minimax RatesIn online binary classification under \emph{apple tasting} feedback, the learner only observes the true label if it predicts “1". First studied by Helmbold et al. (2000a), we revisit this classical partial-feedback setting and study online learnability from a combinatorial perspective. We show that the Littlestone dimension continues to provide a tight quantitative characterization of apple tasting in the agnostic setting, closing an open question posed by Helmbold et al. (2000a). In addition, we give a new combinatorial parameter, called the Effective width, that tightly quantifies the minimax expected number of mistakes in the realizable setting. As a corollary, we use the Effective width to establish a \emph{trichotomy} of the minimax expected number of mistakes in the realizable setting. In particular, we show that in the realizable setting, the expected number of mistakes of any learner, under apple tasting feedback, can only be either $\Theta(1), \Theta(\sqrt{T})$, or $\Theta(T)$. This is in contrast to the full-information realizable setting where only $\Theta(1)$ and $\Theta(T)$ are possible. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/raman24a.html
https://proceedings.mlr.press/v247/raman24a.htmlFit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing DiffusionsScore matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g., energy-based models). The idea is to fit the score of the distribution rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there’s a clear algorithmic benefit, the statistical cost can be steep: recent work by Koehler et al. (2022) showed that for distributions that have poor isoperimetric properties (a large Poincar{é} or log-Sobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussians in one dimension have a poor Poincar{é} constant. In this paper, we show a close connection between the mixing time of a broad class of Markov processes with generator L and stationary distribution p, and an appropriately chosen generalized score matching loss that tries to fit Op. In the special case of O being a gradient operator, and L being the generator of Langevin diffusion, this generalizes and recovers the results from Koehler et al. (2022). This allows us to adapt techniques to speed up Markov chains to construct better score-matching losses. In particular, "preconditioning" the diffusion can be translated to an appropriate "preconditioning" of the score loss. Lifting the chain by adding a temperature like in simulated tempering can be shown to result in a Gaussian-convolution annealed score matching loss, similar to Song and Ermon (2019). Moreover, we show that if the distribution being learned is a finite mixture of Gaussians in d dimensions with a shared covariance, the sample complexity of annealed score matching is polynomial in the ambient dimension, the diameter of the means, and the smallest and largest eigenvalues of the covariance. To show this we bound the mixing time of a "continuously tempered" version of Langevin diffusion for mixtures, which is of standalone interest.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/qin24a.html
https://proceedings.mlr.press/v247/qin24a.htmlOn the Distance from Calibration in Sequential PredictionWe study a sequential binary prediction setting where the forecaster is evaluated in terms of the calibration distance, which is defined as the $L_1$ distance between the predicted values and the set of predictions that are perfectly calibrated in hindsight. This is analogous to a calibration measure recently proposed by Bł{}asiok, Gopalan, Hu and Nakkiran (STOC 2023) for the offline setting. The calibration distance is a natural and intuitive measure of deviation from perfect calibration, and satisfies a Lipschitz continuity property which does not hold for many popular calibration measures, such as the $L_1$ calibration error and its variants. We prove that there is a forecasting algorithm that achieves an $O(\sqrt{T})$ calibration distance in expectation on an adversarially chosen sequence of $T$ binary outcomes. At the core of this upper bound is a structural result showing that the calibration distance is accurately approximated by the lower calibration distance, which is a continuous relaxation of the former. We then show that an $O(\sqrt{T})$ lower calibration distance can be achieved via a simple minimax argument and a reduction to online learning on a Lipschitz class. On the lower bound side, an $\Omega(T^{1/3})$ calibration distance is shown to be unavoidable, even when the adversary outputs a sequence of independent random bits, and has an additional ability to early stop (i.e., to stop producing random bits and output the same bit in the remaining steps). Interestingly, without this early stopping, the forecaster can achieve a much smaller calibration distance of $\mathrm{polylog}(T)$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/qiao24a.html
https://proceedings.mlr.press/v247/qiao24a.htmlDimension-free Structured Covariance EstimationGiven a sample of i.i.d. high-dimensional centered random vectors, we consider a problem of estimation of their covariance matrix $\Sigma$ with an additional assumption that $\Sigma$ can be represented as a sum of a few Kronecker products of smaller matrices. Under mild conditions, we derive the first non-asymptotic dimension-free high-probability bound on the Frobenius distance between $\Sigma$ and a widely used penalized permuted least squares estimate. Because of the hidden structure, the established rate of convergence is faster than in the standard covariance estimation problem.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/puchkin24a.html
https://proceedings.mlr.press/v247/puchkin24a.htmlSample-Optimal Locally Private Hypothesis Selection and the Provable Benefits of InteractivityWe study the problem of hypothesis selection under the constraint of local differential privacy. Given a class $\mathcal{F}$ of $k$ distributions and a set of i.i.d. samples from an unknown distribution $h$, the goal of hypothesis selection is to pick a distribution $\hat{f}$ whose total variation distance to $h$ is comparable with the best distribution in $\mathcal{F}$ (with high probability). We devise an $\varepsilon$-locally-differentially-private ($\varepsilon$-LDP) algorithm that uses $\Theta\left(\frac{k}{\alpha^2\min \{\varepsilon^2,1\}}\right)$ samples to guarantee that $d_{TV}(h,\hat{f})\leq \alpha + 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ with high probability. This sample complexity is optimal for $varepsilon<1$, matching the lower bound of Gopi et al. (2020). All previously known algorithms for this problem required $\Omega\left(\frac{k\log k}{\alpha^2\min \{\varepsilon^2 ,1\}} \right)$ samples to work. Moreover, our result demonstrates the power of interaction for $\varepsilon$-LDP hypothesis selection. Namely, it breaks the known lower bound of $\Omega\left(\frac{k\log k}{\alpha^2 \varepsilon^2} \right)$ for the sample complexity of non-interactive hypothesis selection. Our algorithm achieves this using only $\Theta(\log \log k)$ rounds of interaction. To prove our results, we define the notion of \emph{critical queries} for a Statistical Query Algorithm (SQA) which may be of independent interest. Informally, an SQA is said to use a small number of critical queries if its success relies on the accuracy of only a small number of queries it asks. We then design an LDP algorithm that uses a smaller number of critical queries.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/pour24a.html
https://proceedings.mlr.press/v247/pour24a.htmlSmooth Lower Bounds for Differentially Private Algorithms via Padding-and-Permuting Fingerprinting CodesFingerprinting arguments, first introduced by Bun, Ullman, and Vadhan (STOC 2014), are the most widely used method for establishing lower bounds on the sample complexity or error of approximately differentially private (DP) algorithms. Still, there are many problems in differential privacy for which we don’t know suitable lower bounds, and even for problems that we do, the lower bounds are not smooth, and usually become vacuous when the error is larger than some threshold. In this work, we present a new framework and tools to generate smooth lower bounds on the sample complexity of differentially private algorithms satisfying very weak accuracy. We illustrate the applicability of our method by providing new lower bounds in various settings: 1. A tight lower bound for DP averaging in the low-accuracy regime, which in particular implies a lower bound for the private 1-cluster problem introduced by Nissim, Stemmer, and Vadhan (PODS 2016). 2. A lower bound on the additive error of DP algorithms for approximate k-means clustering and general (k,z)-clustering, as a function of the multiplicative error, which is tight for a constant multiplication error. 3. A lower bound for estimating the top singular vector of a matrix under DP in low-accuracy regimes, which is a special case of DP subspace estimation studied by Singhal and Steinke (NeurIPS 2021). Our main technique is to apply a padding-and-permuting transformation to a fingerprinting code. However, rather than proving our results using a black-box access to an existing fingerprinting code (e.g., Tardos’ code), we develop a new fingerprinting lemma that is stronger than those of Dwork et al. (FOCS 2015) and Bun et al. (SODA 2017), and prove our lower bounds directly from the lemma. Our lemma, in particular, gives a simpler fingerprinting code construction with optimal rate (up to polylogarithmic factors) that is of independent interest.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/peter24a.html
https://proceedings.mlr.press/v247/peter24a.htmlThe Sample Complexity of Simple Binary Hypothesis TestingThe sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either: (i) the prior-free setting, with type-I error at most $\alpha$ and type-II error at most $\beta$; or (ii) the Bayesian setting, with Bayes error at most $\delta$ and prior distribution $(\alpha, 1-\alpha)$. This problem has only been studied when $\alpha = \beta$ (prior-free) or $\alpha = 1/2$ (Bayesian), and the sample complexity is known to be characterized by the Hellinger divergence between $p$ and $q$, up to multiplicative constants. In this paper, we derive a formula that characterizes the sample complexity (up to multiplicative constants that are independent of $p$, $q$, and all error parameters) for: (i) all $0 \le \alpha, \beta \le 1/8$ in the prior-free setting; and (ii) all $\delta \le \alpha/4$ in the Bayesian setting. In particular, the formula admits equivalent expressions in terms of certain divergences from the Jensen–Shannon and Hellinger families. The main technical result concerns an $f$-divergence inequality between members of the Jensen–Shannon and Hellinger families, which is proved by a combination of information-theoretic tools and case-by-case analyses. We explore applications of our results to robust and distributed (locally-private and communication-constrained) hypothesis testing.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/pensia24a.html
https://proceedings.mlr.press/v247/pensia24a.htmlThe sample complexity of multi-distribution learning Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of $k$ data distributions and a hypothesis class of VC dimension $d$, the goal is to learn a hypothesis that minimizes the maximum population loss over $k$ distributions, up to $\epsilon$ additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/peng24b.html
https://proceedings.mlr.press/v247/peng24b.htmlThe complexity of approximate (coarse) correlated equilibrium for incomplete information games We study the iteration complexity of decentralized learning of approximate correlated equilibria in incomplete information games. On the negative side, we prove that in extensive-form games, assuming $\mathsf{PPAD} \not\subset \mathsf{TIME}(n^{\polylog(n)})$, any polynomial-time learning algorithms must take at least $2^{\log_2^{1-o(1)}(|\mathcal{I}|)}$ iterations to converge to the set of $\epsilon$-approximate correlated equilibrium, where $|\mathcal{I}|$ is the number of nodes in the game and $\epsilon > 0$ is an absolute constant. This nearly matches, up to the $o(1)$ term, the algorithms of (Peng and Rubinstein STOC’2024, Dagan et al. STOC’2024) for learning $\epsilon$-approximate correlated equilibrium, and resolves an open question of Anagnostides, Kalavasis, Sandholm, and Zampetakis (Anagnostides et al. ITCS 2024). Our lower bound holds even for the easier solution concept of $\epsilon$-approximate coarse correlated equilibrium. On the positive side, we give uncoupled dynamics that reach $\epsilon$-approximate correlated equilibria of a Bayesian game in polylogarithmic iterations, without any dependence of the number of types. This demonstrates a separation between Bayesian games and extensive-form games.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/peng24a.html
https://proceedings.mlr.press/v247/peng24a.htmlThe Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent CommunicationLocal SGD is a popular optimization method in distributed learning, often outperforming mini-batch SGD. Despite this practical success, proving the efficiency of local SGD has been difficult, creating a significant gap between theory and practice. We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing these assumptions can not capture local SGD’s effectiveness. We also demonstrate the min-max optimality of accelerated mini-batch SGD under these assumptions. Our findings emphasize the need for improved modeling of data heterogeneity. Under higher-order assumptions, we provide new upper bounds that verify the dominance of local SGD over mini-batch SGD when data heterogeneity is low.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/patel24a.html
https://proceedings.mlr.press/v247/patel24a.htmlDepth Separation in Norm-Bounded Infinite-Width Neural NetworksWe study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/parkinson24a.html
https://proceedings.mlr.press/v247/parkinson24a.htmlLearning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinationsWe study the statistical and computational complexity of learning a target function $f_*:\R^d\to\R$ with \textit{additive structure}, that is, $f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(⟨x, v_m⟩)$, where $f_1,f_2,...,f_M:\R\to\R$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features $\{v_m\}_{m=1}^M$, and the number of additive tasks $M$ grows with the dimensionality $M\asymp d^\gamma$ for $\gamma\ge 0$. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of “skills” that are often \textit{localized} in distinct parts of the trained network. We prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks $M$ and the \textit{information exponent} of $f_m$, despite the unknown link function and $M$ growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/oko24a.html
https://proceedings.mlr.press/v247/oko24a.htmlRobust Distribution Learning with Local and Global Adversarial Corruptions (extended abstract)We consider learning in an adversarial environment, where an $\varepsilon$-fraction of samples from a distribution $P$ are arbitrarily modified (\emph{global} corruptions) and the remaining perturbations have average magnitude bounded by $\rho$ (\emph{local} corruptions). Given access to $n$ such corrupted samples, we seek a computationally efficient estimator $\hat{P}_n$ that minimizes the Wasserstein distance $W_1(\hat{P}_n,P)$. In fact, we attack the fine-grained task of minimizing $W_1(\Pi_\sharp \hat{P}_n, \Pi_\sharp P)$ for all orthogonal projections $\Pi \in \mathbb{R}^{d \times d}$, with performance scaling with $\mathrm{rank}(\Pi) = k$. This allows us to account simultaneously for mean estimation ($k=1$), distribution estimation ($k=d$), as well as the settings interpolating between these two extremes. We characterize the optimal population-limit risk for this task and then develop an efficient finite-sample algorithm with error bounded by $\sqrt{\varepsilon k} + \rho + \tilde{O}(k\sqrt{d}n^{-1/k})$ when $P$ has bounded covariance. Our efficient procedure relies on a novel trace norm approximation of an ideal yet intractable 2-Wasserstein projection estimator. We apply this algorithm to robust stochastic optimization, and, in the process, uncover a new method for overcoming the curse of dimensionality in Wasserstein distributionally robust optimization.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/nietert24a.html
https://proceedings.mlr.press/v247/nietert24a.htmlOptimistic Information Directed SamplingWe study the problem of online learning in contextual bandit problems where the loss function is assumed to belong to a known parametric function class. We propose a new analytic framework for this setting that bridges the Bayesian theory of information-directed sampling due to Russo and Van Roy (2018) and the worst-case theory of Foster et al. (2021) based on the decision-estimation coefficient. Drawing from both lines of work, we propose a algorithmic template called Optimistic Information-Directed Sampling and show that it can achieve instance-dependent regret guarantees similar to the ones achievable by the classic Bayesian IDS method, but with the major advantage of not requiring any Bayesian assumptions. The key technical innovation of our analysis is introducing an optimistic surrogate model for the regret and using it to define a frequentist version of the Information Ratio of Russo and Van Roy (2018), and a less conservative version of the Decision Estimation Coefficient of Foster et al. (2021).Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/neu24a.html
https://proceedings.mlr.press/v247/neu24a.htmlExact Mean Square Linear Stability Analysis for SGDThe dynamical stability of optimization methods at the vicinity of minima of the loss has recently attracted significant attention. For gradient descent (GD), stable convergence is possible only to minima that are sufficiently flat w.r.t. the step size, and those have been linked with favorable properties of the trained model. However, while the stability threshold of GD is well-known, to date, no explicit expression has been derived for the exact threshold of stochastic GD (SGD). In this paper, we derive such a closed-form expression. Specifically, we provide an explicit condition on the step size that is both necessary and sufficient for the linear stability of SGD in the mean square sense. Our analysis sheds light on the precise role of the batch size B. In particular, we show that the stability threshold is monotonically non-decreasing in the batch size, which means that reducing the batch size can only decrease stability. Furthermore, we show that SGD’s stability threshold is equivalent to that of a mixture process which takes in each iteration a full batch gradient step w.p. 1-p, and a single sample gradient step w.p. $p$, where $p \approx 1/B$. This indicates that even with moderate batch sizes, SGD’s stability threshold is very close to that of GD’s. We also prove simple necessary conditions for linear stability, which depend on the batch size, and are easier to compute than the precise threshold. Finally, we derive the asymptotic covariance of the dynamics around the minimum, and discuss its dependence on the learning rate. We validate our theoretical findings through experiments on the MNIST dataset.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/mulayoff24a.html
https://proceedings.mlr.press/v247/mulayoff24a.htmlFinding Super-spreaders in Network CascadesSuppose that a cascade (e.g., an epidemic) spreads on an unknown graph, and only the infection times of vertices are observed. What can be learned about the graph from the infection times caused by multiple distinct cascades? Most of the literature on this topic focuses on the task of recovering the \emph{entire} graph, which requires $\Omega ( \log n)$ cascades for an $n$-vertex bounded degree graph. Here we ask a different question: can the important parts of the graph be estimated from just a few (i.e., constant number) of cascades, even as $n$ grows large? In this work, we focus on identifying super-spreaders (i.e., high-degree vertices) from infection times caused by a Susceptible-Infected process on a graph. Our first main result shows that vertices of degree greater than $n^{3/4}$ can indeed be estimated from a constant number of cascades. Our algorithm for doing so leverages a novel connection between vertex degrees and the second derivative of the cumulative infection curve. Conversely, we show that estimating vertices of degree smaller than $n^{1/2}$ requires at least $\log(n) / \log \log (n)$ cascades. Surprisingly, this matches (up to $\log \log n$ factors) the number of cascades needed to learn the \emph{entire} graph if it is a tree.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/mossel24a.html
https://proceedings.mlr.press/v247/mossel24a.htmlFundamental Limits of Non-Linear Low-Rank Matrix EstimationWe consider the task of estimating a low-rank matrix from non-linear and noisy observations. We prove a strong universality result showing that Bayes-optimal performances are characterized by an equivalent Gaussian model with an effective prior, whose parameters are entirely determined by an expansion of the non-linear function. In particular, we show that to reconstruct the signal accurately, one requires a signal-to-noise ratio growing as \(N^{\frac 12 (1-1/k_F)}\), where \(k_F\){is} the first non-zero Fisher information coefficient of the function. We provide asymptotic characterization for the minimal achievable mean squared error (MMSE) and an approximate message-passing algorithm that reaches the MMSE under conditions analogous to the linear version of the problem. We also provide asymptotic errors achieved by methods such as principal component analysis combined with Bayesian denoising, and compare them with Bayes-optimal MMSE. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/mergny24a.html
https://proceedings.mlr.press/v247/mergny24a.htmlFast, blind, and accurate: Tuning-free sparse regression with global linear convergenceMany algorithms for high-dimensional regression problems require the calibration of regularization hyperparameters. This, in turn, often requires the knowledge of the unknown noise variance in order to produce meaningful solutions. Recent works show, however, that there exist certain estimators that are pivotal, i.e., the regularization parameter does not depend on the noise level; the most remarkable example being the square-root lasso. Such estimators have also been shown to exhibit strong connections to distributionally robust optimization. Despite the progress in the design of pivotal estimators, the resulting minimization problem is challenging as both the loss function and the regularization term are non-smooth. To date, the design of fast, robust, and scalable algorithms with strong convergence rate guarantees is still an open problem. This work addresses this problem by showing that an iteratively reweighted least squares (IRLS) algorithm exhibits global linear convergence under the weakest assumption available in the literature. We expect our findings will also have implications for multi-task learning and distributionally robust optimization.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/mayrink-verdun24a.html
https://proceedings.mlr.press/v247/mayrink-verdun24a.htmlLow-degree phase transitions for detecting a planted clique in sublinear timeWe consider the problem of detecting a planted clique of size $k$ in a random graph on $n$ vertices. When the size of the clique exceeds $\Theta(\sqrt{n})$, polynomial-time algorithms for detection proliferate. We study faster—namely, sublinear time—algorithms in the high-signal regime when $k = \Theta(n^{1/2 + \delta})$, for some $\delta > 0$. To this end, we consider algorithms that non-adaptively query a subset $M$ of entries of the adjacency matrix and then compute a low-degree polynomial function of the revealed entries. We prove a computational phase transition for this class of \emph{non-adaptive low-degree algorithms}: under the scaling $\lvert M \rvert = \Theta(n^{\gamma})$, the clique can be detected when $\gamma > 3(1/2 - \delta)$ but not when $\gamma < 3(1/2 - \delta)$. As a result, the best known runtime for detecting a planted clique, $\widetilde{O}(n^{3(1/2-\delta)})$, cannot be improved without looking beyond the non-adaptive low-degree class. Our proof of the lower bound—based on bounding the conditional low-degree likelihood ratio—reveals further structure in non-adaptive detection of a planted clique. Using (a bound on) the conditional low-degree likelihood ratio as a potential function, we show that for \emph{every} non-adaptive query pattern, there is a highly structured query pattern of the same size that is at least as effective.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/mardia24a.html
https://proceedings.mlr.press/v247/mardia24a.htmlHarmonics of Learning: Universal Fourier Features Emerge in Invariant NetworksIn this work, we formally prove that, under certain conditions, if a neural network is invariant to a finite group then its weights recover the Fourier transform on that group. This provides a mathematical explanation for the emergence of Fourier features – a ubiquitous phenomenon in both biological and artificial learning systems. The results hold even for non-commutative groups, in which case the Fourier transform encodes all the irreducible unitary group representations. Our findings have consequences for the problem of symmetry discovery. Specifically, we demonstrate that the algebraic structure of an unknown group can be recovered from the weights of a network that is at least approximately invariant within certain bounds. Overall, this work contributes to a foundation for an algebraic learning theory of invariant neural network representations.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/marchetti24a.html
https://proceedings.mlr.press/v247/marchetti24a.htmlProjection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPsWe consider the problem of learning an $\varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Given access to a generative model, we achieve rate-optimal sample complexity by performing a simple, \emph{perturbed} version of least-squares value iteration with orthogonal trigonometric polynomials as features. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our $\widetilde{O}(\epsilon^{-2-d/(\nu+1)})$ sample complexity, where $d$ is the dimension of the state-action space and $\nu$ the order of smoothness, recovers the state-of-the-art result of discretization approaches for the special case of Lipschitz MDPs $(\nu=0)$. At the same time, for $\nu\to\infty$, it recovers and greatly generalizes the $O(\epsilon^{-2})$ rate of low-rank MDPs, which are more amenable to regression approaches. In this sense, our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/maran24a.html
https://proceedings.mlr.press/v247/maran24a.htmlConvergence of Gradient Descent with Small Initialization for Unregularized Matrix CompletionWe study the problem of symmetric matrix completion, where the goal is to reconstruct a positive semidefinite matrix $X^\star \in \mathbb{R}^{d\times d}$ of rank-$r$, parameterized by $UU^{\top}$, from only a subset of its observed entries. We show that the vanilla gradient descent (GD) with small initialization provably converges to the ground truth $X^\star$ without requiring any explicit regularization. This convergence result holds true even in the over-parameterized scenario, where the true rank $r$ is unknown and conservatively over-estimated by a search rank $r’\gg r$. The existing results for this problem either require explicit regularization, a sufficiently accurate initial point, or exact knowledge of the true rank $r$. In the over-parameterized regime where $r’\geq r$, we show that, with $\widetilde\Omega(dr^9)$ observations, GD with an initial point $\|U_0\| \leq O(\epsilon)$ converges near-linearly to an $\epsilon$-neighborhood of $X^\star$. Consequently, smaller initial points result in increasingly accurate solutions. Surprisingly, neither the convergence rate nor the final accuracy depends on the over-parameterized search rank $r’$, and they are only governed by the true rank $r$. In the exactly-parameterized regime where $r’=r$, we further enhance this result by proving that GD converges at a faster rate to achieve an arbitrarily small accuracy $\epsilon>0$, provided the initial point satisfies $\|U_0\| = O(1/d)$. At the crux of our method lies a novel weakly-coupled leave-one-out analysis, which allows us to establish the global convergence of GD, extending beyond what was previously possible using the classical leave-one-out analysis.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/ma24a.html
https://proceedings.mlr.press/v247/ma24a.htmlLinear bandits with polylogarithmic minimax regret We study a noise model for linear stochastic bandits for which the subgaussian noise parameter vanishes linearly as we select actions on the unit sphere closer and closer to the unknown vector. We introduce an algorithm for this problem that exhibits a minimax regret scaling as $\log^3(T)$ in the time horizon $T$, in stark contrast the square root scaling of this regret for typical bandit algorithms. Our strategy, based on weighted least-squares estimation, achieves the eigenvalue relation $\lambda_{\min} ( V_t ) = \Omega (\sqrt{\lambda_{\max}(V_t ) })$ for the design matrix $V_t$ at each time step $t$ through geometrical arguments that are independent of the noise model and might be of independent interest. This allows us to tightly control the expected regret in each time step to be of the order $O(\frac1{t})$, leading to the logarithmic scaling of the cumulative regret.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/lumbreras24a.html
https://proceedings.mlr.press/v247/lumbreras24a.htmlAutobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing DynamicsWe study a game between autobidding algorithms that compete in an online advertising platform. Each autobidder is tasked with maximizing its advertiser’s total value over multiple rounds of a repeated auction, subject to budget and return-on-investment constraints. We propose a gradient-based learning algorithm that is guaranteed to satisfy all constraints and achieves vanishing individual regret. Our algorithm uses only bandit feedback and can be used with the first- or second-price auction, as well as with any “intermediate” auction format. Our main result is that when these autobidders play against each other, the resulting expected liquid welfare over all rounds is at least half of the expected optimal liquid welfare achieved by any allocation. Our analysis holds whether or not the bidding dynamics converges to an equilibrium, side-stepping the dearth of provable convergence guarantees in the literature and the hardness result (Chen et al., 2021) which precludes such guarantees for budget-constrained second-price auctions. Our vanishing-regret result extends to an adversarial environment, without any assumptions on the other agents. We adopt a non-standard benchmark: the sequence of bids such that each bid $b_t$ maximizes value for the environment in round $t$. Hence, we side-step the impossibility results for the standard benchmark of best fixed bid (Balseiro and Gur, 2019). When there is only a budget constraint, our algorithm specializes to the autobidding algorithm from (Balseiro and Gur, 2019), and our guarantees specialize to the regret and liquid welfare guarantees from Gaitonde et al. (2023). While our approach to bounding liquid welfare shares a common high-level strategy with Gaitonde et al. (2023), handling the ROI constraint, and particularly both constraints jointly, introduces a variety of new technical challenges. These challenges necessitate a new algorithm, changes to the way liquid welfare bounds are established, and a different methodology for establishing regret properties.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/lucier24a.html
https://proceedings.mlr.press/v247/lucier24a.htmlThe Predicted-Updates Dynamic Model: Offline, Incremental, and Decremental to Fully Dynamic TransformationsThe main bottleneck in designing efficient dynamic algorithms is the unknown nature of the update sequence. In particular, there are problems where the separation in runtime between the best offline or partially dynamic solutions and the best fully dynamic solutions is polynomial, sometimes even exponential. In this paper, we formulate the \emph{predicted-updates dynamic model}, one of the first \emph{beyond-worst-case} models for dynamic algorithms, which generalizes a large set of well-studied dynamic models including the offline dynamic, incremental, and decremental models to the fully dynamic setting when given predictions about the update times of the elements. Our paper models real world settings, in which we often have access to side information that allows us to make coarse predictions about future updates. We formulate a framework that bridges the gap between fully and offline/partially dynamic, leading to greatly improved runtime bounds over the state-of-the-art dynamic algorithms for a variety of important problems such as triconnectivity, planar digraph all pairs shortest paths, \(k\)-edge connectivity, and others, for prediction error of reasonable magnitude. Our simple framework avoids heavy machinery, potentially leading to a new set of dynamic algorithms that are implementable in practice.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/liu24c.html
https://proceedings.mlr.press/v247/liu24c.htmlSpatial properties of Bayesian unsupervised treesTree-based methods are popular nonparametric tools for capturing spatial heterogeneity and making predictions in multivariate problems. In unsupervised learning, trees and their ensembles have also been applied to a wide range of statistical inference tasks, such as multi-resolution sketching of distributional variations, localization of high-density regions, and design of efficient data compression schemes. In this paper, we study the spatial adaptation property of Bayesian tree-based methods in the unsupervised setting, with a focus on the density estimation problem. We characterize spatial heterogeneity of the underlying density function by using anisotropic Besov spaces, region-wise anisotropic Besov spaces, and two novel function classes as their extensions. For two types of commonly used prior distributions on trees under the context of unsupervised learning—the optional P{ó}lya tree (Wong and Ma, 2010) and the Dirichlet prior (Lu et al., 2013)—we calculate posterior concentration rates when the density function exhibits different types of heterogeneity. In specific, we show that the posterior concentration rate for trees is near minimax over the anisotropic Besov space. The rate is adaptive in the sense that to achieve such a rate we do not need any prior knowledge of the parameters of the Besov space.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/liu24b.html
https://proceedings.mlr.press/v247/liu24b.htmlThe role of randomness in quantum state certification with unentangled measurementsGiven $n$ copies of an unknown quantum state $\rho\in\mathbb{C}^{d\times d}$, quantum state certification is the task of determining whether $\rho=\rho_0$ or $\|\rho-\rho_0\|_1>\varepsilon$, where $\rho_0$ is a known reference state. We study quantum state certification using unentangled quantum measurements, namely measurements which operate only on one copy of $\rho$ at a time. When there is a common source of randomness available and the unentangled measurements are chosen based on this randomness, prior work has shown that $\Theta(d^{3/2}/\varepsilon^2)$ copies are necessary and sufficient. This holds even when the measurements are allowed to be chosen adaptively. We consider deterministic measurement schemes (as opposed to randomized) and demonstrate that ${\Theta}(d^2/\varepsilon^2)$ copies are necessary and sufficient for state certification. This shows a separation between algorithms with and without randomness. We develop a lower bound framework for both fixed and randomized measurements that relates the hardness of testing to the well-established Lüders rule. More precisely, we obtain lower bounds for randomized and fixed schemes as a function of the eigenvalues of the Lüders channel which characterizes one possible post-measurement state transformation.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/liu24a.html
https://proceedings.mlr.press/v247/liu24a.htmlOnline Policy Optimization in Unknown Nonlinear SystemsWe study online policy optimization in nonlinear time-varying systems where the true dynamical models are unknown to the controller. This problem is challenging because, unlike in linear systems, the controller cannot obtain globally accurate estimations of the ground-truth dynamics using local exploration. We propose a meta-framework that combines a general online policy optimization algorithm (\texttt{ALG}) with a general online estimator of the dynamical system’s model parameters (\texttt{EST}). We show that if the hypothetical joint dynamics induced by \texttt{ALG} with \emph{known} parameters satisfies several desired properties, the joint dynamics under \emph{inexact} parameters from \texttt{EST} will be robust to errors. Importantly, the final regret only depends on \texttt{EST}’s predictions on the visited trajectory, which relaxes a bottleneck on identifying the true parameters globally. To demonstrate our framework, we develop a computationally efficient variant of Gradient-based Adaptive Policy Selection, called Memoryless GAPS (M-GAPS), and use it to instantiate \texttt{ALG}. Combining \mbox{M-GAPS} with online gradient descent to instantiate \texttt{EST} yields (to our knowledge) the first local regret bound for online policy optimization in nonlinear time-varying systems with unknown dynamics.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/lin24a.html
https://proceedings.mlr.press/v247/lin24a.htmlOptimistic Rates for Learning from Label ProportionsWe consider a weakly supervised learning problem called Learning from Label Proportions (LLP), where examples are grouped into "bags" and only the average label within each bag is revealed to the learner. We study various learning rules for LLP that achieve PAC learning guarantees for classification loss. We establish that the classical Empirical Proportional Risk Minimization (EPRM) learning rule (Yu et al., 2014) achieves fast rates under realizability, but EPRM and similar proportion matching learning rules can fail in the agnostic setting. We also show that (1) a debiased proportional square loss, as well as (2) a recently proposed EasyLLP learning rule (Busa-Fekete et al., 2023) both achieve "optimistic rates" (Panchenko, 2002); in both the realizable and agnostic settings, their sample complexity is optimal (up to log factors) in terms of $\epsilon, \delta$, and VC dimension.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/li24b.html
https://proceedings.mlr.press/v247/li24b.htmlMinimax-optimal reward-agnostic exploration in reinforcement learningThis paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of $\frac{SAH^3}{\varepsilon^2}$ sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/li24a.html
https://proceedings.mlr.press/v247/li24a.htmlFollow-the-Perturbed-Leader with Fréchet-type Tail Distributions: Optimality in Adversarial Bandits and Best-of-Both-WorldsThis paper studies the optimality of the Follow-the-Perturbed-Leader (FTPL) policy in both adversarial and stochastic $K$-armed bandits. Despite the widespread use of the Follow-the-Regularized-Leader (FTRL) framework with various choices of regularization, the FTPL framework, which relies on random perturbations, has not received much attention, despite its inherent simplicity. In adversarial bandits, there has been conjecture that FTPL could potentially achieve $\mathcal{O}(\sqrt{KT})$ regrets if perturbations follow a distribution with a Fréchet-type tail. Recent work by Honda et al. (2023) showed that FTPL with Fréchet distribution with shape $\alpha=2$ indeed attains this bound and, notably logarithmic regret in stochastic bandits, meaning the Best-of-Both-Worlds (BOBW) capability of FTPL. However, this result only partly resolves the above conjecture because their analysis heavily relies on the specific form of the Fréchet distribution with this shape. In this paper, we establish a sufficient condition for perturbations to achieve $\mathcal{O}(\sqrt{KT})$ regrets in the adversarial setting, which covers, e.g., Fréchet, Pareto, and Student-$t$ distributions. We also demonstrate the BOBW achievability of FTPL with certain Fréchet-type tail distributions. Our results contribute not only to resolving existing conjectures through the lens of extreme value theory but also potentially offer insights into the effect of the regularization functions in FTRL through the mapping from FTPL to FTRL.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/lee24a.html
https://proceedings.mlr.press/v247/lee24a.htmlInherent limitations of dimensions for characterizing learnability of distribution classes We consider the long-standing question of finding a parameter of a class of probability distributions that characterizes its PAC learnability. While for many learning tasks (such as binary classification and online learning) there is a notion of dimension whose finiteness is equivalent to learnability within any level of accuracy, we show, rather surprisingly, that such parameter does not exist for distribution learning. Concretely, our results apply for several general notions of characterizing learnability and for several learning tasks. We show that there is no notion of dimension that characterizes the sample complexity of learning distribution classes. We then consider the weaker requirement of only characterizing learnability (rather than the quantitative sample complexity function). We propose some natural requirements for such a characterization and go on to show that there exists no characterization of learnability that satisfies these requirements for classes of distributions. Furthermore, we show that our results hold for various other learning problems. In particular, we show that there is no notion of dimension characterizing PAC-learnability for any of the tasks: classification learning w.r.t. a restricted set of marginal distributions and learnability of classes of real-valued functions with continuous losses.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/lechner24a.html
https://proceedings.mlr.press/v247/lechner24a.htmlBetter-than-KL PAC-Bayes Bounds Let $f(\theta, X_1),$ $ …,$ $ f(\theta, X_n)$ be a sequence of random elements, where $f$ is a fixed scalar function, $X_1, …, X_n$ are independent random variables (data), and $\theta$ is a random parameter distributed according to some data-dependent \emph{posterior} distribution $P_n$. In this paper, we consider the problem of proving concentration inequalities to estimate the mean of the sequence. An example of such a problem is the estimation of the generalization error of some predictor trained by a stochastic algorithm, such as a neural network, where $f$ is a loss function. Classically, this problem is approached through a \emph{PAC-Bayes} analysis where, in addition to the posterior, we choose a \emph{prior} distribution which captures our belief about the inductive bias of the learning problem. Then, the key quantity in PAC-Bayes concentration bounds is a divergence that captures the \emph{complexity} of the learning problem where the de facto standard choice is the Kullback-Leibler (KL) divergence. However, the tightness of this choice has rarely been questioned. In this paper, we challenge the tightness of the KL-divergence-based bounds by showing that it is possible to achieve a strictly tighter bound. In particular, we demonstrate new \emph{high-probability} PAC-Bayes bounds with a novel and \emph{better-than-KL} divergence that is inspired by Zhang et al. (2022). Our proof is inspired by recent advances in regret analysis of gambling algorithms, and its use to derive concentration inequalities. Our result is first-of-its-kind in that existing PAC-Bayes bounds with non-KL divergences are not known to be strictly better than KL. Thus, we believe our work marks the first step towards identifying optimal rates of PAC-Bayes bounds.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kuzborskij24a.html
https://proceedings.mlr.press/v247/kuzborskij24a.htmlAccelerated Parameter-Free Stochastic OptimizationWe propose a method that achieves near-optimal rates for \emph{smooth} stochastic convex optimization and requires essentially no prior knowledge of problem parameters. This improves on prior work which requires knowing at least the initial distance to optimality $d_0$. Our method, \textsc{U-DoG}, combines \textsc{UniXGrad} (Kavis et al., 2019) and \textsc{DoG} (Ivgi et al., 2023) with novel iterate stabilization techniques. It requires only loose bounds on $d_0$ and the noise magnitude, provides high probability guarantees under sub-Gaussian noise, and is also near-optimal in the non-smooth case. Our experiments show consistent, strong performance on convex problems and mixed results on neural network training.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kreisler24a.html
https://proceedings.mlr.press/v247/kreisler24a.htmlSimple online learning with consistent oracleWe consider online learning in the model where a learning algorithm can access the class only via the \emph{consistent oracle}—an oracle, that, at any moment, can give a function from the class that agrees with all examples seen so far. This model was recently considered by Assos et al. (COLT’23). It is motivated by the fact that standard methods of online learning rely on computing the Littlestone dimension of subclasses, a computationally intractable problem. Assos et al. gave an online learning algorithm in this model that makes at most $C^d$ mistakes on classes of Littlestone dimension $d$, for some absolute unspecified constant $C > 0$. We give a novel algorithm that makes at most $O(256^d)$ mistakes. Our proof is significantly simpler and uses only very basic properties of the Littlestone dimension. We also show that there exists no algorithm in this model that makes less than $3^d$ mistakes. Our algorithm (as well as the algorithm of Assos et al.) solves an open problem by Hasrati and Ben-David (ALT’23). Namely, it demonstrates that every class of finite Littlestone dimension with recursively enumerable representation admits a computable online learner (that may be undefined on unrealizable samples). Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kozachinskiy24a.html
https://proceedings.mlr.press/v247/kozachinskiy24a.htmlOpen Problem: Anytime Convergence Rate of Gradient DescentRecent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$?Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kornowski24a.html
https://proceedings.mlr.press/v247/kornowski24a.htmlGaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling The connections between (convex) optimization and (logconcave) sampling have been considerably enriched in the past decade with many conceptual and mathematical analogies. For instance, the Langevin algorithm can be viewed as a sampling analogue of gradient descent and has condition-number-dependent guarantees on its performance. In the early 1990s, Nesterov and Nemirovski developed the Interior-Point Method (IPM) for convex optimization based on self-concordant barriers, providing efficient algorithms for structured convex optimization, often faster than the general method. This raises the following question: can we develop an analogous IPM for structured sampling problems? In 2012, Kannan and Narayanan proposed the Dikin walk for uniformly sampling polytopes, and an improved analysis was given in 2020 by Laddha-Lee-Vempala. The Dikin walk uses a local metric defined by a self-concordant barrier for linear constraints. Here we generalize this approach by developing and adapting IPM machinery together with the Dikin walk for poly-time sampling algorithms. Our IPM-based sampling framework provides an efficient warm start and goes beyond uniform distributions and linear constraints. We illustrate the approach on important special cases, in particular giving the fastest algorithms to sample uniform, exponential, or Gaussian distributions on a truncated PSD cone. The framework is general and can be applied to other sampling algorithms.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kook24b.html
https://proceedings.mlr.press/v247/kook24b.htmlSampling from the Mean-Field Stationary Distribution We study the complexity of sampling from the stationary distribution of a mean-field SDE, or equivalently, the complexity of minimizing a functional over the space of probability measures which includes an interaction term. Our main insight is to decouple the two key aspects of this problem: (1) approximation of the mean-field SDE via a finite-particle system, via uniform-in-time propagation of chaos, and (2) sampling from the finite-particle stationary distribution, via standard log-concave samplers. Our approach is conceptually simpler and its flexibility allows for incorporating the state-of-the-art for both algorithms and theory. This leads to improved guarantees in numerous settings, including better guarantees for optimizing certain two-layer neural networks in the mean-field regime.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kook24a.html
https://proceedings.mlr.press/v247/kook24a.htmlConvergence of Kinetic Langevin Monte Carlo on Lie groups Explicit, momentum-based dynamics for optimizing functions defined on Lie groups was recently constructed, based on techniques such as variational optimization and left trivialization. We appropriately add tractable noise to the optimization dynamics to turn it into a sampling dynamics, leveraging the advantageous feature that the trivialized momentum variable is Euclidean despite that the potential function lives on a manifold. We then propose a Lie-group MCMC sampler, by delicately discretizing the resulting kinetic-Langevin-type sampling dynamics. The Lie group structure is exactly preserved by this discretization. Exponential convergence with explicit convergence rate for both the continuous dynamics and the discrete sampler are then proved under $W_2$ distance. Only compactness of the Lie group and geodesically $L$-smoothness of the potential function are needed. To the best of our knowledge, this is the first convergence result for kinetic Langevin on curved spaces, and also the first quantitative result that requires no convexity or, at least not explicitly, any common relaxation such as isoperimetry.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kong24a.html
https://proceedings.mlr.press/v247/kong24a.htmlSuperconstant Inapproximability of Decision Tree LearningWe consider the task of properly PAC learning decision trees with queries. Recent work of Koch, Strassle, and Tan showed that the strictest version of this task, where the hypothesis tree T is required to be optimally small, is NP-hard. Their work leaves open the question of whether the task remains intractable if T is only required to be close to optimal, say within a factor of 2, rather than exactly optimal. We answer this affirmatively and show that the task indeed remains NP-hard even if T is allowed to be within any constant factor of optimal. More generally, our result allows for a smooth tradeoff between the hardness assumption and inapproximability factor. As Koch et al.’s techniques do not appear to be amenable to such a strengthening, we first recover their result with a new and simpler proof, which we couple with a new XOR lemma for decision trees. While there is a large body of work on XOR lemmas for decision trees, our setting necessitates parameters that are extremely sharp and are not known to be attainable by existing such lemmas. Our work also carries new implications for the related problem of Decision Tree Minimization. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/koch24a.html
https://proceedings.mlr.press/v247/koch24a.htmlLearning Intersections of Halfspaces with Distribution Shift: Improved Algorithms and SQ Lower Bounds Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution $\mathcal{D}$, unlabeled samples from test distribution $\mathcal{D}’$, and the goal is to output a classifier with low error on $\mathcal{D}’$ whenever the training samples pass a corresponding test. Their model deviates from all prior work in that no assumptions are made on $\mathcal{D}’$. Instead, the test must accept (with high probability) when the marginals of the training and test distributions are equal. Here we focus on the fundamental case of intersections of halfspaces with respect to Gaussian training distributions and prove a variety of new upper bounds including a $2^{(k/\epsilon)^{O(1)}} \mathsf{poly}(d)$-time algorithm for TDS learning intersections of $k$ homogeneous halfspaces to accuracy $\epsilon$ (prior work achieved $d^{(k/\epsilon)^{O(1)}}$). We work under the mild assumption that the Gaussian training distribution contains at least an $\epsilon$ fraction of both positive and negative examples ($\epsilon$-balanced). We also prove the first set of SQ lower-bounds for any TDS learning problem and show (1) the $\epsilon$-balanced assumption is necessary for $\mathsf{poly}(d,1/\epsilon)$-time TDS learning for a single halfspace and (2) a $d^{\tilde{\Omega}(\log 1/\epsilon)}$ lower bound for the intersection of two general halfspaces, even with the $\epsilon$-balanced assumption. Our techniques significantly expand the toolkit for TDS learning. We use dimension reduction and coverings to give efficient algorithms for computing a localized version of discrepancy distance, a key metric from the domain adaptation literature.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/klivans24b.html
https://proceedings.mlr.press/v247/klivans24b.htmlTestable Learning with Distribution Shift We revisit the fundamental problem of learning with distribution shift, in which a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D’ and is asked to output a classifier with low test error. The standard approach in this setting is to bound the loss of a classifier in terms of some notion of distance between D and D’. These distances, however, seem difficult to compute and do not lead to efficient algorithms. We depart from this paradigm and define a new model called testable learning with distribution shift, where we can obtain provably efficient algorithms for certifying the performance of a classifier on a test distribution. In this model, a learner outputs a classifier with low test error whenever samples from D and D’ pass an associated test; moreover, the test must accept (with high probability) if the marginal of D equals the marginal of D’. We give several positive results for learning well-studied concept classes such as halfspaces, intersections of halfspaces, and decision trees when the marginal of D is Gaussian or uniform on the hypercube. Prior to our work, no efficient algorithms for these basic cases were known without strong assumptions on D’. For halfspaces in the realizable case (where there exists a halfspace consistent with both D and D’), we combine a moment-matching approach with ideas from active learning to simulate an efficient oracle for estimating disagreement regions. To extend to the non-realizable setting, we apply recent work from testable (agnostic) learning. More generally, we prove that any function class with low-degree $\mathcal{L}_2$-sandwiching polynomial approximators can be learned in our model. Since we require $\mathcal{L}_2$- sandwiching (instead of the usual $\mathcal{L}_1$ loss), we cannot directly appeal to convex duality and instead apply constructions from the pseudorandomness literature to obtain the required approximators. We also provide lower bounds to show that the guarantees we obtain on the performance of our output hypotheses are best possible up to constant factors, as well as a separation showing that realizable learning in our model is incomparable to (ordinary) agnostic learning.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/klivans24a.html
https://proceedings.mlr.press/v247/klivans24a.htmlLasso with Latents: Efficient Estimation, Covariate Rescaling, and Computational-Statistical GapsIt is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impossible to close this gap in general. In this work, we propose a natural sparse linear regression setting where strong correlations between covariates arise from unobserved latent variables. In this setting, we analyze the problem caused by strong correlations and design a surprisingly simple fix. While Lasso with standard normalization of covariates fails, there exists a heterogeneous scaling of the covariates with which Lasso will suddenly obtain strong provable guarantees for estimation. Moreover, we design a simple, efficient procedure for computing such a “smart scaling.” The sample complexity of the resulting “rescaled Lasso” algorithm incurs (in the worst case) quadratic dependence on the sparsity of the underlying signal. While this dependence is not information-theoretically necessary, we give evidence that it is optimal among the class of polynomial-time algorithms, via the method of low-degree polynomials. This argument reveals a new connection between sparse linear regression and a special version of sparse PCA with a \emph{near-critical negative spike}. The latter problem can be thought of as a real-valued analogue of learning a sparse parity. Using it, we also establish the first computational-statistical gap for the closely related problem of learning a Gaussian Graphical Model.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kelner24a.html
https://proceedings.mlr.press/v247/kelner24a.htmlChoosing the p in Lp Loss: Adaptive Rates for Symmetric Mean EstimationWhen we have a univariate distribution that is symmetric around its mean, the mean can be estimated with a rate (sample complexity) much faster than $O(1/\sqrt{n})$ in many cases. For example, given univariate random variables $Y_1, \ldots, Y_n$ distributed uniformly on $[\theta_0 - c, \theta_0 + c]$, the sample midrange $\frac{Y_{(n)}+Y_{(1)}}{2}$ maximizes likelihood and has expected error $\mathbb{E}\bigl| \theta_0 - \frac{Y_{(n)}+Y_{(1)}}{2} \bigr| \leq 2c/n$, which is optimal and much lower than the error rate $O(1/\sqrt{n})$ of the sample mean. What the optimal rate is depends on the distribution and it is generally attained by the maximum likelihood estimator (MLE). However, MLE requires exact knowledge of the underlying distribution; if the underlying distribution is \emph{unknown}, it is an open question whether an estimator can adapt to the optimal rate. In this paper, we propose an estimator of the symmetric mean $\theta_0$ with the following properties: it requires no knowledge of the underlying distribution; it has a rate no worse than $1/\sqrt{n}$ in all cases (assuming a finite second moment) and, when the underlying distribution is compactly supported, our estimator can attain a rate of $n^{-\frac{1}{{\alpha}}}$ up to polylog factors, where the rate parameter $\alpha$ can take on any value in $(0, 2]$ and depends on the moments of the underlying distribution. Our estimator is formed by minimizing the $L_\gamma$-loss with respect to the data, for a power $\gamma \geq 2$ chosen in a data-driven way – by minimizing a criterion motivated by the asymptotic variance. Our approach can be directly applied to the regression setting where $\theta_0$ is a function of observed features and motivates the use of $L_\gamma$ loss function with a data-driven $\gamma$ in certain settings. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kao24a.html
https://proceedings.mlr.press/v247/kao24a.htmlSmaller Confidence Intervals From IPW Estimators via Data-Dependent Coarsening (Extended Abstract)Inverse propensity-score weighted (IPW) estimators are prevalent in causal inference for estimating average treatment effects in observational studies. Under unconfoundedness, given accurate propensity scores and $n$ samples, the size of confidence intervals of IPW estimators scales down with $n$, and, several of their variants improve the rate of scaling. However, neither IPW estimators nor their variants are robust to inaccuracies: even if a single covariate has an $\epsilon>0$ additive error in the propensity score, the size of confidence intervals of these estimators can increase arbitrarily. Moreover, even without errors, the rate with which the confidence intervals of these estimators go to zero with $n$ can be arbitrarily slow in the presence of extreme propensity scores (those close to 0 or 1). We introduce a family of Coarse IPW (CIPW) estimators that captures existing IPW estimators and their variants. Each CIPW estimator is an IPW estimator on a coarsened covariate space, where certain covariates are merged. Under mild assumptions, e.g., Lipschitzness in expected outcomes and sparsity of extreme propensity scores, we give an efficient algorithm to find a robust estimator: given $\epsilon$-inaccurate propensity scores and $n$ samples, its confidence interval size scales with $\epsilon+(1/\sqrt{n})$. In contrast, under the same assumptions, existing estimators’ confidence interval sizes are $\Omega(1)$ irrespective of $\epsilon$ and $n$. Crucially, our estimator is data-dependent and we show that no data-independent CIPW estimator can be robust to inaccuracies. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/kalavasis24a.html
https://proceedings.mlr.press/v247/kalavasis24a.htmlSome Constructions of Private, Efficient, and Optimal $K$-Norm and Elliptic Gaussian NoiseDifferentially private computation often begins with a bound on some $d$-dimensional statistic’s $\ell_p$ sensitivity. For pure differential privacy, the $K$-norm mechanism can improve on this approach using a norm tailored to the statistic’s sensitivity space. Writing down a closed-form description of this optimal norm is often straightforward. However, running the $K$-norm mechanism reduces to uniformly sampling the norm’s unit ball; this ball is a $d$-dimensional convex body, so general sampling algorithms can be slow. Turning to concentrated differential privacy, elliptic Gaussian noise offers similar improvement over spherical Gaussian noise. Once the shape of this ellipse is determined, sampling is easy; however, identifying the best such shape may be hard. This paper solves both problems for the simple statistics of sum, count, and vote. For each statistic, we provide a sampler for the optimal $K$-norm mechanism that runs in time $\tilde O(d^2)$ and derive a closed-form expression for the optimal shape of elliptic Gaussian noise. The resulting algorithms all yield meaningful accuracy improvements while remaining fast and simple enough to be practical. More broadly, we suggest that problem-specific sensitivity space analysis may be an overlooked tool for private additive noise.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/joseph24a.html
https://proceedings.mlr.press/v247/joseph24a.htmlFaster Spectral Density Estimation and Sparsification in the Nuclear Norm (Extended Abstract) We consider the problem of estimating the spectral density of a normalized graph adjacency matrix. Concretely, given an undirected graph $G = (V, E, w)$ with $n$ nodes and positive edge weights $w \in \mathbb{R}^{E}_{> 0}$, the goal is to return eigenvalue estimates $\widehat{\lambda}_1 \le \cdots\le \widehat{\lambda}_n$ such that \begin{align*} \frac{1}{n} \sum_{i\in\{1,\ldots, n\}}|\widehat{\lambda}_i-\lambda_i(N_G)|\le \varepsilon, \end{align*} where ${\lambda}_1(N_G)\le \cdots\le{\lambda}_n(N_G)$ are the eigenvalues of $G$’s normalized adjacency matrix, $N_G$. This goal is equivalent to requiring that the Wasserstein-1 distance between the uniform distribution on $\lambda_1, \ldots, \lambda_n$ and the uniform distribution on $\widehat{\lambda}_1, \ldots, \widehat{\lambda}_n$ is less than $\varepsilon$. We provide a randomized algorithm that achieves the guarantee above with $O(n\varepsilon^{-2})$ queries to a degree and neighbor oracle and in $O(n\varepsilon^{-3})$ time. This improves on previous state-of-the-art methods, including an $O(n\varepsilon^{-7})$ time algorithm from [Braverman et al., STOC 2022] and, for sufficiently small $\varepsilon$, a $2^{O(\varepsilon^{-1})}$ time method from [Cohen-Steiner et al., KDD 2018]. To achieve this result, we introduce a new notion of graph sparsification, which we call \emph{nuclear sparsification}. We provide an $O(n\varepsilon^{-2})$-query and $O(n\varepsilon^{-2})$-time algorithm for computing $O(n\varepsilon^{-2})$-sparse nuclear sparsifiers. We show that this bound is optimal in both its sparsity and query complexity, and we separate our results from the related notion of additive spectral sparsification. Of independent interest, we show that our sparsification method also yields the first \emph{deterministic} algorithm for spectral density estimation that scales linearly with $n$ (sublinear in the representation size of the graph).Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/jin24a.html
https://proceedings.mlr.press/v247/jin24a.htmlAlgorithms for mean-field variational inference via polyhedral optimization in the Wasserstein spaceWe develop a theory of finite-dimensional polyhedral subsets over the Wasserstein space and optimization of functionals over them via first-order methods. Our main application is to the problem of mean-field variational inference, which seeks to approximate a distribution $\pi$ over $\mathbb{R}^d$ by a product measure $\pi^\star$. When $\pi$ is strongly log-concave and log-smooth, we provide (1) approximation rates certifying that $\pi^\star$ is close to the minimizer $\pi^\star_\diamond$ of the KL divergence over a \emph{polyhedral} set $\mathcal{P}_\diamond$, and (2) an algorithm for minimizing $\text{KL}(\cdot\|\pi)$ over $\mathcal{P}_\diamond$ with accelerated complexity $O(\sqrt \kappa \log(\kappa d/\varepsilon^2))$, where $\kappa$ is the condition number of $\pi$. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/jiang24a.html
https://proceedings.mlr.press/v247/jiang24a.htmlOffline Reinforcement Learning: Role of State Aggregation and Trajectory DataWe revisit the problem of offline reinforcement learning with value function realizability but without Bellman completeness. Previous work by Xie and Jiang (2021) and Foster et al. (2022) left open the question of whether bounded (all-policy) concentrability coefficient along with trajectory-based offline data admits a polynomial sample complexity. In this work, we provide a negative answer to this question for the task of offline policy evaluation. In addition to addressing this question, we provide a rather complete picture for offline policy evaluation with only value function realizability. Our primary findings are threefold: 1) The sample complexity of offline policy evaluation is governed by the concentrability coefficient in an aggregated Markov Transition Model jointly determined by the function class and the offline data distribution, rather than that in the original MDP. This unifies and generalizes the ideas of Xie and Jiang (2021) and Foster et al. (2022), 2) The concentrability coefficient in the aggregated Markov Transition Model may grow exponentially with the horizon length, even when the concentrability coefficient in the original MDP is small and the offline data is \emph{admissible} (i.e., the data distribution equals the occupancy measure of some policy), 3) Under value function realizability, there is a generic reduction that can convert any hard instance with admissible data to a hard instance with trajectory data, implying that trajectory data offers no extra benefits over admissible data. These three pieces jointly resolve the open problem, though each of them could be of independent interest. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/jia24a.html
https://proceedings.mlr.press/v247/jia24a.htmlClosing the Computational-Query Depth Gap in Parallel Stochastic Convex OptimizationWe develop a new parallel algorithm for minimizing Lipschitz, convex functions with a stochastic subgradient oracle. The total number of queries made and the query depth, i.e., the number of parallel rounds of queries, match the prior state-of-the-art, [CJJLLST23], while improving upon the computational depth by a polynomial factor for sufficiently small accuracy. When combined with previous state-of-the-art methods our result closes a gap between the best-known query depth and the best-known computational depth of parallel algorithms. Our method starts with a \emph{ball acceleration} framework of previous parallel methods, i.e., [CJJJLST20, ACJJS21], which reduce the problem to minimizing a regularized Gaussian convolution of the function constrained to Euclidean balls. By developing and leveraging new stability properties of the Hessian of this induced function, we depart from prior parallel algorithms and reduce these ball-constrained optimization problems to stochastic unconstrained quadratic minimization problems. Although we are unable to prove concentration of the asymmetric matrices that we use to approximate this Hessian, we nevertheless develop an efficient parallel method for solving these quadratics. Interestingly, our algorithms can be improved using fast matrix multiplication and run in nearly-linear time if the matrix multiplication exponent is 2.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/jambulapati24b.html
https://proceedings.mlr.press/v247/jambulapati24b.htmlBlack-Box k-to-1-PCA Reductions: Theory and ApplicationsThe $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have black-box access to via samples. Motivated by these settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work in sample complexity by a $\mathsf{poly}(k)$ factor.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/jambulapati24a.html
https://proceedings.mlr.press/v247/jambulapati24a.htmlAdaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-WorldsFollow-The-Regularized-Leader (FTRL) is known as an effective and versatile approach in online learning, where appropriate choice of the learning rate is crucial for smaller regret. To this end, we formulate the problem of adjusting FTRL’s learning rate as a sequential decision-making problem and introduce the framework of competitive analysis. We establish a lower bound for the competitive ratio and propose update rules for the learning rate that achieves an upper bound within a constant factor of this lower bound. Specifically, we illustrate that the optimal competitive ratio is characterized by the (approximate) monotonicity of components of the penalty term, showing that a constant competitive ratio is achievable if the components of the penalty term form a monotone non-increasing sequence, and derive a tight competitive ratio when penalty terms are $\xi$-approximately monotone non-increasing. Our proposed update rule, referred to as \textit{stability-penalty matching}, also facilitates the construction of Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our results contribute to achieving tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/ito24a.html
https://proceedings.mlr.press/v247/ito24a.htmlReconstructing the Geometry of Random Geometric Graphs (Extended Abstract)Random geometric graphs are random graph models defined on metric spaces. Such a model is defined by first sampling points from a metric space and then connecting each pair of sampled points with probability that depends on their distance, independently among pairs. In this work we show how to efficiently reconstruct the geometry of the underlying space from the sampled graph under the {\em manifold} assumption, i.e., assuming that the underlying space is a low dimensional manifold and that the connection probability is a strictly decreasing function of the Euclidean distance between the points in a given embedding of the manifold in $\mathbb{R}^N$. Our work complements a large body of work on manifold learning, where the goal is to recover a manifold from sampled points sampled in the manifold along with their (approximate) distanceSun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/huang24c.html
https://proceedings.mlr.press/v247/huang24c.htmlInformation-Theoretic Thresholds for the Alignments of Partially Correlated GraphsThis paper studies the problem of recovering the hidden vertex correspondence between two correlated random graphs. We propose the partially correlated Erdős-Rényi graphs model, wherein a pair of induced subgraphs with a certain number are correlated. We investigate the information-theoretic thresholds for recovering the latent correlated subgraphs and the hidden vertex correspondence. We prove that there exists an optimal rate for partial recovery for the number of correlated nodes, above which one can correctly match a fraction of vertices and below which correctly matching any positive fraction is impossible, and we also derive an optimal rate for exact recovery. In the proof of possibility results, we propose correlated functional digraphs, which categorize the edges of the intersection graph into two cases of components, and bound the error probability by lower-order cumulant generating functions. The proof of impossibility results build upon the generalized Fano’s inequality and the recovery thresholds settled in correlated Erdős-Rényi graphs modelSun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/huang24b.html
https://proceedings.mlr.press/v247/huang24b.htmlFaster Sampling without Isoperimetry via Diffusion-based Monte Carlo To sample from a general target distribution $p_*\propto e^{-f_*}$ beyond the isoperimetric condition, Huang et al. (2023) proposed to perform sampling through reverse diffusion, giving rise to Diffusion-based Monte Carlo (DMC). Specifically, DMC follows the reverse SDE of a diffusion process that transforms the target distribution to the standard Gaussian, utilizing a non-parametric score estimation. However, the original DMC algorithm encountered high gradient complexity, resulting in an exponential dependency on the error tolerance $\epsilon$ of the obtained samples. In this paper, we demonstrate that the high complexity of the original DMC algorithm originates from its redundant design of score estimation, and proposed a more efficient DMC algorithm, called RS-DMC, based on a novel recursive score estimation method. In particular, we first divide the entire diffusion process into multiple segments and then formulate the score estimation step (at any time step) as a series of interconnected mean estimation and sampling subproblems accordingly, which are correlated in a recursive manner. Importantly, we show that with a proper design of the segment decomposition, all sampling subproblems will only need to tackle a strongly log-concave distribution, which can be very efficient to solve using the standard sampler (e.g., Langevin Monte Carlo) with a provably rapid convergence rate. As a result, we prove that the gradient complexity of RS-DMC exhibits merely a quasi-polynomial dependency on $\epsilon$. This finding is highly unexpected as it substantially enhances the prevailing belief of the necessity for exponential gradient complexity in all prior works such as Huang et al. (2023). Under commonly used dissipative conditions, our algorithm is provably much faster than the popular Langevin-based algorithms. Our algorithm design and theoretical framework illuminate a novel direction for addressing sampling problems, which could be of broader applicability in the community. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/huang24a.html
https://proceedings.mlr.press/v247/huang24a.htmlOpen Problem: Optimal Rates for Stochastic Decision-Theoretic Online Learning Under Differentially PrivacyFor the stochastic variant of decision-theoretic online learning with $K$ actions, $T$ rounds, and minimum gap $\Delta_{\min}$, the optimal, gap-dependent rate of the pseudo-regret is known to be $O \left( \frac{\log K}{\Delta_{\min}} \right)$. We ask to settle the optimal gap-dependent rate for the problem under $\varepsilon$-differential privacy.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/hu24a.html
https://proceedings.mlr.press/v247/hu24a.htmlOn the sample complexity of parameter estimation in logistic regression with normal designThe logistic regression model is one of the most popular data generation model in noisy binary classification problems. In this work, we study the sample complexity of estimating the parameters of the logistic regression model up to a given $\ell_2$ error, in terms of the dimension and the inverse temperature, with standard normal covariates. The inverse temperature controls the signal-to-noise ratio of the data generation process. While both generalization bounds and asymptotic performance of the maximum-likelihood estimator for logistic regression are well-studied, the non-asymptotic sample complexity that shows the dependence on error and the inverse temperature for parameter estimation is absent from previous analyses. We show that the sample complexity curve has two change-points in terms of the inverse temperature, clearly separating the low, moderate, and high temperature regimes.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/hsu24a.html
https://proceedings.mlr.press/v247/hsu24a.htmlAdversarially-Robust Inference on Trees via Belief PropagationWe introduce and study the problem of posterior inference on tree-structured graphical models in the presence of a malicious adversary who can corrupt some observed nodes. In the well-studied \emph{broadcasting on trees} model, corresponding to the ferromagnetic Ising model on a $d$-regular tree with zero external field, when a natural signal-to-noise ratio exceeds one (the celebrated \emph{Kesten-Stigum threshold}), the posterior distribution of the root given the leaves is bounded away from $\mathrm{Ber}(1/2)$, and carries nontrivial information about the sign of the root. This posterior distribution can be computed exactly via dynamic programming, also known as belief propagation. We first confirm a folklore belief that a malicious adversary who can corrupt an inverse-polynomial fraction of the leaves of their choosing makes this inference impossible. Our main result is that accurate posterior inference about the root vertex given the leaves \emph{is} possible when the adversary is constrained to make corruptions at a $\rho$-fraction of randomly-chosen leaf vertices, so long as the signal-to-noise ratio exceeds $O(\log d)$ and $\rho \leq c \varepsilon$ for some universal $c > 0$. Since inference becomes information-theoretically impossible when $\rho \gg \varepsilon$, this amounts to an information-theoretically optimal fraction of corruptions, up to a constant multiplicative factor. Furthermore, we show that the canonical belief propagation algorithm performs this inference.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/hopkins24a.html
https://proceedings.mlr.press/v247/hopkins24a.htmlOpen problem: Direct Sums in Learning TheoryIn computer science, the term ’direct sum’ refers to fundamental questions about the scaling of computational or information complexity with respect to multiple task instances. Consider an algorithmic task \({T} \){and} a computational resource \({C} \). For instance, \({T} \){might} be the task of computing a polynomial, with \({C} \){representing} the number of arithmetic operations required, or \({T} \){could} be a learning task with its sample complexity as \({C} \). The direct sum inquiry focuses on the cost of solving \({k} \){separate} instances of \({T} \), particularly how this aggregate cost compares to the resources needed for a single instance. Typically, the cost for multiple instances is at most \({k} \){times} the cost of one, since each can be handled independently. However, there are intriguing scenarios where the total cost for \({k} \){instances} is less than this linear relationship. Such questions naturally extend to the machine-learning setting in which one may be interested in solving several learning problems at once. This notion of direct sums of learning problems gives rise to various natural questions and interesting problemsSun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/hanneke24c.html
https://proceedings.mlr.press/v247/hanneke24c.htmlList Sample Compression and Uniform ConvergenceList learning is a variant of supervised classification where the learner outputs multiple plausible labels for each instance rather than just one. We investigate classical principles related to generalization within the context of list learning. Our primary goal is to determine whether classical principles in the PAC setting retain their applicability in the domain of list PAC learning. We focus on uniform convergence (which is the basis of Empirical Risk Minimization) and on sample compression (which is a powerful manifestation of Occam’s Razor). In classical PAC learning, both uniform convergence and sample compression satisfy a form of ‘completeness’: whenever a class is learnable, it can also be learned by a learning rule that adheres to these principles. We ask whether the same completeness holds true in the list learning setting. We show that uniform convergence remains equivalent to learnability in the list PAC learning setting. In contrast, our findings reveal surprising results regarding sample compression: we prove that when the label space is $Y=\{0,1,2\}$, then there are 2-list-learnable classes that cannot be compressed. This refutes the list version of the sample compression conjecture by Littlestone and Warmuth in 1986. We prove an even stronger impossibility result, showing that there are $2$-list-learnable classes that cannot be compressed even when the reconstructed function can work with lists of arbitrarily large size. We prove a similar result for (1-list) PAC learnable classes when the label space is unbounded.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/hanneke24b.html
https://proceedings.mlr.press/v247/hanneke24b.htmlThe Star Number and Eluder Dimension: Elementary Observations About the Dimensions of DisagreementThis article presents a number of elementary observations and relations concerning commonly-studied combinatorial dimensions from the learning theory literature on classification and reinforcement learning: namely, the star number, eluder dimension, VC dimension, Littlestone dimension, threshold dimension, and cardinality of the class. One theme of the work is understanding how these dimensions may be re-expressed as natural dimensions of the convexity space of version spaces. Specifically, we find that the star number is precisely the VC dimension of version spaces (and of their disagreement regions), whereas the eluder dimension is precisely the threshold dimension of version spaces (and of their disagreement regions). We are also interested in understanding direct relations among these dimensions. For instance, we show that there is no infinite concept class with both finite Littlestone dimension and finite star number. Moreover, any infinite concept class must have infinite eluder dimension. In both cases, we also provide quantitative relations to the cardinality of the class. For the latter result, we also show an analogous relation for real-valued functions, where the cardinality of the class is replaced by the $L_\infty$ covering number. As another relation between star numbers and VC dimension, we provide a simple, precise, and general characterization of the VC dimension of the minimal intersection-closed class containing a given concept class: namely, the 1-centered star number of the original class. Moreover, we generalize this result to provide a unifying approach to the design of certain sample compression schemes, along with a simple combinatorial dimension characterizing its compression size: the minimum star number. We also discuss a number of implications of many of these observations. Though the proofs of the above observations are actually all incredibly simple, it is interesting that such fundamental relations among these well-known quantities appear to have heretofore gone unnoticed in the literature.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/hanneke24a.html
https://proceedings.mlr.press/v247/hanneke24a.htmlPrediction from compression for models with infinite memory, with applications to hidden Markov and renewal processesConsider the problem of predicting the next symbol given a sample path of length $n$, whose joint distribution belongs to a distribution class that may have long-term memory. The goal is to compete with the conditional predictor that knows the true model. For both hidden Markov models (HMMs) and renewal processes, we determine the optimal prediction risk in Kullback-Leibler divergence up to universal constant factors. Extending existing results in finite-order Markov models (Han et al. (2023)) and drawing ideas from universal compression, the proposed estimator has a prediction risk bounded by redundancy of the distribution class and a memory term that accounts for the long-range dependency of the model. Notably, for HMMs with bounded state and observation spaces, a polynomial-time estimator based on dynamic programming is shown to achieve the optimal prediction risk $\Theta(\frac{\log n}{n})$; prior to this work, the only known result of this type is $O(\frac{1}{\log n})$ obtained using Markov approximation (Sharan et al. (2018)). Matching minimax lower bounds are obtained by making connections to redundancy and mutual information via a reduction argument.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/han24a.html
https://proceedings.mlr.press/v247/han24a.htmlBeyond Catoni: Sharper Rates for Heavy-Tailed and Robust Mean EstimationWe study the fundamental problem of estimating the mean of a $d$-dimensional distribution with covariance $\Sigma \preccurlyeq \sigma^2 I_d$ given $n$ samples. When $d = 1$, \cite{catoni} showed an estimator with error $(1+o(1)) \cdot \sigma \sqrt{\frac{2 \log \frac{1}{\delta}}{n}}$, with probability $1 - \delta$, matching the Gaussian error rate. For $d>1$, a natural estimator outputs the center of the minimum enclosing ball of one-dimensional confidence intervals to achieve a $1-\delta$ confidence radius of $\sqrt{\frac{2 d}{d+1}} \cdot \sigma \left(\sqrt{\frac{d}{n}} + \sqrt{\frac{2 \log \frac{1}{\delta}}{n}}\right)$, incurring a $\sqrt{\frac{2d}{d+1}}$-factor loss over the Gaussian rate. When the $\sqrt{\frac{d}{n}}$ term dominates by a $\sqrt{\log \frac{1}{\delta}}$ factor, \cite{lee2022optimal-highdim} showed an improved estimator matching the Gaussian rate. This raises a natural question: Is the $\sqrt{\frac{2 d}{d+1}}$ loss \emph{necessary} when the $\sqrt{\frac{2 \log \frac{1}{\delta}}{n}}$ term dominates? We show that the answer is \emph{no} – we construct an estimator that improves over the above naive estimator by a constant factor. We also consider robust estimation, where an adversary is allowed to corrupt an $\epsilon$-fraction of samples arbitrarily: in this case, we show that the above strategy of combining one-dimensional estimates and incurring the $\sqrt{\frac{2d}{d+1}}$-factor \emph{is} optimal in the infinite-sample limit.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gupta24a.html
https://proceedings.mlr.press/v247/gupta24a.htmlStochastic Constrained Contextual Bandits via Lyapunov Optimization Based Estimation to Decision FrameworkThis paper studies the problem of stochastic constrained contextual bandits (CCB) under general realizability condition where the expected rewards and costs are within general function classes. We propose LOE2D, a Lyapunov Optimization Based Estimation to Decision framework with online regression oracles for learning reward/constraint. LOE2D establishes $\Tilde O(T^{\frac{3}{4}}U^{\frac{1}{4}})$ regret and constraint violation, which can be further refined to $\Tilde O(\min\{\sqrt{TU}/\varepsilon^2, T^{\frac{3}{4}}U^{\frac{1}{4}}\})$ when the Slater condition holds in the underlying offline problem with the Slater “constant” $ \varepsilon=\Omega(\sqrt{U/T}),$ where $U$ denotes the error bounds of online regression oracles. These results improve LagrangeCBwLC in two aspects: i) our results hold without any prior information while LagrangeCBwLC requires the knowledge of Slater constant to design a proper learning rate; ii) our results hold when $\varepsilon=\Omega(\sqrt{U/T})$ while LagrangeCBwLC requires a constant margin $\varepsilon=\Omega(1).$ These improvements stem from two novel techniques: violation-adaptive learning in E2D module and multi-step Lyapunov drift analysis in bounding constraint violation. The experiments further justify LOE2D outperforms the baseline algorithm. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/guo24a.html
https://proceedings.mlr.press/v247/guo24a.htmlCommunity detection in the hypergraph stochastic block model and reconstruction on hypertreesWe study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In this model, $n$ vertices are randomly divided into two communities, and size-$r$ hyperedges are added randomly depending on whether all vertices in the hyperedge are in the same community. The goal of weak recovery is to recover a non-trivial fraction of the communities given the hypergraph. Pal and Zhu (2021); Stephan and Zhu (2022) established that weak recovery is always possible above a natural threshold called the Kesten-Stigum (KS) threshold. For assortative models (i.e., monochromatic hyperedges are preferred), Gu and Polyanskiy (2023) proved that the KS threshold is tight if $r\le 4$ or the expected degree $d$ is small. For other cases, the tightness of the KS threshold remained open. In this paper we determine the tightness of the KS threshold for a wide range of parameters. We prove that for $r\le 6$ and $d$ large enough, the KS threshold is tight. This shows that there is no information-computation gap in this regime and partially confirms a conjecture of Angelini et al. (2015). On the other hand, we show that for $r\ge 5$, there exist parameters for which the KS threshold is not tight. In particular, for $r\ge 7$, the KS threshold is not tight if the model is disassortative (i.e., polychromatic hyperedges are preferred) or $d$ is large enough. This provides more evidence supporting the existence of an information-computation gap in these cases. Furthermore, we establish asymptotic bounds on the weak recovery threshold for fixed $r$ and large $d$. We also obtain a number of results regarding the broadcasting on hypertrees (BOHT) model, including the asymptotics of the reconstruction threshold for $r\ge 7$ and impossibility of robust reconstruction at criticality.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gu24a.html
https://proceedings.mlr.press/v247/gu24a.htmlPrincipal eigenstate classical shadowsGiven many copies of an unknown quantum state $\rho$, we consider the task of learning a classical description of its principal eigenstate. Namely, assuming that $\rho$ has an eigenstate $|\phi⟩$ with (unknown) eigenvalue $\lambda > 1/2$, the goal is to learn a (classical shadows style) classical description of $|\phi⟩$ which can later be used to estimate expectation values $⟨\phi |O | \phi ⟩$ for any $O$ in some class of observables. We consider the sample-complexity setting in which generating a copy of $\rho$ is expensive, but joint measurements on many copies of the state are possible. We present a protocol for this task scaling with the principal eigenvalue $\lambda$ and show that it is optimal within a space of natural approaches, e.g., applying quantum state purification followed by a single-copy classical shadows scheme. Furthermore, when $\lambda$ is sufficiently close to $1$, the performance of our algorithm is optimal—matching the sample complexity for pure state classical shadows.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/grier24a.html
https://proceedings.mlr.press/v247/grier24a.htmlOn the Computability of Robust PAC LearningWe initiate the study of computability requirements for adversarially robust learning. Adversarially robust PAC-type learnability is by now an established field of research. However, the effects of computability requirements in PAC-type frameworks are only just starting to emerge. We introduce the problem of robust computable PAC (robust CPAC) learning and provide some simple sufficient conditions for this. We then show that learnability in this setup is not implied by the combination of its components: classes that are both CPAC and robustly PAC learnable are not necessarily robustly CPAC learnable. Furthermore, we show that the novel framework exhibits some surprising effects: for robust CPAC learnability it is not required that the robust loss is computably evaluable! Towards understanding characterizing properties, we introduce a novel dimension, the computable robust shattering dimension. We prove that its finiteness is necessary, but not sufficient for robust CPAC learnability. This might yield novel insights for the corresponding phenomenon in the context of robust PAC learnability, where insufficiency of the robust shattering dimension for learnability has been conjectured, but so far a resolution has remained elusive.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gourdeau24a.html
https://proceedings.mlr.press/v247/gourdeau24a.htmlIdentification of mixtures of discrete product distributions in near-optimal sample and time complexityWe consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables $X_1 \ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gordon24a.html
https://proceedings.mlr.press/v247/gordon24a.htmlOmnipredictors for regression and the approximate rank of convex functionsConsider the supervised learning setting where the goal is to learn to predict labels $\mathbf y$ given points $\mathbf x$ from a distribution. An \textit{omnipredictor} for a class $\mathcal L$ of loss functions and a class $\mathcal C$ of hypotheses is a predictor whose predictions incur less expected loss than the best hypothesis in $\mathcal C$ for every loss in $\mathcal L$. Since the work of Gopalan et al. (2021) that introduced the notion, there has been a large body of work in the setting of binary labels where $\mathbf y \in \{0, 1\}$, but much less is known about the regression setting where $\mathbf y \in [0,1]$ can be continuous. The naive generalization of the previous approaches to regression is to predict the probability distribution of $y$, discretized to $\varepsilon$-width intervals. The running time would be exponential in the size of the output of the omnipredictor, which is $1/\varepsilon$. Our main conceptual contribution is the notion of \textit{sufficient statistics} for loss minimization over a family of loss functions: these are a set of statistics about a distribution such that knowing them allows one to take actions that minimize the expected loss for any loss in the family. The notion of sufficient statistics relates directly to the approximate rank of the family of loss functions. Thus, improved bounds on the latter yield improved runtimes for learning omnipredictors. Our key technical contribution is a bound of $O(1/\varepsilon^{2/3})$ on the $\epsilon$-approximate rank of convex, Lipschitz functions on the interval $[0,1]$, which we show is tight up to a factor of $\mathrm{polylog} (1/\epsilon)$. This yields improved runtimes for learning omnipredictors for the class of all convex, Lipschitz loss functions under weak learnability assumptions about the class $\mathcal C$. We also give efficient omnipredictors when the loss families have low-degree polynomial approximations, or arise from generalized linear models (GLMs). This translation from sufficient statistics to faster omnipredictors is made possible by lifting the technique of loss outcome indistinguishability introduced by Gopalan et al. (2023a) for Boolean labels to the regression setting.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gopalan24b.html
https://proceedings.mlr.press/v247/gopalan24b.htmlOn Computationally Efficient Multi-Class CalibrationConsider a multi-class labelling problem, where the labels can take values in $[k]$, and a predictor predicts a distribution over the labels. In this work, we study the following foundational question: \emph{Are there notions of multi-class calibration that give strong guarantees of meaningful predictions and can be achieved in time and sample complexities polynomial in $k$?} Prior notions of calibration exhibit a tradeoff between computational efficiency and expressivity: they either suffer from having sample complexity exponential in $k$, or needing to solve computationally intractable problems, or give rather weak guarantees. Our main contribution is a notion of calibration that achieves all these desiderata: we formulate a robust notion of \emph{projected smooth calibration} for multi-class predictions, and give new recalibration algorithms for efficiently calibrating predictors under this definition with complexity polynomial in $k$. Projected smooth calibration gives strong guarantees for all downstream decision makers who want to use the predictor for binary classification problems of the form: does the label belong to a subset $T \subseteq [k]$: \emph{e.g. is this an image of an animal?} It ensures that the probabilities predicted by summing the probabilities assigned to labels in $T$ are close to some perfectly calibrated binary predictor for that task. We also show that natural strengthenings of our definition are computationally hard to achieve: they run into information theoretic barriers or computational intractability. Underlying both our upper and lower bounds is a tight connection that we prove between multi-class calibration and the well-studied problem of agnostic learning in the (standard) binary prediction setting. This allows us to use kernel methods to design efficient algorithms, and also to use known hardness results for agnostic learning based on the hardness of refuting random CSPs to show lower bounds. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gopalan24a.html
https://proceedings.mlr.press/v247/gopalan24a.htmlMirror Descent Algorithms with Nearly Dimension-Independent Rates for Differentially-Private Stochastic Saddle-Point Problems extended abstractWe study the problem of differentially-private (DP) stochastic (convex-concave) saddle-points in the polyhedral setting. We propose $(\varepsilon, \delta)$-DP algorithms based on stochastic mirror descent that attain nearly dimension-independent convergence rates for the expected duality gap, a type of guarantee that was known before only for bilinear objectives. For convex-concave and first-order-smooth stochastic objectives, our algorithms attain a rate of $\sqrt{\log(d)/n} + (\log(d)^{3/2}/[n\varepsilon])^{1/3}$, where $d$ is the dimension of the problem and $n$ the dataset size. Under an additional second-order-smoothness assumption, we improve the rate on the expected gap to $\sqrt{\log(d)/n} + (\log(d)^{3/2}/[n\varepsilon])^{2/5}$. Under this additional assumption, we also show, by using bias-reduced gradient estimators, that the duality gap is bounded by $\log(d)/\sqrt{n} + \log(d)/[n\varepsilon]^{1/2}$ with constant success probability. This result provides evidence of the near-optimality of the approach. Finally, we show that combining our methods with acceleration techniques from online learning leads to the first algorithm for DP Stochastic Convex Optimization in the polyhedral setting that is not based on Frank-Wolfe methods. For convex and first-order-smooth stochastic objectives, our algorithms attain an excess risk of $\sqrt{\log(d)/n} + \log(d)^{7/10}/[n\varepsilon]^{2/5}$, and when additionally assuming second-order-smoothness, we improve the rate to $\sqrt{\log(d)/n} + \log(d)/\sqrt{n\varepsilon}$. Instrumental to all of these results are various extensions of the classical Maurey Sparsification Lemma, which may be of independent interest.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gonzalez24a.html
https://proceedings.mlr.press/v247/gonzalez24a.htmlLinear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few ActionsOne of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well- specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., (Jiang et al., 2017; Zanette et al., 2020a)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/golowich24a.html
https://proceedings.mlr.press/v247/golowich24a.htmlOn Convex Optimization with Semi-Sensitive FeaturesWe study the differentially private (DP) empirical risk minimization (ERM) problem under the \emph{semi-sensitive DP} setting where only some features are sensitive. This generalizes the Label DP setting where only the label is sensitive. We give improved upper and lower bounds on the excess risk for DP-ERM. In particular, we show that the error only scales polylogarithmically in terms of the sensitive domain size, improving upon previous results that scale polynomially in the size of the sensitive domain (Ghazi et al., NeurIPS 2021).Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/ghazi24a.html
https://proceedings.mlr.press/v247/ghazi24a.html$(ε, u)$-Adaptive Regret Minimization in Heavy-Tailed BanditsHeavy-tailed distributions naturally arise in several settings, from finance to telecommunications. While regret minimization under subgaussian or bounded rewards has been widely studied, learning with heavy-tailed distributions only gained popularity over the last decade. In this paper, we consider the setting in which the reward distributions have finite absolute raw moments of maximum order $1+\epsilon$, uniformly bounded by a constant $u<+\infty$, for some $\epsilon \in (0,1]$. In this setting, we study the regret minimization problem when $\epsilon$ and $u$ are unknown to the learner and it has to adapt. First, we show that adaptation comes at a cost and derive two negative results proving that the same regret guarantees of the non-adaptive case cannot be achieved with no further assumptions. Then, we devise and analyze a fully data-driven trimmed mean estimator and propose a novel adaptive regret minimization algorithm, \texttt{AdaR-UCB}, that leverages such an estimator. Finally, we show that \texttt{AdaR-UCB} is the first algorithm that, under a known distributional assumption, enjoys regret guarantees nearly matching those of the non-adaptive heavy-tailed case.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/genalti24a.html
https://proceedings.mlr.press/v247/genalti24a.htmlAdversarial Online Learning with Temporal Feedback GraphsWe study a variant of prediction with expert advice where the learner’s action at round $t$ is only allowed to depend on losses on a specific subset of the rounds (where the structure of which rounds’ losses are visible at time $t$ is provided by a directed “feedback graph” known to the learner). We present a novel learning algorithm for this setting based on a strategy of partitioning the losses across sub-cliques of this graph. We complement this with a lower bound that is tight in many practical settings, and which we conjecture to be within a constant factor of optimal. For the important class of transitive feedback graphs, we prove that this algorithm is efficiently implementable and obtains the optimal regret bound (up to a universal constant).Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gatmiry24b.html
https://proceedings.mlr.press/v247/gatmiry24b.htmlSampling Polytopes with Riemannian HMC: Faster Mixing via the Lewis Weights BarrierWe analyze Riemannian Hamiltonian Monte Carlo (RHMC) on a manifold endowed with the metric defined by the Hessian of a convex barrier function and apply it to sample a polytope defined by $m$ inequalities in $\R^n$. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate of RHMC has a linear dependence on the number of inequalities. We introduce a hybrid of the Lewis weight barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by $\tilde O(m^{1/3}n^{4/3})$, improving on the previous best bound of $\tilde O(mn^{2/3})$ (based on the log barrier). This continues the general parallels between optimization and sampling, with the latter typically leading to new tools and requiring more refined analysis. To prove our main results, we overcomes several challenges relating to the smoothness of Hamiltonian curves and self-concordance properties of the barrier. In the process, we give a general framework for the analysis of Markov chains on Riemannian manifolds, derive new smoothness bounds on Hamiltonian curves, a central topic of comparison geometry, and extend self-concordance theory to the infinity norm, which gives sharper bounds; these properties all appear to be of independent interest.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gatmiry24a.html
https://proceedings.mlr.press/v247/gatmiry24a.htmlSafe Linear Bandits over Unknown PolytopesThe safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gangrade24a.html
https://proceedings.mlr.press/v247/gangrade24a.htmlAgnostic Active Learning of Single Index Models with Linear Sample Complexity We study active learning methods for single index models of the form $F({\bm x}) = f(⟨{\bm w}, {\bm x}⟩)$, where $f:\mathbb{R} \to \mathbb{R}$ and ${\bx,\bm w} \in \mathbb{R}^d$. In addition to their theoretical interest as simple examples of non-linear neural networks, single index models have received significant recent attention due to applications in scientific machine learning like surrogate modeling for partial differential equations (PDEs). Such applications require sample-efficient active learning methods that are robust to adversarial noise. I.e., that work even in the challenging agnostic learning setting. We provide two main results on agnostic active learning of single index models. First, when $f$ is known and Lipschitz, we show that $\tilde{O}(d)$ samples collected via {statistical leverage score sampling} are sufficient to learn a near-optimal single index model. Leverage score sampling is simple to implement, efficient, and already widely used for actively learning linear models. Our result requires no assumptions on the data distribution, is optimal up to log factors, and improves quadratically on a recent ${O}(d^{2})$ bound of Gajjar et. al 2023. Second, we show that $\tilde{O}(d)$ samples suffice even in the more difficult setting when $f$ is \emph{unknown}. Our results leverage tools from high dimensional probability, including Dudley’s inequality and dual Sudakov minoration, as well as a novel, distribution-aware discretization of the class of Lipschitz functions.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/gajjar24a.html
https://proceedings.mlr.press/v247/gajjar24a.htmlOnline Newton Method for Bandit Convex Optimisation Extended AbstractWe introduce a computationally efficient algorithm for zeroth-order bandit convex optimisation and prove that in the adversarial setting its regret is at most $d^{3.5} \sqrt{n} \mathrm{polylog}(n, d)$ with high probability where $d$ is the dimension and $n$ is the time horizon. In the stochastic setting the bound improves to $M d^{2} \sqrt{n} \mathrm{polylog}(n, d)$ where $M \in [d^{-1/2}, d^{-1/4}]$ is a constant that depends on the geometry of the constraint set and the desired computational properties.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/fokkema24a.html
https://proceedings.mlr.press/v247/fokkema24a.htmlComputation-information gap in high-dimensional clusteringWe investigate the existence of a fundamental computation-information gap for the problem of clustering a mixture of isotropic Gaussian in the high-dimensional regime, where the ambient dimension $p$ is larger than the number $n$ of points. The existence of a computation-information gap in a specific Bayesian high-dimensional asymptotic regime has been conjectured by Lesieur et. al (2016) based on the replica heuristic from statistical physics. We provide evidence of the existence of such a gap generically in the high-dimensional regime $p\geq n$, by (i) proving a non-asymptotic low-degree polynomials computational barrier for clustering in high-dimension, matching the performance of the best known polynomial time algorithms, and by (ii) establishing that the information barrier for clustering is smaller than the computational barrier, when the number $K$ of clusters is large enough. These results are in contrast with the (moderately) low-dimensional regime $n\geq \text{poly}(p,K)$, where there is no computation-information gap for clustering a mixture of isotropic Gaussian. In order to prove our low-degree computational barrier, we develop sophisticated combinatorial arguments to upper-bound the mixed moments of the signal under a Bernoulli Bayesian model.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/even24a.html
https://proceedings.mlr.press/v247/even24a.htmlContraction of Markovian Operators in Orlicz Spaces and Error Bounds for Markov Chain Monte Carlo (Extended Abstract) We introduce a novel concept of convergence for Markovian processes within Orlicz spaces, extending beyond the conventional approach associated with $L_p$ spaces. After showing that Markovian operators are contractive in Orlicz spaces, our technical contribution is an upper bound on their contraction coefficient, which admits a closed-form expression. The bound is tight in some settings, and it recovers well-known results, such as the connection between contraction and ergodicity, ultra-mixing and Doeblin’s minorisation. Moreover, we can define a notion of convergence of Markov processes in Orlicz spaces, which depends on the corresponding contraction coefficient. The key novelty comes from duality considerations: the convergence of a Markovian process determined by $K$ depends on the contraction coefficient of its dual $K^\star$, which can in turn be bounded by considering appropriate nested norms of densities of $K^\star$ with respect to the stationary measure. Our approach stands out as the first of its kind, as it does not rely on the existence of a spectral gap. Specialising our approach to $L_p$ spaces leads to a significant improvement upon classical Riesz-Thorin’s interpolation methods. We present the following applications of the proposed framework: \begin{enumerate} \item Tighter bounds on the mixing time of Markovian processes: one can relate the contraction coefficient of the dual operator to the mixing time of the corresponding Markov chain regardless of the norm chosen. Consequently, our tighter bound on the contraction coefficient implies a tighter bound on the mixing time. We offer a result that provides an intuitive understanding of what it means to be close in a specific norm (relating the probability of any event with the probability of the same event under the stationary measure $\pi$ and a $\psi$-Orlicz/Amemiya-norm). We then focus on $L_p$ norms and show that asking for a bounded norm with larger $p$ guarantees a faster decay in the probability. This is particularly relevant for exponentially decaying probabilities under $\pi$. Moreover, by exploiting the flexibility offered by Orlicz spaces, we can tackle settings where the stationary distribution is heavy-tailed, a severely under-studied setup. \item Improved concentration bounds for MCMC methods leading to improved lower bounds on the burn-in period: by leveraging $L_p$-norms with large $p$ and our results on the contraction coefficient, similar to the approach undertaken for the mixing times, we can provide improved exponential concentration bounds for MCMC methods. \item Improved concentration bounds for sequences of Markovian random variables: we show how our results can be used to outperform existing bounds based on a change of measure technique for random variables with a Markovian dependence. In particular, we can prove exponential concentration in new settings (inaccessible to earlier approaches) and improve the rate in others. \end{enumerate}Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/esposito24a.html
https://proceedings.mlr.press/v247/esposito24a.htmlTopological Expressivity of ReLU Neural NetworksWe study the expressivity of ReLU neural networks in the setting of a binary classification problem from a topological perspective. Recently, empirical studies showed that neural networks operate by changing topology, transforming a topologically complicated data set into a topologically simpler one as it passes through the layers. This topological simplification has been measured by Betti numbers, which are algebraic invariants of a topological space. We use the same measure to establish lower and upper bounds on the topological simplification a ReLU neural network can achieve with a given architecture. We therefore contribute to a better understanding of the expressivity of ReLU neural networks in the context of binary classification problems by shedding light on their ability to capture the underlying topological structure of the data. In particular the results show that deep ReLU neural networks are exponentially more powerful than shallow ones in terms of topological simplification. This provides a mathematically rigorous explanation why deeper networks are better equipped to handle complex and topologically rich data sets.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/ergen24a.html
https://proceedings.mlr.press/v247/ergen24a.htmlThe Real Price of Bandit Information in Multiclass ClassificationWe revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetilde{\Theta}(\min |\mathcal{H}| + \sqrt{T}, \sqrt{KT \log |\mathcal{H}|})}$, where $\mathcal{H}$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|\mathcal{H}|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/erez24a.html
https://proceedings.mlr.press/v247/erez24a.htmlMinimax Linear Regression under the Quantile RiskWe study the problem of designing minimax procedures in linear regression under the quantile risk. We start by considering the realizable setting with independent Gaussian noise, where for any given noise level and distribution of inputs, we obtain the \emph{exact} minimax quantile risk for a rich family of error functions and establish the minimaxity of OLS. This improves on the lower bounds obtained by Lecue and Mendelson (2016) and Mendelson (2017) for the special case of square error, and provides us with a lower bound on the minimax quantile risk over larger sets of distributions. Under the square error and a fourth moment assumption on the distribution of inputs, we show that this lower bound is tight over a larger class of problems. Specifically, we prove a matching upper bound on the worst-case quantile risk of a variant of the procedure proposed by Lecue and Lerasle (2020), thereby establishing its minimaxity, up to absolute constants. We illustrate the usefulness of our approach by extending this result to all $p$-th power error functions for $p \in (2, \infty)$. Along the way, we develop a generic analogue to the classical Bayesian method for lower bounding the minimax risk when working with the quantile risk, as well as a tight characterization of the quantiles of the smallest eigenvalue of the sample covariance matrix.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/el-hanchi24a.html
https://proceedings.mlr.press/v247/el-hanchi24a.htmlOn sampling diluted Spin-Glasses using Glauber Dynamics {\em Spin-glasses} are natural Gibbs distributions that have been studied in theoretical computer science for many decades. Recently, they have been gaining renewed attention from the community as they emerge naturally in {\em neural computation} and {\em learning}, {\em network inference}, {\em optimisation} and many other areas. Here we consider the {\em {2-spin model}} at inverse temperature $\beta$ when the underlying graph is an instance of $G(n,d/n)$, i.e., the random graph on $n$ vertices such that each edge appears independently with probability $d/n$, where the expected degree $d=\Theta(1)$. We study the problem of efficiently sampling from the aforementioned distribution using the well-known Markov chain called {\em Glauber dynamics}. For a certain range of $\beta$, that depends only on the expected degree $d$ of the graph, and for typical instances of the {2-spin model} on $G(n,d/n)$, we show that the corresponding (single-site) Glauber dynamics exhibits mixing time $O\left(n^{2+\frac{3}{\log^2 d}}\right)$. The range of $\beta$ for which we obtain our rapid mixing result corresponds to the expected influence being smaller than $1/d$. We establish our results by utilising the well-known {\em path-coupling} technique. In the standard setting of Glauber dynamics on $G(n,d/n)$ one has to deal with the so-called effect of high degree vertices. % in the path-coupling analysis. Here, with the spin-glasses, rather than considering vertex-degrees, it is more natural to use a different measure on the vertices of the graph, that we call {\em aggregate influence}. We build on the block-construction approach proposed by [Dyer, Flaxman, Frieze and Vigoda: 2006] to circumvent the problem with the high degrees in the path-coupling analysis. Specifically, to obtain our results, we first establish rapid mixing for an appropriately defined block-dynamics. We design this dynamics such that vertices of large aggregate influence are placed deep inside their blocks. Then, we obtain rapid mixing for the (single-site) Glauber dynamics by utilising a comparison argument. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/efthymiou24a.html
https://proceedings.mlr.press/v247/efthymiou24a.htmlAn information-theoretic lower bound in time-uniform estimationWe present an information-theoretic lower bound for the problem of parameter estimation with time-uniform coverage guarantees. We use a reduction to sequential testing to obtain stronger lower bounds that capture the hardness of the time-uniform setting. In the case of location model estimation and logistic regression, our lower bound is $\Omega(\sqrt{n^{-1}\log \log n})$, which is tight up to constant factors in typical settings.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/duchi24a.html
https://proceedings.mlr.press/v247/duchi24a.htmlUniversal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture ModelsClustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd’s algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through Chernoff information, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd’s algorithm employing a Bregman divergence, is rate optimal.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/dreveton24a.html
https://proceedings.mlr.press/v247/dreveton24a.htmlPhysics-informed machine learning as a kernel methodPhysics-informed machine learning combines the expressiveness of data-based approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. We prove that for linear differential priors, the problem can be formulated as a kernel regression task. Taking advantage of kernel theory, we derive convergence rates for the minimizer $\hat f_n$ of the regularized risk and show that $\hat f_n$ converges at least at the Sobolev minimax rate. However, faster rates can be achieved, depending on the physical error. This principle is illustrated with a one-dimensional example, supporting the claim that regularizing the empirical risk with physical information can be beneficial to the statistical performance of estimators.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/doumeche24a.html
https://proceedings.mlr.press/v247/doumeche24a.htmlOn the Growth of Mistakes in Differentially Private Online Learning: A Lower Bound Perspective In this paper, we provide lower bounds for Differentially Private (DP) Online Learning algorithms. Our result shows that, for a broad class of $(\epsilon,\delta)$-DP online algorithms, for number of rounds $T$ such that $\log T\leq O\left(1 / \delta\right)$, the expected number of mistakes incurred by the algorithm grows as \(\Omega\left(\log T\right)\). This matches the upper bound obtained by Golowich and Livni (2021) and is in contrast to non-private online learning where the number of mistakes is independent of \(T\). To the best of our knowledge, our work is the first result towards settling lower bounds for DP–Online learning and partially addresses the open question in Sanyal and Ramponi (2022).Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/dmitriev24a.html
https://proceedings.mlr.press/v247/dmitriev24a.htmlEfficiently Learning One-Hidden-Layer ReLU Networks via SchurPolynomialsWe study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $\mathbb{R}^d$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/\epsilon)^{O(k)}$, where $\epsilon>0$ is the target accuracy. Prior work had given an algorithm for this problem with complexity $(dk/\epsilon)^{h(k)}$, where the function $h(k)$ scales super-polynomially in $k$. Interestingly, the complexity of our algorithm is near-optimal within the class of Correlational Statistical Query algorithms. At a high-level, our algorithm uses tensor decomposition to identify a subspace such that all the $O(k)$-order moments are small in the orthogonal directions. Its analysis makes essential use of the theory of Schur polynomials to show that the higher-moment error tensors are small given that the lower-order ones are.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/diakonikolas24c.html
https://proceedings.mlr.press/v247/diakonikolas24c.htmlStatistical Query Lower Bounds for Learning Truncated GaussiansWe study the problem of estimating the mean of an identity covariance Gaussian in the truncated setting, in the regime when the truncation set comes from a low-complexity family $\mathcal{C}$ of sets. Specifically, for a fixed but unknown truncation set $S \subseteq \mathbb{R}^d$, we are given access to samples from the distribution $\mathcal{N}(\bm{\mu}, \vec{I})$ truncated to the set $S$. The goal is to estimate $\bm{\mu}$ within accuracy $\epsilon>0$ in $\ell_2$-norm. Our main result is a Statistical Query (SQ) lower bound suggesting a super-polynomial information-computation gap for this task. In more detail, we show that the complexity of any SQ algorithm for this problem is $d^{\mathrm{poly}(1/\epsilon)}$, even when the class $\mathcal{C}$ is simple so that $\mathrm{poly}(d/\epsilon)$ samples information-theoretically suffice. Concretely, our SQ lower bound applies when $\mathcal{C}$ is a union of a bounded number of rectangles whose VC dimension and Gaussian surface are small. As a corollary of our construction, it also follows that the complexity of the previously known algorithm for this task is qualitatively best possible.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/diakonikolas24b.html
https://proceedings.mlr.press/v247/diakonikolas24b.htmlTestable Learning of General Halfspaces with Adversarial Label NoiseWe study the task of testable learning of general — not necessarily homogeneous — halfspaces with adversarial label noise with respect to the Gaussian distribution. In the testable learning framework, the goal is to develop a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data. Our main result is the first polynomial time tester-learner for general halfspaces that achieves dimension-independent misclassification error. At the heart of our approach is a new methodology to reduce testable learning of general halfspaces to testable learning of \snew{nearly} homogeneous halfspaces that may be of broader interest. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/diakonikolas24a.html
https://proceedings.mlr.press/v247/diakonikolas24a.htmlIs Efficient PAC Learning Possible with an Oracle That Responds "Yes" or "No"?The \emph{empirical risk minimization (ERM)} principle has been highly impactful in machine learning, leading both to near-optimal theoretical guarantees for ERM-based learning algorithms as well as driving many of the recent empirical successes in deep learning. In this paper, we investigate the question of whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a given dataset, is necessary for efficient learning: in particular, is there a weaker oracle than ERM which can nevertheless enable learnability? We answer this question affirmatively, showing that in the realizable setting of PAC learning for binary classification, a concept class can be learned using an oracle which only returns a \emph{single bit} indicating whether a given dataset is realizable by some concept in the class. The sample complexity and oracle complexity of our algorithm depend polynomially on the VC dimension of the hypothesis class, thus showing that there is only a polynomial price to pay for use of our weaker oracle. Our results extend to the agnostic learning setting with a slight strengthening of the oracle, as well as to the partial concept, multiclass and real-valued learning settings. In the setting of partial concept classes, prior to our work no oracle-efficient algorithms were known, even with a standard ERM oracle. Thus, our results address a question of Alon et al. (2021) who asked whether there are algorithmic principles which enable efficient learnability in this setting.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/daskalakis24a.html
https://proceedings.mlr.press/v247/daskalakis24a.htmlComputational-Statistical Gaps in Gaussian Single-Index Models (Extended Abstract)Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational trade-offs in the high-dimensional regime. While the information-theoretic sample complexity to recover the hidden direction is linear in the dimension $d$, we show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $\Omega(d^{k^\star/2})$ samples, where $k^\star$ is a “generative” exponent associated with the model that we explicitly characterize. Moreover, we show that this sample complexity is also sufficient, by establishing matching upper bounds using a partial-trace algorithm. Therefore, our results provide evidence of a sharp computational-to-statistical gap (under both the SQ and LDP class) whenever $k^\star>2$. To complete the study, we construct smooth and Lipschitz deterministic target functions with arbitrarily large generative exponents $k^\star$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/damian24a.html
https://proceedings.mlr.press/v247/damian24a.htmlRefined Sample Complexity for Markov Games with Independent Linear Function Approximation (Extended Abstract)Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the “curse of multi-agents” (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable until several recent works (Daskalakis et al., 2023; Cui et al., 2023; Wang et al., 2023). While these works resolved the curse of multi-agents, when the state spaces are prohibitively large and (linear) function approximations are deployed, they either had a slower convergence rate of $O(T^{-1/4})$ or brought a polynomial dependency on the number of actions $A_{\max}$ – which is avoidable in single-agent cases even when the loss functions can arbitrarily vary with time. This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing \textit{data-dependent} (i.e., stochastic) pessimistic estimation of the sub-optimality gap, allowing a broader choice of plug-in algorithms. When specialized to MGs with independent linear function approximations, we propose novel \textit{action-dependent bonuses} to cover occasionally extreme estimation errors. With the help of state-of-the-art techniques from the single-agent RL literature, we give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T^{-1/2})$ convergence rate, and avoids $\text{poly}(A_{\max})$ dependency simultaneously.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/dai24a.html
https://proceedings.mlr.press/v247/dai24a.htmlLearnability Gaps of Strategic ClassificationIn contrast with standard classification tasks, strategic classification involves agents strategically modifying their features in an effort to receive favorable predictions. For instance, given a classifier determining loan approval based on credit scores, applicants may open or close their credit cards and bank accounts to fool the classifier. The learning goal is to find a classifier robust against strategic manipulations. Various settings, based on what and when information is known, have been explored in strategic classification. In this work, we focus on addressing a fundamental question: the learnability gaps between strategic classification and standard learning. We essentially show that any learnable class is also strategically learnable: we first consider a fully informative setting, where the manipulation structure (which is modeled by a manipulation graph $G^\star$) is known and during training time the learner has access to both the pre-manipulation data and post-manipulation data. We provide nearly tight sample complexity and regret bounds, offering significant improvements over prior results. Then, we relax the fully informative setting by introducing two natural types of uncertainty. First, following Ahmadi et al. (2023), we consider the setting in which the learner only has access to the post-manipulation data. We improve the results of Ahmadi et al. (2023) and close the gap between mistake upper bound and lower bound raised by them. Our second relaxation of the fully informative setting introduces uncertainty to the manipulation structure. That is, we assume that the manipulation graph is unknown but belongs to a known class of graphs. We provide nearly tight bounds on the learning complexity in various unknown manipulation graph settings. Notably, our algorithm in this setting is of independent interest and can be applied to other problems such as multi-label learning.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/cohen24c.html
https://proceedings.mlr.press/v247/cohen24c.htmlLower Bounds for Differential Privacy Under Continual Observation and Online Threshold QueriesOne of the most basic problems for studying the “price of privacy over time” is the so called {\em private counter problem}, introduced by Dwork et al. (2010) and Chan et al. (2011). In this problem, we aim to track the number of {\em events} that occur over time, while hiding the existence of every single event. More specifically, in every time step $t\in[T]$ we learn (in an online fashion) that $\Delta_t\geq 0$ new events have occurred, and must respond with an estimate $n_t\approx\sum_{j=1}^t \Delta_j$. The privacy requirement is that {\em all of the outputs together}, across all time steps, satisfy {\em event level} differential privacy. The main question here is how our error needs to depend on the total number of time steps $T$ and the total number of events $n$. Dwork et al. (2015) showed an upper bound of $O\left(\log(T)+\log^2(n)\right)$, and Henzinger et al. (2023) showed a lower bound of $\Omega\left( \min\{\log n, \log T\} \right)$. We show a new lower bound of $\Omega\left(\min\{n,\log T\}\right)$, which is tight w.r.t. the dependence on $T$, and is tight in the sparse case where $\log^2 n=O(\log T)$. Our lower bound has the following implications: \begin{itemize} \item We show that our lower bound extends to the {\em online thresholds} problem, where the goal is to privately answer many “quantile queries” when these queries are presented one-by-one. This resolves an open question of Bun et al. (2017). \item Our lower bound implies, for the first time, a separation between the number of mistakes obtainable by a private online learner and a non-private online learner. This partially resolves a COLT’22 open question published by Sanyal and Ramponi. \item Our lower bound also yields the first separation between the standard model of private online learning and a recently proposed relaxed variant of it, called {\em private online prediction}. \end{itemize} Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/cohen24b.html
https://proceedings.mlr.press/v247/cohen24b.htmlStatistical curriculum learning: An elimination algorithm achieving an oracle riskWe consider a statistical version of curriculum learning (CL) in a parametric prediction setting. The learner is required to estimate a target parameter vector, and can adaptively collect samples from either the target model, or other source models that are similar to the target model, but less noisy. We consider three types of learners, depending on the level of side-information they receive. The first two, referred to as strong/weak-oracle learners, receive high/low degrees of information about the models, and use these to learn. The third, a fully adaptive learner, estimates the target parameter vector without any prior information. In the single source case, we propose an elimination learning method, whose risk matches that of a strong-oracle learner. In the multiple source case, we advocate that the risk of the weak-oracle learner is a realistic benchmark for the risk of adaptive learners. We develop an adaptive multiple elimination-rounds CL algorithm, and characterize instance-dependent conditions for its risk to match that of the weak-oracle learner. We consider instance-dependent minimax lower bounds, and discuss the challenges associated with defining the class of instances for the bound. We derive two minimax lower bounds, and determine the conditions under which the performance weak-oracle learner is minimax optimal.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/cohen24a.html
https://proceedings.mlr.press/v247/cohen24a.htmlRisk-Sensitive Online Algorithms (Extended Abstract)We study the design of risk-sensitive online algorithms, in which risk measures are used in the competitive analysis of randomized online algorithms. We introduce the CVaR$_\delta$-competitive ratio ($\delta$-CR) using the conditional value-at-risk of an algorithm’s cost, which measures the expectation of the $(1-\delta)$-fraction of worst outcomes against the offline optimal cost, and use this measure to study three online optimization problems: continuous-time ski rental, discrete-time ski rental, and one-max search. The structure of the optimal $\delta$-CR and algorithm varies significantly between problems: we prove that the optimal $\delta$-CR for continuous-time ski rental is $2-2^{-\Theta(\frac{1}{1-\delta})}$, obtained by an algorithm described by a delay differential equation. In contrast, in discrete-time ski rental with buying cost $B$, there is an abrupt phase transition at $\delta = 1 - \Theta(\frac{1}{\log B})$, after which the classic deterministic strategy is optimal. Similarly, one-max search exhibits a phase transition at $\delta = \frac{1}{2}$, after which the classic deterministic strategy is optimal; we also obtain an algorithm that is asymptotically optimal as $\delta \todown 0$ that arises as the solution to a delay differential equation.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/christianson24a.html
https://proceedings.mlr.press/v247/christianson24a.htmlUndetectable Watermarks for Language ModelsRecent advances in the capabilities of large language models such as GPT-4 have spurred increasing concern about our ability to detect AI-generated text. Prior works have suggested methods of embedding watermarks in model outputs, by *noticeably* altering the output distribution. We ask: Is it possible to introduce a watermark without incurring *any detectable* change to the output distribution? To this end, we introduce a cryptographically-inspired notion of undetectable watermarks for language models. That is, watermarks can be detected only with the knowledge of a secret key; without the secret key, it is computationally intractable to distinguish watermarked outputs from those of the original model. In particular, it is impossible for a user to observe any degradation in the quality of the text. Crucially, watermarks remain undetectable even when the user is allowed to adaptively query the model with arbitrarily chosen prompts. We construct undetectable watermarks based on the existence of one-way functions, a standard assumption in cryptography.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/christ24a.html
https://proceedings.mlr.press/v247/christ24a.htmlThe power of an adversary in Glauber dynamicsGlauber dynamics are a natural model of dynamics of dependent systems. While originally introduced in statistical physics, they have found important applications in the study of social networks, computer vision and other domains. In this work, we introduce a model of corrupted Glauber dynamics whereby instead of updating according to the prescribed conditional probabilities, some of the vertices and their updates are controlled by an adversary. We study the effect of such corruptions on global features of the system. Among the questions we study are: How many nodes need to be controlled in order to change the average statistics of the system in polynomial time? And how many nodes are needed to obstruct approximate convergence of the dynamics? Given a specific budget, how can the adversary choose nodes to control to maximize the overall effect? Our results can be viewed as studying the robustness of classical sampling methods and are thus related to robust inference. The proofs connect to classical theory of Glauber dynamics from statistical physics. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chin24a.html
https://proceedings.mlr.press/v247/chin24a.htmlNew Lower Bounds for Testing Monotonicity and Log Concavity of Distributions We develop a new technique for proving distribution testing lower bounds for properties defined by inequalities on the individual bin probabilities (such as monotonicity and log-concavity). The basic idea is to find a base distribution $Q$ where these inequalities barely hold in many places. We then find two different ensembles of distributions that modify $Q$ in slightly different ways. We use a moment matching construction so that each ensemble has the same bin moments (in particular the expectation over the choice of distribution $p$ of $p_{i}^t$ is the same for the two ensembles for small integers $t$). We show that this makes it impossible to distinguish between the two ensembles with a small number of samples. On the other hand, we construct them so that one ensemble will tweak Q in such a way that it may violate the defining inequalities of the property in question in many places, while the second ensembles does not. Since any valid tester for this property must be able to reliably distinguish these ensembles, we obtain a lower bound of testing the property. Roughly speaking, if we can construct Q which nearly violates the defining inequalities in n places and if the desired error $\epilon$ is small enough relative to n, we hope to obtain a lower bound of roughly $\frac{n}{\epsilon^2}$ up to log factors. In particular, we obtain a lower bound of $\Omega( \min(n,(1/\epsilon)/ \log^3(1/\epsilon))\allowbreak / ( \epsilon^2 \log^7(1/\epsilon)))$ for monotonicity testing on $[n]$ and $\Omega(\log^{-7}(1/\epsilon) \epsilon^{-2} \min(n,\epsilon^{-1/2}\log^{-3/2}(1/\epsilon)))$ for log-concavity testing on $[n]$, the latter of which matches known upper bounds to within logarithmic factors. More generally, for monotonicity testing on $[n]^d$, we have the lower bound of $2^{-O(d)}d^{-d} \epsilon^{-2} \log^{-7}(1/\epsilon) \min(n,d \epsilon^{-1} \log^{-3}(1/\epsilon))^d$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/cheng24a.html
https://proceedings.mlr.press/v247/cheng24a.htmlOpen Problem: Black-Box Reductions and Adaptive Gradient Methods for Nonconvex OptimizationWe describe an open problem: reduce offline nonconvex stochastic optimization to regret minimization in online convex optimization. The conjectured reduction aims to make progress on explaining the success of adaptive gradient methods for deep learning. A prize of 500 dollars is offered to the winner.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chen24e.html
https://proceedings.mlr.press/v247/chen24e.htmlScale-free Adversarial Reinforcement LearningThis paper initiates the study of scale-free learning in Markov Decision Processes (MDPs), where the scale of rewards/losses is unknown to the learner. We design a generic algorithmic framework, \underline{S}cale \underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this framework in both the adversarial Multi-armed Bandit (MAB) setting and the adversarial MDP setting. Through this framework, we achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs, resolving an open problem raised in \cite{hadiji2020adaptation}. On adversarial MDPs, our framework also give birth to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$ high-probability regret guarantee.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chen24d.html
https://proceedings.mlr.press/v247/chen24d.htmlNear-Optimal Learning and Planning in Separated Latent MDPsWe study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs). In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs. To sidestep known impossibility results, we consider several notions of $\delta$-separation of the constituent MDPs. The main thrust of this paper is in establishing a nearly-sharp \textit{statistical threshold} for the horizon length necessary for efficient learning. On the computational side, we show that under a weaker assumption of separability under the optimal policy, there is a quasi-polynomial algorithm with time complexity scaling in terms of the statistical threshold. We further show a near-matching time complexity lower bound under the exponential time hypothesis.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chen24c.html
https://proceedings.mlr.press/v247/chen24c.htmlA faster and simpler algorithm for learning shallow networksWe revisit the well-studied problem of learning a linear combination of $k$ ReLU activations given labeled examples drawn from the standard $d$-dimensional Gaussian measure. Chen et al. recently gave the first algorithm for this problem to run in $\mathrm{poly}(d,1/\epsilon)$ time when $k = O(1)$, where $\epsilon$ is the target error. More precisely, their algorithm runs in time $(d/\epsilon)^{\mathrm{quasipoly}(k)}$ and learns over multiple stages. Here we show that a much simpler one-stage version of their algorithm suffices, and moreover its runtime is only $(d k/\epsilon)^{O(k^2)}$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chen24b.html
https://proceedings.mlr.press/v247/chen24b.htmlOn Finding Small Hyper-Gradients in Bilevel Optimization: Hardness Results and Improved AnalysisBilevel optimization reveals the inner structure of otherwise oblique optimization problems, such as hyperparameter tuning, neural architecture search, and meta-learning. A common goal in bilevel optimization is to minimize a hyper-objective that implicitly depends on the solution set of the lower-level function. Although this hyper-objective approach is widely used, its theoretical properties have not been thoroughly investigated in cases where \textit{the lower-level functions lack strong convexity}. In this work, we first provide hardness results to show that the goal of finding stationary points of the hyper-objective for nonconvex-convex bilevel optimization can be intractable for zero-respecting algorithms. Then we study a class of tractable nonconvex-nonconvex bilevel problems when the lower-level function satisfies the Polyak-Ł{}ojasiewicz (PL) condition. We show a simple first-order algorithm can achieve complexity bounds of $\tilde{\mathcal{O}}(\epsilon^{-2})$, $\tilde{\mathcal{O}}(\epsilon^{-4})$ and $\tilde{\mathcal{O}}(\epsilon^{-6})$ in the deterministic, partially stochastic, and fully stochastic setting respectively.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chen24a.html
https://proceedings.mlr.press/v247/chen24a.htmlDual VC Dimension Obstructs Sample Compression by EmbeddingsThis work studies embedding of arbitrary VC classes in well-behaved VC classes, focusing particularly on extremal classes. Our main result expresses an impossibility: such embeddings necessarily require a significant increase in dimension. In particular, we prove that for every $d$ there is a class with VC dimension $d$ that cannot be embedded in any extremal class of VC dimension smaller than exponential in $d$. In addition to its independent interest, this result has an important implication in learning theory, as it reveals a fundamental limitation of one of the most extensively studied approaches to tackling the long-standing sample compression conjecture. Concretely, the approach proposed by Floyd and Warmuth entails embedding any given VC class into an extremal class of a comparable dimension, and then applying an optimal sample compression scheme for extremal classes. However, our results imply that this strategy would in some cases result in a sample compression scheme at least exponentially larger than what is predicted by the sample compression conjecture. The above implications follow from a general result we prove: any extremal class with VC dimension $d$ has dual VC dimension at most $2d+1$. This bound is exponentially smaller than the classical bound $2^{d+1}-1$ of Assouad, which applies to general concept classes (and is known to be unimprovable for some classes). We in fact prove a stronger result, establishing that $2d+1$ upper bounds the dual Radon number of extremal classes. This theorem represents an abstraction of the classical Radon theorem for convex sets, extending its applicability to a wider combinatorial framework, without relying on the specifics of Euclidean convexity. The proof utilizes the topological method and is primarily based on variants of the Topological Radon Theorem.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chase24a.html
https://proceedings.mlr.press/v247/chase24a.htmlSmoothed Analysis for Learning Concepts with Low Intrinsic DimensionIn the well-studied agnostic model of learning, the goal of a learner– given examples from an arbitrary joint distribution on $\mathbb{R}^d \times \{\pm 1\}$– is to output a hypothesis that is competitive (to within $\epsilon$) of the best fitting concept from some class. In order to escape strong hardness results for learning even simple concept classes in this model, we introduce a smoothed analysis framework where we require a learner to compete only with the best classifier that is robust to small random Gaussian perturbation. This subtle change allows us to give a wide array of learning results for any concept that (1) depends on a low-dimensional subspace (aka multi-index model) and (2) has a bounded Gaussian surface area. This class includes functions of halfspaces and (low-dimensional) convex sets, cases that are only known to be learnable in non-smoothed settings with respect to highly structured distributions such as Gaussians. Perhaps surprisingly, our analysis also yields new results for traditional non-smoothed frameworks such as learning with margin. In particular, we obtain the first algorithm for agnostically learning intersections of $k$-halfspaces in time $k^{\poly(\frac{\log k}{\epsilon \gamma}) }$ where $\gamma$ is the margin parameter. Before our work, the best-known runtime was exponential in $k$ (Arriaga and Vempala, 1999). Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chandrasekaran24a.html
https://proceedings.mlr.press/v247/chandrasekaran24a.htmlNon-Clashing Teaching Maps for Balls in GraphsRecently, Kirkpatrick et al. [ALT 2019] and Fallat et al. [JMLR 2023] introduced non-clashing teaching and showed it to be the most efficient machine teaching model satisfying the benchmark for collusion-avoidance set by Goldman and Mathias. A teaching map $T$ for a concept class $\mathcal{C}$ assigns a (teaching) set $T(C)$ of examples to each concept $C \in \mathcal{C}$. A teaching map is non-clashing if no pair of concepts are consistent with the union of their teaching sets. The size of a non-clashing teaching map (NCTM) $T$ is the maximum size of a teaching set $T(C)$, $C \in \mathcal{C}$. The non-clashing teaching dimension $\text{NCTD}(\mathcal{C})$ of $\mathcal{C}$ is the minimum size of an NCTM for $\mathcal{C}$. $\text{NCTM}^+$ and $\text{NCTD}^+(\mathcal{C})$ are defined analogously, except the teacher may only use positive examples.
We study NCTMs and $\text{NCTM}^+\text{s}$ for the concept class $\mathcal{B}(G)$ consisting of all balls of a graph $G$. We show that the associated decision problem $\text{B-NCTD}^+$ for $\text{NCTD}^+$ is NP-complete in split, co-bipartite, and bipartite graphs. Surprisingly, we even prove that, unless the ETH fails, $\text{B-NCTD}^+$ does not admit an algorithm running in time $2^{2^{o(\mathtt{vc})}}\cdot n^{\mathcal{O}(1)}$, nor a kernelization algorithm outputting a kernel with $2^{o(\mathtt{vc})}$ vertices, where $\mathtt{vc}$ is the vertex cover number of $G$. We complement these lower bounds with matching upper bounds. These are extremely rare results: it is only the second problem in NP to admit such a tight double-exponential lower bound parameterized by $\mathtt{vc}$, and only one of very few problems to admit such an ETH-based conditional lower bound on the number of vertices in a kernel. For trees, interval graphs, cycles, and trees of cycles, we derive $\text{NCTM}^+\text{s}$ or NCTMs for $\mathcal{B}(G)$ of size proportional to its VC-dimension. For Gromov-hyperbolic graphs, we design an approximate $\text{NCTM}^+$ for $\mathcal{B}(G)$ of size $2$, in which only pairs of balls with Hausdorff distance larger than some constant must satisfy the non-clashing condition.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/chalopin24a.html
https://proceedings.mlr.press/v247/chalopin24a.htmlInformation-theoretic generalization bounds for learning from quantum dataLearning tasks play an increasingly prominent role in quantum information and computation. They range from fundamental problems such as state discrimination and metrology over the framework of quantum probably approximately correct (PAC) learning, to the recently proposed shadow variants of state tomography. However, the many directions of quantum learning theory have so far evolved separately. We propose a mathematical formalism for describing quantum learning by training on classical-quantum data and then testing how well the learned hypothesis generalizes to new data. In this framework, we prove bounds on the expected generalization error of a quantum learner in terms of classical and quantum information-theoretic quantities measuring how strongly the learner’s hypothesis depends on the data seen during training. To achieve this, we use tools from quantum optimal transport and quantum concentration inequalities to establish non-commutative versions of decoupling lemmas that underlie classical information-theoretic generalization bounds. Our framework encompasses and gives intuitive generalization bounds for a variety of quantum learning scenarios such as quantum state discrimination, PAC learning quantum states, quantum parameter estimation, and quantumly PAC learning classical functions. Thereby, our work lays a foundation for a unifying quantum information-theoretic perspective on quantum learning.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/caro24a.html
https://proceedings.mlr.press/v247/caro24a.htmlThe Price of Adaptivity in Stochastic Convex OptimizationWe prove impossibility results for adaptivity in non-smooth stochastic convex optimization. Given a set of problem parameters we wish to adapt to, we define a “price of adaptivity” (PoA) that, roughly speaking, measures the multiplicative increase in suboptimality due to uncertainty in these parameters. When the initial distance to the optimum is unknown but a gradient norm bound is known, we show that the PoA is at least logarithmic for expected suboptimality, and double-logarithmic for median suboptimality. When there is uncertainty in both distance and gradient norm, we show that the PoA must be polynomial in the level of uncertainty. Our lower bounds nearly match existing upper bounds, and establish that there is no parameter-free lunch.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/carmon24a.html
https://proceedings.mlr.press/v247/carmon24a.htmlOpen Problem: Tight Characterization of Instance-Optimal Identity TestingIn the “instance-optimal” identity testing introduced by Valiant and Valiant (2014), one is given the (succinct) description of a discrete probability distribution $q$, as well as a a parameter $\varepsilon\in(0,1]$ and i.i.d. samples from an (unknown, arbitrary) discrete distribution $p$. The goal is to distinguish with high probability between the cases (i) $p=q$ and (ii) $\textrm{TV}(p,q) > \varepsilon$, using the minimum number of samples possible as a function of (some simple functional of) $q$ and $\varepsilon$. This is in contrast with the standard formulation of identity testing, where the sample complexity is taken as worst-case over all possible reference distributions $q$. Valiant and Valiant provided upper and lower bounds on this question, where the sample complexity is expressed in terms of the “$\ell_{2/3}$ norm” of some (truncated version) of the reference distribution $q$. However, these upper and lower bounds do not always match up to constant factors, and can differ by an arbitrary multiplicative gap for some choices of $q$. The question then is: what is the tight characterization of the sample complexity of instance-optimal identity testing? What is the “right” functional $\Phi(q)$ for it? Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/canonne24a.html
https://proceedings.mlr.press/v247/canonne24a.htmlComputational-Statistical Gaps for Improper Learning in Sparse Linear RegressionWe study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given $n$ samples from a $k$-sparse linear model in dimension $d$, we ask what is the minimum sample complexity to efficiently (in time polynomial in $d$, $k$, and $n$) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the $n$ samples. Information-theoretically this can be achieved using $\Theta(k \log (d/k))$ samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than $\Theta(d)$ samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly) $\Omega(k^2)$ samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least $\Omega(k^2)$ samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/buhai24a.html
https://proceedings.mlr.press/v247/buhai24a.htmlInsufficient Statistics Perturbation: Stable Estimators for Private Least Squares Extended AbstractWe present a sample- and time-efficient differentially private algorithm for ordinary least squares, with error that depends linearly on the dimension and is independent of the condition number of $X^\top X$, where $X$ is the design matrix. All prior private algorithms for this task require either $d^{3/2}$ examples, error growing polynomially with the condition number, or exponential time. Our near-optimal accuracy guarantee holds for any dataset with bounded statistical leverage and bounded residuals. Technically, we build on the approach of Brown et al. (2023) for private mean estimation, adding scaled noise to a carefully designed stable nonprivate estimator of the empirical regression vector.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/brown24b.html
https://proceedings.mlr.press/v247/brown24b.htmlOnline Stackelberg Optimization via Nonlinear ControlIn repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy \textit{local controllability}, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient $O(\sqrt{T})$ regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally \textit{action-linear}. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for \textit{strongly} or \textit{weakly} locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/brown24a.html
https://proceedings.mlr.press/v247/brown24a.htmlEfficient Algorithms for Learning Monophonic Halfspaces in GraphsWe study the problem of learning a binary classifier on the vertices of a graph. In particular, we consider classifiers given by \emph{monophonic halfspaces}, partitions of the vertices that are convex in a certain abstract sense. Monophonic halfspaces, and related notions such as geodesic halfspaces, have recently attracted interest, and several connections have been drawn between their properties (e.g., their VC dimension) and the structure of the underlying graph $G$. We prove several novel results for learning monophonic halfspaces in the supervised, online, and active settings. Our main result is that a monophonic halfspace can be learned with near-optimal passive sample complexity in time polynomial in $n=|V(G)|$. This requires us to devise a polynomial-time algorithm for consistent hypothesis checking, based on several structural insights on monophonic halfspaces and on a reduction to 2-satisfiability. We prove similar results for the online and active settings. We also show that the concept class can be enumerated with delay $\mathrm{poly}(n)$, and that empirical risk minimization can be performed in time $2^{\omega(G)}\mathrm{poly}(n)$ where $\omega(G)$ is the clique number of $G$. These results answer open questions from the literature (González et al. 2020), and show a contrast with geodesic halfspaces, for which some of the said problems are NP-hard (Seiffarth et al., 2023).Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/bressan24b.html
https://proceedings.mlr.press/v247/bressan24b.htmlA Theory of Interpretable ApproximationsCan a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are \emph{interpretable} by humans. In this work we study such questions by introducing \emph{interpretable approximations}, a notion that captures the idea of approximating a target concept $c$ by a small aggregation of concepts from some base class $\mathcal{H}$. In particular, we consider the approximation of a binary concept $c$ by decision trees based on a simple class $\mathcal{H}$ (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following remarkable trichotomy. For any given pair of $\mathcal{H}$ and $c$, exactly one of these cases holds: (i) $c$ cannot be approximated by $\mathcal{H}$ with arbitrary accuracy; (ii) $c$ can be approximated by $\mathcal{H}$ with arbitrary accuracy, but there exists no universal rate that bounds the complexity of the approximations as a function of the accuracy; or (iii) there exists a constant $\kappa$ that depends only on $\mathcal{H}$ and $c$ such that, for \emph{any} data distribution and \emph{any} desired accuracy level, $c$ can be approximated by $\mathcal{H}$ with a complexity not exceeding $\kappa$. This taxonomy stands in stark contrast to the landscape of supervised classification, which offers a complex array of distribution-free and universally learnable scenarios. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-free) complexity. We extend our trichotomy to classes $\mathcal{H}$ of unbounded VC dimension and give characterizations of interpretability based on the algebra generated by $\mathcal{H}$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/bressan24a.html
https://proceedings.mlr.press/v247/bressan24a.htmlThresholds for Reconstruction of Random Hypergraphs From Graph ProjectionsThe graph projection of a hypergraph is a simple graph with the same vertex set and with an edge between each pair of vertices that appear in a hyperedge. We consider the problem of reconstructing a random $d$-uniform hypergraph from its projection. Feasibility of this task depends on $d$ and the density of hyperedges in the random hypergraph. For $d=3$ we precisely determine the threshold, while for $d\ge 4$ we give bounds. All of our feasibility results are obtained by exhibiting an efficient algorithm for reconstructing the original hypergraph, while infeasibility is information-theoretic.
Our results also apply to mildly inhomogeneous random hypergrahps, including hypergraph stochastic block models (HSBM). A consequence of our results is an optimal HSBM recovery algorithm, improving on Gaudio and Joshi (2023a).
Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/bresler24a.html
https://proceedings.mlr.press/v247/bresler24a.htmlErrors are Robustly Tamed in Cumulative Knowledge ProcessesWe study processes of societal knowledge accumulation, where the validity of a new unit of knowledge depends both on the correctness of its derivation and on the validity of the units it depends on. A fundamental question in this setting is: If a constant fraction of the new derivations is wrong, can investing a constant fraction, bounded away from one, of effort ensure that a constant fraction of knowledge in society is valid? Ben-Eliezer, Mikulincer, Mossel, and Sudan (ITCS 2023) introduced a concrete probabilistic model to analyze such questions and showed an affirmative answer to this question. Their study, however, focuses on the simple case where each new unit depends on just one existing unit, and units attach according to a {\em preferential attachment rule}. In this work, we consider much more general families of cumulative knowledge processes, where new units may attach according to varied attachment mechanisms and depend on multiple existing units. We also allow a (random) fraction of insertions of adversarial nodes. We give a robust affirmative answer to the above question by showing that for \textit{all} of these models, as long as many of the units follow simple heuristics for checking a bounded number of units they depend on, all errors will be eventually eliminated. Our results indicate that preserving the quality of large interdependent collections of units of knowledge is feasible, as long as careful but not too costly checks are performed when new units are derived/deposited.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/brandenberger24a.html
https://proceedings.mlr.press/v247/brandenberger24a.htmlOn the Performance of Empirical Risk Minimization with Smoothed DataIn order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as $\tilde O( \sqrt{\mathrm{comp}(\mathcal F) \cdot T} )$, where $\mathrm{comp}(\mathcal{F})$ is the statistical complexity of learning $\mathcal F$ with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/block24a.html
https://proceedings.mlr.press/v247/block24a.htmlCorrelated Binomial ProcessCohen and Kontorovich (COLT 2023) initiated the study of what we call here the Binomial Empirical Process: the maximal empirical mean deviation for sequences of binary random variables (up to rescaling, the empirical mean of each entry of the random sequence is a binomial hence the naming). They almost fully analyzed the case where the binomials are independent, which corresponds to all random variable entries from the sequence being independent. The remaining gap was closed by Blanchard and Voráček (ALT 2024). In this work, we study the much more general and challenging case with correlations. In contradistinction to Gaussian processes, whose behavior is characterized by the covariance structure, we discover that, at least somewhat surprisingly, for binomial processes covariance does not even characterize convergence. Although a full characterization remains out of reach, we take the first steps with nontrivial upper and lower bounds in terms of covering numbers.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/blanchard24a.html
https://proceedings.mlr.press/v247/blanchard24a.htmlMetric Clustering and MST with Strong and Weak Distance OraclesWe study optimization problems in a metric space $(\mathcal{X},d)$ where we can compute distances in two ways: via a “strong” oracle that returns exact distances $d(x,y)$, and a “weak” oracle that returns distances $\tilde{d}(x,y)$ which may be arbitrarily corrupted with some probability. This model captures the increasingly common trade-off between employing both an expensive similarity model (e.g. a large-scale embedding model), and a less accurate but cheaper model. Hence, the goal is to make as few queries to the strong oracle as possible. We consider both “point queries”, where the strong oracle is queried on a set of points $S \subset \cX $ and returns $d(x,y)$ for all $x,y \in S$, and “edge queries” where it is queried for individual distances $d(x,y)$. Our main contributions are optimal algorithms and lower bounds for clustering and Minimum Spanning Tree (MST) in this model. For $k$-centers, $k$-median, and $k$-means, we give constant factor approximation algorithms with only $\tilde{O}(k)$ strong oracle point queries, and prove that $\Omega(k)$ queries are required for any bounded approximation. For edge queries, our upper and lower bounds are both $\tilde{\Theta}(k^2)$. Surprisingly, for the MST problem we give a $O(\sqrt{\log n})$ approximation algorithm using no strong oracle queries at all, and we prove a matching $\Omega(\sqrt{\log n})$ lower bound which holds even if $\Tilde{\Omega}(n)$ strong oracle point queries are allowed. Furthermore, we empirically evaluate our algorithms, and show that their quality is comparable to that of the baseline algorithms that are given all true distances, but while querying the strong oracle on only a small fraction ($<1%$) of points.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/bateni24a.html
https://proceedings.mlr.press/v247/bateni24a.htmlDetection of $L_∞$ Geometry in Random Geometric Graphs: Suboptimality of Triangles and Cluster ExpansionIn this paper we study the random geometric graph $\mathsf{RGG}(n,\mathbb{T}^d,\mathsf{Unif},\sigma^q_p,p)$ with $L_q$ distance where each vertex is sampled uniformly from the $d$-dimensional torus and where the connection radius is chosen so that the marginal edge probability is $p$. In addition to results addressing other questions, we make progress on determining when it is possible to distinguish $\mathsf{RGG}(n,\mathbb{T}^d,\mathsf{Unif},\sigma^q_p,p)$ from the Erdős-Rényi graph $\ergraph$. Our strongest result is in the setting $q = \infty$, in which case $\mathsf{RGG}(n,\mathbb{T}^d,\mathsf{Unif},\sigma^q_p,p)$ is the \textsf{AND} of $d$ 1-dimensional random geometric graphs. We derive a formula similar to the \emph{cluster-expansion} from statistical physics, capturing the compatibility of subgraphs from each of the $d$ 1-dimensional copies, and use it to bound the signed expectations of small subgraphs. We show that counting signed 4-cycles is optimal among all low-degree tests, succeeding with high probability if and only if $d = \tilde{o}(np).$ In contrast, the signed triangle test is suboptimal and only succeeds when $d = \tilde{o}((np)^{3/4}).$ Our result stands in sharp contrast to the existing literature on random geometric graphs (mostly focused on $L_2$ geometry) where the signed triangle statistic is optimal.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/bangachev24a.html
https://proceedings.mlr.press/v247/bangachev24a.htmlThe SMART approach to instance-optimal online learningWe devise an online learning algorithm – titled Switching via Monotone Adapted Regret Traces (SMART) – that adapts to the data and achieves regret that is instance optimal, i.e., simultaneously competitive on every input sequence compared to the performance of the follow-the-leader (FTL) policy and the worst case guarantee of any other input policy. We show that the regret of the SMART policy on any input sequence is within a multiplicative factor e/(e-1), approximately 1.58, of the smaller of: 1) the regret obtained by FTL on the sequence, and 2) the upper bound on regret guaranteed by the given worst-case policy. This implies a strictly stronger guarantee than typical ‘best-of-both-worlds’ bounds as the guarantee holds for every input sequence regardless of how it is generated. SMART is simple to implement as it begins by playing FTL and switches at most once during the time horizon to the worst-case algorithm. Our approach and results follow from a reduction of instance optimal online learning to competitive analysis for the ski-rental problem. We complement our competitive ratio upper bounds with a fundamental lower bound showing that over all input sequences, no algorithm can get better than a 1.43-fraction of the minimum regret achieved by FTL and the minimax-optimal policy. We present a modification of SMART that combines FTL with a “small-loss" algorithm to achieve instance optimality between the regret of FTL and the small loss regret bound. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/banerjee24a.html
https://proceedings.mlr.press/v247/banerjee24a.htmlOpen Problem: What is the Complexity of Joint Differential Privacy in Linear Contextual Bandits?Contextual bandits serve as a theoretical framework to design recommender systems, which often rely on user-sensitive data, making privacy a critical concern. However, a significant gap remains between the known upper and lower bounds on the regret achievable in linear contextual bandits under Joint Differential Privacy (JDP), which is a popular privacy definition used in this setting. In particular, the best regret upper bound is known to be $O\left(d \sqrt{T} \log(T) + \textcolor{blue}{d^{3/4} \sqrt{T \log(1/\delta)} / \sqrt{\epsilon}} \right)$, while the lower bound is $\Omega \left(\sqrt{d T \log(K)} + \textcolor{blue}{d/(\epsilon + \delta)}\right)$. We discuss the recent progress on this problem, both from the algorithm design and lower bound techniques, and posit the open questions.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/azize24a.html
https://proceedings.mlr.press/v247/azize24a.htmlLearning Neural Networks with Sparse ActivationsA core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/awasthi24a.html
https://proceedings.mlr.press/v247/awasthi24a.htmlUniversal Rates for Regression: Separations between Cut-Off and Absolute LossIn this work we initiate the study of regression in the universal rates framework of Bousquet et al. Unlike the traditional uniform learning setting, we are interested in obtaining learning guarantees that hold for all fixed data-generating distributions, but do not hold uniformly across them. We focus on the realizable setting and we consider two different well-studied loss functions: the cut-off loss at scale $\gamma > 0$, which asks for predictions that are $\gamma$-close to the correct one, and the absolute loss, which measures how far away the prediction is from the correct one. Our results show that the landscape of the achievable rates in the two cases is completely different. First we give a trichotomic characterization of the optimal learning rates under the cut-off loss: each class is learnable either at an exponential rate, a (nearly) linear rate or requires arbitrarily slow rates. Moving to the absolute loss, we show that the achievable learning rates are significantly more involved by illustrating that an infinite number of different optimal learning rates is achievable. This is the first time that such a rich landscape of rates is obtained in the universal rates literature.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/attias24a.html
https://proceedings.mlr.press/v247/attias24a.htmlThe Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed BanditsWe give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of $O(n/\Delta^2)$ requires $\Omega(\log{(1/\Delta)}/\log\log{(1/\Delta)})$ passes. Here, $n$ is the number of arms and $\Delta$ is the reward gap between the best and the second-best arms. Our result matches the $O(\log(1/\Delta))$ pass algorithm of Jin et al. [ICML’21] (up to lower order terms) that only uses $O(1)$ memory and answers an open question posed by Assadi and Wang [STOC’20].Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/assadi24a.html
https://proceedings.mlr.press/v247/assadi24a.htmlOpen Problem: Can Local Regularization Learn All Multiclass Problems?Multiclass classification is the simple generalization of binary classification to arbitrary label sets. Despite its simplicity, it has been remarkably resistant to study: a characterization of multiclass learnability was established only two years ago by Brukhim et al. 2022, and the understanding of optimal learners for multiclass problems remains fairly limited. We ask whether there exists a simple algorithmic template — akin to empirical risk minimization (ERM) for binary classification — which characterizes multiclass learning. Namely, we ask whether local regularization, introduced by Asilis et al. 2024, is sufficiently expressive to learn all multiclass problems possible. Towards (negatively) resolving the problem, we propose a hypothesis class which may not be learnable by any such local regularizer.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/asilis24b.html
https://proceedings.mlr.press/v247/asilis24b.htmlRegularization and Optimal Multiclass LearningThe quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. Relatedly, the practice of machine learning is rife with considerably richer algorithmic techniques, perhaps the most notable of which is regularization. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to precisely characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam’s Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian inference. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem’s transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs – judged using nodes’ outdegrees minus a system of node-dependent credits – characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/asilis24a.html
https://proceedings.mlr.press/v247/asilis24a.htmlUniversally Instance-Optimal Mechanisms for Private Statistical Estimation We consider the problem of instance-optimal statistical estimation under the constraint of differential privacy where mechanisms must adapt to the difficulty of the input dataset. We prove a new instance specific lower bound using a new divergence and show it characterizes the local minimax optimal rates for private statistical estimation. We propose two new mechanisms that are universally instance-optimal for general estimation problems up to logarithmic factors. Our first mechanism, the total variation mechanism, builds on the exponential mechanism with stable approximations of the total variation distance, and is universally instance-optimal in the high privacy regime $\epsilon \leq 1/\sqrt{n}$. Our second mechanism, the T-mechanism, is based on the T-estimator framework (Birg{é}, 2006) using the clipped log likelihood ratio as a stable test: it attains instance-optimal rates for any $\epsilon \leq 1$ up to logarithmic factors. Finally, we study the implications of our results to robust statistical estimation, and show that our algorithms are universally optimal for this problem, characterizing the optimal minimax rates for robust statistical estimation. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/asi24a.html
https://proceedings.mlr.press/v247/asi24a.htmlMode Estimation with Partial Feedback The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize key aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution with partial feedback. We showcase how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our original problem. Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/arnal24a.html
https://proceedings.mlr.press/v247/arnal24a.htmlTwo fundamental limits for uncertainty quantification in predictive inferenceWe study the statistical hardness of estimating two basic representations of uncertainty in predictive inference: prediction sets and calibration error. First, we show that conformal prediction sets cannot approach a desired weighted conformal coverage level—with respect to a family of binary witness functions with VC dimension $d$—at a minimax rate faster than $O(d^{1/2}n^{-1/2})$. We also show that the algorithm in Gibbs et al. (2023) achieves this rate and that extending our class of conformal sets beyond thresholds of non-conformity scores to include arbitrary convex sets of non-conformity scores only improves the minimax rate by a constant factor. Then, under a similar VC dimension constraint on the witness function class, we show it is not possible to estimate the weighted weak calibration error at a minimax rate faster than $O(d^{1/4}n^{-1/2})$. We show that the algorithm in Kumar et al. (2019) achieves this rate in the particular case of estimating the squared weak calibration error of a predictor that outputs $d$ distinct values.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/areces24a.html
https://proceedings.mlr.press/v247/areces24a.htmlFast parallel sampling under isoperimetryWe show how to sample in parallel from a distribution $\pi$ over $\mathbb{R}^d$ that satisfies a log-Sobolev inequality and has a smooth log-density, by parallelizing the Langevin (resp. underdamped Langevin) algorithms. We show that our algorithm outputs samples from a distribution $\hat{\pi}$ that is close to $\pi$ in Kullback–Leibler (KL) divergence (resp. total variation (TV) distance), while using only $\log(d)^{O(1)}$ parallel rounds and $\widetilde{O}(d)$ (resp. $\widetilde O(\sqrt d)$) gradient evaluations in total. This constitutes the first parallel sampling algorithms with TV distance guarantees. For our main application, we show how to combine the TV distance guarantees of our algorithms with prior works and obtain RNC sampling-to-counting reductions for families of discrete distribution on the hypercube $\{\pm 1\}^n$ that are closed under exponential tilts and have bounded covariance. Consequently, we obtain an RNC sampler for directed Eulerian tours and asymmetric determinantal point processes, resolving open questions raised in prior works.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/anari24a.html
https://proceedings.mlr.press/v247/anari24a.htmlMitigating Covariate Shift in Misspecified Regression with Applications to Reinforcement LearningA pervasive phenomenon in machine learning applications is \emph{distribution shift}, where training and deployment conditions for a machine learning model differ. As distribution shift typically results in a degradation in performance, much attention has been devoted to algorithmic interventions that mitigate these detrimental effects. This paper studies the effect of distribution shift in the presence of model misspecification, specifically focusing on $L_{\infty}$-misspecified regression and \emph{adversarial covariate shift}, where the regression target remains fixed while the covariate distribution changes arbitrarily. We show that empirical risk minimization, or standard least squares regression, can result in undesirable \emph{misspecification amplification} where the error due to misspecification is amplified by the density ratio between the training and testing distributions. As our main result, we develop a new algorithm—inspired by robust optimization techniques—that avoids this undesirable behavior, resulting in no misspecification amplification while still obtaining optimal statistical rates. As applications, we use this regression procedure to obtain new guarantees in offline and online reinforcement learning with misspecification and establish new separations between previously studied structural conditions and notions of coverage.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/amortila24a.html
https://proceedings.mlr.press/v247/amortila24a.htmlA Unified Characterization of Private Learnability via Graph TheoryWe provide a unified framework for characterizing pure and approximate differentially private (DP) learnability. The framework uses the language of graph theory: for a concept class $\mathcal{H}$, we define the contradiction graph $G$ of $\mathcal{H}$. Its vertices are realizable datasets and two datasets $S,S’$ are connected by an edge if they contradict each other (i.e., there is a point $x$ that is labeled differently in $S$ and $S’$). Our main finding is that the combinatorial structure of $G$ is deeply related to learning $\mathcal{H}$ under DP. Learning $\mathcal{H}$ under pure DP is captured by the fractional clique number of $G$. Learning $\mathcal{H}$ under approximate DP is captured by the clique number of $G$. Consequently, we identify graph-theoretic dimensions that characterize DP learnability: the \emph{clique dimension} and \emph{fractional clique dimension}. Along the way, we reveal properties of the contradiction graph which may be of independent interest. We also suggest several open questions and directions for future research.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/alon24a.html
https://proceedings.mlr.press/v247/alon24a.htmlMetalearning with Very Few Samples Per Task Metalearning and multitask learning are two frameworks for solving a group of related learning tasks more efficiently than we could hope to solve each of the individual tasks on their own. In multitask learning, we are given a fixed set of related learning tasks and need to output one accurate model per task, whereas in metalearning we are given tasks that are drawn i.i.d. from a metadistribution and need to output some common information that can be easily specialized to new, previously unseen tasks from the metadistribution. In this work, we consider a binary classification setting where tasks are related by a shared representation, that is, every task $P$ of interest can be solved by a classifier of the form $f_{P} \circ h$ where $h \in \mathcal{H}$ is a map from features to some representation space that is shared across tasks, and $f_{P} \in \mathcal{F}$ is a task-specific classifier from the representation space to labels. The main question we ask in this work is how much data do we need to metalearn a good representation? Here, the amount of data is measured in terms of both the number of tasks $t$ that we need to see and the number of samples $n$ per task. We focus on the regime where the number of samples per task is extremely small. Our main result shows that, in a distribution-free setting where the feature vectors are in $\mathbb{R}^d$, the representation is a linear map from $\mathbb{R}^d \to \mathbb{R}^k$, and the task-specific classifiers are halfspaces in $\mathbb{R}^k$, we can metalearn a representation with error $\varepsilon$ using just $n = k+2$ samples per task, and $d \cdot (1/\varepsilon)^{O(k)}$ tasks. Learning with so few samples per task is remarkable because metalearning would be impossible with $k+1$ samples per task, and because we cannot even hope to learn an accurate task-specific classifier with just $k+2$ samples per task. To obtain this result, we develop a sample-and-task-complexity theory for distribution-free metalearning and multitask learning, which identifies what properties of $\mathcal{F}$ and $\mathcal{H}$ make metalearning possible with few samples per task. Our theory also yields a simple characterization of distribution-free multitask learning. Finally, we give sample-efficient reductions between metalearning and multitask learning, which, when combined with our characterization of multitask learning, give a characterization of metalearning in certain parameter regimes.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/aliakbarpour24a.html
https://proceedings.mlr.press/v247/aliakbarpour24a.htmlConference on Learning Theory 2024: PrefaceSun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/agrawal24a.html
https://proceedings.mlr.press/v247/agrawal24a.htmlMajority-of-Three: The Simplest Optimal Learner?Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke’s algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm’s error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/aden-ali24a.html
https://proceedings.mlr.press/v247/aden-ali24a.htmlLimits of Approximating the Median Treatment EffectAverage Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\mathbf{a}, \mathbf{b}$, much of the prior work focused on estimating median$(\mathbf{a}) -$ median$(\mathbf{b})$, as it is easier to estimate than the desired estimand of median$(\mathbf{a-b})$, called the Median Treatment Effect (MTE). In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\mathbf{a-b}$: we approximate the median by a value whose quantiles in $\mathbf{a-b}$ are close to $0.5$ (median). Next, we identify a quantity called \emph{variability} that exactly captures the complexity of MTE estimation. Using this, we establish that when potential outcomes take values in the set $\{0,1,\ldots,k-1\}$ the worst-case (over inputs $\mathbf{a,b}$) optimal (over algorithms) approximation factor of the MTE is $\frac{1}{2}\cdot \frac{2k-3}{2k-1}$. Further, by drawing connections to the notions of instance-optimality studied in theoretical computer science, we show that \emph{every} algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability, on roughly a per input basis: hence, variability leads to an almost instance optimal approximation algorithm for estimating the MTE. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior works, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values $\{0,1,\ldots,k-1\}$.Sun, 30 Jun 2024 00:00:00 +0000
https://proceedings.mlr.press/v247/addanki24a.html
https://proceedings.mlr.press/v247/addanki24a.html