- title: 'Conference on Learning Theory 2025: Preface'
  volume: 291
  URL: https://proceedings.mlr.press/v291/haghtalab25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/haghtalab25a/haghtalab25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-haghtalab25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: i-i
  id: haghtalab25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: i
  lastpage: i
  published: 2025-07-02 00:00:00 +0000
- title: 'Open Problem: Regret Minimization in Heavy-Tailed Bandits with Unknown Distributional Parameters'
  abstract: 'The heavy-tailed bandit problem Bubeck et al.,2013 , is a variant of the stochastic multi-armed bandit problem where the reward distributions have finite absolute raw moments of maximum order $1+\epsilon$, uniformly bounded by a constant $u < +\infty$, for some $\epsilon \in (0,1]$. In this setting, most of the proposed approaches crucially rely on the knowledge of both $\epsilon$ and $u$. Recent works have highlighted that adapting to such parameters when they are unknown is harder than adapting to the subgaussian constant or the rewards range in non-heavy-tailed bandits. It is known that it is not possible to adapt to either $\epsilon$ or $u$ without either ($i$) incurring extra regret or ($ii$) enforcing additional assumptions. However, it remains an open question what the best attainable performance is when no additional assumptions are provided. Moreover, the assumptions proposed in the literature are not comparable, as none of them is strictly weaker than the others. Thus, another open question is about the nature of the assumptions needed to compensate for this cost.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/genalti25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/genalti25a/genalti25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-genalti25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Gianmarco
    family: Genalti
  - given: Alberto Maria
    family: Metelli
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1-5
  id: genalti25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1
  lastpage: 5
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimistic Q-learning for average reward and episodic reinforcement learning extended abstract'
  abstract: 'Model-free methods for reinforcement learning (RL), particularly Q-learning, have gained popularity in practice because of their simplicity and flexibility, and underlie most successful modern deep RL algorithms. However, the sample complexity and regret bounds for these approaches have often lagged behind their model-based counterparts, especially in the average reward setting. Our work addresses this gap. We present a simple, optimistic Q-learning algorithm for regret minimization in a tabular RL setting that encompasses  \textit{both average reward and episodic} settings. Our contributions include new modeling, algorithm design, and regret analysis techniques. Our first and foremost contribution is a natural modeling assumption that generalizes the episodic and ergodic MDP settings and provides a more practically applicable formulation. We consider the class of MDPs where there is an "upper  bound $H$ on the time to visit a frequent state $s_0$", either in expectation or with constant probability. The upper bound $H$ is assumed to hold under all feasible policies (stationary or non-stationary) and is known to the RL agent, although the identity of the frequent state $s_0$ may not be known. This assumption is naturally satisfied by the episodic settings since the terminal state is visited after every set of $H$ steps, and also by the ergodic MDP settings that assume bounded worst-case hitting time $H$ for {\it all states}. Furthermore, as we demonstrate using several examples from queuing admission control and inventory management, it allows for significantly more modeling flexibility than the existing settings.  A key technical contribution of our work is the introduction of an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Under the given assumption, we show that the $\overline{L}$ operator has a strict contraction (in span) even in the average-reward setting where the discount factor is $1$. Our algorithm design builds upon the  Q-learning algorithm while replacing the Bellman operator with the novel $\Lbar$ operator.  It uses ideas from episodic  Q-learning to estimate and apply this operator iteratively.  Our model-free algorithm improves the existing literature both in simplicity of algorithmic design and regret bounds.  Specifically, our algorithm achieves a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$ in the average reward setting, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A regret bound of $\tilde{O}(H^6 S\sqrt{AT})$ in the episodic setting with fixed episode length $H$ follows as a corollary of this result.  Thus, we provide a unified view of algorithm design and regret minimization in episodic and non-episodic settings, which may be of independent interest.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/agrawal25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/agrawal25a/agrawal25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-agrawal25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Priyank
    family: Agrawal
  - given: Shipra
    family: Agrawal
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1-1
  id: agrawal25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1
  lastpage: 1
  published: 2025-07-02 00:00:00 +0000
- title: 'Computable learning of natural hypothesis classes'
  abstract: 'This paper is about the recent notion of computably probably approximately correct learning, which lies between the statistical learning theory where there is no computational requirement on the learner and efficient PAC learning where the learner must be polynomially bounded. Examples have recently been given of hypothesis classes which are PAC-learnable but not computably PAC-learnable, but these hypothesis classes can be viewed as unnatural or non-canonical. We use the on-a-cone machinery from computability theory to prove that, under certain assumptions on the hypothesis class, any “natural” hypothesis class which is learnable must be computably learnable.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/akbari25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/akbari25a/akbari25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-akbari25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Syed
    family: Akbari
  - given: Matthew
    family: Harrison-Trainor
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2-21
  id: akbari25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2
  lastpage: 21
  published: 2025-07-02 00:00:00 +0000
- title: 'Better Private Distribution Testing by Leveraging Unverified Auxiliary Data'
  abstract: 'We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, \emph{uniformity}, \emph{identity}, and \emph{closeness} testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/aliakbarpour25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/aliakbarpour25a/aliakbarpour25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-aliakbarpour25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Maryam
    family: Aliakbarpour
  - given: Arnav
    family: Burudgunte
  - given: Clément
    family: Canonne
  - given: Ronitt
    family: Rubinfeld
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 22-63
  id: aliakbarpour25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 22
  lastpage: 63
  published: 2025-07-02 00:00:00 +0000
- title: 'Regret Bounds for Robust Online Decision Making'
  abstract: 'We propose a framework which generalizes “decision making with structured observations" from Foster et al. (2023) by allowing \emph{robust} (i.e. multivalued) models. In this framework, each model associates each decision with a \emph{convex set} of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be non-oblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework, which extends the “decision-estimation coefficients" of Foster et al. (2023). Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits (previously studied in Kosoy (2024)) and tabular robust online reinforcement learning (previously studied in Tian et al. (2021)). In both cases, we derive regret bounds that improve state-of-the-art (except that we do not address computational efficiency).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/appel25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/appel25a/appel25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-appel25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alexander
    family: Appel
  - given: Vanessa
    family: Kosoy
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 64-146
  id: appel25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 64
  lastpage: 146
  published: 2025-07-02 00:00:00 +0000
- title: 'Simplifying Adversarially Robust PAC Learning With Tolerance'
  abstract: 'Adversarially robust PAC learning has proved to be challenging, with the currently best known learners relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning \emph{with tolerance} and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class $\mathcal{H}$. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on $\mathcal{H}$. Even though our learner is improper, it is “almost proper" in the sense that it outputs a hypothesis that is “similar" to a hypothesis in $\mathcal{H}$.  We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi-supervised algorithm,  but avoids the use of intricate subroutines from previous works, and is “almost proper."'
  volume: 291
  URL: https://proceedings.mlr.press/v291/ashtiani25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/ashtiani25a/ashtiani25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-ashtiani25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hassan
    family: Ashtiani
  - given: Vinayak
    family: Pathak
  - given: Ruth
    family: Urner
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 147-168
  id: ashtiani25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 147
  lastpage: 168
  published: 2025-07-02 00:00:00 +0000
- title: 'Computational Intractability of Strategizing against Online Learners'
  abstract: 'Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases—such as structured auction settings or contract design—no general efficient algorithm is known.   In this paper, we establish a strong computational hardness result: unless $\mathsf{P} = \mathsf{NP}$, no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an $\Omega(T)$ hardness bound, significantly strengthening previous work that only showed an additive $\Theta(1)$ impossibility result. Furthermore, while the prior hardness result focused on learners using fictitious play—an algorithm that is not no-regret—we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings.  '
  volume: 291
  URL: https://proceedings.mlr.press/v291/assos25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/assos25a/assos25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-assos25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Angelos
    family: Assos
  - given: Yuval
    family: Dagan
  - given: Nived
    family: Rajaraman
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 169-199
  id: assos25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 169
  lastpage: 199
  published: 2025-07-02 00:00:00 +0000
- title: 'Testing Thresholds and Spectral Properties of High-Dimensional Random Toroidal Graphs via Edgeworth-Style Expansions'
  abstract: 'We study high-dimensional random geometric graphs (RGGs) of edge-density $p$ with vertices uniformly distributed on the $d$-dimensional torus and edges inserted between \say{sufficiently close} vertices with respect to an $L_q$-norm. In this setting, we focus on distinguishing an RGG from an Erdős–Rényi graph if both models have the same marginal edge probability $p$. So far, most results in the literature considered either spherical RGGs with $L_2$-distance or toroidal RGGs under $L_\infty$-distance. However, for general $L_q$-distances, many questions remain open, especially if $p$ is allowed to depend on $n$. The main reason for this is that RGGs under $L_q$-distances can not easily be represented as the logical \say{AND} of their 1-dimensional counterparts, as is the case for $L_\infty$ geometries. To overcome this difficulty, we devise a novel technique for quantifying the dependence between edges based on a modified version of Edgeworth expansions. Our technique yields the first tight algorithmic upper bounds for distinguishing toroidal RGGs under general $L_q$ norms from Erdős–Rényi graphs for any fixed $p$ and $q$. We achieve this by showing that the signed triangle statistic can distinguish the two models when $d\ll n^3p^3$ for the whole regime of edge probabilities $\frac{c}{n}<p<1$. Additionally, our technique yields an improved information-theoretic lower bound for this task, showing that the two distributions converge in total variation whenever $d=\tilde{\Omega}(n^3p^2)$, which is just as strong as the currently best known lower bound for spherical RGGs in case of general $p$ from Liu et al. [STOC’22]. Finally, our expansions allow us to tightly characterize the spectral properties of toroidal RGGs both under $L_q$-distances for fixed $1 \le q < \infty$, and $L_\infty$-distance. We find that these are quite different for $q<\infty$ vs. $q=\infty$. Our results partially resolve a conjecture of Bangachev and Bressler [COLT ’24] and prove that the distance metric, rather than the underlying space, is responsible for the observed differences in the behavior of high-dimensional spherical and toroidal RGGs.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/baguley25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/baguley25a/baguley25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-baguley25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Samuel
    family: Baguley
  - given: Andreas
    family: Göbel
  - given: Marcus
    family: Pappik
  - given: Leon
    family: Schiller
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 200-201
  id: baguley25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 200
  lastpage: 201
  published: 2025-07-02 00:00:00 +0000
- title: 'Faster Acceleration for Steepest Descent'
  abstract: 'Recent advances (Sherman, 2017; Sidford and Tian, 2018; Cohen et al., 2021) have overcome the fundamental barrier of dimension dependence in the iteration complexity of solving $\ell_\infty$ regression with first-order methods. Yet it remains unclear to what extent such acceleration can be achieved for general $\ell_p$ smooth functions. In this paper, we propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to \textit{differing} norms, which are then coupled using an \textit{implicitly} determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provides an iteration complexity improvement of up to $O(d^{1-\frac{2}{p}})$ in terms of calls to a first-order oracle, thereby allowing us to circumvent long-standing barriers in accelerated non-Euclidean steepest descent.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/bai25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bai25a/bai25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bai25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cedar Site
    family: Bai
  - given: Brian
    family: Bullins
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 202-230
  id: bai25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 202
  lastpage: 230
  published: 2025-07-02 00:00:00 +0000
- title: 'Thompson Sampling for Bandit Convex Optimisation'
  abstract: ' We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively,  we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/bakhtiari25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bakhtiari25a/bakhtiari25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bakhtiari25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alireza
    family: Bakhtiari
  - given: Tor
    family: Lattimore
  - given: Csaba
    family: Szepesvári
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 231-263
  id: bakhtiari25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 231
  lastpage: 263
  published: 2025-07-02 00:00:00 +0000
- title: 'Metric Embeddings Beyond Bi-Lipschitz Distortion via Sherali-Adams'
  abstract: ' Metric embeddings are a widely used method in algorithm design, where generally a "complex" metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that are commonly used in practice avoid bi-Lipschitz distortion; yet these approaches have received comparatively less study in theory. In this paper, we focus on Multi-dimensional Scaling (MDS), where we are given a set of non-negative dissimilarities $\{d_{i,j}\}_{i,j\in[n]}$ over $n$ points, and the goal is to find an embedding $\{x_1,…,x_n\}\subset\mathbb{R}^k$ that minimizes \[ \mathrm{OPT} = \min_{x_1,…,x_n} \mathbb{E}_{i,j\in[n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right]. \]{Despite} its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine et al. gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant-dimensional Euclidean space with cost $\mathrm{OPT} + \epsilon$ in $n^2 \cdot 2^{\mathrm{poly}(\Delta/\epsilon)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, $\Delta$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for constant-dimensional Euclidean space, we achieve a solution with cost $O(\log \Delta)\cdot \mathrm{OPT}^{\Omega(1)} + \epsilon$ in time $n^{O(1)} \cdot 2^{\mathrm{poly}\left(\frac{\log(\Delta)}{\epsilon}\right)}$. Our algorithms are based on a novel geometry-aware analysis of a conditional rounding of the Sherali-Adams LP hierarchy, allowing us to avoid the exponential dependency on the aspect ratio that would typically result from this rounding. \end{abstract} '
  volume: 291
  URL: https://proceedings.mlr.press/v291/bakshi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bakshi25a/bakshi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bakshi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ainesh
    family: Bakshi
  - given: Vincent
    family: Cohen-Addad
  - given: Rajesh
    family: Jayaram
  - given: Samuel B.
    family: Hopkins
  - given: Silvio
    family: Lattanzi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 264-279
  id: bakshi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 264
  lastpage: 279
  published: 2025-07-02 00:00:00 +0000
- title: 'How to safely discard features based on aggregate SHAP values'
  abstract: 'SHAP is one of the most popular \textit{local} feature-attribution methods. Given a function $f$ and an input $x \in \mathbb{R}^d$, it quantifies each feature’s contribution to $f(x)$. Recently, SHAP has been increasingly used for \textit{global} insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard “unimportant” features. % In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the $i$-th SHAP value equals $0$ on the entire data support, there exist functions that clearly depend on Feature $i$. The issue is that computing SHAP values involves evaluating $f$ on points outside of the data support, where $f$ can be strategically designed to mask its dependence on Feature $i$. % To address this, we propose to aggregate SHAP values over the \textit{extended} support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. % We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate KernelSHAP value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the "Shapley Lie algebra", which offers algebraic insights that may enable a deeper investigation of SHAP and we show that a simple preprocessing step – randomly permuting each column of the data matrix – enables safely discarding features based on aggregate SHAP and KernelSHAP values.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/bhattacharjee25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bhattacharjee25a/bhattacharjee25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bhattacharjee25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Robi
    family: Bhattacharjee
  - given: Karolin
    family: Frohnapfel
  - given: Ulrike
    prefix: von
    family: Luxburg
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 280-314
  id: bhattacharjee25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 280
  lastpage: 314
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimal Graph Reconstruction by Counting Connected Components in Induced Subgraphs'
  abstract: 'The graph reconstruction problem has been extensively studied under various query models. In this paper, we propose a new query model regarding the number of connected components, which is one of the most basic and fundamental graph parameters. Formally, we consider the problem of reconstructing an $n$-node $m$-edge graph with oracle queries of the following form: provided with a subset of vertices, the oracle returns the number of connected components in the induced subgraph. We show $\Theta(\frac{m \log n}{\log m})$ queries in expectation are both sufficient and necessary to adaptively reconstruct the graph. In contrast, we show that $\Omega(n^2)$ non-adaptive queries are required, even when $m = O(n)$. We also provide an $O(m\log n + n\log^2 n)$ query algorithm using only two rounds of adaptivity.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/black25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/black25a/black25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-black25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hadley
    family: Black
  - given: Arya
    family: Mazumdar
  - given: Barna
    family: Saha
  - given: Yinzhan
    family: Xu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 315-343
  id: black25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 315
  lastpage: 343
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Partitions with Optimal Query and Round Complexities'
  abstract: 'We consider the basic problem of learning an unknown partition of $n$ elements into at most $k$ sets using simple queries that reveal information about a small subset of elements. Our starting point is the popular and well-studied pairwise \emph{same-set} queries which ask if a pair of elements belong to the same class. It is well-known that non-adaptive (fully parallel) algorithms require $\Theta(n^2)$ queries, while adaptive (fully sequential) algorithms require $\Theta(nk)$ queries, and the best known algorithm uses $k-1$ rounds of adaptivity. Many variations of this problem have been studied over the last two decades in multiple disciplines due to its fundamental nature and connections to clustering, active learning, and crowd-sourcing. In many of these applications, it is of paramount interest to reduce adaptivity, a.k.a the number of rounds, while minimizing the query complexity. In this paper, we give a complete characterization of the deterministic query complexity of this problem as a function of the number of rounds, $r$, which interpolates smoothly between the non-adaptive and adaptive settings: for any constant $r \geq 1$, the query complexity is $\smash{\Theta(n^{1+\frac{1}{2^r-1}}k^{1-\frac{1}{2^r-1}})}$. Additionally, our algorithm only needs $O(\log \log n)$ rounds to attain the optimal $O(nk)$ query complexity, which is a double-exponential improvement over prior works when $k$ is a polynomial in $n$. Next, we consider two natural generalizations of pair-wise queries to general subsets $S$ of size at most $s$: (1) weak subset queries which return the number of classes intersected by $S$, and (2) strong subset queries which return the entire partition restricted on $S$. Once again in crowd sourcing applications, queries on large sets may be prohibitive. For non-adaptive algorithms, we show $\Omega(n^2/s^2)$ strong queries are needed. In contrast, perhaps surprisingly, we show that there is a non-adaptive randomized algorithm using weak queries that matches this bound up to log-factors for all $s \leq \sqrt{n}$. More generally, we obtain nearly matching upper and lower bounds for algorithms using weak and strong queries in terms of both the number of rounds, $r$, and the query size bound, $s$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/black25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/black25b/black25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-black25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hadley
    family: Black
  - given: Arya
    family: Mazumdar
  - given: Barna
    family: Saha
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 344-374
  id: black25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 344
  lastpage: 374
  published: 2025-07-02 00:00:00 +0000
- title: 'A Distributional-Lifting Theorem for PAC Learning'
  abstract: ' The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a {\sl distributional-lifting theorem}: This upgrades a learner that succeeds with respect to a limited distribution family $\mathcal{D}$ to one that succeeds with respect to {\sl any} distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$.  Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting  uniform-distribution learners and designed a lifter that uses a  conditional sample oracle for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then uses this information to lift.  We show that their approach is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then take a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/blanc25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/blanc25a/blanc25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-blanc25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Guy
    family: Blanc
  - given: Jane
    family: Lange
  - given: Carmen
    family: Strassle
  - given: Li-Yang
    family: Tan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 375-379
  id: blanc25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 375
  lastpage: 379
  published: 2025-07-02 00:00:00 +0000
- title: 'Stability and List-Replicability for Agnostic Learners'
  abstract: 'Two seminal papers–Alon, Livni, Malliaris, Moran (STOC 2019)  and Bun, Livni, and Moran (FOCS 2020)–established the equivalence between online learnability and globally stable PAC learnability in binary classification.  However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes.  To address this limitation, Chase \emph{et al.} introduced two relaxations of agnostic global stability.  In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner’s error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun \emph{et al.} by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase \emph{et al.}, we prove that only finite hypothesis classes are globally stable learnable even if we restrict the agnostic setting to distributions with small population loss.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/blondal25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/blondal25a/blondal25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-blondal25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ari
    family: Blondal
  - given: Gao
    family: Shan
  - given: Hamed
    family: Hatami
  - given: Pooya
    family: Hatami
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 380-400
  id: blondal25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 380
  lastpage: 400
  published: 2025-07-02 00:00:00 +0000
- title: 'Proofs as Explanations: Short Certificates for Reliable Predictions'
  abstract: 'We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S’$ of the training data (if it exists) such that {\em all} classifiers $h’ \in \cH$ that make at most  $b$ mistakes on $S’$ predict $h’(x)=y$.   Such a set $S’$ serves as a {\em proof} that $x$ indeed has label $y$ under the assumption that (1) the true target function $h^\star$ belongs to $\cH$, and (2) the set $S$ contains at most $b$ noisy or corrupted points.   For example, if $b=0$ and $\cH$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and therefore every consistent linear classifier labels $x$ as positive), then Carathéodory’s theorem states that $x$ in fact lies inside the convex hull of $d+1$ of those points.  So, a set $S’$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of perfect realizability.   In this work, we consider this problem more generally, for general hypothesis classes $\cH$ and general values $b\geq 0$.  We define the notion of the {\em robust hollow star number} of $\cH$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as {\em distribution-dependent} bounds that we show tightly control the sample size needed to get a certificate for any given test example.  In particular, we define a notion of the {\em certificate coefficient} $\eps_x$ of an example $x$ with respect to a data distribution $\cD$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\eps_x$, $b$, and the VC dimension $d$ of $\cH$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/blum25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/blum25a/blum25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-blum25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Avrim
    family: Blum
  - given: Steve
    family: Hanneke
  - given: Chirag
    family: Pabbaraju
  - given: Donya
    family: Saless
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 401-420
  id: blum25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 401
  lastpage: 420
  published: 2025-07-02 00:00:00 +0000
- title: 'Accelerating Proximal Gradient Descent via Silver Stepsizes'
  abstract: 'Surprisingly, recent work has shown that gradient descent can be accelerated without using momentum—just by judiciously choosing stepsizes. An open question raised by several papers is whether this phenomenon of stepsize-based acceleration holds more generally for constrained and/or composite convex optimization via projected and/or proximal versions of gradient descent. We answer this in the affirmative by proving that the silver stepsize schedule yields analogously accelerated rates in these settings. These rates are conjectured to be asymptotically optimal among all stepsize schedules, and match the silver convergence rate of vanilla gradient descent (Altschuler and Parrilo, 2024, 2025), namely $O(\varepsilon^{-\log_{\rho} 2})$ for smooth convex optimization and $O(\kappa^{\log_\rho 2} \log 1/\varepsilon)$ under strong convexity, where $\varepsilon$ is the precision, $\kappa$ is the condition number, and $\rho = 1 + \sqrt{2}$ is the silver ratio. The key technical insight is the combination of recursive gluing—the technique underlying all analyses of gradient descent accelerated with time-varying stepsizes—with a certain Laplacian-structured sum-of-squares certificate for the analysis of proximal point updates.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/bok25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bok25a/bok25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bok25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jinho
    family: Bok
  - given: Jason M.
    family: Altschuler
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 421-453
  id: bok25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 421
  lastpage: 453
  published: 2025-07-02 00:00:00 +0000
- title: 'Logarithmic regret of exploration in average reward Markov decision processes'
  abstract: 'In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick  (DT) rule or a variant thereof.  In this work, without modifying EVI, we show that there is a significant advantage in replacing the doubling trick by another simple rule, that we call the Vanishing Multiplicative rule (VM). When managing episodes with VM, the algorithm’s regret is, both in theory and in practice, as good if not better than with DT while the one-shot behavior is greatly improved.  More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under VM than DT by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/boone25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/boone25a/boone25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-boone25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Victor
    family: Boone
  - given: Bruno
    family: Gaujal
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 454-533
  id: boone25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 454
  lastpage: 533
  published: 2025-07-02 00:00:00 +0000
- title: 'Partial and Exact Recovery of a Random Hypergraph from its Graph Projection'
  abstract: 'Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid so that the average degree is $n^\delta$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of Bresler, Guo, and Polyanskiy (COLT 2024) showed that exact recovery for $d=3$ is possible if and only if $\delta < 2/5$. This work completely resolves the question for all values of $d$ for both exact and partial recovery and for both cases of whether multiplicity information about each edge is available or not. In addition, we show that the reconstruction fidelity undergoes an all-or-nothing transition at a threshold. In particular, this resolves all conjectures from Bresler, Guo, and Polyanskiy (COLT 2024).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/bresler25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bresler25a/bresler25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bresler25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Guy
    family: Bresler
  - given: Chenghao
    family: Guo
  - given: Yury
    family: Polyanskiy
  - given: Andrew
    family: Yao
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 534-593
  id: bresler25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 534
  lastpage: 593
  published: 2025-07-02 00:00:00 +0000
- title: 'Computational Equivalence of Spiked Covariance and Spiked Wigner Models via Gram-Schmidt Perturbation'
  abstract: 'In this work, we show the first average-case reduction transforming the sparse Spiked Covariance Model into the sparse Spiked Wigner Model and as a consequence obtain the first computational equivalence result between two well-studied high-dimensional statistics models. Our approach leverages a new perturbation equivariance property for Gram-Schmidt orthogonalization, enabling removal of dependence in the noise while preserving the signal.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/bresler25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bresler25b/bresler25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bresler25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Guy
    family: Bresler
  - given: Alina
    family: Harbuzova
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 594-595
  id: bresler25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 594
  lastpage: 595
  published: 2025-07-02 00:00:00 +0000
- title: 'Of Dice and Games: A Theory of Generalized Boosting'
  abstract: 'Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold). We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/bressan25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bressan25a/bressan25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bressan25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Marco
    family: Bressan
  - given: Nataly
    family: Brukhim
  - given: Nicolò
    family: Cesa-Bianchi
  - given: Emmanuel
    family: Esposito
  - given: Yishay
    family: Mansour
  - given: Shay
    family: Moran
  - given: Maximilian
    family: Thiessen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 596-640
  id: bressan25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 596
  lastpage: 640
  published: 2025-07-02 00:00:00 +0000
- title: 'A Fine-grained Characterization of PAC Learnability'
  abstract: 'In the multiclass PAC setting, even when full learnability is unattainable, meaningful information can often be extracted to guide predictions. However, classical learning theory has mainly focused on the dichotomy “learnable vs. non-learnable”, leaving notions of partial learnability largely unexplored. Indeed, even for a non-learnable class, a learner may still achieve partial success-for example, by making reliable predictions whenever the true label belongs to a fixed subset of the label space, even if it fails otherwise. Similarly, the rigid nature of PAC learnability makes it impossible to distinguish between classes where one can achieve favorable trade-offs between, say, false-positive and false-negative rates, and classes where such trade-offs are fundamentally unattainable.  In a nutshell, standard PAC learnability precludes a fine-grained exploration of learnability. To overcome this limitation, we develop a fine-grained theory of PAC learnability. For any hypothesis class $\mathcal{H}$, given a loss function (which quantifies the penalty for predicting $\hat{y}$ instead of the true label $y$) and a target loss threshold $z$, our theory determines whether it is possible to achieve a loss of at most $z$. In contrast, classical PAC learning considers only the special case of the zero-one loss and $z = 0$, corresponding to a near perfect classification guarantee. We give a complete characterization of all attainable guarantees, captured by a \emph{finite family} of combinatorial dimensions, which we term the \emph{$J$-cube dimensions} of $\mathcal{H}$. These dimensions are defined for every subset $J$ of at least two labels. This extends the fundamental theorem of realizable PAC learning based on the VC dimension. In fact, our results hold in a more general multi-objective setting where we fully characterize the Pareto frontier of guarantees attainable for the class $\mathcal{H}$. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/bressan25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/bressan25b/bressan25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-bressan25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Marco
    family: Bressan
  - given: Nataly
    family: Brukhim
  - given: Nicolò
    family: Cesa-Bianchi
  - given: Emmanuel
    family: Esposito
  - given: Yishay
    family: Mansour
  - given: Shay
    family: Moran
  - given: Maximilian
    family: Thiessen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 641-676
  id: bressan25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 641
  lastpage: 676
  published: 2025-07-02 00:00:00 +0000
- title: 'On the Convergence of Min-Max Langevin Dynamics and Algorithm'
  abstract: 'We study zero-sum games in the space of probability distributions over the Euclidean space $\mathbb{R}^d$ with entropy regularization, in the setting when the interaction function between the players is smooth and strongly convex-strongly concave. We prove an exponential convergence guarantee for the mean-field min-max Langevin dynamics to compute the equilibrium distribution of the zero-sum game. We also study the finite-particle approximation of the mean-field min-max Langevin dynamics, both in continuous and discrete times. We prove biased convergence guarantees for the continuous-time finite-particle min-max Langevin dynamics to the stationary mean-field equilibrium distribution with an explicit bias term which does not scale with the number of particles. We also prove biased convergence guarantees for the discrete-time finite-particle min-max Langevin algorithm to the stationary mean-field equilibrium distribution with an additional bias term which scales with the step size and the number of particles. This provides an explicit iteration complexity for the average particle along the finite-particle algorithm to approximately compute the equilibrium distribution of the zero-sum game. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/cai25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/cai25a/cai25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-cai25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yang
    family: Cai
  - given: Siddharth
    family: Mitra
  - given: Xiuyuan
    family: Wang
  - given: Andre
    family: Wibisono
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 677-754
  id: cai25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 677
  lastpage: 754
  published: 2025-07-02 00:00:00 +0000
- title: 'What Makes Treatment Effects Identifiable? Characterizations and Estimators Beyond Unconfoundedness (Extended Abstract)'
  abstract: 'Most of the widely used estimators of the \emph{average treatment effect} (ATE) in causal inference rely on the assumptions of \emph{unconfoundedness} and \emph{overlap}. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, many types of studies frequently violate unconfoundedness or overlap, for instance, observational studies with deterministic treatment decisions - popularly known as Regression Discontinuity designs - violate overlap. In this paper, we initiate the study of general conditions that enable the \emph{identification} of the average treatment effect, extending beyond unconfoundedness and overlap. In particular, following the paradigm of {statistical} learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the \emph{average treatment effect on the treated} (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, under mild assumptions on the data distributions, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/cai25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/cai25b/cai25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-cai25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yang
    family: Cai
  - given: Alkis
    family: Kalavasis
  - given: Katerina
    family: Mamali
  - given: Anay
    family: Mehrotra
  - given: Manolis
    family: Zampetakis
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 755-756
  id: cai25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 755
  lastpage: 756
  published: 2025-07-02 00:00:00 +0000
- title: 'Information-theoretic reduction of deep neural networks to linear models in the overparametrized proportional regime'
  abstract: 'We rigorously analyse fully-trained neural networks of arbitrary depth in the Bayesian optimal setting in the so-called proportional scaling regime where the number of training samples and width of the input and all inner layers diverge proportionally. We prove an information-theoretic equivalence between the Bayesian deep neural network model trained from data generated by a teacher with matching architecture, and a simpler model of optimal inference in a generalized linear model. This equivalence enables us to compute the optimal generalization error for deep neural networks in this regime. We thus prove the "deep Gaussian equivalence principle" conjectured in Cui et al. (2023). Our result highlights that in order to escape this "trivialisation" of deep neural networks (in the sense of reduction to a linear model) happening in the strongly overparametrized proportional regime, models trained from much more data have to be considered.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/camilli25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/camilli25a/camilli25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-camilli25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Francesco
    family: Camilli
  - given: Daria
    family: Tieplova
  - given: Eleonora
    family: Bergamin
  - given: Jean
    family: Barbier
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 757-798
  id: camilli25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 757
  lastpage: 798
  published: 2025-07-02 00:00:00 +0000
- title: 'Market Making without Regret'
  abstract: 'We consider a sequential decision-making setting where, at every round $t$, the learner (a market maker) posts a bid price $B_t$ and an ask price $A_t$ to an incoming trader (the taker) with a private valuation for some asset. If the trader’s valuation is lower than the bid price, or higher than the ask price, then a trade (sell or buy) occurs. Letting $M_t$ be the market price (observed only at the end of round $t$), the maker’s utility is $M_t-B_t$ if the maker bought the asset, it is $A_t-M_t$ if they sold it, and it is $0$ if no trade occurred. We characterize the maker’s regret with respect to the best fixed choice of bid and ask pairs under a variety of assumptions (adversarial, i.i.d., and their variants) on the sequence of market prices and valuations.  Our upper bound analysis unveils an intriguing connection relating market making to first-price auctions and dynamic pricing. Our main technical contribution is a lower bound for the i.i.d. case with Lipschitz distributions and independence between market prices and takers’ valuations. The difficulty in the analysis stems from a unique relationship between the reward and feedback functions that allows learning algorithms to trade off reward for information in a continuous way.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/cesa-bianchi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/cesa-bianchi25a/cesa-bianchi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-cesa-bianchi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nicolò
    family: Cesa-Bianchi
  - given: Tommaso
    family: Cesari
  - given: Roberto
    family: Colomboni
  - given: Luigi
    family: Foscari
  - given: Vinayak
    family: Pathak
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 799-837
  id: cesa-bianchi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 799
  lastpage: 837
  published: 2025-07-02 00:00:00 +0000
- title: 'Towards Fair Representation: Clustering and Consensus'
  abstract: 'Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS’17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness – an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. We complement our approximation result by showing that the problem is NP-hard for two unequal-sized groups. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chakraborty25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chakraborty25a/chakraborty25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chakraborty25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Diptarka
    family: Chakraborty
  - given: Kushagra
    family: Chatterjee
  - given: Debarati
    family: Das
  - given: Tien Long
    family: Nguyen
  - given: Romina
    family: Nobahari
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 838-853
  id: chakraborty25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 838
  lastpage: 853
  published: 2025-07-02 00:00:00 +0000
- title: 'Exploring Facets of Language Generation in the Limit'
  abstract: ' The recent work of Kleinberg and Mullainathan provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification,  they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Li, Raman and Tewari studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved — namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable collection has a generator with the stronger property of non-uniform generation in the limit. However, while the generation algorithm of Kleinberg and Mullainathan can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of Kleinberg and Mullainathan by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. We also provide a precise characterization of the language collections for which exhaustive generation is possible. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of an abstract complexity measure of the collection. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/charikar25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/charikar25a/charikar25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-charikar25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Moses
    family: Charikar
  - given: Chirag
    family: Pabbaraju
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 854-887
  id: charikar25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 854
  lastpage: 887
  published: 2025-07-02 00:00:00 +0000
- title: 'Deterministic Apple Tasting'
  abstract: 'In binary ($0/1$) online classification with apple tasting feedback, the learner receives feedback only when predicting $1$. Besides some degenerate learning tasks, all previously known learning algorithms for this model are randomized. Consequently, prior to this work it was unknown whether deterministic apple tasting is generally feasible. In this work, we provide the first widely-applicable deterministic apple tasting learner, and show that in the realizable case, a hypothesis class is learnable if and only if it is deterministically learnable, confirming a conjecture of Raman, Subedi, Raman, Tewari-24. Quantitatively, we show that every class H is learnable with mistake bound $O (\sqrt{L(H) T log T})$ (where $L(H)$ is the Littlestone dimension of $H$), and that this is tight for some classes. This demonstrates a separation between a deterministic and randomized learner, where the latter can learn every class with mistake bound $O(\sqrt{L(H)T})$, as shown in Raman et al.-24. We further study the agnostic case, in which the best hypothesis makes at most $k$ many mistakes, and prove a trichotomy stating that every class $H$ must be either easy, hard, or unlearnable. Easy classes have (both randomized and deterministic) mistake bound $\Theta_{H}(k)$. Hard classes have randomized mistake bound $\tilde{\Theta}_{H}(k + \sqrt{T})$, and deterministic mistake bound $\tilde{\Theta}_{H}(\sqrt{k \cdot T})$, where $T$ is the time horizon. Unlearnable classes have (both randomized and deterministic) mistake bound $\Theta(T)$. Our upper bound is based on a deterministic algorithm for learning from expert advice with apple tasting feedback, a problem interesting in its own right. For this problem, we show that the optimal deterministic mistake bound is $\Theta (\sqrt{T (k + \log n)})$ for all $k$ and $T \leq n \leq 2^T$, where $n$ is the number of experts. Our algorithm is a natural variation of the well-known exponential weights forecaster.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chase25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chase25a/chase25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chase25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zachary
    family: Chase
  - given: Idan
    family: Mehalel
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 888-923
  id: chase25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 888
  lastpage: 923
  published: 2025-07-02 00:00:00 +0000
- title: 'DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory'
  abstract: ' Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\poly\left(\frac{\log n}{\epsilon}\right)$ samples from the data distribution, we can round nearly all $n$ model parameters such that the expected approximation error of the quantized model on the true data distribution is $\le \epsilon$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our algorithm is based on the famous Lovett-Meka algorithm from discrepancy theory and uses sticky Brownian motion to find a good rounding. We also give a simple and practical rounding algorithm called \emph{DiscQuant}, which is inspired by our theoretical insights. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64% accuracy on the GSM8k dataset, whereas GPTQ achieves 54% and RTN achieves 31% (the original model achieves 84%). We make our code available at \url{https://github.com/jerry-chee/DiscQuant}. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/chee25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chee25a/chee25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chee25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jerry
    family: Chee
  - given: Arturs
    family: Backurs
  - given: Rainie
    family: Heck
  - given: Li
    family: Zhang
  - given: Janardhan
    family: Kulkarni
  - given: Thomas
    family: Rothvoss
  - given: Sivakanth
    family: Gopi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 924-951
  id: chee25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 924
  lastpage: 951
  published: 2025-07-02 00:00:00 +0000
- title: 'Solving Convex-Concave Problems with $\mathcal{O}(\epsilon^{-4/7})$ Second-Order Oracle Complexity'
  abstract: 'Previous algorithms can solve convex-concave minimax problems $\min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} f(x,y)$ with $\gO(\epsilon^{-2/3})$ second-order oracle calls using Newton-type methods.    This result has been speculated to be optimal because the upper bound is achieved by a natural generalization of the optimal first-order method. In this work, we show an improved upper bound of $\tilde{\gO}(\epsilon^{-4/7})$  by generalizing the optimal second-order method for convex optimization to solve the convex-concave minimax problem.  We further apply a similar technique to lazy Hessian algorithms and show that our proposed algorithm can also be seen as a second-order “Catalyst” framework (Lin et al., JMLR 2018) that could accelerate any globally convergent algorithms for solving minimax problems.  '
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25a/chen25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Lesi
    family: Chen
  - given: Chengchang
    family: Liu
  - given: Luo
    family: Luo
  - given: Jingzhao
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 952-982
  id: chen25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 952
  lastpage: 982
  published: 2025-07-02 00:00:00 +0000
- title: 'Decision Making in Changing Environments: Robustness, Query-Based Learning, and Differential Privacy'
  abstract: 'We study the problem of interactive decision making in which the underlying environment changes over time subject to given constraints. We propose a framework, which we call \textit{hybrid Decision Making with Structured Observations} (hybrid DMSO), that provides an interpolation between the stochastic and adversarial settings of decision making. Within this framework, we can analyze local differentially private decision making, query-based learning (in particular, SQ learning), and robust and smooth decision making under the same umbrella, deriving upper and lower bounds based on variants of the Decision-Estimation Coefficient (DEC). We further establish strong connections between the DEC’s behavior, the SQ dimension, local minimax complexity, learnability, and joint differential privacy. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25b/chen25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Fan
    family: Chen
  - given: Alexander
    family: Rakhlin
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 983-985
  id: chen25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 983
  lastpage: 985
  published: 2025-07-02 00:00:00 +0000
- title: 'Predicting quantum channels over general product distributions'
  abstract: 'We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an $n$-qubit channel $\mathcal{E}$ and an observable $\mathcal{O}$, we aim to learn the mapping \begin{equation*} \rho \mapsto \Tr(\mathcal{O} \mathcal{E}[\rho]) \end{equation*} to within a small error for most $\rho$ sampled from a distribution $\mathcal{D}$. Previously, Huang et al. proved a surprising result that even if $\mathcal{E}$ is arbitrary, this task can be solved in time roughly $n^{O(\log(1/\epsilon))}$, where $\epsilon$ is the target prediction error.  However, their guarantee applied only to input distributions $\mathcal{D}$ invariant under all single-qubit Clifford gates, and their algorithm fails for important cases such as general product distributions over product states $\rho$.  In this work, we propose a new approach that achieves accurate prediction over essentially any product distribution $\mathcal{D}$, provided it is not “classical” in which case there is a trivial exponential lower bound. Our method employs a “biased Pauli analysis,” analogous to classical biased Fourier analysis. Implementing this approach requires overcoming several challenges unique to the quantum setting, including the lack of a basis with appropriate orthogonality properties. The techniques we develop to address these issues may have broader applications in quantum information. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25c.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25c/chen25c.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sitan
    family: Chen
  - given: Jaume
    family: de Dios Pont
  - given: Jun-Ting
    family: Hsieh
  - given: Hsin-Yuan
    family: Huang
  - given: Jane
    family: Lange
  - given: Jerry
    family: Li
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 986-1007
  id: chen25c
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 986
  lastpage: 1007
  published: 2025-07-02 00:00:00 +0000
- title: 'Improved sample upper and lower bounds for trace estimation of quantum state powers'
  abstract: 'As often emerges in various basic quantum properties such as entropy, the trace of quantum state powers $\operatorname{tr}(\rho^q)$ has attracted a lot of attention. The recent work of Liu and Wang (SODA 2025) showed that $\operatorname{tr}(\rho^q)$ can be estimated to within additive error $\varepsilon$ with a dimension-independent sample complexity of $\widetilde O(1/\varepsilon^{3+\frac{2}{q-1}})$ for any constant $q > 1$, where only an $\Omega(1/\varepsilon)$ lower bound was given. In this paper, we significantly improve the sample complexity of estimating $\operatorname{tr}(\rho^q)$ in both the upper and lower bounds. In particular: - For $q > 2$, we settle the sample complexity with matching upper and lower bounds $\widetilde \Theta(1/\varepsilon^2)$. - For $1 < q < 2$, we provide an upper bound $\widetilde O(1/\varepsilon^{\frac{2}{q-1}})$, with a lower bound $\Omega(1/\varepsilon^{\max\{\frac{1}{q-1}, 2\}})$ for dimension-independent estimators, implying there is only room for a quadratic improvement. Our upper bounds are obtained by (non-plug-in) quantum estimators based on weak Schur sampling, in sharp contrast to the prior approach based on quantum singular value transformation and samplizer.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25d.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25d/chen25d.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25d.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Kean
    family: Chen
  - given: Qisheng
    family: Wang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1008-1028
  id: chen25d
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1008
  lastpage: 1028
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning general Gaussian mixtures with efficient score matching'
  abstract: 'We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions.  We make no separation assumptions on the underlying mixture components:  we only require that the covariance matrices have bounded condition number  and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\textrm{poly}(k/\epsilon)}$ samples from the target mixture,  runs in sample-polynomial time, and constructs a sampler whose output distribution is $\epsilon$-close from the unknown mixture in total variation.  Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$.  Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25e.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25e/chen25e.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25e.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sitan
    family: Chen
  - given: Vasilis
    family: Kontonis
  - given: Kulin
    family: Shah
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1029-1090
  id: chen25e
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1029
  lastpage: 1090
  published: 2025-07-02 00:00:00 +0000
- title: 'Algorithms for Sparse LPN and LSPN Against Low-noise (extended abstract)'
  abstract: 'We consider sparse variants of the classical Learning Parities with random Noise (LPN) problem. Our main contribution is a new algorithmic framework that provides learning algorithms against low-noise for both Learning Sparse Parities (LSPN) problem and sparse LPN problem. Different from previous approaches for LSPN and sparse LPN, this framework has a simple structure without fast matrix multiplication or tensor methods such that its algorithms are easy to implement and run in polynomial space. Let $n$ be the dimension, $k$ denote the sparsity, and $\eta$ be the noise rate such that each label gets flipped with probability $\eta$. As a fundamental problem in computational learning theory, Learning Sparse Parities with Noise (LSPN) assumes the hidden parity is $k$-sparse instead of a potentially dense vector. While the simple enumeration algorithm takes ${n \choose k}=O(n/k)^k$ time, previously known results stills need at least ${n \choose k/2} = \Omega(n/k)^{k/2}$ time for any noise rate $\eta$. Our framework provides a LSPN algorithm runs in time $O(\eta \cdot n/k)^k$ for any noise rate $\eta$, which improves the state-of-the-art of LSPN whenever $\eta \in (k/n,\sqrt{k/n})$. The sparse LPN problem is closely related to the classical problem of refuting random $k$-CSP  and has been widely used in cryptography as the hardness assumption. Different from the standard LPN that samples random vectors in $\mathbf{F}_2^n$, it samples random $k$-sparse vectors. Because the number of $k$-sparse vectors is ${n \choose k}<n^k$, sparse LPN has learning algorithms in polynomial time when $m>n^{k/2}$. However, much less is known about learning algorithms for a constant $k$ like 3 and  $m<n^{k/2}$ samples, except the Gaussian elimination algorithm and sum-of-squares algorithms. Our framework provides a learning algorithm in $e^{\tilde{O}(\eta \cdot n^{\frac{\delta+1}{2}})}$ time given $\delta \in (0,1)$ and $m=\max\{1,\frac{\eta \cdot n^{\frac{\delta+1}{2}}}{k^2}\} \cdot n^{1+(1-\delta)\cdot \frac{k-1}{2}}$ samples. This improves previous learning algorithms. For example, in the classical setting of $k=3$ and $m=n^{1.4}$, our algorithm would be faster than previous approaches for any $\eta<n^{-0.7}$. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25f.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25f/chen25f.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25f.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xue
    family: Chen
  - given: Wenxuan
    family: Shu
  - given: Zhaienhe
    family: Zhou
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1091-1093
  id: chen25f
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1091
  lastpage: 1093
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimization, Isoperimetric Inequalities, and Sampling via Lyapunov Potentials'
  abstract: 'In this paper, we prove that optimizability of any function $F$ using Gradient Flow from all initializations implies a Poincaré Inequality for Gibbs measures $\mu_{\beta}\propto e^{-\beta F}$ at low temperature.  In particular, under mild regularity assumptions on the convergence rate of Gradient Flow, we establish that $\mu_{\beta}$ satisfies a Poincaré Inequality with constant $O(C’)$ for $\beta \ge \Omega(d)$, where $C’$ is the Poincaré constant of $\mu_{\beta}$ restricted to a neighborhood of the global minimizers of $F$.  Under an additional mild condition on $F$, we show that $\mu_{\beta}$ satisfies a Log-Sobolev Inequality with constant $O(S \beta C’)$ where $S$ denotes the second moment of $\mu_{\beta}$. Here asymptotic notation hides $F$-dependent parameters. At a high level, this establishes that optimizability via Gradient Flow from every initialization implies a Poincaré and Log-Sobolev Inequality for the low-temperature Gibbs measure, which in turn imply sampling from all initializations.  Analogously, we establish that under the same assumptions, if $F$ can be initialized from everywhere except some set $\mathcal{S}$, then $\mu_{\beta}$ satisfies a Weak Poincaré Inequality with parameters $(O(C’), O(\mu_{\beta}(\mathcal{S})))$ for $\beta \ge \Omega(d)$. At a high level, this shows while optimizability from ‘most’ initializations implies a Weak Poincaré Inequality, which in turn implies sampling from suitable warm starts. Our regularity assumptions are mild and as a consequence, we show we can efficiently sample from several new natural and interesting classes of non-log-concave densities, an important setting with relatively few examples.  As another corollary, we obtain efficient discrete-time sampling results for log-concave measures satisfying milder regularity conditions than smoothness, similar to Lehec (2023). '
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25g.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25g/chen25g.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25g.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: August Y
    family: Chen
  - given: Karthik
    family: Sridharan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1094-1153
  id: chen25g
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1094
  lastpage: 1153
  published: 2025-07-02 00:00:00 +0000
- title: 'Heavy-tailed Estimation is Easier than Adversarial Contamination'
  abstract: 'A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to the development of statistically and computationally efficient estimators robust to the presence of outliers in data. In the course of these developments, two particular outlier models have received significant attention: the adversarial and heavy-tailed contamination models. While the former models outliers as the result of a potentially malicious adversary inspecting and manipulating the data, the latter instead relaxes the assumptions on the distribution generating the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of \emph{statistical} efficiency caused by outliers, where the dependence on the failure probability is paramount. Surprisingly, despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the corruption models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that \emph{any} adversarially robust estimator is also resilient to heavy-tailed outliers for \emph{any} statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we establish the existence of optimal heavy-tailed estimators whose application to the adversarial setting \emph{requires} any black-box reduction to remove \emph{almost all the outliers} in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches bypassing the computational barriers inherent to the adversarial setting. Additionally, \emph{any} confidence intervals obtained for adversarially robust estimation also hold with high-probability. The proof of our reduction from heavy-tailed to adversarially robust estimation rests on the isoperimetry properties of the set of adversarially robust datasets. Conversely, we identify novel structural properties for samples drawn from a heavy-tailed distribution. We show that such a sample obeys a logarithmic tail-decay condition scaling with the target failure probability. This allows for a quantile-smoothed heavy-tailed estimator which \emph{requires arbitrarily large} stable subsets of the input data to succeed. In the process of analyzing this estimator, we also strengthen the analysis of algorithms utilized previously in the literature.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/cherapanamjeri25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/cherapanamjeri25a/cherapanamjeri25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-cherapanamjeri25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yeshwanth
    family: Cherapanamjeri
  - given: Daniel
    family: Lee
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1154-1184
  id: cherapanamjeri25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1154
  lastpage: 1184
  published: 2025-07-02 00:00:00 +0000
- title: 'The Space Complexity of Learning-Unlearning Algorithms (extended abstract)'
  abstract: 'We study the memory complexity of machine unlearning algorithms that provide strong data deletion guarantees to the users. Formally, consider an algorithm for a particular learning task that initially receives a training dataset. Then, after learning, it receives data deletion requests from a subset of users (of arbitrary size), and the goal of unlearning is to perform the task as if the learner never received the data of deleted users. In this paper, we ask how many bits of storage are needed to be able to complete certain training samples, at a later time. We focus on the task of realizability testing, where the goal is to check whether the remaining training samples are realizable within a given hypothesis class $\mathcal{H}$. Toward that end, we first provide a negative result showing that the VC dimension, a well-known combinatorial property of $\mathcal{H}$ that characterizes the amount of information needed for learning and representing the ERM hypothesis in the standard PAC learning task—is not a characterization of the space complexity of unlearning. In particular, we provide a hypothesis class with constant VC dimension (and Littlestone dimension), but for which any unlearning algorithm for realizability testing needs to store $\Omega(n)$ bits, where $n$ denotes the size of the initial training dataset. In fact, we provide a stronger separation by showing that for any hypothesis class $\mathcal{H}$, the amount of information that the learner needs to store, so as to perform unlearning later, is lower bounded by the Eluder dimension of $\mathcal{H}$, a combinatorial notion always larger than the VC dimension. We complement the lower bound with an upper bound in terms of the star number of the underlying hypothesis class, albeit in a stronger ticketed memory model proposed by Ghoroi et al. (2023). We show that for any class $\mathcal{H}$ with bounded star number, there exists a ticketed scheme that uses only $\tilde{O}(\text{StarNo}(\mathcal{H}))$ many bits of storage and these many tickets. Since, the star number for a hypothesis class is never larger than its Eluder dimension, our work highlights a fundamental separation between central and ticketed memory models for machine unlearning. Lastly, we consider the setting where the number of deletions is bounded and show that in contrast to the unbounded setting, there exist unlearning schemes with sublinear (in $n$) storage for hypothesis classes with bounded hollow star number, a notion of complexity that is always smaller than the star number and the Eluder dimension.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/cherapanamjeri25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/cherapanamjeri25b/cherapanamjeri25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-cherapanamjeri25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yeshwanth
    family: Cherapanamjeri
  - given: Sumegba
    family: Garg
  - given: Nived
    family: Rajaraman
  - given: Ayush
    family: Sekhari
  - given: Abhishek
    family: Shetty
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1185-1193
  id: cherapanamjeri25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1185
  lastpage: 1193
  published: 2025-07-02 00:00:00 +0000
- title: 'Quantum State and Unitary Learning Implies Circuit Lower Bounds'
  abstract: 'We establish connections between state tomography, pseudorandomness, quantum state synthesis, and circuit lower bounds. In particular, let $\mathfrak C $ be a family of non-uniform quantum circuits of polynomial size and suppose that there exists an algorithm that, given copies of $\ket \psi$, distinguishes whether $\ket \psi$ is produced by $\mathfrak C$ or is Haar random, promised one of these is the case. For arbitrary fixed constant $c$, we show that if the algorithm uses at most $O\!\left(2^{n^c}\right)$ time and $2^{n^{0.99}}$ samples then $\mathsf{stateBQE} \not\subset \mathsf{state}\mathfrak{C}$. Here $\mathsf{stateBQE} \coloneqq \mathsf{stateBQTIME}\left[2^{O(n)}\right]$ and $\mathsf{state}\mathfrak{C}$ are state synthesis complexity classes as introduced by Rosenthal and Yuen (2022), which capture problems with classical inputs but quantum output. Note that efficient tomography implies a similarly efficient distinguishing algorithm against Haar random states, even for nearly exponential-time algorithms. Because every state produced by a polynomial-size circuit can be learned with $2^{O(n)}$ samples and time, or $\omega(\mathrm{poly}(n))$ samples and $2^{\omega(\mathrm{poly}(n))}$ time, we show that even slightly non-trivial quantum state tomography algorithms would lead to new statements about quantum state synthesis. Finally, a slight modification of our proof shows that distinguishing algorithms for quantum states can imply circuit lower bounds for decision problems as well. We then take these results and port them over to the setting of unitary learning and unitary synthesis. All combined, this helps shed light on why time-efficient tomography algorithms for non-uniform quantum circuit classes has only had limited and partial progress. Our work extends the results of Arunachalam et. al. (2022), which revealed a connection between quantum learning of \emph{Boolean functions} and circuit lower bounds for \emph{classical} circuit classes, to the setting of state (resp. unitary) tomography and state (resp. unitary) synthesis. As a result, we establish a conditional pseudorandom state (resp. unitary) generator, a circuit size hierarchy theorems for non-uniform state (resp. unitary) synthesis, and connections between state (resp. unitary) synthesis class separations and decision class separations, which may be of independent interest.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chia25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chia25a/chia25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chia25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nai-Hui
    family: Chia
  - given: Daniel
    family: Liang
  - given: Fang
    family: Song
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1194-1252
  id: chia25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1194
  lastpage: 1252
  published: 2025-07-02 00:00:00 +0000
- title: 'Stochastic block models with many communities and the Kesten–Stigum bound - extended abstract'
  abstract: 'We study the inference of communities in stochastic block models with a growing number of communities. For block models with $n$ vertices and a fixed number of communities $q$, it was predicted in Decelle et al. (2011) that there are computationally efficient algorithms for recovering the communities above the Kesten–Stigum (KS) bound and that efficient recovery is impossible below the KS bound. This conjecture  has since stimulated a lot of interest, with the achievability side proven in a line of research that culminated in the work of Abbe and Sandon (2018). Conversely,  recent work provides evidence for the hardness part using the low-degree paradigm. In this paper we investigate community recovery in the regime $q=q_n \to \infty$ as $n\to\infty$ where no such predictions exist. We show that efficient inference of communities remains possible above the KS bound. Furthermore, we show that recovery of block models is low-degree hard below the KS bound when the number of communities satisfies $q\ll \sqrt{n}$. Perhaps surprisingly, we find that when $q \gg \sqrt{n}$, there is an efficient algorithm based on non-backtracking walks for recovery even below the KS bound. We identify a new threshold and ask if it is the threshold for efficient recovery in this regime. Finally, we show that detection is easy and identify (up to a constant) the information-theoretic threshold for community recovery as the number of communities $q$ diverges.  Our low-degree hardness results also naturally have consequences for graphon estimation, improving results of Luo and Gao (2024).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chin25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chin25a/chin25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chin25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Byron
    family: Chin
  - given: Elchanan
    family: Mossel
  - given: Youngtak
    family: Sohn
  - given: Alexander S.
    family: Wein
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1253-1258
  id: chin25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1253
  lastpage: 1258
  published: 2025-07-02 00:00:00 +0000
- title: 'Spherical Dimension'
  abstract: 'We introduce and study the \emph{spherical dimension}, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension.  The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chornomaz25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chornomaz25a/chornomaz25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chornomaz25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Bogdan
    family: Chornomaz
  - given: Shay
    family: Moran
  - given: Tom
    family: Waknine
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1259-1313
  id: chornomaz25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1259
  lastpage: 1313
  published: 2025-07-02 00:00:00 +0000
- title: 'Lower Bounds for Greedy Teaching Set Constructions'
  abstract: 'A fundamental open problem in learning theory is to characterize the best-case teaching dimension $\operatorname{TS}_{\min}$ of a concept class $\mathcal{C}$ with finite VC dimension $d$. Resolving this problem will, in particular, settle the conjectured upper bound on Recursive Teaching Dimension posed by [Simon and Zilles; COLT 2015]. Prior work used a natural greedy algorithm to construct teaching sets recursively, thereby proving upper bounds on $\operatorname{TS}_{\min}$, with the best known bound being $O(d^2)$ [Hu, Wu, Li, and Wang; COLT 2017]. In each iteration, this greedy algorithm chooses to add to the teaching set the $k$ labeled points that restrict the concept class the most. In this work, we prove lower bounds on the performance of this greedy approach for small $k$. Specifically, we show that for $k = 1$, the algorithm does not improve upon the halving-based bound of $O(\log(|\mathcal{C}|))$. Furthermore, for $k = 2$, we complement the upper bound of $O\left(\log(\log(|\mathcal{C}|))\right)$ from [Moran, Shpilka, Wigderson, and Yuhudayoff; FOCS 2015] with a matching lower bound. Most consequentially, our lower bound extends up to $k \le \lceil c d \rceil$ for small constant $c>0$: suggesting that studying higher-order interactions may be necessary to resolve the conjecture that $\operatorname{TS}_{\min} = O(d)$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/compton25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/compton25a/compton25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-compton25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Spencer
    family: Compton
  - given: Chirag
    family: Pabbaraju
  - given: Nikita
    family: Zhivotovskiy
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1314-1329
  id: compton25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1314
  lastpage: 1329
  published: 2025-07-02 00:00:00 +0000
- title: 'Non-Euclidean High-Order Smooth Convex Optimization Extended Abstract'
  abstract: 'We develop algorithms for the optimization of convex objectives that have Hölder continuous $q$-th derivatives by using a $q$-th order oracle, for any $q \geq 1$. Our algorithms work for general norms under mild conditions, including the $\ell_p$-settings for $1\leq p\leq \infty$. We can also optimize structured functions that allow for inexactly implementing a non-Euclidean ball optimization oracle. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an \textit{inexact uniformly convex regularizer}.  We show a lower bound for general norms that demonstrates our algorithms are nearly optimal in high-dimensions in the black-box oracle model for $\ell_p$-settings and all $q \geq 1$, even in randomized and parallel settings. This new lower bound, when applied to the first-order smooth case, resolves an open question in parallel convex optimization. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/contreras25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/contreras25a/contreras25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-contreras25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Juan Pablo
    family: Contreras
  - given: Cristóbal
    family: Guzmán
  - given: David
    family: Martı́nez-Rubio
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1330-1330
  id: contreras25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1330
  lastpage: 1330
  published: 2025-07-02 00:00:00 +0000
- title: 'Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions'
  abstract: 'The problem of learning single-index and multi-index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analyzed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterizations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low- and high-complexity learning tasks. In this work, we show that high-complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution-via a random shift in the first moment-renders any Gaussian single-index model as easy to learn as a linear function. We further extend this result to a class of multi-index models, namely sparse Boolean functions, also known as Juntas.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/cornacchia25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/cornacchia25a/cornacchia25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-cornacchia25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Elisabetta
    family: Cornacchia
  - given: Dan
    family: Mikulincer
  - given: Elchanan
    family: Mossel
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1331-1365
  id: cornacchia25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1331
  lastpage: 1365
  published: 2025-07-02 00:00:00 +0000
- title: 'Non-Monetary Mechanism Design without Distributional Information: Using Scarce Audits Wisely (Extended Abstract)'
  abstract: 'We study a repeated resource allocation problem with strategic agents where monetary transfers are disallowed and the central planner has no prior information on agents’ utility distributions. In light of Arrow’s impossibility theorem, acquiring information about agent preferences through some form of feedback is necessary. We assume that the central planner can request powerful but expensive audits on the winner in any round, revealing the true utility of the winner in that round. We design a mechanism achieving $T$-independent $\mathcal O(K^2)$ regret in social welfare while requesting $\mathcal O(K^3 \log T)$ audits in expectation, where $K$ is the number of agents and $T$ is the number of rounds. We also show an $\Omega(K)$ lower bound on the regret and an $\Omega(1)$ lower bound on the number of audits when having low regret. Algorithmically, we show that incentive-compatibility can be mostly enforced with an accurate estimation of the winning probability of each agent under truthful reporting. To do so, we impose future punishments and introduce a \emph{flagging} component, allowing agents to flag any biased estimate (we show that doing so aligns with individual incentives). On the technical side, without monetary transfers and distributional information, the central planner cannot ensure that truthful reporting is exactly an equilibrium. Instead, we characterize the equilibrium via a reduction to a simpler \emph{auxiliary game}, in which agents cannot strategize until late in the $T$ rounds of the allocation problem. The tools developed therein may be of independent interest for other mechanism design problems in which the revelation principle cannot be readily applied.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/dai25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/dai25a/dai25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-dai25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yan
    family: Dai
  - given: Moïse
    family: Blanchard
  - given: Patrick
    family: Jaillet
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1366-1367
  id: dai25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1366
  lastpage: 1367
  published: 2025-07-02 00:00:00 +0000
- title: 'Existence of Adversarial Examples for Random Convolutional Networks via  Isoperimetric Inequalities on $\mathbb{SO}(d)$'
  abstract: 'We show that adversarial examples exist for various random convolutional networks, and furthermore, that this is a relatively simple consequence of the isoperimetric inequality on the special orthogonal group $\mathbb{SO}(d)$. This extends and simplifies a recent line of work which shows similar results for random fully connected networks.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/daniely25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/daniely25a/daniely25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-daniely25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Amit
    family: Daniely
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1368-1379
  id: daniely25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1368
  lastpage: 1379
  published: 2025-07-02 00:00:00 +0000
- title: 'Rate-Preserving Reductions for Blackwell Approachability'
  abstract: 'Abernethy et al. (2011) showed that Blackwell approachability and no-regret learning are equivalent, in the sense that any algorithm that solves a specific Blackwell approachability instance can be converted to a sublinear regret algorithm for a specific no-regret learning instance, and vice versa. In this paper, we study a more fine-grained form of such reductions, and ask when this translation between problems preserves not only a sublinear rate of convergence, but also preserves the optimal rate of convergence. That is, in which cases does it suffice to find the optimal regret bound for a no-regret learning instance in order to find the optimal rate of convergence for a corresponding approachability instance?  We show that the reduction of Abernethy et al. (2011) (and of other subsequent work) does not preserve rates: their reduction may reduce a $d$-dimensional approachability instance $\mathcal{I}_1$ with optimal convergence rate $R_1$ to a no-regret learning instance $\mathcal{I}_2$ with optimal regret-per-round of $R_2$, with $R_{2}/R_{1}$ arbitrarily large (in particular, it is possible that $R_1 = 0$ and $R_{2} > 0$). On the other hand, we show that it is possible to tightly reduce any approachability instance to an instance of a generalized form of regret minimization we call \emph{improper $\phi$-regret minimization} (a variant of the $\phi$-regret minimization of Gordon et al. (2008) where the transformation functions may map actions outside of the action set).  Finally, we characterize when linear transformations suffice to reduce improper $\phi$-regret minimization problems to standard classes of regret minimization problems (such as external regret minimization and proper $\phi$-regret minimization) in a rate preserving manner. We prove that some improper $\phi$-regret minimization instances cannot be reduced to either subclass of instance in this way, suggesting that approachability can capture some problems that cannot be easily phrased in the standard language of online learning.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/dann25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/dann25a/dann25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-dann25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Christoph
    family: Dann
  - given: Yishay
    family: Mansour
  - given: Mehryar
    family: Mohri
  - given: Jon
    family: Schneider
  - given: Balasubramanian
    family: Sivan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1380-1414
  id: dann25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1380
  lastpage: 1414
  published: 2025-07-02 00:00:00 +0000
- title: 'Low-rank fine-tuning lies between lazy training and feature learning'
  abstract: 'LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model “from scratch” can require significantly more iterations.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/dayi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/dayi25a/dayi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-dayi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Arif Kerem
    family: Dayi
  - given: Sitan
    family: Chen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1415-1471
  id: dayi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1415
  lastpage: 1471
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Intersections of Two Margin Halfspaces under Factorizable Distributions'
  abstract: 'Learning intersections of halfspaces is a central problem in Computational Learning Theory. Even for just two halfspaces, it remains a major open question whether learning is possible in polynomial time with respect to the margin $\gamma$ of the data points and their dimensionality $d$. The best-known algorithms run in quasi-polynomial time $d^{O( \log{1/\gamma} )}$, and it has been shown that this complexity is unavoidable for any algorithm relying solely on correlational statistical queries (CSQ). In this work, we introduce a novel algorithm that provably circumvents  the CSQ hardness barrier. Our approach applies to a broad class of  distributions satisfying a natural, previously studied, factorizability assumption. Factorizable distributions lie between the distribution-specific and distribution-free settings, and significantly extend previously known tractable cases. For these distributions,  we show that CSQ-based methods still require quasipolynomial time even for weak learning. Our main result is a learning algorithm for intersections of two margin halfspaces under factorizable distributions that  achieves $\text{poly}(d,1/\gamma)$  time by leveraging more general statistical queries (SQ). As a corollary, we establish a strong separation between CSQ and SQ for this fundamental PAC learning problem.  Our main result is grounded in a rigorous analysis utilizing a novel duality framework that characterizes the moment tensor structure induced by the marginal distributions. Building on these structural insights, our learning algorithm combines a refined variant of Jennrich’s Algorithm with PCA over random projections of the moment tensor, along with a gradient-descent-based non-convex optimization framework.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/diakonikolas25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/diakonikolas25a/diakonikolas25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-diakonikolas25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Ma
    family: Mingchen
  - given: Ren
    family: Lisheng
  - given: Tzamos
    family: Christos
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1472-1530
  id: diakonikolas25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1472
  lastpage: 1530
  published: 2025-07-02 00:00:00 +0000
- title: 'Faster Algorithms for Agnostically Learning Disjunctions and their Implications'
  abstract: ' We study the algorithmic task of learning Boolean disjunctions  in the distribution-free agnostic PAC model. The best known  agnostic learner for the class of disjunctions over $\{0, 1\}^n$  is the $L_1$-polynomial regression algorithm,  achieving complexity $2^{\tilde{O}(n^{1/2})}$.  This complexity bound is known to be nearly best possible  within the class of Correlational Statistical Query (CSQ) algorithms.  In this work, we develop an agnostic learner  for this concept class with complexity $2^{\tilde{O}(n^{1/3})}$.  Our algorithm can be implemented in the Statistical Query (SQ) model,  providing the first separation between the SQ and CSQ models in  distribution-free agnostic learning.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/diakonikolas25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/diakonikolas25b/diakonikolas25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-diakonikolas25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ilias
    family: Diakonikolas
  - given: Daniel M.
    family: Kane
  - given: Lisheng
    family: Ren
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1531-1558
  id: diakonikolas25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1531
  lastpage: 1558
  published: 2025-07-02 00:00:00 +0000
- title: 'A Proof of The Changepoint Detection Threshold Conjecture in Preferential Attachment Models'
  abstract: 'We investigate the problem of detecting and estimating a changepoint in the attachment function of a network evolving according to a preferential attachment model on  $n$ vertices, using only a single final snapshot of the network. Bet et al. (2023) show that a simple test based on thresholding the number of vertices with minimum degrees can detect the changepoint when the change occurs at time $n-\Omega(\sqrt{n})$. They further make the striking conjecture that detection becomes impossible for any test if the change occurs at time $n-o(\sqrt{n}).$  Kaddouri et al. (2024) make a step forward by proving the detection is impossible if the change occurs at time $n-o(n^{1/3}).$ In this paper, we resolve the conjecture affirmatively,  proving that detection is indeed impossible if the change occurs at time $n-o(\sqrt{n}).$  Furthermore, we establish that estimating the changepoint with an error smaller than $o(\sqrt{n})$ is also impossible, thereby confirming that the estimator proposed in Bhamidi et al. (2018) is order-optimal.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/du25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/du25a/du25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-du25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hang
    family: Du
  - given: Shuyang
    family: Gong
  - given: Jiaming
    family: Xu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1559-1563
  id: du25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1559
  lastpage: 1563
  published: 2025-07-02 00:00:00 +0000
- title: 'From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs'
  abstract: 'Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph.  Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post-processing. We begin by observing that, by combining a slightly modified form of the online K29* algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest; for example, we obtain efficient outcome indistinguishability for some interesting infinite collections of tests, as well as for any bounded function — including those computable by deep (graph) neural networks. We apply these techniques to evolving graphs by designing efficient kernel functions that capture socially meaningful features of nodes and their neighborhoods. We obtain online outcome-indistinguishable omnipredictors for rich — possibly infinite — sets of distinguishers yielding, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/dwork25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/dwork25a/dwork25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-dwork25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cynthia
    family: Dwork
  - given: Chris
    family: Hays
  - given: Nicole
    family: Immorlica
  - given: Juan C.
    family: Perdomo
  - given: Pranay
    family: Tankala
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1564-1637
  id: dwork25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1564
  lastpage: 1637
  published: 2025-07-02 00:00:00 +0000
- title: 'Logarithmic Width Suffices for Robust Memorization'
  abstract: '    The memorization capacity of neural networks with a given architecture has been thoroughly studied in many works. Specifically, it is well-known that memorizing $N$ samples can be done using a network of constant width, independent of $N$. However, the required constructions are often quite delicate. In this paper, we consider the natural question of how well feedforward ReLU neural networks can memorize \emph{robustly}, namely while being able to withstand adversarial perturbations of a given radius.  We establish both upper and lower bounds on the possible radius for general $l_p$ norms, implying (among other things) that width \emph{logarithmic} in the number of input samples is necessary and sufficient to achieve robust memorization (with robustness radius independent of $N$).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/egosi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/egosi25a/egosi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-egosi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Amitsour
    family: Egosi
  - given: Gilad
    family: Yehudai
  - given: Ohad
    family: Shamir
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1638-1690
  id: egosi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1638
  lastpage: 1690
  published: 2025-07-02 00:00:00 +0000
- title: 'Detecting Arbitrary Planted Subgraphs in Random Graphs'
  abstract: 'The problems of detecting and recovering planted structures/subgraphs in Erdős-Rényi random graphs, have received significant attention over the past three decades, leading to many exciting results and mathematical techniques. However, prior work has largely focused on specific ad hoc planted structures and inferential settings, while a general theory has remained elusive. In this paper, we bridge this gap by investigating the detection of an \emph{arbitrary} planted subgraph $\Gamma = \Gamma_n$ in an Erdős-Rényi random graph $\mathcal{G}(n, q_n)$, where the edge probability within $\Gamma$ is $p_n$. We examine both the statistical and computational aspects of this problem and establish the following results. In the dense regime, where the edge probabilities $p_n$ and $q_n$ are fixed, we tightly characterize the information-theoretic and computational thresholds for detecting $\Gamma$, and provide conditions under which a computational-statistical gap arises. Most notably, these thresholds depend on $\Gamma$ only through its number of edges, maximum degree, and maximum subgraph density. Our lower and upper bounds are general and apply to any value of $p_n$ and $q_n$ as functions of $n$. Accordingly, we also analyze the sparse regime where $q_n = \Theta(n^{-\alpha})$ and $p_n-q_n =\Theta(q_n)$, with $\alpha\in[0,2]$, as well as the critical regime where $p_n=1-o(1)$ and $q_n = \Theta(n^{-\alpha})$, both of which have been widely studied, for specific choices of $\Gamma$. For these regimes, we show that our bounds are tight for all planted subgraphs investigated in the literature thus far—and many more. Finally, we identify conditions under which detection undergoes sharp phase transition, where the boundaries at which algorithms succeed or fail shift abruptly as a function of $q_n$.  '
  volume: 291
  URL: https://proceedings.mlr.press/v291/elimelech25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/elimelech25a/elimelech25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-elimelech25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dor
    family: Elimelech
  - given: Wasim
    family: Huleihel
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1691-1798
  id: elimelech25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1691
  lastpage: 1798
  published: 2025-07-02 00:00:00 +0000
- title: 'Universality of High-Dimensional Logistic Regression and a Novel CGMT under Dependence with Applications to Data Augmentation'
  abstract: 'Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors-an assumption that significantly limit its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, $m$-dependence and special cases of $\beta$-mixing, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/esmaili-mallory25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/esmaili-mallory25a/esmaili-mallory25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-esmaili-mallory25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Matthew
    family: Esmaili Mallory
  - given: Kevin Han
    family: Huang
  - given: Morgane
    family: Austern
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1799-1918
  id: esmaili-mallory25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1799
  lastpage: 1918
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Augmented Graph $k$-Clustering'
  abstract: ' Clustering is a fundamental task in unsupervised learning. Previous research has focused on learning-augmented $k$-means in Euclidean metrics, limiting its applicability to complex data representations. In this paper, we generalize learning-augmented $k$-clustering to operate on general metrics, enabling its application to graph-structured and non-Euclidean domains. Our framework also relaxes restrictive cluster size constraints, providing greater flexibility for datasets with imbalanced or unknown cluster distributions. Furthermore, we extend the hardness of query complexity to general metrics: under the Exponential Time Hypothesis (ETH), we show that any polynomial-time algorithm must perform approximately $\Omega(k / \alpha)$ queries to achieve a $(1 + \alpha)$-approximation. These contributions strengthen both the theoretical foundations and practical applicability of learning-augmented clustering, bridging gaps between traditional methods and real-world challenges.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/fan25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/fan25a/fan25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-fan25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chenglin
    family: Fan
  - given: Kijun
    family: Shin
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1919-1934
  id: fan25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1919
  lastpage: 1934
  published: 2025-07-02 00:00:00 +0000
- title: 'Trade-offs in Data Memorization via Strong Data Processing Inequalities'
  abstract: 'Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization’s role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/feldman25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/feldman25a/feldman25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-feldman25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vitaly
    family: Feldman
  - given: Guy
    family: Kornowski
  - given: Xin
    family: Lyu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1935-1973
  id: feldman25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1935
  lastpage: 1973
  published: 2025-07-02 00:00:00 +0000
- title: 'Approximating the total variation distance between spin systems'
  abstract: 'Spin systems form an important class of undirected graphical models. For two Gibbs distributions $\mu$ and $\nu$ induced by two spin systems on the same graph $G = (V, E)$, we study the problem of approximating the total variation distance $d_{\mathrm{TV}}\left({\mu},{\nu}\right)$ with an $\epsilon$-relative error. We propose a new reduction that connects the problem of approximating the TV-distance to sampling and approximate counting. Our applications include the hardcore model and the antiferromagnetic Ising model in the uniqueness regime, the ferromagnetic Ising model, and the general Ising model satisfying the spectral condition.  Additionally, we explore the computational complexity of approximating the total variation distance $d_{\mathrm{TV}}\left({\mu_S},{\nu_S}\right)$ between two marginal distributions on an arbitrary subset $S \subseteq V$. We prove that this problem remains hard even when both $\mu$ and $\nu$ admit polynomial-time sampling and approximate counting algorithms.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/feng25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/feng25a/feng25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-feng25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Weiming
    family: Feng
  - given: Hongyang
    family: Liu
  - given: Minji
    family: Yang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 1974-2025
  id: feng25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 1974
  lastpage: 2025
  published: 2025-07-02 00:00:00 +0000
- title: 'Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration'
  abstract: 'Language model alignment (or, reinforcement learning) techniques that leverage active exploration – deliberately encouraging the model to produce diverse, informative responses – offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses – a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/foster25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/foster25a/foster25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-foster25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dylan J
    family: Foster
  - given: Zakaria
    family: Mhammedi
  - given: Dhruv
    family: Rohatgi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2026-2142
  id: foster25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2026
  lastpage: 2142
  published: 2025-07-02 00:00:00 +0000
- title: 'An uncertainty principle for Linear Recurrent Neural Networks'
  abstract: ' We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on the simple but core copy task, whose goal is to build a linear filter of order $S$ that approximates the filter that looks $K$ time steps in the past (which we refer to as the shift-$K$ filter), where $K$ is  larger than $S$. Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle for this task: the optimal filter has to average values around the $K$-th time step in the past with a range (width) that is proportional to $K/S$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/francois25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/francois25a/francois25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-francois25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alexandre
    family: François
  - given: Antonio
    family: Orvieto
  - given: Francis
    family: Bach
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2143-2187
  id: francois25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2143
  lastpage: 2187
  published: 2025-07-02 00:00:00 +0000
- title: 'Complexity of Injectivity and Verification of ReLU Neural Networks (Extended Abstract)'
  abstract: 'Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research as this enables a better interpretability of learning processes.  Injectivity of a function computed by a ReLU network, that is, the question if different inputs to the network always lead to different outputs, plays a crucial role whenever invertibility of the function is required, such as, e.g., for inverse problems or generative models. The exact computational complexity of deciding injectivity was recently posed as an open problem (Puthawala et al. [JMLR 2022]). We answer this question by proving coNP-completeness. On the positive side, we show that the problem for a single ReLU-layer is still tractable for small input dimension; more precisely, we present a parameterized algorithm which yields fixed-parameter tractability with respect to the input dimension.  In addition, we study the network verification problem which is to verify that certain inputs only yield specific outputs. This is of great importance since neural networks are increasingly used in safety-critical systems. We prove that network verification is coNP-hard for a general class of input domains. Our results also exclude constant-factor polynomial-time approximations for the maximum of a function computed by a ReLU network. In this context, we also characterize surjectivity of functions computed by ReLU networks with one-dimensional output which turns out to be the complement of a basic network verification task.  We reveal interesting connections to computational convexity by formulating the surjectivity problem as a zonotope containment problem.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/froese25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/froese25a/froese25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-froese25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vincent
    family: Froese
  - given: Moritz
    family: Grillo
  - given: Martin
    family: Skutella
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2188-2189
  id: froese25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2188
  lastpage: 2189
  published: 2025-07-02 00:00:00 +0000
- title: 'Bayes correlated equilibria, no-regret dynamics in Bayesian games, and the price of anarchy'
  abstract: 'This paper investigates equilibrium computation and the price of anarchy for Bayesian games, which are the fundamental models of games with incomplete information. In normal-form games with complete information, it is known that efficiently computable no-regret dynamics converge to correlated equilibria, and the price of anarchy for correlated equilibria can be bounded for a broad class of games called smooth games. However, in Bayesian games, as surveyed by Forges (1993), several non-equivalent extensions of correlated equilibria exist, and it remains unclear whether they can be efficiently computed or whether their price of anarchy can be bounded. In this paper, we identify a natural extension of correlated equilibria that can be computed efficiently and is guaranteed to have bounds on the price of anarchy in various games. First, we propose a variant of regret called untruthful swap regret. If each player minimizes it in repeated play of Bayesian games, the empirical distribution of these dynamics is guaranteed to converge to communication equilibria, which is one of the extensions of correlated equilibria proposed by Myerson (1982). We present an efficient algorithm for minimizing untruthful swap regret with a sublinear upper bound, which we prove to be tight in terms of the number of types. As a result, by simulating the dynamics with our algorithm, we can approximately compute a communication equilibrium in polynomial time. Furthermore, we extend existing lower bounds on the price of anarchy based on the smoothness arguments from Bayes–Nash equilibria to equilibria obtained by the proposed dynamics.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/fujii25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/fujii25a/fujii25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-fujii25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Kaito
    family: Fujii
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2190-2191
  id: fujii25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2190
  lastpage: 2191
  published: 2025-07-02 00:00:00 +0000
- title: 'Gradient Methods with Online Scaling'
  abstract: 'We introduce a framework to accelerate the convergence of gradient-based methods with online learning. The framework learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically. In contrast with previous literature, where convergence is established based on worst-case analysis, our framework provides a strong convergence guarantee with respect to the optimal stepsize for the iteration trajectory. For smooth strongly convex optimization, our framework provides an $\Ocal(\kappa^\star \log(1/\varepsilon)$) asymptotic complexity result, where $\kappa^\star$ is the condition number achievable by the optimal preconditioner, improving on the previous $\Ocal(\sqrt{n}\kappa^\star \log(1/\varepsilon))$ result. For smooth convex optimization, we obtain the first convergence guarantee for the widely used hypergradient descent heuristic.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gao25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gao25a/gao25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gao25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Wenzhi
    family: Gao
  - given: Ya-Chi
    family: Chu
  - given: Yinyu
    family: Ye
  - given: Madeleine
    family: Udell
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2192-2226
  id: gao25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2192
  lastpage: 2226
  published: 2025-07-02 00:00:00 +0000
- title: 'Computing High-dimensional Confidence Sets for Arbitrary Distributions'
  abstract: 'We study the problem of learning a high-density region of an arbitrary distribution over $\mathbb{R}^d$. Given a target coverage parameter $\delta$, and sample access to an arbitrary distribution $\mathcal{D}$, we want to output a confidence set $S \subset \mathbb{R}^d$ such that $S$ achieves $\delta$ coverage of $\mathcal{D}$, i.e., $\mathbb{P}_{y \sim \mathcal{D}} \left[ y \in S \right] \ge \delta$, and the volume of $S$ is as small as possible. This is a central problem in high-dimensional statistics with applications in high-dimensional analogues of finding confidence intervals, uncertainty quantification, and support estimation.  In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class $\mathcal{C}$ with bounded VC-dimension. An algorithm for learning confidence sets is competitive with class $\mathcal{C}$ if, given samples from an arbitrary distribution $\mathcal{D}$, it outputs in polynomial time a set that achieves $\delta$ coverage of $\mathcal{D}$, and whose volume is competitive with the smallest set in $\mathcal{C}$ with the required coverage $\delta$. This problem is computationally challenging even in the basic setting when $\mathcal{C}$ is the set of all Euclidean balls.  Existing algorithms based on coresets find in polynomial time a ball whose volume is  $\exp(\tilde{O}( d/ \log d))$-factor competitive with the volume of the best ball.   Our main result is an algorithm that finds a confidence set whose volume is $\exp(\tilde{O}(d^{1/2}))$ factor competitive with the optimal ball having the desired coverage. It is surprisingly simple and also extends to finding confidence sets competitive against unions of $k$ balls, and improved guarantees under additional assumptions.  The algorithm is improper (it outputs an ellipsoid). Combined with our computational intractability result for proper learning balls within an $\exp(\tilde{O}(d^{1-o(1)}))$ approximation factor in volume, our results provide an interesting separation between proper and (improper) learning of confidence sets.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gao25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gao25b/gao25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gao25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chao
    family: Gao
  - given: Liren
    family: Shan
  - given: Vaidehi
    family: Srinivas
  - given: Aravindan
    family: Vijayaraghavan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2227-2269
  id: gao25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2227
  lastpage: 2269
  published: 2025-07-02 00:00:00 +0000
- title: 'Blackwell’s Approachability with Approximation Algorithms'
  abstract: 'We revisit Blackwell’s celebrated approachability problem which considers a repeated vector-valued game between a player and an adversary. Motivated by settings in which the action set of the player or adversary (or both) is difficult to optimize over, for instance when it corresponds to the set of all possible solutions to some NP-Hard optimization problem, we ask what can the player guarantee \textit{efficiently}, when only having access to these sets via approximation algorithms with ratios $\alpha_{\mX} \geq 1$ and $ 1 \geq \alpha_{\mY} > 0$, respectively. Assuming the player has monotone preferences, i.e., that he does not prefer a vector-valued loss $\ell_1$ over $\ell_2$ if $\ell_2 \leq \ell_1$, we establish that given a Blackwell instance with an approachable target set $S$,  the downward closure of the appropriately-scaled set $\alpha_{\mX}\alpha_{\mY}^{-1}S$ is \textit{efficiently} approachable with optimal rate. In case only the player’s or adversary’s set is equipped with an approximation algorithm, we give simpler and more efficient algorithms.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/garber25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/garber25a/garber25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-garber25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dan
    family: Garber
  - given: Massalha
    family: Mhna
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2270-2290
  id: garber25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2270
  lastpage: 2290
  published: 2025-07-02 00:00:00 +0000
- title: 'Faster Low-Rank Approximation and Kernel Ridge Regression via the Block-Nyström Method'
  abstract: 'The Nyström method is a popular low-rank approximation technique for large matrices that arise in kernel methods and convex optimization. Yet, when the data exhibits heavy-tailed spectral decay, the effective dimension of the problem often becomes so large that even the Nyström method may be outside of our computational budget. To address this, we propose Block-Nyström, an algorithm that injects a block-diagonal structure into the Nyström method, thereby significantly reducing its computational cost while recovering strong approximation guarantees. We show that Block-Nyström can be used to construct improved preconditioners for second-order optimization, as well as to efficiently solve kernel ridge regression for statistical learning over Hilbert spaces. Our key technical insight is that, within the same computational budget, combining several smaller Nyström approximations leads to stronger tail estimates of the input spectrum than using one larger approximation. Along the way, we provide a novel recursive preconditioning scheme for efficiently inverting the Block-Nyström matrix, and provide new statistical learning bounds for a broad class of approximate kernel ridge regression solvers.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/garg25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/garg25a/garg25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-garg25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sachin
    family: Garg
  - given: Michał
    family: Dereziński
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2291-2325
  id: garg25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2291
  lastpage: 2325
  published: 2025-07-02 00:00:00 +0000
- title: 'Model predictive control is almost optimal for restless bandits'
  abstract: 'We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem. We propose a model predictive control based non-stationary policy with a rolling computational horizon $\tau$. At each time-slot, this policy solves a $\tau$ horizon linear program whose first control value is kept as a control for the RMAB. Our solution requires minimal assumptions and quantifies the loss in optimality in terms of $\tau$ and the number of arms, $N$. We show that its sub-optimality gap is $O(1/\sqrt{N})$ in general, and $\exp(-\Omega(N))$ under a local-stability condition. Our proof is based on a framework from dynamic control known as dissipativity. Our solution is easy to implement and performs very well in practice when compared to the state of the art. Further, both our solution and our proof methodology can easily be generalized to more general constrained MDP settings and should thus be of great interest to the burgeoning RMAB community.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gast25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gast25a/gast25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gast25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nicolas
    family: "Gast
  - given: Dheeraj"
    family: Narasimha
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2326-2361
  id: gast25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2326
  lastpage: 2361
  published: 2025-07-02 00:00:00 +0000
- title: 'Computing Optimal Regularizers for Online Linear Optimization'
  abstract: 'Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret. However, the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL that achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011. Our algorithm requires preprocessing time and space exponential in the dimension $d$ of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension $d$). We complement this with a lower bound showing that even deciding whether a given regularizer is $\alpha$-strongly-convex with respect to a given norm is NP-hard.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gatmiry25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gatmiry25a/gatmiry25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gatmiry25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Khashayar
    family: Gatmiry
  - given: Jon
    family: Schneider
  - given: Stefanie
    family: Jegelka
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2362-2402
  id: gatmiry25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2362
  lastpage: 2402
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Mixtures of Gaussians Using Diffusion Models'
  abstract: 'We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models.      Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution.  Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gatmiry25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gatmiry25b/gatmiry25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gatmiry25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Khashayar
    family: Gatmiry
  - given: Jonathan
    family: Kelner
  - given: Holden
    family: Lee
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2403-2456
  id: gatmiry25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2403
  lastpage: 2456
  published: 2025-07-02 00:00:00 +0000
- title: '“All-Something-Nothing” Phase Transitions in Planted $k$-Factor Recovery (Extended Abstract)'
  abstract: 'This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdős–Rényi random graph $\mathcal{G}(n,\lambda/n)$. We uncover an interesting “all-something-nothing” phase transition. Specifically, we show that as the average degree $\lambda$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery (“all” phase) to partial recovery (“something” phase). Moreover, as $\lambda$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the “nothing” phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an “all-or-nothing” phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $\lambda < 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $\Theta(n)$ many small trees of size $\Theta(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $\Theta(\sqrt{n\log n})$ suffices.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gaudio25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gaudio25a/gaudio25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gaudio25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Julia
    family: Gaudio
  - given: Colin
    family: Sandon
  - given: Jiaming
    family: Xu
  - given: Dana
    family: Yang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2457-2459
  id: gaudio25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2457
  lastpage: 2459
  published: 2025-07-02 00:00:00 +0000
- title: 'PREM: Privately Answering Statistical Queries with Relative Error'
  abstract: 'We introduce $\mathsf{PREM}$ (Private Relative Error Multiplicative weight update), a new framework for generating synthetic data that achieves a {\em relative} error guarantee for statistical queries under $(\varepsilon, \delta)$-differential privacy (DP). Namely, for a domain ${\cal X}$, a family ${\cal F}$ of queries $f : {\cal X} \to \{0, 1\}$, and $\zeta > 0$, our framework yields a mechanism that on input dataset $D \in {\cal X}^n$ outputs a synthetic dataset $\widehat{D} \in {\cal X}^n$ such that all statistical queries in ${\cal F}$ on $D$, namely $\sum_{x \in D} f(x)$ for $f \in {\cal F}$, are within a $1 \pm \zeta$ {\em multiplicative} factor of the corresponding value on $\widehat{D}$ up to an {\em additive} error that is polynomial in $\log |{\cal F}|$, $\log |{\cal X}|$, $\log n$, $\log(1/\delta)$, $1/\varepsilon$, and $1/\zeta$. In contrast, any $(\varepsilon, \delta)$-DP mechanism is known to require worst-case additive error that is polynomial in at least one of $n, |{\cal F}|$, or $|{\cal X}|$. We complement our algorithm with nearly matching lower bounds.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/ghazi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/ghazi25a/ghazi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-ghazi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Badih
    family: Ghazi
  - given: Cristóbal
    family: Guzmán
  - given: Pritish
    family: Kamath
  - given: Alexander
    family: Knop
  - given: Ravi
    family: Kumar
  - given: Pasin
    family: Manurangsi
  - given: Sushant
    family: Sachdeva
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2460-2460
  id: ghazi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2460
  lastpage: 2460
  published: 2025-07-02 00:00:00 +0000
- title: 'Mean-field analysis of polynomial-width two-layer neural network beyond finite time horizon'
  abstract: 'We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle’s velocity in the mean- field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension d. We show that, due to a certain "self-concordance" property in these problems - where the local Hessian of a particle is bounded by a constant times the particle’s velocity - polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/glasgow25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/glasgow25a/glasgow25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-glasgow25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Margalit
    family: Glasgow
  - given: Denny
    family: Wu
  - given: Joan
    family: Bruna
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2461-2539
  id: glasgow25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2461
  lastpage: 2539
  published: 2025-07-02 00:00:00 +0000
- title: 'Tight Bounds for Noisy Computation of High-Influence Functions, Connectivity, and Threshold'
  abstract: 'In the noisy query model, the (binary) return value of every query (possibly repeated) is independently flipped with some fixed probability $p \in (0, 1/2)$. In this paper, we obtain tight bounds on the noisy query complexity of several fundamental problems. Our first contribution is to show that any Boolean function with total influence $\Omega(n)$ has noisy query complexity $\Theta(n\log n)$. Previous works often focus on specific problems, and it is of great interest to have a characterization of noisy query complexity for general functions. Our result is the first noisy query complexity lower bound of this generality, beyond what was known for random Boolean functions (Reischuk and Schmeltz, FOCS 1991). Our second contribution is to prove that Graph Connectivity has noisy query complexity $\Theta(n^2 \log n)$. In this problem, the goal is to determine whether an undirected graph is connected, where each query asks for the existence of an edge in the graph. A simple algorithm can solve the problem with error probability $o(1)$ using $O(n^2 \log n)$ noisy queries, but no non-trivial lower bounds were known prior to this work. Last but not least, we determine the exact number of noisy queries (up to lower order terms) needed to solve the $k$-Threshold problem and the Counting problem. The $k$-Threshold problem asks to decide whether there are at least $k$ ones among $n$ bits, given noisy query access to the bits. We prove that $(1\pm o(1)) \frac{n\log (\min\{k,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ queries are both sufficient and necessary to achieve error probability $\delta = o(1)$. Previously, such a result was only known when $\min\{k,n-k+1\}=o(n)$ (Wang, Ghaddar, Zhu and Wang, ALT 2025). We also show a similar $(1\pm o(1)) \frac{n\log (\min\{k+1,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ bound for the Counting problem, where one needs to count the number of ones among $n$ bits given noisy query access and $k$ denotes the answer.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/gu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gu25a/gu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yuzhou
    family: Gu
  - given: Xin
    family: Li
  - given: Yinzhan
    family: Xu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2540-2591
  id: gu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2540
  lastpage: 2591
  published: 2025-07-02 00:00:00 +0000
- title: 'Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals'
  abstract: 'We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether an arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting.  Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/guo25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/guo25a/guo25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-guo25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Anxin
    family: Guo
  - given: Aravindan
    family: Vijayaraghavan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2592-2631
  id: guo25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2592
  lastpage: 2631
  published: 2025-07-02 00:00:00 +0000
- title: 'Alternating Regret for Online Convex Optimization'
  abstract: 'Motivated by alternating learning dynamics in two-player games, a recent work by Cevher et al. (2024) shows that $o(\sqrt{T})$ alternating regret is possible for any $T$-round adversarial Online Linear Optimization (OLO) problem, and left as an open question whether the same is true for general Online Convex Optimization (OCO). We answer this question in the affirmative by showing that the continuous Hedge algorithm achieves $\tilde{\mathcal{O}}(d^{\frac{2}{3}}T^{\frac{1}{3}})$ alternating regret for any adversarial $d$-dimensional OCO problems. We show that this implies an alternating learning dynamic that finds a Nash equilibrium for any convex-concave zero-sum games or a coarse correlated equilibrium for any convex two-player general-sum games at a rate of $\tilde{\mathcal{O}}(d^{\frac{2}{3}}/T^{\frac{2}{3}})$. To further improve the time complexity and/or the dimension dependence, we propose another simple algorithm, Follow-the-Regularized-Leader with a regularizer whose convex conjugate is 3rd-order smooth, for OCO with smooth and self-concordant loss functions (such as linear or quadratic losses). We instantiate our algorithm with different regularizers and show that, for example, when the decision set is the $\ell_2$ ball, our algorithm achieves $\tilde{\mathcal{O}}(T^{\frac{2}{5}})$ alternating regret with no dimension dependence (and a better $\tilde{\mathcal{O}}(T^{\frac{1}{3}})$ bound for quadratic losses). We complement our results by showing some algorithm-specific alternating regret lower bounds, including a somewhat surprising $\Omega(\sqrt{T})$ lower bound for a Regret Matching variant that is widely used in alternating learning dynamics.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hait25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hait25a/hait25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hait25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Soumita
    family: Hait
  - given: Ping
    family: Li
  - given: Haipeng
    family: Luo
  - given: Mengxiao
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2632-2633
  id: hait25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2632
  lastpage: 2633
  published: 2025-07-02 00:00:00 +0000
- title: 'Data Selection for ERMs'
  abstract: 'Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hanneke25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hanneke25a/hanneke25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hanneke25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  - given: Shay
    family: Moran
  - given: Alexander
    family: Shlimovich
  - given: Amir
    family: Yehudayoff
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2634-2665
  id: hanneke25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2634
  lastpage: 2665
  published: 2025-07-02 00:00:00 +0000
- title: 'Universal Rates of ERM for Agnostic Learning'
  abstract: 'The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of Hanneke and Xu (2024) studied the universal rates of ERM for binary classification under the realizable setting. However, the assumption of realizability is too restrictive to hold in practice. Indeed, the majority of the literature on universal learning has focused on the realizable case, leaving the non-realizable case barely explored. In this paper, we consider the problem of universal learning by ERM for binary classification under the agnostic setting, where the ”learning curve" reflects the decay of the excess risk as the sample size increases. We explore the possibilities of agnostic universal rates and reveal a compact trichotomy: there are three possible agnostic universal rates of ERM, being either exponential, super-root, or arbitrarily slow. We provide a complete characterization of which concept classes fall into each of these categories. Moreover, we also establish complete characterizations for the target-dependent universal rates as well as the Bayes-dependent universal rates.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hanneke25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hanneke25b/hanneke25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hanneke25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  - given: Mingyue
    family: Xu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2666-2703
  id: hanneke25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2666
  lastpage: 2703
  published: 2025-07-02 00:00:00 +0000
- title: 'Universal Rates for Multiclass Learning with Bandit Feedback'
  abstract: 'The seminal work of (Daniely et al., COLT 2011) introduced the problem of multiclass learning under bandit feedback and provided a combinatorial characterization of its learnability within the framework of PAC learning. In multiclass learning under bandit feedback, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$ similar to classical multiclass learning, but the learner does not directly observe the correct labels of the i.i.d. training examples. Instead, during each round, the learner receives an example, makes a prediction for its label, and receives bandit feedback only indicating whether the prediction is correct. Despite this restriction, the goal remains the same as in classical multiclass learning. In the present work, we study the problem of multiclass learning under bandit feedback within the framework of \emph{universal learning} (Bousquet et al., STOC 2021). This makes it possible to study the behavior of learning curves. In the \emph{uniform learning} framework, no concept class $\mathcal{C}$ is learnable when the effective label space is unbounded. In contrast, surprisingly, we demonstrate that the universal learnability of concept classes $\mathcal{C}$ even when the effective label space is unbounded gives rise to a rich theory. More concretely, our primary contribution is a theory that reveals an inherent trichotomy governing instance optimal learning curves in the realizable setting. Moreover, the best achievable universal learning rate for any given concept class can only decay either at an \emph{exponential}, a \emph{linear}, or an \emph{arbitrarily slow} rate. In particular, the trichotomy is combinatorially characterized by the absence of an infinite multiclass Littlestone tree and the combination of an infinite Natarajan Littlestone tree and an infinite progressive Littlestone tree. Furthermore, we introduce novel learning algorithms for achieving instance optimal universal rates.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hanneke25c.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hanneke25c/hanneke25c.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hanneke25c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  - given: Amirreza
    family: Shaeiri
  - given: Qian
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2704-2756
  id: hanneke25c
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2704
  lastpage: 2756
  published: 2025-07-02 00:00:00 +0000
- title: 'Compression Barriers in Autoregressive Transformers'
  abstract: 'A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(de^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by Zandieh, Han, Mirrokni, and Karbasi (2024), matches this bound.  Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked.  Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/haris25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/haris25a/haris25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-haris25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Themistoklis
    family: Haris
  - given: Krzysztof
    family: Onak
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2757-2785
  id: haris25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2757
  lastpage: 2785
  published: 2025-07-02 00:00:00 +0000
- title: 'On the query complexity of sampling from non-log-concave distributions (extended abstract)'
  abstract: 'We study the problem of sampling from a $d$-dimensional distribution with density $p(x)\propto e^{-f(x)}$, which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any $L,M$ satisfying $LM\ge d\ge 5$, $\epsilon\in \left(0,\frac{1}{200}\right)$, and any algorithm with query accesses to the value of $f(x)$ and $\nabla f(x)$, there exists an $L$-log-smooth distribution with second moment at most $M$ such that the algorithm requires $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ queries to compute a sample whose distribution is within $\epsilon$ in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring $\left(\frac{LM}{d\epsilon}\right)^{\mathcal O(d)}$ queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from the family of non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT’24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when $M=\mathrm{poly}(d)$. Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are $\mathcal O(1)$-log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be $\mathcal O(1)$-log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of “sampling versus optimization”, as studied in Ma et al. (PNAS’19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension $d$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/he25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/he25a/he25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-he25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yuchen
    family: He
  - given: Chihao
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2786-2787
  id: he25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2786
  lastpage: 2787
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning DNF through Generalized Fourier Representations'
  abstract: 'The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, the learnability of Disjunctive Normal Form (DNF) under the uniform and product distributions has been established through such representations.  This paper makes three main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution $D$ through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion.  These results hold for any distribution. Third, it analyzes the $L_1$ spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which  can be represented by a difference-bounded tree BN,  where a parent node in the BN representation can change the conditional expectation of a child node by at most $\alpha<0.5$. Lower bounds are presented to show that such constraints are necessary.  Combining these contributions, the paper shows learnability of DNF with membership queries under difference-bounded tree BN.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/heidari25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/heidari25a/heidari25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-heidari25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mohsen
    family: Heidari
  - given: Roni
    family: Khardon
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2788-2804
  id: heidari25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2788
  lastpage: 2804
  published: 2025-07-02 00:00:00 +0000
- title: 'Noisy Group Testing in the Linear Regime: Exact Thresholds and Efficient '
  abstract: 'In group testing, the task is to identify defective items by testing groups of them together using as few tests as possible. We consider the setting where each item is defective with a constant probability $\alpha$, independent of all other items. In the (over-)idealized noiseless setting, tests are positive exactly if any of the tested items are defective. We study a more realistic model in which observed test results are subject to noise, i.e., tests can display false positive or false negative results with constant positive probabilities. We determine precise constants $c$ such that $cn\log n$ tests are required to recover the infection status of every individual for both adaptive and non-adaptive group testing: in the former, the selection of groups to test can depend on previously observed test results, whereas it cannot in the latter. Additionally, for both settings, we provide efficient algorithms that identify all defective items with the optimal amount of tests with high probability. Thus, we completely solve the problem of binary noisy group testing in the studied setting.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hintze25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hintze25a/hintze25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hintze25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Lukas
    family: Hintze
  - given: Lena
    family: Krieg
  - given: Olga
    family: Scheftelowitsch
  - given: Haodong
    family: Zhu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2805-2821
  id: hintze25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2805
  lastpage: 2821
  published: 2025-07-02 00:00:00 +0000
- title: 'Improved Margin Generalization Bounds for Voting Classifiers'
  abstract: 'In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost Freund and Schapire (1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifiers with an expected error matching the theoretical lower bound. This result provides a more natural alternative to the Majority-of-5 algorithm by H\{o}gsgaard et al. (2024), and matches the Majority-of-3 result by Aden-Ali et al. (2024) for the realizable prediction model.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hogsgaard-moller25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hogsgaard-moller25a/hogsgaard-moller25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hogsgaard-moller25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mikael
    family: Høgsgaard Møller
  - given: Kasper
    family: Green Larsen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2822-2855
  id: hogsgaard-moller25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2822
  lastpage: 2855
  published: 2025-07-02 00:00:00 +0000
- title: 'Polynomial low degree hardness for Broadcasting on Trees (Extended Abstract)'
  abstract: 'Broadcasting on trees is a fundamental model from statistical physics that plays an important role in information theory, noisy computation and phylogenetic reconstruction within computational biology and linguistics. While this model permits efficient linear-time algorithms for the inference of the root from the leaves, recent work suggests that non-trivial computational complexity may be required for inference. The inference of the root state can be performed using the celebrated Belief Propagation (BP) algorithm, which achieves Bayes-optimal performance. Although BP runs in linear time using real arithmetic operations, recent research indicates that it requires non-trivial computational complexity using more refined complexity measures.  Moitra, Mossel, and Sandon demonstrated such complexity by constructing a Markov chain for which estimating the root better than random guessing (for typical inputs) is $NC^1$-complete. Kohler and Mossel constructed chains where, for trees with $N$ leaves, achieving better-than-random root recovery requires polynomials of degree $N^{\Omega(1)}$. The papers above raised the question of whether such complexity bounds hold generally below the celebrated Kesten-Stigum bound. In a recent work, Huang and Mossel established a general degree lower bound of $\Omega(\log N)$ below the Kesten-Stigum bound. Specifically, they proved that any function expressed as a linear combination of functions of at most $O(log N)$ leaves has vanishing correlation with the root. In this work, we get an exponential improvement of this lower bound by establishing an $N^{\Omega(1)}$ degree lower bound, for any broadcast process in the whole regime below the Kesten-Stigum bound.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/huang25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/huang25a/huang25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-huang25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Han
    family: Huang
  - given: Elchanan
    family: Mossel
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2856-2857
  id: huang25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2856
  lastpage: 2857
  published: 2025-07-02 00:00:00 +0000
- title: 'Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback'
  abstract: 'No-regret self-play learning dynamics have become one of the premier ways to solve large-scale games in practice. Accelerating their convergence via improving the regret of the players over the naive $O(\sqrt{T})$ bound after $T$ rounds has been extensively studied in recent years, but almost all studies assume access to exact gradient feedback. We address the question of whether acceleration is possible under bandit feedback only and provide an affirmative answer for two-player zero-sum normal-form games. Specifically, we show that if both players apply the Tsallis-INF algorithm of Zimmert and Seldin (2021), then their regret is at most $O(c_1 \log T +  \sqrt{c_2 T})$, where $c_1$ and $c_2$ are game-dependent constants that characterize the difficulty of learning  $c_1$ resembles the complexity of learning a stochastic multi-armed bandit instance and depends inversely on some gap measures, while $c_2$ can be much smaller than the number of actions when the Nash equilibria have a small support or are close to the boundary. In particular, for the case when a pure strategy Nash equilibrium exists, $c_2$ becomes zero, leading to an optimal instance-dependent regret bound as we show. We additionally prove that in this case our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample complexity.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/ito25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/ito25a/ito25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-ito25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Shinji
    family: Ito
  - given: Haipeng
    family: Luo
  - given: Taira
    family: Tsuchiya
  - given: Yue
    family: Wu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2858-2892
  id: ito25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2858
  lastpage: 2892
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimal Differentially Private Sampling of Unbounded Gaussians'
  abstract: 'We provide the first $\widetilde{\mathcal{O}}(d)$-sample algorithm for sampling from unbounded Gaussian distributions under the constraint of $(\varepsilon, \delta)$-differential privacy. This is a quadratic improvement over previous results for the same problem, settling an open question of Ghazi, Hu, Kumar, and Manurangsi.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/iverson25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/iverson25a/iverson25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-iverson25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Valentio
    family: Iverson
  - given: Gautam
    family: Kamath
  - given: Argyris
    family: Mouzakis
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2893-2941
  id: iverson25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2893
  lastpage: 2941
  published: 2025-07-02 00:00:00 +0000
- title: 'Local Regularizers Are Not Transductive Learners'
  abstract: 'We partly resolve an open question raised by Asilis et al. 2024: whether the algorithmic template of local regularization — an intriguing generalization of explicit regularization, a.k.a. structural risk minimization — suffices to learn all learnable multiclass problems. Specifically, we provide a negative answer to this question in the transductive model of learning. We exhibit a multiclass classification problem which is learnable in both the transductive and PAC models, yet cannot be learned transductively by any local regularizer. The corresponding hypothesis class, and our proof, are based on principles from cryptographic secret sharing. We outline challenges in extending our negative result to the PAC model, leaving open the tantalizing possibility of a PAC/transductive separation with respect to local regularization.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/jafar25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/jafar25a/jafar25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-jafar25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sky
    family: Jafar
  - given: Julian
    family: Asilis
  - given: Shaddin
    family: Dughmi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2942-2957
  id: jafar25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2942
  lastpage: 2957
  published: 2025-07-02 00:00:00 +0000
- title: 'On the Minimax Regret of Sequential Probability Assignment via Square-Root Entropy'
  abstract: 'We study the problem of sequential probability assignment under  logarithmic loss, both with and without side information. Our objective is to analyze the \emph{minimax regret}—a notion extensively studied in the literature—in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions.  We show that the minimax regret for the case of no side information (equivalently, the Shtarkov sum) can be upper bounded in terms of \emph{sequential square-root entropy}, a  notion closely related to Hellinger distance. For the problem of sequential probability assignment with side information, we develop both upper and lower bounds based on the aforementioned entropy. The lower bound matches the upper bound, up to log factors, for classes in the Donsker regime (according to our definition of entropy). '
  volume: 291
  URL: https://proceedings.mlr.press/v291/jia25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/jia25a/jia25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-jia25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zeyu
    family: Jia
  - given: Alexander
    family: Rakhlin
  - given: Yury
    family: Polyanskiy
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 2958-3016
  id: jia25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 2958
  lastpage: 3016
  published: 2025-07-02 00:00:00 +0000
- title: 'Regularized Dikin Walks for Sampling Truncated Logconcave Measures, Mixed Isoperimetry and Beyond Worst-Case Analysis'
  abstract: 'We study sampling from logconcave distributions truncated on polytopes, motivated by Bayesian models with indicator variables. Built on interior point methods and the Dikin walk, we analyze the mixing time of regularized Dikin walks. Our contributions include: (1) proving that the soft-threshold Dikin walk mixes in $\widetilde{O}(mn+\kappa n)$ iterations for logconcave distributions with condition number $\kappa$, dimension $n$ and $m$ linear constraints, without requiring bounded polytopes. Moreover, we introduce the regularized Dikin walk using Lewis weights and show it mixes in  $\widetilde{O}(n^{2.5}+\kappa n)$; (2) extending the above mixing time guarantees to weakly log-concave truncated distributions with finite covariance matrices; and (3) going beyond worst-case mixing time analysis, we show that soft-threshold Dikin walk mixes significantly faster when $O(1)$ number of constraints intersect the high-probability mass of the distribution, improving the $\widetilde{O}(mn+\kappa n)$ upper bound to $\widetilde{O}(m + \kappa n)$. Additionally, we provide practical implementation to generate a warm initialization.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/jiang25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/jiang25a/jiang25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-jiang25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Minhui
    family: Jiang
  - given: Yuansi
    family: Chen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3017-3078
  id: jiang25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3017
  lastpage: 3078
  published: 2025-07-02 00:00:00 +0000
- title: 'Online Covariance Estimation in Nonsmooth Stochastic Approximation'
  abstract: 'We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of  Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting.  In this paper, we study an online batch-means covariance matrix estimator introduced in  Zhu et al. (2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/jiang25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/jiang25b/jiang25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-jiang25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Liwei
    family: Jiang
  - given: Abhishek
    family: Roy
  - given: Krishnakumar
    family: Balasubramanian
  - given: Damek
    family: Davis
  - given: Dmitriy
    family: Drusvyatskiy
  - given: Sen
    family: Na
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3079-3123
  id: jiang25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3079
  lastpage: 3123
  published: 2025-07-02 00:00:00 +0000
- title: 'Provable Complexity Improvement of AdaGrad over SGD: Upper and Lower Bounds in Stochastic Non-Convex Optimization'
  abstract: 'Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) for stochastic convex optimization under favorable geometry, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an \emph{upper bound} on the convergence rate of AdaGrad and a corresponding \emph{lower bound} for SGD. In particular, we identify non-convex settings in which the iteration complexity of AdaGrad is favorable over SGD and show that, for certain configurations of problem parameters, it outperforms SGD by a factor of $d$, where $d$ is the problem dimension. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/jiang25c.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/jiang25c/jiang25c.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-jiang25c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ruichen
    family: Jiang
  - given: Devyani
    family: Maladkar
  - given: Aryan
    family: Mokhtari
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3124-3158
  id: jiang25c
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3124
  lastpage: 3158
  published: 2025-07-02 00:00:00 +0000
- title: 'Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation (Extended Abstract)'
  abstract: 'Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, the statistical optimality of these methods has still remained an open area of investigation, especially in regimes where these methods do not achieve parametric rates. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that achieve some statistical estimation rate. This framework is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as black-box sub-processes. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT), as well as weighted variants of the former, which arise  in policy evaluation.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/jin25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/jin25a/jin25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-jin25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jikai
    family: Jin
  - given: Vasilis
    family: Syrgkanis
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3159-3160
  id: jin25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3159
  lastpage: 3160
  published: 2025-07-02 00:00:00 +0000
- title: 'A Theory of Learning with Autoregressive Chain of Thought'
  abstract: 'For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/joshi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/joshi25a/joshi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-joshi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nirmit
    family: Joshi
  - given: Gal
    family: Vardi
  - given: Adam
    family: Block
  - given: Surbhi
    family: Goel
  - given: Zhiyuan
    family: Li
  - given: Theodor
    family: Misiakiewicz
  - given: Nathan
    family: Srebro
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3161-3212
  id: joshi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3161
  lastpage: 3212
  published: 2025-07-02 00:00:00 +0000
- title: 'The Sample Complexity of Distributed Simple Binary Hypothesis Testing under Information Constraints'
  abstract: 'This paper resolves two open problems from a recent paper (Pensia et al., 2024b) concerning the sample complexity of distributed simple binary hypothesis testing under information constraints. The first open problem asks whether interaction reduces the sample complexity of distributed simple binary hypothesis testing. In this paper, we show that sequential interaction does not help. The second problem suggests tightening existing sample complexity bounds for communication-constrained simple binary hypothesis testing. We derive optimally tight bounds for this setting and resolve this problem. Our main technical contributions are: (i) a one-shot lower bound on the Bayes error in simple binary hypothesis testing that satisfies a crucial tensorisation property; (ii) a streamlined proof of the formula for the sample complexity of simple binary hypothesis testing without constraints, first established in (Pensia et al., 2024b); and (iii) a reverse data-processing inequality for Hellinger-$\lambda$ divergences, generalising the results from Bhatt et al.(2021) and Pensia et al. (2023).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kazemi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kazemi25a/kazemi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kazemi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hadi
    family: Kazemi
  - given: Ankit
    family: Pensia
  - given: Jog
    family: Varun
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3213-3214
  id: kazemi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3213
  lastpage: 3214
  published: 2025-07-02 00:00:00 +0000
- title: 'Experimental Design for Semiparametric Bandits'
  abstract: 'We study finite-armed semiparametric bandits, where each arm’s reward combines a linear component with an unknown, potentially adversarial shift. This model strictly generalizes classical linear bandits and reflects complexities common in practice. We propose the first experimental-design approach that simultaneously offers a sharp regret bound, a PAC bound, and a best-arm identification guarantee. Our method attains the minimax regret $\tilde{\mathcal{O}}(\sqrt{dT})$, matching the known lower bound for finite-armed linear bandits, and further achieves logarithmic regret under a positive suboptimality gap condition. These guarantees follow from our refined non-asymptotic analysis of orthogonalized regression that attains the optimal $\sqrt{d}$ rate, paving the way for robust and efficient learning across a broad class of semiparametric bandit problems.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kim25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kim25a/kim25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kim25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Seok-Jin
    family: Kim
  - given: Gi-Soo
    family: Kim
  - given: Min-hwan
    family: Oh
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3215-3252
  id: kim25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3215
  lastpage: 3252
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Constant-Depth Circuits in Malicious Noise Models'
  abstract: 'The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits ($\mathsf{AC}^0$) with respect to the uniform distribution on the hypercube.  Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open.  Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives. Our proof uses a simple outlier-removal method combined with Braverman’s theorem for fooling constant-depth circuits.  We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called “nasty noise"). '
  volume: 291
  URL: https://proceedings.mlr.press/v291/klivans25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/klivans25a/klivans25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-klivans25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Adam
    family: Klivans
  - given: Konstantinos
    family: Stavropoulos
  - given: Arsen
    family: Vasilyan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3253-3263
  id: klivans25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3253
  lastpage: 3263
  published: 2025-07-02 00:00:00 +0000
- title: 'Efficiently learning and sampling multimodal distributions with data-based initialization'
  abstract: 'We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a k-th order spectral gap,  initialization from a set of  $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincaré inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation.  This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/koehler25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/koehler25a/koehler25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-koehler25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Frederic
    family: Koehler
  - given: Holden
    family: Lee
  - given: Thuy-Duong
    family: Vuong
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3264-3326
  id: koehler25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3264
  lastpage: 3326
  published: 2025-07-02 00:00:00 +0000
- title: 'The Oracle Complexity of Simplex-based Matrix Games: Linear Separability and Nash Equilibria'
  abstract: 'We study the problem of solving matrix games of the form $\max_{\mathbf{w}\in\mathcal{W}}\min_{\mathbf{p}\in\Delta}\mathbf{p}^{\top}A\mathbf{w}$, where $A$ is some matrix and $\Delta$ is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity (Nemirovski and Yudin, 1983)) is not well-understood. In this work, we first identify different oracle models which are implicitly used by prior algorithms, amounting to multiplying the matrix $A$ by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. Specifically, we start by showing that algorithms for linear separability based on one-sided multiplications must require $\Omega(\gamma_A^{-2})$ iterations, where $\gamma_A$ is the margin, as matched by the Perceptron algorithm. We then prove that accelerated algorithms for this task, which utilize multiplications from both sides, must require $\tilde{\Omega}(\gamma_{A}^{-2/3})$ iterations, establishing the first oracle complexity barrier for such algorithms. Finally, by adapting our lower bound to $\ell_1$ geometry, we prove that computing an $\epsilon$-approximate Nash equilibrium requires $\tilde{\Omega}(\epsilon^{-2/5})$ iterations, which is an exponential improvement over the previously best-known lower bound due to Hadiji et al. (2024).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kornowski25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kornowski25a/kornowski25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kornowski25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Guy
    family: Kornowski
  - given: Ohad
    family: Shamir
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3327-3353
  id: kornowski25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3327
  lastpage: 3353
  published: 2025-07-02 00:00:00 +0000
- title: 'Spectral Estimators for Multi-Index Models: Precise Asymptotics and Optimal Weak Recovery'
  abstract: 'Multi-index models provide a popular framework to investigate the learnability of functions with low-dimensional structure and, also due to their connections with neural networks, they have been object of recent intensive study. In this paper, we focus on recovering the subspace spanned by the signals via spectral estimators – a family of methods routinely used in practice, often as a warm-start for iterative algorithms. Our main technical contribution is a precise asymptotic characterization of the performance of spectral methods, when sample size and input dimension grow proportionally and the dimension $p$ of the space to recover is fixed. Specifically, we locate the top-$p$ eigenvalues of the spectral matrix and establish the overlaps between the corresponding eigenvectors (which give the spectral estimators) and a basis of the signal subspace. Our analysis unveils a phase transition phenomenon in which, as the sample complexity grows, eigenvalues escape from the bulk of the spectrum and, when that happens, eigenvectors recover directions of the desired subspace. The precise characterization we put forward enables the optimization of the data preprocessing, thus allowing to identify the spectral estimator that requires the minimal sample size for weak recovery.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kovacevic25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kovacevic25a/kovacevic25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kovacevic25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Filip
    family: Kovačević
  - given: Zhang
    family: Yihan
  - given: Marco
    family: Mondelli
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3354-3404
  id: kovacevic25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3354
  lastpage: 3404
  published: 2025-07-02 00:00:00 +0000
- title: 'The Role of Environment Access in Agnostic Reinforcement Learning (Extended Abstract)'
  abstract: 'We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically:  1) We show that even with a strong function approximation assumption called \emph{policy completeness}, and \emph{generative access}—perhaps the strongest possible access to the MDP—policy learning methods cannot achieve sample complexity guarantees that scale with the intrinsic complexity of exploration, as measured via the \emph{coverability coefficient}[XFB+22] of the MDP. This resolves an open problem posed by [JLR+23] and shows, in a strong, information-theoretic sense, that policy learning methods cannot explore. 2) We study the $\mu$-reset setting, where the learner can roll out from an exploratory reset distribution $\mu$, and investigate whether error amplification can be controlled without policy completeness (which is required for classical results of PSDP [BKSN03] and CPI [KL02]). We show that agnostic policy learning is information-theoretically impossible. We also show algorithm-specific lower bounds for PSDP and CPI under the weaker condition of \emph{policy class realizability}. 3) In light of these lower bounds, we introduce a new model of access called \emph{hybrid resets}, which subsumes both local simulators (which is weaker than generative access) and $\mu$-resets. We show that under hybrid resets, and when the reset distribution satisfies \emph{pushforward concentrability} [XJ21], sample-efficient policy learning is possible in Block MDPs [JKA+17, DKJ+19] via a new algorithm.  Since all of our lower bound constructions are Block MDPs, this indicates the significant power of hybrid reset access in agnostic policy learning. On a technical level, we introduce a new algorithmic tool called a \emph{policy emulator} that allows us to efficiently evaluate various policies within a  large class $\Pi$. Informally speaking, a policy emulator is the “minimal object” useful for solving policy learning. Instead of learning the Block MDP in a traditional model-based sense (which would require samples scaling with the observation space size), our algorithm leverages hybrid resets to construct a policy emulator in a statistically efficient manner. Taken together, our results reveal intriguing interplays between function approximation and environment access in RL. Extended abstract. Full version appears as \href{https://arxiv.org/abs/2504.05405}{[arXiv:2504.05405, v1]}. Authors are listed in alphabetical order of their last names. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/krishnamurthy25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/krishnamurthy25a/krishnamurthy25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-krishnamurthy25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Akshay
    family: Krishnamurthy
  - given: Gene
    family: Li
  - given: Ayush
    family: Sekhari
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3405-3406
  id: krishnamurthy25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3405
  lastpage: 3406
  published: 2025-07-02 00:00:00 +0000
- title: 'Spike-and-Slab Posterior Sampling in High Dimensions'
  abstract: 'Posterior sampling with the spike-and-slab prior (Mitchell and Beauchamp, 1988), a popular multi-modal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression (Carvalho et al., 2009; Rockova, 2018). However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) (Yang et al., 2016), only work when the measurement count grows at least linearly in the dimension (Montanari and Wu, 2024), or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix $X \in \R^{n \times d}$ and noisy observations $y = X\theta^\star + \xi$ of a signal $\theta^\star$ drawn from a spike-and-slab prior $\pi$ with a Gaussian diffuse density and expected sparsity $k$, where $\xi \sim \mathcal{N}(0_d, \sigma^2 I_n)$. We give a polynomial-time high-accuracy sampler for the posterior $\pi(\cdot \mid X, y)$, for any SNR $\sigma^{-1} > 0$, as long as $n \geq k^3 \cdot \text{polylog}(d)$ and $X$ is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time $\approx nd$ in the same setting, as long as $n \geq k^5 \cdot \text{polylog}(d)$. To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when $\sigma = O(\frac{1}{k})$ is bounded.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kumar25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kumar25a/kumar25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kumar25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Symantak
    family: Kumar
  - given: Purnamrita
    family: Sarkar
  - given: Kevin
    family: Tian
  - given: Yusong
    family: Zhu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3407-3462
  id: kumar25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3407
  lastpage: 3462
  published: 2025-07-02 00:00:00 +0000
- title: 'A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis'
  abstract: 'Recent works have characterized the function-space inductive bias of infinite-width bounded-norm single-hidden-layer neural networks as a kind of bounded-variation-type space. This novel neural network Banach space encompasses many classical multivariate function spaces, including certain Sobolev spaces and the spectral Barron spaces. Notably, this Banach space also includes functions that exhibit less classical regularity, such as those that only vary in a few directions. On bounded domains, it is well-established that the Gaussian reproducing kernel Hilbert space (RKHS) strictly embeds into this Banach space, demonstrating a clear gap between the Gaussian RKHS and the neural network Banach space. It turns out that when investigating these spaces on unbounded domains, e.g., all of $\mathbb{R}^d$, the story is fundamentally different. We establish the following fundamental result: Certain functions that lie in the Gaussian RKHS have infinite norm in the neural network Banach space. This provides a nontrivial gap between kernel methods and neural networks by exhibiting functions that kernel methods easily represent, whereas neural networks cannot.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kumar25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kumar25b/kumar25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kumar25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Akash
    family: Kumar
  - given: Rahul
    family: Parhi
  - given: Mikhail
    family: Belkin
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3463-3485
  id: kumar25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3463
  lastpage: 3485
  published: 2025-07-02 00:00:00 +0000
- title: 'Low coordinate degree algorithms II: Categorical signals and generalized stochastic block models'
  abstract: 'We study when low coordinate degree functions (LCDF)—linear combinations of functions depending on small subsets of entries of a vector—can test for the presence of categorical structure, including community structure and generalizations thereof, in high-dimensional data. This complements recent results studying the power of LCDF in testing for continuous structure like real-valued signals corrupted by additive noise. We study a general form of stochastic block model (SBM), where a population is assigned random labels and every $p$-tuple generates an observation according to an arbitrary probability measure associated to the $p$ labels of its members. We show that the performance of LCDF admits a unified analysis for this class of models. As applications, we prove tight lower bounds against LCDF for broad families of previously studied graph and uniform hypergraph SBMs, always matching suitable generalizations of the Kesten-Stigum threshold. We also prove tight lower bounds for group synchronization and abelian group sumset problems under the “truth-or-Haar” noise model, and give an improved analysis of Gaussian multi-frequency group synchronization. In most of these models, for some parameter settings our lower bounds give new evidence for conjectural statistical-to-computational gaps. Finally, interpreting some of our findings, we propose a new analogy between categorical and continuous signals: a general SBM as above behaves qualitatively like a spiked $p_*$-tensor model of a certain order $p_*$ depending on the parameters of the SBM.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/kunisky25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/kunisky25a/kunisky25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-kunisky25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dmitriy
    family: Kunisky
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3486-3526
  id: kunisky25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3486
  lastpage: 3526
  published: 2025-07-02 00:00:00 +0000
- title: 'Fast and Furious Symmetric Learning in Zero-Sum Games: Gradient Descent as Fictitious Play'
  abstract: 'This paper investigates the sublinear regret gu rantees of two \textit{non}-no-regret algorithms in zero-sum games:Fictitious Play, and Online Gradient Descent with \textit{constant} stepsizes. In general adversarial online learning settings, both algorithms may exhibit instability and linear regret due to no regularization (Fictitious Play) or small amounts of regularization (Gradient Descent). However, their ability to obtain tighter regret bounds in two-player zero-sum games is less understood. In this work, we obtain strong new regret guarantees for both algorithms on a class of symmetric zero-sum games that generalize the classic three-strategy Rock-Paper-Scissors to a weighted, $n$-dimensional regime. Under \textit{symmetric initializations} of the players’ strategies, we prove that Fictitious Play with \textit{any tiebreaking rule} has $O(\sqrt{T})$ regret, establishing a new class of games for which  Karlin’s Fictitious Play conjecture holds. Moreover, by leveraging a connection between the geometry of the iterates of Fictitious Play and Gradient Descent in the dual space of payoff vectors, we prove that Gradient Descent, for \textit{almost all} symmetric initializations, obtains a similar $O(\sqrt{T})$ regret bound when its stepsize is a \textit{sufficiently large} constant. For Gradient Descent, this establishes the first “fast and furious” behavior  (i.e., sublinear regret \textit{without} time-vanishing stepsizes) for zero-sum games larger than $2\times2$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/lazarsfeld25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/lazarsfeld25a/lazarsfeld25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-lazarsfeld25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: John
    family: Lazarsfeld
  - given: Georgios
    family: Piliouras
  - given: Ryann
    family: Sim
  - given: Andre
    family: Wibisono
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3527-3577
  id: lazarsfeld25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3527
  lastpage: 3577
  published: 2025-07-02 00:00:00 +0000
- title: 'The Fundamental Limits of Recovering Planted Subgraphs (extended abstract)'
  abstract: 'Given an arbitrary subgraph $H=H_n$ and $p=p_n\in(0,1)$, the planted subgraph model is defined as follows. A statistician observes the union of the "signal," which is a random "planted" copy $H^*$ of $H$, together with random "noise" in the form of an instance of an Erd{ö}s-R{é}nyi graph $G(n,p)$. The goal then of the statistician is to recover the planted $H^*$ from the observed graph. Our focus in this work is to understand the minimum mean-squared error (MMSE) in terms of recovering the edges of $H^*$, as a function of $p$ and $H$. A recent paper [MNSSZ23] characterizes the graphs for which this MMSE curve undergoes a sharp phase transition from $0$ to $1$ as $p$ increases, a behavior known as the All-or-Nothing phenomenon, up to a mild density assumption on $H$. However, their techniques fail to describe the MMSE curves for graphs that do not display such a sharp phase transition. In this paper, we provide a formula for the limiting MMSE curve for any graph $H=H_n$, up to the same mild density assumption. This curve is expressed in terms of a variational formula over pairs of subgraphs of $H$, and is inspired by the celebrated subgraph expectation thresholds from probabilistic combinatorics [KK07]. Furthermore, we give a polynomial-time description of the optimizers of this variational problem. This allows one to efficiently compute the MMSE curve for any given dense graph $H$. The proof relies on a novel graph decomposition as well as a min-max duality theorem which may be of independent interest. Our results generalize to the setting of planting arbitrary monotone boolean properties, where the statistician observes the union of a planted minimal element $A\subseteq[N]$ of a monotone property and a random $\mathrm{Ber}(p)^{\otimes N}$ vector. In this setting, we provide a variational formula inspired by the so-called "fractional" expectation threshold [Tal10], again describing the MMSE curve (in this case up to a multiplicative constant).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/lee25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/lee25a/lee25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-lee25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Daniel Z.
    family: Lee
  - given: Francisco
    family: Pernice
  - given: Amit
    family: Rajaraman
  - given: Ilias
    family: Zadik
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3578-3579
  id: lee25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3578
  lastpage: 3579
  published: 2025-07-02 00:00:00 +0000
- title: 'Robust random graph matching in Gaussian models via vector approximate message passing'
  abstract: ' In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. Although Polynomial-time algorithms for graph matching have been studied in the line of work (Barak et al. (2019); Ding et al. (2021); Fan et al. (2023a,b); Ganassali and Massoulié (2020); Ganassali et al. (2024a); Mao et al. (2021, 2023a); Ganassali et al. (2024b); Mao et al. (2023b); Ding and Li (2025+, 2023)), many of the efficient algorithms used to achieve matching recovery are believed to be fragile in the sense that adversarially modifying a small fraction of edges could fool the algorithm into outputting a result which deviates strongly from the true underlying matching. Thus, we are particularly interested in a robust version of this problem such that our observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices supported on an unknown $\epsilon n * \epsilon n$ principle minor of $A,B$, respectively. We propose a vector approximate message passing (vector AMP) algorithm that succeeds in polynomial time as long as the correlation $\rho$ between $(A,B)$ is a non-vanishing constant and $\epsilon = o\big( \frac{1}{(\log n)^{20}} \big)$.  The main methodological inputs for our result are the iterative random graph matching algorithm proposed in Ding and Li (2025+, 2023) and the spectral cleaning procedure proposed in Ivkov and Schramm (2025). To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of $n^{1-o(1)}$ size.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/li25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/li25a/li25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-li25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zhangsong
    family: Li
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3580-3581
  id: li25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3580
  lastpage: 3581
  published: 2025-07-02 00:00:00 +0000
- title: 'Some easy optimization problems have the overlap-gap property'
  abstract: 'We show that the shortest $s$-$t$ path problem has the overlap-gap property in (i) sparse $\mathbb{G}(n,p)$ graphs and (ii) complete graphs with i.i.d. Exponential edge weights. Furthermore, we demonstrate that in sparse $\mathbb{G}(n,p)$ graphs, shortest path is solved by $O(\log n)$-degree polynomial estimators, and a uniform approximate shortest path can be sampled in polynomial time. This constitutes the first example in which the overlap-gap property is not predictive of algorithmic intractability for a (non-algebraic) average-case optimization problem.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/li25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/li25b/li25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-li25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Shuangping
    family: Li
  - given: Tselil
    family: Schramm
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3582-3622
  id: li25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3582
  lastpage: 3622
  published: 2025-07-02 00:00:00 +0000
- title: 'A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions'
  abstract: 'In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds (Ito et al., 2017) under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds (Kale et al., 2017; Ito et al., 2017).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/li25c.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/li25c/li25c.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-li25c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Junfan
    family: Li
  - given: Shizhong
    family: Liao
  - given: Zenglin
    family: Xu
  - given: Liqiang
    family: Nie
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3623-3670
  id: li25c
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3623
  lastpage: 3670
  published: 2025-07-02 00:00:00 +0000
- title: 'Multi-Pass Memory Lower Bounds for Learning Problems'
  abstract: 'Space complexity in learning problems has received a lot of attention in recent years. In this direction, Brown, Bun, and Smith (COLT 2022) studied space complexity lower bounds for several natural learning problems under the \textit{one-pass streaming} setting. Assuming that the examples are sampled from $\{0,1\}^d$ and the optimal hypothesis can be encoded using $\kappa$ bits, they showed learning algorithms with constant error using a near-minimal number of examples, $\Tilde{O}(\kappa)$, require $\Tilde{\Omega}(d\kappa)$ bits of memory. Moreover, for a general number $N$ of examples, their memory lower bound takes the form $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{N})$.  However, as mentioned by Brown, Bun, and Smith (COLT 2022), the learning process often involves multiple passes over the data. Hence, it is equally important to study the space complexity in the \textit{multi-pass streaming} setting. The authors conjectured that similar lower bounds should apply but left it as an open problem. In this paper, we resolve this open problem by proving that any $L$-pass streaming algorithm using $N$ samples requires $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{NL})$ bits of memory. Intuitively, our lower bound shows that a stream of $L\cdot N$ fresh examples is at least as useful as $L$ passes over $N$ examples. A key component of our approach is a lower bound on the information complexity of the \textsf{Bit-Bias$(p,q)$} problem in the multi-pass streaming setting, a basic problem that may have independent significance. In the \textsf{Bit-Bias$(p,q)$} problem, one sees a stream of $N$ i.i.d. random bits drawn from either \textsf{Bernoulli$(p)$} or \textsf{Bernoulli$(q)$}, and would like to distinguish the two cases. Our results not only extend the previous lower bound on \textsf{Bit-Bias$(0,1/2)$} by Brown, Bun, and Smith from the one-pass streaming setting to the more general multi-pass setting, but also cover more general values of $p$ and $q$. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/li25d.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/li25d/li25d.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-li25d.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Qian
    family: Li
  - given: Shuo
    family: Wang
  - given: Jiapeng
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3671-3699
  id: li25d
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3671
  lastpage: 3699
  published: 2025-07-02 00:00:00 +0000
- title: 'Private Realizable-to-Agnostic Transformation with Near-Optimal Sample Complexity'
  abstract: 'The realizable-to-agnostic transformation (Beimel et al., 2015; Alon et al., 2020) provides a general mechanism to convert a private learner in the realizable setting (where the examples are labeled by some function in the concept class) to a private learner in the agnostic setting (where no assumptions are imposed on the data). Specifically, for any concept class $\mathcal{C}$ and error parameter $\alpha$, a private realizable learner for $\mathcal{C}$ can be transformed into a private agnostic learner while only increasing the sample complexity by $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$, which is essentially tight assuming a constant privacy parameter $\varepsilon = \Theta(1)$. However, when $\varepsilon$ can be arbitrary, one has to apply the standard privacy-amplification-by-subsampling technique (Kasiviswanathan et al., 2011), resulting in a suboptimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2\varepsilon)$ that involves a $1/\varepsilon$ factor. In this work, we give an improved construction that eliminates the dependence on $\varepsilon$, thereby achieving a near-optimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$ for any $\varepsilon\le 1$. Moreover, our result reveals that in private agnostic learning, the privacy cost is only significant for the realizable part. We also leverage our technique to obtain a nearly tight sample complexity bound for the private prediction problem, resolving an open question posed by Dwork and Feldman (2018) and Dagan and Feldman (2020).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/li25e.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/li25e/li25e.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-li25e.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Bo
    family: Li
  - given: Wei
    family: Wang
  - given: Peng
    family: Ye
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3700-3722
  id: li25e
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3700
  lastpage: 3722
  published: 2025-07-02 00:00:00 +0000
- title: 'Low-dimensional adaptation of diffusion models: Convergence in total variation (extended abstract)'
  abstract: ' This paper presents new theoretical insights into how diffusion generative models adapt to low-dimensional structure in data distributions. We study two widely used samplers — the denoising diffusion probabilistic model (DDPM) and the denoising diffusion implicit model (DDIM) — and analyze their convergence behavior under the assumption of accurate score estimates. Our main result shows that both DDPM and DDIM require at most $O(k/\varepsilon)$ iterations (up to logarithmic factors) to generate samples that are $\varepsilon$-close to the target distribution in total variation distance, where $k$ captures an intrinsic low-dimensional structure of the distribution. Importantly, our theory holds without assuming smoothness or log-concavity. These results provide the first rigorous guarantees for the low-dimensional adaptation capability of DDIM-type samplers, and significantly improve upon prior TV-based convergence bounds for DDPM. Our analysis also highlights the role of discretization coefficients in exploiting low-dimensional structure, and establishes lower bounds that justify the optimality of commonly used parameter choices originally proposed by Ho et al. (2020); Song et al. (2020). '
  volume: 291
  URL: https://proceedings.mlr.press/v291/liang25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/liang25a/liang25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-liang25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jiadong
    family: Liang
  - given: Zhihan
    family: Huang
  - given: Yuxin
    family: Chen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3723-3729
  id: liang25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3723
  lastpage: 3729
  published: 2025-07-02 00:00:00 +0000
- title: 'Characterizing Dependence of Samples along the Langevin Dynamics and Algorithms via Contraction of $Φ$-Mutual Information (Extended Abstract)'
  abstract: 'The mixing time of a Markov chain determines how fast the iterates of the Markov chain converge to the stationary distribution; however, it does not control the dependencies between samples along the Markov chain. In this paper, we study the question of how fast the samples become approximately independent along popular Markov chains for continuous-space sampling: the Langevin dynamics in continuous time, and the Unadjusted Langevin Algorithm and the Proximal Sampler in discrete time. We measure the dependence between samples via $\Phi$-mutual information, which is a broad generalization of the standard mutual information, and which is equal to $0$ if and only if the the samples are independent. We show that along these Markov chains, the $\Phi$-mutual information between the first and the $k$-th iterate decreases to $0$ exponentially fast in $k$ when the target distribution is strongly log-concave. Our proof technique is based on showing the Strong Data Processing Inequalities (SDPIs) hold along the Markov chains. To prove fast mixing of the Markov chains, we only need to show the SDPIs hold for the stationary distribution. In contrast, to prove the contraction of $\Phi$-mutual information, we need to show the SDPIs hold along the entire trajectories of the Markov chains; we prove this when the iterates along the Markov chains satisfy the corresponding $\Phi$-Sobolev inequality, which is implied by the strong log-concavity of the target distribution. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/liang25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/liang25b/liang25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-liang25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jiaming
    family: Liang
  - given: Siddharth
    family: Mitra
  - given: Andre
    family: Wibisono
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3730-3731
  id: liang25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3730
  lastpage: 3731
  published: 2025-07-02 00:00:00 +0000
- title: 'Decision Making in Hybrid Environments: A Model Aggregation Approach'
  abstract: 'Recent work by Foster et al. (2021, 2022, 2023b) and Xu and Zeevi (2023) developed the framework of decision estimation coefficient (DEC) that characterizes the complexity of general online decision making problems and provides a general algorithm design principle. These works, however, either focus on the pure stochastic regime where the world remains fixed over time, or the pure adversarial regime where the world arbitrarily changes over time. For the hybrid regime where the dynamics of the world is fixed while the reward arbitrarily changes, they only give pessimistic bounds on the decision complexity. In this work, we propose a general extension of DEC that more precisely characterizes this case. Besides applications in special cases, our framework leads to a flexible algorithm design where the learner learns over subsets of the hypothesis set, trading estimation complexity with decision complexity, which could be of independent interest. Our work covers model-based learning and model-free learning in the hybrid regime, with a newly proposed extension of the bilinear classes (Du et al., 2021) to the adversarial-reward case. In addition, our method improves the best-known regret bounds for linear $Q^\star/V^\star$ MDPs in the pure stochastic regime.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/liu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/liu25a/liu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-liu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Haolin
    family: Liu
  - given: Chen-Yu
    family: Wei
  - given: Zimmert
    family: Julian
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3732-3765
  id: liu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3732
  lastpage: 3765
  published: 2025-07-02 00:00:00 +0000
- title: 'Robust Algorithms for Recovering Planted $r$-Colorable Graphs'
  abstract: 'The planted clique problem is a fundamental problem in the study of algorithms and has been extensively studied in various random and semirandom models. It is known that a clique planted in a random graph can be efficiently recovered if the size of the clique is above the conjectured computational threshold of $\Omega_p(\sqrt{n})$. A natural question that arises then is: what other planted structures can be efficiently recovered? In this work, we investigate this question by considering random planted and semirandom models for the $r$-coloring problem. In our model, a subset $S \subseteq V$ of size $k$ is chosen, and an arbitrary $r$-colorable graph is planted on the subgraph induced by $S$. Edges between pairs in $V \setminus S$ are added independently with probability $p$, and an adversary may add arbitrary edges between $S$ and $V \setminus S$. Our main result is a polynomial-time algorithm that recovers most of the vertices of the planted $r$-colorable graph when $k \geq c r \sqrt{n/p}$, for some constant $c$. The key technical contribution is a novel semidefinite programming (SDP) relaxation and a rounding algorithm. Our algorithm is also robust to the presence of a monotone adversary that can insert edges within $V \setminus S$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/louis25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/louis25a/louis25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-louis25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Anand
    family: Louis
  - given: Rameesh
    family: Paul
  - given: Prasad
    family: Raghavendra
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3766-3794
  id: louis25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3766
  lastpage: 3794
  published: 2025-07-02 00:00:00 +0000
- title: 'Sparsity-Based Interpolation of External, Internal and Swap Regret'
  abstract: 'Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via $\phi$-regret minimization, which measures the total loss of an algorithm by its regret with respect to an arbitrary action modification rule $\phi$. With $d$ experts and $T\gg d$ rounds in total, we present a single algorithm achieving the instance-adaptive $\phi$-regret bound $ \tilde O\left\{\min\left\{\sqrt{d-d^{\mathrm{unif}}_\phi+1},\sqrt{d-d^{\mathrm{self}}_\phi}\right\}\cdot\sqrt{T}\right\}, $ where $d^{\mathrm{unif}}_\phi$ is the maximum amount of experts modified identically by $\phi$, and $d^{\mathrm{self}}_\phi$ is the amount of experts that $\phi$ trivially modifies to themselves. By recovering the optimal $O(\sqrt{T\log d})$ external regret bound when $d^{\mathrm{unif}}_\phi=d$, the standard $\tilde O(\sqrt{T})$ internal regret bound when $d^{\mathrm{self}}_\phi=d-1$ and the optimal $\tilde O(\sqrt{dT})$ swap regret bound in the worst case, we improve upon existing algorithms in the intermediate regimes. In addition, the computational complexity of our algorithm matches that of the standard swap-regret minimization algorithm due to Blum and Mansour (2007). Technically, building on the well-known reduction from $\phi$-regret minimization to external regret minimization on stochastic matrices, our main idea is to further convert the latter to online linear regression using Haar-wavelet-inspired matrix features. Then, by associating the complexity of each $\phi$ instance with its sparsity under the feature representation, we apply techniques from comparator-adaptive online learning to exploit the sparsity in this regression subroutine.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/lu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/lu25a/lu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-lu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zhou
    family: Lu
  - given: Y Jennifer
    family: Sun
  - given: Zhiyu
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3795-3828
  id: lu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3795
  lastpage: 3828
  published: 2025-07-02 00:00:00 +0000
- title: 'Sample Efficient Omniprediction and Downstream Swap Regret for Non-Linear Losses'
  abstract: 'We define “decision swap regret” which generalizes both prediction for downstream swap regret and omniprediction, and give algorithms for obtaining it for arbitrary multi-dimensional Lipschitz loss functions in online adversarial settings. We also give sample complexity bounds in the batch setting via an online-to-batch reduction. When applied to omniprediction, our algorithm gives the first polynomial sample-complexity bounds for Lipschitz loss functions—prior bounds either applied only to linear loss (or binary outcomes) or scaled exponentially with the error parameter even under the assumption that the loss functions were convex.  When applied to prediction for downstream regret, we give the first algorithm capable of guaranteeing swap regret bounds for all downstream agents with non-linear loss functions over a multi-dimensional outcome space: prior work applied only to linear loss functions, modeling risk neutral agents. Our general bounds scale exponentially with the dimension of the outcome space, but we give improved regret and sample complexity bounds for specific families of multidimensional functions of economic interest: constant elasticity of substitution (CES), Cobb-Douglas, and Leontief utility functions.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/lu25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/lu25b/lu25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-lu25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jiuyao
    family: Lu
  - given: Aaron
    family: Roth
  - given: Mirah
    family: Shi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3829-3878
  id: lu25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3829
  lastpage: 3878
  published: 2025-07-02 00:00:00 +0000
- title: 'Identifiability and Estimation in High-Dimensional Nonparametric Latent Structure Models'
  abstract: 'This paper studies the problems of identifiability and estimation in high-dimensional nonparametric latent structure models. We introduce an identifiability theorem that generalizes existing conditions, establishing a unified framework applicable to diverse statistical settings.  Our results rigorously demonstrate how increased dimensionality, coupled with diversity in variables, inherently facilitates identifiability. For the estimation problem, we establish near-optimal minimax rate bounds for the high-dimensional nonparametric density estimation under latent structures with smooth marginals. Contrary to the conventional curse of dimensionality, our sample complexity scales only polynomially with the dimension. Additionally, we develop a perturbation theory for component recovery and propose a recovery procedure based on simultaneous diagonalization.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/lyu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/lyu25a/lyu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-lyu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yichen
    family: Lyu
  - given: Pengkun
    family: Yang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3879-3880
  id: lyu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3879
  lastpage: 3880
  published: 2025-07-02 00:00:00 +0000
- title: 'Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries'
  abstract: 'In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{\mathsf{s}}$ and a sink node $v_{\mathsf{t}}$, let $\mathcal{X} \subseteq \{0,1\}^{|E|}$ denote the set of all paths from $v_{\mathsf{s}}$ to $v_{\mathsf{t}}$. At each round $t$, we select a path $\mathbf{x}_t \in \mathcal{X}$ and receive bandit feedback on our loss $⟨\mathbf{x}_t, \mathbf{y}_t ⟩\in [-1,1]$, where $\mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $\tilde{\mathcal{O}}(\sqrt{|E|T\log |\mathcal{X}|})$ with high probability against any adaptive adversary, where $\tilde{\mathcal{O}}(\cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/maiti25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/maiti25a/maiti25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-maiti25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Arnab
    family: Maiti
  - given: Zhiyuan
    family: Fan
  - given: Kevin
    family: Jamieson
  - given: Lillian J.
    family: Ratliff
  - given: Gabriele
    family: Farina
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3881-3932
  id: maiti25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3881
  lastpage: 3932
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning sparse generalized linear models with binary outcomes via iterative hard thresholding'
  abstract: 'In statistics, generalized linear models (GLMs) are widely used for modeling data and can expressively capture potential nonlinear dependence of the model’s outcomes on its covariates. Within the broad family of GLMs, those with binary outcomes, which include logistic and probit regressions, are motivated by common tasks such as  binary classification with (possibly) non-separable data. In addition, in modern machine learning and statistics, data is often high-dimensional yet has a low intrinsic dimension, making sparsity constraints in models another reasonable consideration. In this work, we propose to use and analyze an iterative hard thresholding (projected gradient descent on the ReLU loss) algorithm, called binary iterative hard thresholding (BIHT), for parameter estimation in sparse GLMs with binary outcomes. We establish that BIHT is statistically efficient and converges to the correct solution for parameter estimation in  a general class of sparse binary GLMs. Unlike many other methods for learning GLMs, including maximum likelihood estimation, generalized approximate message passing, and GLM-tron (Kakade et al., 2011; Bahmani et al., 2016), BIHT does not require knowledge of the GLM’s link function, offering flexibility and generality in allowing the algorithm to learn arbitrary binary GLMs. As two applications, logistic and probit regression are additionally studied. In this regard, it is shown that in logistic regression, the algorithm is in fact statistically optimal in the sense that the order-wise sample complexity matches (up to logarithmic factors) the lower bound obtained previously. To the best of our knowledge, this is the first work achieving statistical optimality for logistic regression in all noise regimes with a computationally efficient algorithm. Moreover, for probit regression, our sample complexity is on the same order as that obtained for logistic regression.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/matsumoto25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/matsumoto25a/matsumoto25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-matsumoto25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Namiko
    family: Matsumoto
  - given: Arya
    family: Mazumdar
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 3933-4032
  id: matsumoto25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 3933
  lastpage: 4032
  published: 2025-07-02 00:00:00 +0000
- title: 'Online Convex Optimization with a Separation Oracle'
  abstract: 'In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of $O(T^{3/4})$, while more recent separation-based approaches guarantee a regret bound of $O(\kappa \sqrt{T})$, where $\kappa$ denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, $\kappa$ can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{dT} + \kappa d)$, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per round. Crucially, the main term in the bound, $\widetilde{O}(\sqrt{d T})$, is independent of $\kappa$, addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the $O(\kappa \sqrt{T})$ regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of $\widetilde{O}(\sigma/\sqrt{T} + \kappa d/T)$, where $\sigma$ represents the noise in the stochastic gradients, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per iteration.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/mhammedi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/mhammedi25a/mhammedi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-mhammedi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zakaria
    family: Mhammedi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4033-4077
  id: mhammedi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4033
  lastpage: 4077
  published: 2025-07-02 00:00:00 +0000
- title: 'Sample and Oracle Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions'
  abstract: 'Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are exponential in the horizon.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/mhammedi25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/mhammedi25b/mhammedi25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-mhammedi25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zakaria
    family: Mhammedi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4078-4165
  id: mhammedi25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4078
  lastpage: 4165
  published: 2025-07-02 00:00:00 +0000
- title: 'The Planted Spanning Tree Problems: Exact Overlap Characterization via Local Weak Convergence Extended Abstract'
  abstract: 'We study the problem of detecting and recovering a planted spanning tree $M_n^*$ hidden within a complete, randomly weighted graph $G_n$.  Specifically, each edge $e$ has a non-negative weight drawn independently from $P_n$ if $e \in M_n^*$ and from $Q_n$ otherwise, where $P_n \equiv P$ is fixed and $Q_n$ scales with $n$ such that its density at the origin satisfies $\lim_{n\to\infty} n Q’_n(0)=1.$ We consider two representative cases: when $M_n^*$ is either a uniform spanning tree or a uniform Hamiltonian path. We analyze the recovery performance of the minimum spanning tree (MST) algorithm and derive a fixed-point equation that characterizes the asymptotic fraction of edges in $M_n^*$ successfully recovered by the MST as  $n \to \infty.$ Furthermore, we establish the asymptotic mean  weight of the MST, extending Frieze’s $\zeta(3)$ result to the planted model. {Leveraging this result, we design an efficient test based on the MST weight and show that it can distinguish the planted model from the unplanted model with vanishing testing error as  $n \to \infty.$} Our analysis relies on an asymptotic characterization of the local structure of the planted model, employing the framework of local weak convergence.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/moharrami25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/moharrami25a/moharrami25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-moharrami25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mehrdad
    family: Moharrami
  - given: Cristopher
    family: Moore
  - given: Jiaming
    family: Xu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4166-4167
  id: moharrami25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4166
  lastpage: 4167
  published: 2025-07-02 00:00:00 +0000
- title: 'Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks'
  abstract: 'We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error–a standard approach that leads to worst-case bounds tied to the Littlestone dimension—we consider comparing with predictors that are robust to small input perturbations, perform well under Gaussian smoothing, or maintain a prescribed output margin. Previous examples of this were primarily limited to the hinge loss. Our algorithms achieve regret guarantees that depend only on the VC dimension and the complexity of the instance space (e.g., metric entropy), and notably, they incur only an $O(\log(1/\gamma))$ dependence on the generalized margin $\gamma$. This stands in contrast to most existing regret bounds, which typically exhibit a polynomial dependence on $1/\gamma$. We complement this with matching lower bounds. Our analysis connects recent ideas from adversarial robustness and smoothed online learning.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/montasser25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/montasser25a/montasser25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-montasser25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Omar
    family: Montasser
  - given: Abhishek
    family: Shetty
  - given: Nikita
    family: Zhivotovskiy
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4168-4202
  id: montasser25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4168
  lastpage: 4202
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning'
  abstract: 'We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $\gamma \in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/moulin25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/moulin25a/moulin25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-moulin25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Antoine
    family: Moulin
  - given: Gergely
    family: Neu
  - given: Luca
    family: Viano
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4203-4270
  id: moulin25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4203
  lastpage: 4270
  published: 2025-07-02 00:00:00 +0000
- title: 'Are all models wrong? Fundamental limits in distribution-free empirical model falsification'
  abstract: 'In statistics and machine learning, when we train a fitted model on available data, we typically want to ensure that we are searching within a model class that contains at least one accurate model—that is, we would like to ensure an upper bound on the \emph{model class risk} (the lowest possible risk that can be attained by any model in the class). However, it is also of interest to establish lower bounds on the model class risk, for instance so that we can determine whether our fitted model is at least approximately optimal within the class, or, so that we can decide whether the model class is unsuitable for the particular task at hand.  Particularly in the setting of interpolation learning where machine learning models are trained to reach zero error on the training data, we might ask if, at the very least, a positive lower bound on the model class risk is possible—or are we unable to detect that “all models are wrong”? In this work, we answer these questions in a distribution-free setting by establishing a model-agnostic, fundamental hardness result for the problem of constructing a lower bound on the best test error achievable over a model class, and examine its implications on specific model classes such as tree-based methods and linear regression.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/muller25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/muller25a/muller25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-muller25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Manuel M.
    family: Müller
  - given: Yuetian
    family: Luo
  - given: Rina Foygel
    family: Barber
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4271-4308
  id: muller25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4271
  lastpage: 4308
  published: 2025-07-02 00:00:00 +0000
- title: 'Sharper Bounds for Chebyshev Moment Matching, with Applications'
  abstract: 'We study the problem of approximately recovering a probability distribution given noisy measurements of its Chebyshev polynomial moments. This problem arises broadly across algorithms, statistics, and machine learning.  By leveraging a \emph{global decay bound} on the coefficients in the Chebyshev expansion of any Lipschitz function, we sharpen prior work, proving that accurate recovery in the Wasserstein distance is possible with more noise than previously known. Our result immediately yields a number of applications: (1) We give a simple “linear query” algorithm for constructing a differentially private synthetic data distribution with Wasserstein-$1$ error $\tilde{O}(1/n)$ based on a dataset of $n$ points in $[-1,1]$. This bound is optimal up to log factors and matches a recent breakthrough of Boedihardjo, Strohmer, and Vershynin [Probab. Theory. Rel., 2024], which uses a more complex “superregular random walk” method to beat an $O(1/\sqrt{n})$ accuracy barrier inherent to earlier approaches. (2) We give an $\tilde{O}(n^2/\epsilon)$ time algorithm for the linear algebraic problem of estimating the spectral density of an $n\times n$ symmetric matrix up to $\epsilon$ error in the Wasserstein distance. Our result accelerates prior methods from  Chen et al. [ICML 2021] and Braverman et al. [STOC 2022]. (3) We tighten an analysis of Vinayak, Kong, Valiant, and Kakade [ICML 2019] on the maximum likelihood estimator for the statistical problem of “Learning Populations of Parameters”, extending the parameter regime in which sample optimal results can be obtained. Beyond these main results, we provide an extension of our bound to estimating distributions in $d > 1$ dimensions. We hope that these bounds will find applications more broadly to problems involving distribution recovery from noisy moment information.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/musco25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/musco25a/musco25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-musco25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Cameron
    family: Musco
  - given: Christopher
    family: Musco
  - given: Lucas
    family: Rosenblatt
  - given: Apoorv Vikram
    family: Singh
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4309-4358
  id: musco25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4309
  lastpage: 4358
  published: 2025-07-02 00:00:00 +0000
- title: 'Estimating stationary mass, frequency by frequency'
  abstract: 'Suppose we observe a trajectory of length $n$ from an exponentially $\alpha$-mixing stochastic process over a finite but potentially large state space. We consider the problem of estimating the probability mass placed by the stationary distribution of any such process on elements that occur with a certain frequency in the observed sequence. We estimate this vector of probabilities in total variation distance, showing universal consistency in $n$ and recovering known results for i.i.d. sequences as special cases. Our proposed methodology—implementable in linear time—carefully combines the plug-in (or empirical) estimator with a recently-proposed modification of the Good–Turing estimator called WingIt, which was originally developed for Markovian sequences. En route to controlling the error of our estimator, we develop new performance bounds on WingIt and the plug-in estimator for  exponentially $\alpha$-mixing stochastic processes. Importantly, the extensively used method of Poissonization can no longer be applied in our non i.i.d. setting, and so we develop complementary tools—including concentration inequalities for a natural self-normalized statistic of mixing sequences—that may prove independently useful in the design and analysis of estimators for related problems. Simulation studies corroborate our theoretical findings.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/nakul25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/nakul25a/nakul25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-nakul25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Milind
    family: Nakul
  - given: Vidya
    family: Muthukumar
  - given: Ashwin
    family: Pananjady
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4359-4359
  id: nakul25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4359
  lastpage: 4359
  published: 2025-07-02 00:00:00 +0000
- title: 'Improved algorithms for learning quantum Hamiltonians, via flat polynomials'
  abstract: 'We give an improved algorithm for learning a quantum Hamiltonian given copies of its Gibbs state, that can succeed at any temperature. Specifically, we improve over the work of Bakshi, Liu, Moitra, and Tang (2024) by reducing the sample complexity and runtime dependence to singly exponential in the inverse-temperature parameter, as opposed to doubly exponential. Our main technical contribution is a new flat polynomial approximation to the exponential function, with significantly lower degree than the flat polynomial approximation used in Bakshi et al.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/narayanan25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/narayanan25a/narayanan25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-narayanan25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Shyam
    family: Narayanan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4360-4385
  id: narayanan25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4360
  lastpage: 4385
  published: 2025-07-02 00:00:00 +0000
- title: 'Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching'
  abstract: ' Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/nguyen25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/nguyen25a/nguyen25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-nguyen25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Quan
    family: Nguyen
  - given: Shinji
    family: Ito
  - given: Junpei
    family: Komiyama
  - given: Mehta
    family: Nishant
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4386-4451
  id: nguyen25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4386
  lastpage: 4451
  published: 2025-07-02 00:00:00 +0000
- title: 'On the Hardness of Bandit Learning'
  abstract: 'We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known,  but arbitrary, function class F.   While many instances of this problem are well understood, we seek a general theory of bandit learnability, akin to the PAC framework for classification. Our investigation is guided by the following two fundamental questions: (1) which classes F are learnable,  and (2) how they are learnable. For example, in the case of binary PAC classification,  learnability is fully determined by a combinatorial dimension, namely, the VC dimension, and can be attained via a simple algorithmic principle, namely, empirical risk minimization (ERM). In contrast to classical learning-theoretic results, our findings reveal  fundamental limitations of learning in structured bandits, offering insights into the boundaries of bandit learnability. First, for the question of "which",  we show that the paradigm of identifying the learnable classes via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019).  For the question of "how", we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time unless RP=NP. We also prove that this class admits efficient algorithms for standard (albeit possibly computationally hard) algorithmic operations often considered in learning theory, such as an ERM. This implies  that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/brukhim25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/brukhim25a/brukhim25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-brukhim25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nataly
    family: Brukhim
  - given: Aldo
    family: Pacchiano
  - given: Miroslav
    family: Dudik
  - given: Robert
    family: Schapire
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4452-4485
  id: brukhim25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4452
  lastpage: 4485
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Algorithms in the Limit'
  abstract: 'This paper studies the problem of learning computable functions in the limit by extending Gold’s inductive inference framework to incorporate \textit{computational observations} and \textit{restricted input sources}. Complimentary to the traditional Input-Output Observations, we introduce Time-Bound Observations, and Policy-Trajectory Observations to study the learnability of general recursive functions under more realistic constraints. While input-output observations do not suffice for learning the class of general recursive functions in the limit, we overcome this learning barrier by imposing computational complexity constraints or supplementing with approximate time-bound observations. Further, we build a formal framework around observations of \textit{computational agents} and show that learning computable functions from policy trajectories reduces to learning rational functions from input and output, thereby revealing interesting connections to finite-state transducer inference. On the negative side, we show that computable or polynomial-mass characteristic sets cannot exist for the class of linear-time computable functions even for policy-trajectory observations.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/papazov25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/papazov25a/papazov25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-papazov25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hristo
    family: Papazov
  - given: Nicolas
    family: Flammarion
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4486-4510
  id: papazov25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4486
  lastpage: 4510
  published: 2025-07-02 00:00:00 +0000
- title: 'Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts'
  abstract: 'We study the problem of releasing a differentially private (DP) synthetic graph $G’$ that well approximates the triangle-motif sizes of all cuts of any given graph $G$, where a motif in general refers to a frequently occurring subgraph within complex networks. Non-private versions of such graphs have found applications in diverse fields such as graph clustering, graph sparsification, and social network analysis. Specifically, we present the first $(\varepsilon,\delta)$-DP mechanism that, given an input graph $G$ with $n$ vertices, $m$ edges and local sensitivity of triangles $\ell_{3}(G)$, generates a synthetic graph $G’$ in polynomial time, approximating the triangle-motif sizes of all cuts $(S,V\setminus S)$ of the input graph $G$ up to an additive error of $\tilde{O}(\sqrt{m\ell_3(G)}n/\varepsilon^{3/2})$. Additionally, we provide a lower bound of $\Omega(\sqrt{mn}\ell_3(G)/\varepsilon)$ on the additive error for any DP algorithm that answers the triangle-motif size queries of all $(S,T)$-cut of $G$. Finally, our algorithm generalizes to weighted graphs, and our lower bound extends to any $K_h$-motif cut for any constant $h\geq 2$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/peng25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/peng25a/peng25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-peng25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Pan
    family: Peng
  - given: Hangyu
    family: Xu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4511-4564
  id: peng25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4511
  lastpage: 4564
  published: 2025-07-02 00:00:00 +0000
- title: 'Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method'
  abstract: 'Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers’ abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/pilliat25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/pilliat25a/pilliat25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-pilliat25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Emmanuel
    family: Pilliat
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4565-4595
  id: pilliat25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4565
  lastpage: 4595
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimal Robust Estimation under Local and Global Corruptions: Stronger Adversary and Smaller Error'
  abstract: 'Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led Nietert et al. (2024) to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even stronger local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/pittas25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/pittas25a/pittas25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-pittas25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Thanasis
    family: Pittas
  - given: Ankit
    family: Pensia
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4596-4639
  id: pittas25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4596
  lastpage: 4639
  published: 2025-07-02 00:00:00 +0000
- title: 'Lower Bounds for Private Estimation of Gaussian Covariance Matrices under All Reasonable Parameter Regimes'
  abstract: ' One of the most basic problems in statistics is estimating the covariance matrix of a Gaussian distribution. Over the past decade, researchers have studied the efficiency of covariance estimation in the setting of differential privacy. The goal is to minimize the number of samples needed to achieve the desired accuracy and privacy guarantees. We prove lower bounds on the number of samples needed to privately estimate the covariance matrix of a Gaussian distribution. Our bounds match existing upper bounds in the widest known setting of parameters. Our analysis can be seen as a fingerprinting argument, one of the main techniques used to prove lower bounds in differential privacy. Most fingerprinting arguments rely on results  analogous to the celebrated Stein’s identity from probability theory. We use a matrix extension of this identity known as the Stein-Haff identity.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/portella25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/portella25a/portella25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-portella25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Victor S.
    family: Portella
  - given: Nicholas J. A.
    family: Harvey
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4640-4667
  id: portella25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4640
  lastpage: 4667
  published: 2025-07-02 00:00:00 +0000
- title: 'Linear Convergence of Diffusion Models Under the Manifold Hypothesis'
  abstract: 'Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler (KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$. Moreover, we show that this linear dependency is sharp.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/potaptchik25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/potaptchik25a/potaptchik25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-potaptchik25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Peter
    family: Potaptchik
  - given: Iskander
    family: Azangulov
  - given: George
    family: Deligiannidis
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4668-4685
  id: potaptchik25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4668
  lastpage: 4685
  published: 2025-07-02 00:00:00 +0000
- title: 'Truthfulness of Decision-Theoretic Calibration Measures'
  abstract: 'Calibration measures quantify how much a forecaster’s predictions violate calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster approximately minimizes error by always reporting the true probabilities). Existing measures satisfy at most one of the properties, but not both. We introduce a new calibration measure termed subsampled step calibration, $\mathrm{StepCE}^{\mathrm{sub}}$, that is both decision-theoretic and truthful. In particular, on any product distribution, $\mathrm{StepCE}^{\mathrm{sub}}$ is truthful up to an $O(1)$ factor whereas prior decision-theoretic calibration measures suffer from an $e^{-\Omega(T)}$–$\Omega(\sqrt{T})$ truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude $c>0$, $\mathrm{StepCE}^{\mathrm{sub}}$ is truthful up to an $O(\sqrt{\log(1/c)})$ factor, while prior decision-theoretic measures have an $e^{-\Omega(T)}$–$\Omega(T^{1/3})$ truthfulness gap. We also prove a general impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/qiao25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/qiao25a/qiao25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-qiao25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Mingda
    family: Qiao
  - given: Eric
    family: Zhao
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4686-4739
  id: qiao25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4686
  lastpage: 4739
  published: 2025-07-02 00:00:00 +0000
- title: 'Generation through the lens of learning theory'
  abstract: 'We study generation through the lens of learning theory. First, we formalize generation as a sequential two-player game between an adversary and a generator, which generalizes the notion of “language generation in the limit” from  Kleinberg and Mullainathan (2024). Then, we extend the notion of “generation in the limit" to two new settings, which we call “uniform" and “non-uniform" generation. We provide a characterization of hypothesis classes that are uniformly and non-uniformly generatable. As is standard in learning theory, our characterizations are in terms of the finiteness of a new combinatorial dimension termed the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are incomparable – there are classes that are generatable but not predictable and vice versa. Finally, we extend our results to capture prompted generation and give a complete characterization of which classes are prompt generatable, generalizing some of the work by  Kleinberg and Mullainathan (2024).  '
  volume: 291
  URL: https://proceedings.mlr.press/v291/raman25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/raman25a/raman25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-raman25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vinod
    family: Raman
  - given: Jiaxun
    family: Li
  - given: Ambuj
    family: Tewari
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4740-4776
  id: raman25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4740
  lastpage: 4776
  published: 2025-07-02 00:00:00 +0000
- title: 'Metric Clustering and Graph Optimization Problems using Weak Comparison Oracles'
  abstract: 'Traditional clustering methods assume that precise pairwise distances for the input data are readily available. However, this assumption is often impractical in real-world scenarios where data points cannot be measured accurately. For instance, machine learning-based techniques for estimating distances may fail when the dataset consists of images, videos, or natural language. This paper studies clustering and graph problems in settings where direct access to pairwise distances between all pairs is expensive.  We adopt oracle-based methods as defined by Galhotra et al. (2024), focusing on two types of oracles: the quadruplet oracle, a weak and inexpensive comparator that answers binary queries of the form "Is A closer to B or C closer to D?" and the distance oracle, a stronger but costlier oracle that returns exact pairwise distances. The quadruplet oracle can be implemented via crowdsourcing, trained classifiers, or other predictive models. As these sources are often unreliable, the oracle’s responses may be noisy; we consider both probabilistic and adversarial noise models.  Consider a finite metric space $\Sigma=(\mathcal{V},d)$ of size $|\mathcal{V}|=n$ that supports the quadruplet and the distance oracle. When the input dataset has low intrinsic (doubling) dimension, for each of the $k$-center, $k$-median, and $k$-means clustering problem on $\mathcal{V}$, we design constant approximation algorithms that perform  $\widetilde{O}(n+k^2)$ calls to the quadruplet oracle and $\widetilde{O}(1)$ calls to the distance oracle in both noise models. For general metric spaces, our algorithms achieve constant approximation while making $\widetilde{O}(nk)$ calls to the quadruplet oracle and $\widetilde{O}(1)$ calls to the distance oracle. In all cases, we improve the quadruplet oracle query complexity by a factor of $k$ and the distance oracle call complexity by a factor of $k^2$ compared to Galhotra et al. (2024).  Furthermore, in low dimensional settings, if the spread of the input data is polynomially bounded, we construct a data structure performing $\widetilde{O}(n)$ queries to the quadruplet oracle and $\widetilde{O}(1)$ queries to the distance oracle, such that given any query pair of vertices $(u,v)\in \mathcal{V}\times \mathcal{V}$, it approximates the distance $d(u,v)$ without using any oracle queries. Once the data structure is constructed, we can emulate standard algorithms for various graph problems on $\Sigma$ without additional oracle queries.  In summary, our results show that access to a noisy pairwise ranker for distances is to sufficient to efficiently solve a large class of problems while almost entirely bypassing exact distance computations. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/raychaudhury25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/raychaudhury25a/raychaudhury25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-raychaudhury25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Rahul
    family: Raychaudhury
  - given: Wen-Zhi
    family: Li
  - given: Syamantak
    family: Das
  - given: Sainyam
    family: Galhotra
  - given: Stavros
    family: Sintos
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4777-4830
  id: raychaudhury25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4777
  lastpage: 4830
  published: 2025-07-02 00:00:00 +0000
- title: 'Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract)'
  abstract: 'Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from \emph{error amplification}, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in \emph{well-specified} settings, and, indeed, a growing body of empirical work hypothesizes that \emph{misspecification}, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification—where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq{}1$—we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: \textbf{(1)} Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. \textbf{(2)} Next-token prediction can be made robust to achieve $C=\tilde{O}(H)$, representing moderate error amplification, but this is an inherent barrier: \emph{any} next-token prediction-style objective must suffer $C=\Omega(H)$. \textbf{(3)} For the natural testbed of autoregressive \emph{linear} models, \emph{no computationally efficient algorithm} can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/rohatgi25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/rohatgi25a/rohatgi25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-rohatgi25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dhruv
    family: Rohatgi
  - given: Adam
    family: Block
  - given: Audrey
    family: Huang
  - given: Akshay
    family: Krishnamurthy
  - given: Dylan J.
    family: Foster
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4831-4837
  id: rohatgi25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4831
  lastpage: 4837
  published: 2025-07-02 00:00:00 +0000
- title: 'Necessary and Sufficient Oracles: Toward a Computational Taxonomy for Reinforcement Learning'
  abstract: 'Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning “oracles” it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a \emph{minimal} oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model—a ubiquitous setting for RL with function approximation—we identify \emph{two-context regression} as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify \emph{one-context regression} as a near-minimal oracle in the stronger \emph{reset} access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to \emph{Low-Rank MDPs}, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/rohatgi25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/rohatgi25b/rohatgi25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-rohatgi25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dhruv
    family: Rohatgi
  - given: Dylan J.
    family: Foster
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4838-4936
  id: rohatgi25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4838
  lastpage: 4936
  published: 2025-07-02 00:00:00 +0000
- title: 'Can a calibration metric be both testable and actionable?'
  abstract: 'Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration—ensuring forecasted probabilities match empirical frequencies—is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable, but it is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. To resolve this question, we consider Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable, and we examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/rossellini25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/rossellini25a/rossellini25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-rossellini25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Raphael
    family: Rossellini
  - given: Jake A.
    family: Soloff
  - given: Rina Foygel
    family: Barber
  - given: Zhimei
    family: Ren
  - given: Rebecca
    family: Willett
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4937-4972
  id: rossellini25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4937
  lastpage: 4972
  published: 2025-07-02 00:00:00 +0000
- title: 'Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs'
  abstract: 'We study online learning with oblivious losses and delays under a novel “capacity constraint” that limits how many past rounds can be tracked simultaneously for delayed feedback. Under “clairvoyance” (i.e., delay durations are revealed upfront each round) and/or “preemptibility” (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the “optimal capacity” needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity.   Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{\max}$, adding ${\widetilde{O}(d_{\max})}$ to the regret.   For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $\Theta(\sqrt{TK(1+d/C)+Td\log(K)})$ and the optimal capacity is $\Theta(\min\{K/\log(K),d\})$ in the bandit setting, while in the full-information feedback setting, the minimax regret is $\Theta(\sqrt{T(d+1)\log(K)})$ and the optimal capacity is $\Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel preemptive and non-preemptive scheduling policies, based on Pareto-distributed proxy delays, and batching techniques, respectively. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/ryabchenko25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/ryabchenko25a/ryabchenko25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-ryabchenko25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Alexander
    family: Ryabchenko
  - given: Idan
    family: Attias
  - given: Daniel M.
    family: Roy
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 4973-5014
  id: ryabchenko25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 4973
  lastpage: 5014
  published: 2025-07-02 00:00:00 +0000
- title: 'Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing'
  abstract: 'We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/ryu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/ryu25a/ryu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-ryu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: J. Jon
    family: Ryu
  - given: Jeongyeol
    family: Kwon
  - given: Benjamin
    family: Koppe
  - given: Kwang-Sung
    family: Jun
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5015-5053
  id: ryu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5015
  lastpage: 5053
  published: 2025-07-02 00:00:00 +0000
- title: 'New Lower Bounds for Non-Convex Stochastic Optimization through Divergence Decomposition'
  abstract: 'We study fundamental limits of first\hyp order stochastic optimization in a range of non\hyp convex settings, including $L$-smooth functions satisfying Quasar\hyp Convexity ({QC}), Quadratic Growth ({QG}), and Restricted Secant Inequalities ({RSI}). While the convergence properties of standard algorithms are well understood in deterministic regimes, significantly fewer results address the stochastic case, where only unbiased and noisy gradients are available. We establish new lower bounds on the number of noisy gradient queries to minimize these classes of functions, also showing that they are tight (up to a logarithmic factor) in all the relevant quantities characterizing each class. Our approach reformulates the optimization task as a function identification problem, leveraging \textit{divergence decomposition} arguments to construct a challenging subclass that leads to sharp lower bounds. Furthermore, we present a specialized algorithm in the one\hyp dimensional setting that achieves faster rates, suggesting that certain dimensional thresholds are intrinsic to the complexity of non\hyp convex stochastic optimization.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/saad25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/saad25a/saad25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-saad25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: El Mehdi
    family: Saad
  - given: Wei-Cheng
    family: Lee
  - given: Francesco
    family: Orabona
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5054-5107
  id: saad25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5054
  lastpage: 5107
  published: 2025-07-02 00:00:00 +0000
- title: 'Depth Separations in Neural Networks: Separating the Dimension from the Accuracy'
  abstract: 'We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in Safran et al. (2019), and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/safran25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/safran25a/safran25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-safran25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Itay
    family: Safran
  - given: Daniel
    family: Reichman
  - given: Paul
    family: Valiant
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5108-5142
  id: safran25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5108
  lastpage: 5142
  published: 2025-07-02 00:00:00 +0000
- title: 'The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks'
  abstract: 'We analyze the implicit bias of constant step stochastic subgradient descent (SGD). We consider the setting of binary classification with homogeneous neural networks – a large class of deep neural networks with $\ReLU$-type activation functions such as MLPs and CNNs without biases. We interpret the dynamics of normalized SGD iterates as an Euler-like discretization of a conservative field flow that is naturally associated to the normalized classification margin. Owing to this interpretation, we show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training (i.e., assuming that the data is correctly classified with positive normalized margin).  Up to our knowledge, this is the first extension of the analysis of Lyu and Li (2020) on the discrete dynamics of gradient descent to the nonsmooth and stochastic setting. Our main result applies to binary classification with exponential or logistic losses. We additionally discuss extensions to more general settings.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/schechtman25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/schechtman25a/schechtman25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-schechtman25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Sholom
    family: Schechtman
  - given: Nicolas
    family: Schreuder
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5143-5172
  id: schechtman25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5143
  lastpage: 5172
  published: 2025-07-02 00:00:00 +0000
- title: 'Private List Learnability vs. Online List Learnability'
  abstract: 'This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learnability is equivalent to online learnability [Alon, Livni, Malliaris, and Moran (2019); Bun, Livni, and Moran (2020); Jung, Kim, and Tewari (2020)]. Perhaps surprisingly, we show that this equivalence does not hold in the context of list learning. Specifically, we prove that, unlike in the multiclass setting, a finite $k$-Littlestone dimension-a variant of the classical Littlestone dimension that characterizes online $k$-list learnability-is not a sufficient condition for DP $k$-list learnability. However, similar to the multiclass case, we prove that it remains a necessary condition. To demonstrate where the equivalence breaks down, we provide an example showing that the class of monotone functions with $k+1$ labels over $\mathbb{N}$ is online $k$-list learnable, but not DP $k$-list learnable. This leads us to introduce a new combinatorial dimension, the \emph{$k$-monotone dimension}, which serves as a generalization of the threshold dimension. Unlike the multiclass setting, where the Littlestone and threshold dimensions are finite together, for $k>1$, the $k$-Littlestone and $k$-monotone dimensions do not exhibit this relationship. We prove that a finite $k$-monotone dimension is another necessary condition for DP $k$-list learnability, alongside finite $k$-Littlestone dimension. Whether the finiteness of both dimensions implies private $k$-list learnability remains an open question.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hanneke25d.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hanneke25d/hanneke25d.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hanneke25d.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  - given: Shay
    family: Moran
  - given: Hilla
    family: Schefler
  - given: Iska
    family: Tsubari
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5173-5213
  id: hanneke25d
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5173
  lastpage: 5213
  published: 2025-07-02 00:00:00 +0000
- title: 'Testing Juntas and Junta Subclasses with Relative Error'
  abstract: 'This paper considers the junta testing problem in a recently introduced “relative error” variant of the standard Boolean function property testing model. In relative-error testing we measure the distance from $f$ to $g$, where $f,g: \{0,1\}^n \to \{0,1\}$, by the ratio of $|f^{-1}(1) \triangle g^{-1}(1)|$ (the number of inputs on which $f$ and $g$ disagree) to $|f^{-1}(1)|$ (the number of satisfying assignments of $f$), and we give the testing algorithm both black-box access to $f$ and also  access to independent uniform samples from $f^{-1}(1)$.   Chen et al. (SODA 2025) observed that the class of $k$-juntas is poly$(2^k,1/\epsilon)$-query testable in the relative-error model, and asked whether poly$(k,1/\epsilon)$ queries is achievable.  We answer this question affirmatively by giving a $\tilde{O}(k/\epsilon)$-query algorithm, matching the optimal complexity achieved in the less challenging standard model.  Moreover, as our main result, we show that any subclass of $k$-juntas that is closed under permuting variables is relative-error testable with a similar complexity.  This gives highly efficient relative-error testing algorithms for a number of well-studied function classes, including size-$k$ decision trees, size-$k$ branching programs, and size-$k$ Boolean formulas.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/chen25h.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/chen25h/chen25h.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-chen25h.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xi
    family: Chen
  - given: William
    family: Pires
  - given: Toniann
    family: Pitassi
  - given: R. A.
    family: Servedio
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5214-5245
  id: chen25h
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5214
  lastpage: 5245
  published: 2025-07-02 00:00:00 +0000
- title: 'Testing (Conditional) Mutual Information - Extended Abstract'
  abstract: 'We investigate the sample complexity of mutual information and conditional mutual information testing. For conditional mutual information testing, given access to independent samples of a triple of random variables $(A, B, C)$ with unknown distribution, we want to distinguish between two cases: (i) $A$ and $C$ are conditionally independent, i.e., $I(A:C|B) = 0$, and (ii) $A$ and $C$ are conditionally dependent, i.e., $I(A:C|B) \geq \varepsilon$ for some threshold $\varepsilon$. We establish an upper bound on the number of samples required to distinguish between the two cases with high confidence, as a function of $\varepsilon$ and the three alphabet sizes. We conjecture that our bound is tight and show that this is indeed the case in several parameter regimes. For the special case of mutual information testing (when $B$ is trivial), we establish the necessary and sufficient number of samples required up to polylogarithmic terms.  Our technical contributions include a novel method to efficiently simulate weakly correlated samples from the conditionally independent distribution $P_{A|B} P_{C|B} P_B$ given access to samples from an unknown distribution $P_{ABC}$, and a new estimator for equivalence testing that can handle such correlated samples, which might be of independent interest.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/seyfried25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/seyfried25a/seyfried25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-seyfried25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Jan
    family: Seyfried
  - given: Sayantan
    family: Sen
  - given: Marco
    family: Tomamichel
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5246-5247
  id: seyfried25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5246
  lastpage: 5247
  published: 2025-07-02 00:00:00 +0000
- title: 'The Pitfalls of Imitation Learning when Actions are Continuous'
  abstract: ' We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action space control system. We show that there exist  stable dynamics (i.e. contracting exponentially quickly) and smooth, deterministic experts such that any smooth, deterministic imitator policy  necessarily suffers  error on execution that is  exponentially larger, as a function of problem horizon, than  the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless  they produce highly \emph{improper} imitator policies — those which are non-smooth, non-Markovian, or which  exhibit highly state-dependent stochasticity — or unless the expert trajectory distribution is sufficiently spread. We provide preliminary evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today’s popular policy parameterizations in robot learning (e.g. action-chunking and diffusion-policies). We also establish a host of complementary negative and positive results for imitation in control systems.  '
  volume: 291
  URL: https://proceedings.mlr.press/v291/simchowitz25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/simchowitz25a/simchowitz25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-simchowitz25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Max
    family: Simchowitz
  - given: Daniel
    family: Pfrommer
  - given: Ali
    family: Jadbabaie
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5248-5351
  id: simchowitz25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5248
  lastpage: 5351
  published: 2025-07-02 00:00:00 +0000
- title: 'Community detection with the Bethe-Hessian'
  abstract: 'The Bethe-Hessian matrix, introduced by Saade, Krzakala, and Zdeborov{á} (2014), is a Hermitian matrix designed for applying spectral clustering algorithms to sparse networks. Rather than employing a non-symmetric and high-dimensional non-backtracking operator, a spectral method based on the Bethe-Hessian matrix is conjectured to also reach the Kesten-Stigum detection threshold in the sparse stochastic block model (SBM). We provide the first rigorous analysis of the Bethe-Hessian spectral method in the SBM under both the bounded expected degree and the growing degree regimes. Specifically, we demonstrate that: (i) When the expected degree $d\geq 2$, the number of negative outliers of the Bethe-Hessian matrix can consistently estimate the number of blocks above the Kesten-Stigum threshold, thus confirming a conjecture from Saade, Krzakala, and Zdeborov{á} (2014) for $d\geq 2$. (ii) For sufficiently large $d$, its eigenvectors can be used to achieve weak recovery. (iii) As  $d\to\infty$, we establish the concentration of the locations of its negative outlier eigenvalues, and weak consistency can be achieved via a spectral method based on the Bethe-Hessian matrix.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/stephan25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/stephan25a/stephan25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-stephan25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Ludovic
    family: Stephan
  - given: Yizhe
    family: Zhu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5352-5353
  id: stephan25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5352
  lastpage: 5353
  published: 2025-07-02 00:00:00 +0000
- title: 'Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity Extended Abstract'
  abstract: 'For the problem of reconstructing  a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions,  it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground-truth. In contrast, while non-convex approaches are computationally less expensive,  existing recovery guarantees assume  that the number of samples scales at least quadratically with the rank $r$ of the ground-truth matrix. In this paper, we close this gap by showing that  the non-convex approaches can be as efficient as nuclear norm minimization in terms of sample complexity. Namely, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements.  We show that factorized gradient descent with spectral initialization converges to the ground truth with a linear rate  as soon as the number of samples scales with $ \Omega (rd\kappa^2)$, where $d$ is the dimension, and $\kappa$ is the condition number of the ground truth matrix. This improves the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. We expect that our proof technique will be of independent interest to other non-convex problems.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/stoger25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/stoger25a/stoger25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-stoger25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Dominik
    family: Stöger
  - given: Yizhe
    family: Zhu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5354-5355
  id: stoger25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5354
  lastpage: 5355
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimal Online Bookmaking for Any Number of Outcomes'
  abstract: 'We study the \emph{Online Bookmaking} problem, where a bookmaker dynamically updates betting odds on the possible outcomes of an event. In each betting round, the bookmaker can adjust the odds based on the cumulative betting behavior of gamblers, aiming to maximize profit while mitigating potential loss. We show that for any event and any number of betting rounds, in a worst-case setting over all possible gamblers and outcome realizations, the bookmaker’s optimal loss is the largest root of a simple polynomial. Our solution shows that bookmakers can be as fair as desired while avoiding financial risk, and the explicit characterization reveals an intriguing relation between the bookmaker’s regret and Hermite polynomials. We develop an efficient algorithm that computes the optimal bookmaking strategy: when facing an optimal gambler, the algorithm achieves the optimal loss, and in rounds where the gambler is suboptimal, it reduces the achieved loss to the \emph{optimal opportunistic} loss, a notion that is related to subgame perfect Nash equilibrium. The key technical contribution to achieve these results is an explicit characterization of the \emph{Bellman-Pareto frontier}, which unifies the dynamic programming updates for Bellman’s value function with the multi-criteria optimization framework of the Pareto frontier in the context of vector repeated games.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/tal25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/tal25a/tal25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-tal25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Hadar
    family: Tal
  - given: Oron
    family: Sabag
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5356-5409
  id: tal25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5356
  lastpage: 5409
  published: 2025-07-02 00:00:00 +0000
- title: 'Beyond propagation of chaos: A stochastic algorithm for mean field optimization'
  abstract: 'Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with $n$ particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables.  In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm’s output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/tankala25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/tankala25a/tankala25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-tankala25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chandan
    family: Tankala
  - given: Dheeraj
    family: Nagaraj
  - given: Anant
    family: Raj
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5410-5440
  id: tankala25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5410
  lastpage: 5440
  published: 2025-07-02 00:00:00 +0000
- title: 'Optimal Scheduling of Dynamic Transport'
  abstract: 'Flow-based methods for sampling and generative modeling use continuous-time dynamical systems to represent a {transport map} that pushes forward a source measure to a target measure. The introduction of a time axis provides considerable design freedom, and a central question is how to exploit this freedom. Though many popular methods seek straight line (i.e., zero acceleration) trajectories, we show here that a specific class of “curved” trajectories can significantly improve approximation and learning. In particular, we consider the unit-time interpolation of any given transport map $T$ and seek the schedule $\tau: [0,1] \to [0,1]$ that minimizes the spatial Lipschitz constant of the corresponding velocity field over all times $t \in [0,1]$. This quantity is crucial as it allows for control of the approximation error when the velocity field is learned from data. We show that, for a broad class of source/target measures and transport maps $T$, the \emph{optimal schedule} can be computed in closed form, and that the resulting optimal Lipschitz constant is \emph{exponentially smaller} than that induced by an identity schedule  (corresponding to, for instance, the Wasserstein geodesic).  Our proof technique relies on the calculus of variations and $\Gamma$-convergence, allowing us to approximate the aforementioned degenerate objective by a family of smooth, tractable problems.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/tsimpos25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/tsimpos25a/tsimpos25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-tsimpos25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Panos
    family: Tsimpos
  - given: Ren
    family: Zhi
  - given: Jakob
    family: Zech
  - given: Youssef
    family: Marzouk
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5441-5505
  id: tsimpos25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5441
  lastpage: 5505
  published: 2025-07-02 00:00:00 +0000
- title: 'Corrupted Learning Dynamics in Games'
  abstract: 'Learning in games refers to scenarios where multiple players interact in a shared environment, each aiming to minimize their regret. It is well known that an equilibrium can be computed at a fast rate of $O(1/T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL). However, this acceleration is limited to the \textit{honest regime}, in which all players fully adhere to a prescribed algorithm—a situation that may not be realistic in practice. To address this issue, we present \textit{corrupted learning dynamics} that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm. First, in two-player zero-sum corrupted games, we provide learning dynamics for which the external regret of the $x$-player (and similarly for the $y$-player) is roughly bounded by $O(\log (m_x m_y) + \sqrt{\hat{C}_y} + \hat{C}_x)$, where $m_x$ and $m_y$ denote the number of actions of the $x$- and $y$-players, respectively, and $\hat{C}_x$ and $\hat{C}_y$ represent their cumulative deviations. We then extend our approach to multiplayer general-sum corrupted games, providing learning dynamics for which the swap regret of player $i$ is bounded by $O(\log T + \sqrt{\sum_{k} \hat{C}_k \log T} + \hat{C}_i)$ ignoring dependence on the number of players and actions, where $\hat{C}_i$ is the cumulative deviation of player $i$ from the prescribed algorithm. Our learning dynamics are agnostic to the levels of corruption. A key technical contribution is a new analysis that ensures the stability of a stationary distribution of a Markov chain under a new adaptive learning rate, thereby allowing us to achieve the desired bound in the corrupted regime while matching the best existing bound in the honest regime. Notably, our framework can be extended to address not only corruption in strategies but also corruption in the observed expected utilities, and we provide several matching lower bounds.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/tsuchiya25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/tsuchiya25a/tsuchiya25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-tsuchiya25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Taira
    family: Tsuchiya
  - given: Shinji
    family: Ito
  - given: Haipeng
    family: Luo
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5506-5552
  id: tsuchiya25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5506
  lastpage: 5552
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning shallow quantum circuits with many-qubit gates'
  abstract: 'The seminal work of [LMN’93] established a cornerstone result for classical complexity, with profound implications for learning theory. By proving low-degree Fourier concentration of AC0, the work demonstrated that Boolean functions computed by constant-depth circuits can be efficiently PAC-learned via low-degree Fourier sampling. This breakthrough provided the first sample- and time-efficient (quasi-polynomial) algorithm for learning AC0.  Proposed by [Moore’99] as a natural quantum analog of AC0, QAC0 is the class of constant-depth quantum circuits composed of arbitrary single-qubit gates and polynomial $CZ$ gates of unbounded width. In this work, we present the first algorithm for efficient average-case learning of QAC0 circuits with logarithmic ancilla. Namely, our algorithm achieves quasi-polynomial sample- and time-complexity for learning unknown QAC0 unitaries to inverse-polynomially small error. We further show that these learned unitaries can be efficiently synthesized via poly-logarithmic depth circuits, making progress towards proper learning of QAC0. Since in finite-dimensional circuit geometries QAC0 circuits require polynomial depth to implement, this result significantly expands the family of efficiently learnable quantum circuits.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/vasconcelos25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/vasconcelos25a/vasconcelos25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-vasconcelos25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Francisca
    family: Vasconcelos
  - given: Hsin-Yuan
    family: Huang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5553-5604
  id: vasconcelos25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5553
  lastpage: 5604
  published: 2025-07-02 00:00:00 +0000
- title: 'Black-Box Reductions for Decentralized Online Convex Optimization in Changing Environments'
  abstract: 'We investigate decentralized online convex optimization (D-OCO) in changing environments, and choose adaptive regret and dynamic regret as the performance metric. Specifically, these two metrics compare each local learner against the optimal comparator over every interval, and any sequence of comparators over all rounds, respectively. It is well-known that in the centralized setting, plenty of algorithms with (nearly) optimal bounds on these two metrics have been proposed. However, none of them has been extended into D-OCO, possibly due to the difficulty in handling their commonly used two-level structure. To fill the gap, in this paper, we propose black-box reductions from minimizing these two metrics of D-OCO to minimizing them in the centralized setting. Let $n$, $\rho$, and $T$ denote the number of local learners, the spectral gap of the communication matrix, and the time horizon, respectively. For adaptive regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{\tau}\log T)$ bound over any interval of length $\tau$ in general, and an improved one of $\tilde{O}(n\rho^{-1/2}(\log T)^3)$ when facing strongly convex functions. These two bounds match existing lower bounds up to polylogarithmic factors. For dynamic regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{T(1+P_T)\log T})$ bound in general, where $P_T$ is the path-length of comparators. We also provide the first lower bound for dynamic regret of D-OCO to demonstrate that our dynamic regret is nearly optimal.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/wan25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/wan25a/wan25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-wan25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yuanyu
    family: Wan
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5605-5631
  id: wan25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5605
  lastpage: 5631
  published: 2025-07-02 00:00:00 +0000
- title: 'Learning Compositional Functions with Transformers from Easy-to-Hard Data'
  abstract: 'Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via SGD, remains an open question. Towards answering this question, we study the learnability of a task called the \emph{$k$-fold composition}, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations, and can be expressed by a transformer with $O(\log k)$ layers. On the negative front, we provide a Statistical Query lower bound showing that any learner which is trained on samples from the $k$-fold composition task and makes polynomially many queries must have sample size exponential in $k$, thus establishing a statistical-computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in $k$, by gradient descent on an $O(\log k)$-depth transformer via two different curriculum learning strategies: one in which data consists of $k’$-fold composition functions with $k’ \le k$ presented in increasing order of difficulty, and another in which all data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/wang25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/wang25a/wang25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-wang25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zixuan
    family: Wang
  - given: Eshaan
    family: Nichani
  - given: Alberto
    family: Bietti
  - given: Alex
    family: Damian
  - given: Daniel
    family: Hsu
  - given: Jason D
    family: Lee
  - given: Denny
    family: Wu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5632-5711
  id: wang25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5632
  lastpage: 5711
  published: 2025-07-02 00:00:00 +0000
- title: 'Orthogonal Causal Calibration (Extended Abstract)'
  abstract: 'Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters.  In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses $\ell$ satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses $\ell$ satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model that decouples the error in estimating unknown nuisance parameters from the calibration error in a hypothetical world where the learned nuisances are true. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/whitehouse25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/whitehouse25a/whitehouse25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-whitehouse25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Justin
    family: Whitehouse
  - given: Christopher
    family: Jung
  - given: Vasilis
    family: Syrgkanis
  - given: Bryan
    family: Wilder
  - given: Zhiwei Steven
    family: Wu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5712-5713
  id: whitehouse25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5712
  lastpage: 5713
  published: 2025-07-02 00:00:00 +0000
- title: 'Time-Uniform Self-Normalized Concentration for Vector-Valued Processes (Extended Abstract)'
  abstract: 'Self-normalized processes arise naturally in many learning-related tasks. While self-normalized concentration has been extensively studied for scalar-valued processes, there are few results for multidimensional processes  outside of the sub-Gaussian setting. In this work, we construct a general, self-normalized inequality for multivariate processes that satisfy a simple yet broad “sub-$\psi$” tail condition, which generalizes assumptions based on cumulant generating functions. From this general inequality, we derive an upper law of the iterated logarithm for sub-$\psi$ vector-valued processes, which is tight up to small constants. We show how our inequality can be leveraged to derive a variety of novel, self-normalized concentration inequalities under both light and heavy-tailed observations. Further, we provide applications in prototypical statistical tasks, such as parameter estimation in online linear regression, autoregressive modeling, and bounded mean estimation via a new (multivariate) empirical Bernstein concentration inequality. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/whitehouse25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/whitehouse25b/whitehouse25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-whitehouse25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Justin
    family: Whitehouse
  - given: Zhiwei Steven
    family: Wu
  - given: Aaditya
    family: Ramdas
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5714-5715
  id: whitehouse25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5714
  lastpage: 5715
  published: 2025-07-02 00:00:00 +0000
- title: 'Mixing Time of the Proximal Sampler in Relative Fisher Information via Strong Data Processing Inequality (Extended Abstract)'
  abstract: 'We study sampling from a probability distribution $\nu \propto e^{-f}$ on $\mathbb{R}^d$, from the perspective of minimizing relative entropy (KL divergence) $H_\nu(\rho)$ on the space of probability distributions with the Wasserstein geometry. The Langevin dynamics is a continuous-time stochastic process in $\mathbb{R}^d$ that implements the Wasserstein gradient flow for minimizing $H_\nu$. The relative Fisher information has the geometric meaning as the squared Wasserstein gradient of relative entropy. When $\nu$ is strongly log-concave, relative entropy $H_\nu$ is a strongly convex function, and Langevin dynamics has fast convergence guarantee in relative Fisher information. In discrete time, we study the Proximal Sampler, a two-step Gibbs sampling algorithm to sample from an auxiliary joint distribution which has the original target distribution as the $x$-marginal. The Proximal Sampler can be seen as an approximate proximal discretization of the Langevin dynamics, and it has matching convergence rates with the continuous-time Langevin dynamics in many settings, for example an exponential convergence rate in KL divergence under log-Sobolev inequality.  In this work, we show that when $\nu$ is $\alpha$-strongly log-concave, Proximal Sampler also has an exponential convergence in relative Fisher information. We conclude a high-iteration complexity guarantee of the Proximal Sampler in relative Fisher information when the target is strongly log-concave and log-smooth. Our analysis proceeds via establishing the strong data processing inequality (SDPI) for a family of Fokker-Planck channels driven by diffusion processes, including the Gaussian channel, the Ornstein-Uhlenbeck (OU) channel, the Langevin dynamics, and the reverse Gaussian channel. We show that even along the Gaussian channel, data processing inequality in relative Fisher information may not hold when the second distribution is arbitrary.  We also show along the Gaussian channel, (S)DPI in relative Fisher information holds when the second distribution is (strongly) log-concave; we also show SDPI in relative Fisher information eventually holds when the second distribution is a log-Lipschitz perturbation of a strongly log-concave distribution. Along the Ornstein-Uhlenbeck channel, we show that SDPI in relative Fisher information eventually holds when the second distribution is strongly log-concave, and exhibit an example where DPI initially does not hold even when both input distributions are Gaussian. For our algorithmic result, we can write the Proximal Sampler as a composition of the Gaussian and reverse Gaussian channels. Then we can combine the SDPI for the Gaussian channel under SLC and the DPI for the reverse Gaussian channel  to show that relative Fisher information converges exponentially fast along the Proximal Sampler. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/wibisono25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/wibisono25a/wibisono25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-wibisono25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Andre
    family: Wibisono
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5716-5717
  id: wibisono25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5716
  lastpage: 5717
  published: 2025-07-02 00:00:00 +0000
- title: 'Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization'
  abstract: 'Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score–the exact solution to the denoising score matching–leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/wu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/wu25a/wu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-wu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yu-Han
    family: Wu
  - given: Pierre
    family: Marion
  - given: Gérard
    family: Biau
  - given: Claire
    family: Boyer
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5718-5756
  id: wu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5718
  lastpage: 5756
  published: 2025-07-02 00:00:00 +0000
- title: 'Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications'
  abstract: 'In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the “Gaussian equivalence” phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in Erba et al. (2024) regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in Maillard et al. (2024) on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/xu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/xu25a/xu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-xu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yizhou
    family: Xu
  - given: Antoine
    family: Maillard
  - given: Lenka
    family: Zdeborová
  - given: Florent
    family: Krzakala
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5757-5823
  id: xu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5757
  lastpage: 5823
  published: 2025-07-02 00:00:00 +0000
- title: 'Generalization error bound for denoising score matching under relaxed manifold assumption'
  abstract: 'We examine theoretical properties of the denoising score matching estimate. We model the density of observations with a nonparametric Gaussian mixture. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. At the same time, we are still able to leverage a nice distribution structure. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate. The rates of convergence are determined by the intrinsic dimension. Furthermore, our bounds remain valid even if we allow the ambient dimension grow polynomially with the sample size.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/yakovlev25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/yakovlev25a/yakovlev25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-yakovlev25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Konstantin
    family: Yakovlev
  - given: Nikita
    family: Puchkin
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5824-5891
  id: yakovlev25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5824
  lastpage: 5891
  published: 2025-07-02 00:00:00 +0000
- title: 'Improved Algorithms for Effective Resistance Computation on Graphs'
  abstract: 'Effective Resistance (ER) is a fundamental tool in various graph learning tasks. In this paper, we address the problem of efficiently approximating ER on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $n$ vertices and $m$ edges. First, we focus on local online-computation algorithms for ER approximation, aiming to improve the dependency on the approximation error parameter $\epsilon$. Specifically, for a given vertex pair $(s,t)$, we propose a local algorithm with a time complexity of $\tilde{O}(\sqrt{d}/\epsilon)$ to compute an $\epsilon$-approximation of the $s,t$-ER value for expander graphs, where $d=\min \{d_s,d_t\}$. This improves upon the previous state-of-the-art, including an $\tilde{O}(1/\epsilon^2)$ time algorithm based on random walk sampling by Andoni et al. (ITCS’19) and Peng et al. (KDD’21). Our method achieves this improvement by combining deterministic search with random walk sampling to reduce variance. Second, we establish a lower bound for ER approximation on expander graphs. We prove that for any $\epsilon\in (0,1)$, there exist an expander graph and a vertex pair $(s,t)$ such that any local algorithm requires at least $\Omega(1/\epsilon)$ time to compute the $\epsilon$-approximation of the $s,t$-ER value. Finally, we extend our techniques to index-based algorithms for ER computation. We propose an algorithm with $\tilde{O}(\min \{m+n/\epsilon^{1.5},\sqrt{nm}/\epsilon\})$ processing time, $\tilde{O}(n/\epsilon)$ space complexity and $O(1)$ query complexity, which returns an $\epsilon$-approximation of the $s,t$-ER value for any $s,t\in \mathcal{V}$ for expander graphs. Our approach improves upon the state-of-the-art $\tilde{O}(m/\epsilon)$ processing time by Dwaraknath et al. (NeurIPS’24) and the $\tilde{O}(m+n/\epsilon^2)$ processing time by Li and Sachdeva (SODA’23).'
  volume: 291
  URL: https://proceedings.mlr.press/v291/yichun25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/yichun25a/yichun25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-yichun25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yang
    family: Yichun
  - given: Li
    family: Rong-Hua
  - given: Liao
    family: Meihao
  - given: Wang
    family: Guoren
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5892-5920
  id: yichun25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5892
  lastpage: 5920
  published: 2025-07-02 00:00:00 +0000
- title: 'Robustly Learning Monotone Generalized Linear Models via Data Augmentation'
  abstract: 'We study the task of learning  Generalized Linear models (GLMs) in the agnostic  model under the Gaussian distribution.  We give the first polynomial-time  algorithm that achieves a constant-factor approximation  for {\em any} monotone Lipschitz activation.  Prior constant-factor GLM learners succeed for a substantially  smaller class of activations.  Our work resolves a well-known open problem,  by developing a robust counterpart to the classical GLMtron algorithm \citep{kakade2011efficient}.   Our robust learner applies more generally, encompassing all  monotone activations with bounded $(2+\zeta)$-moments,  for any fixed $\zeta>0$—a condition that is  essentially necessary.  To obtain our results, we leverage a novel data augmentation technique  with decreasing Gaussian noise injection and prove a number  of structural results that may be useful in other settings.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/zarifis25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/zarifis25a/zarifis25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-zarifis25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Nikos
    family: Zarifis
  - given: Puqian
    family: Wang
  - given: Ilias
    family: Diakonikolas
  - given: Jelena
    family: Diakonikolas
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5921-5990
  id: zarifis25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5921
  lastpage: 5990
  published: 2025-07-02 00:00:00 +0000
- title: 'Anytime Acceleration of Gradient Descent'
  abstract: 'This work investigates stepsize-based acceleration of gradient descent with anytime convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O\big(T^{-\frac{2\log_2\rho}{1+\log_2\rho}}\big) \approx O(T^{-1.119})$ for any stopping time  $T$, where $\rho=\sqrt{2}+1$ is the silver ratio and the stepsize schedule is predetermined without prior knowledge of the stopping time. This result provides an affirmative answer to a COLT open problem  regarding whether stepsize-based acceleration  can yield anytime convergence rates of $o(T^{-1})$. We further extend our theory to yield anytime convergence guarantees of $\exp(-\Omega(T/\kappa^{0.893}))$ for smooth and strongly convex optimization, with $\kappa$ being the condition number. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/zhang25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/zhang25a/zhang25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-zhang25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zihan
    family: Zhang
  - given: Jason
    family: Lee
  - given: Simon
    family: Du
  - given: Yuxin
    family: Chen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 5991-6013
  id: zhang25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 5991
  lastpage: 6013
  published: 2025-07-02 00:00:00 +0000
- title: 'Fast and Multiphase Rates for Nearest Neighbor Classifiers'
  abstract: 'We study the scaling of classification error rates with respect to the size of the training dataset. In contrast to classical results where rates are minimax optimal for a problem class, this work starts with the empirical observation that, even for a fixed data distribution,  the error scaling can have \emph{diverse} rates across different ranges of sample size.  To understand when and why the error rate is non-uniform, we theoretically analyze nearest neighbor classifiers. We show that an error scaling law can have fine-grained rates: in the early phase, the test error depends polynomially on the data dimension and decreases fast; whereas in the later phase, the error depends exponentially on the data dimension and decreases slowly.  Our analysis highlights the complexity of the data distribution in determining the test error. When the data are distributed benignly, we show that the generalization error of nearest neighbor classifier can depend polynomially, instead of exponentially, on the data dimension. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/yang25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/yang25a/yang25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-yang25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Pengkun
    family: Yang
  - given: Jingzhao
    family: Zhang
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6014-6015
  id: yang25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6014
  lastpage: 6015
  published: 2025-07-02 00:00:00 +0000
- title: 'Linear Bandits on Ellipsoids: Minimax Optimal Algorithms'
  abstract: 'We consider linear stochastic bandits where the set of actions is an ellipsoid.  We provide the first known minimax optimal algorithm for this problem.  We first derive a novel information-theoretic lower bound on the regret of any algorithm, which must be at least $\Omega(\min(d \sigma \sqrt{T} + d \|\theta\|_{A}, \|\theta\|_{A} T))$ where $d$ is the dimension, $T$ the time horizon, $\sigma^2$ the noise variance, $A$ a matrix defining the set of actions and $\theta$ the vector of unknown parameters. We then provide an algorithm whose regret matches this bound to a multiplicative universal constant.  The algorithm is non-classical in the sense that it is not optimistic, and it is not a sampling algorithm.  The main idea is to combine a novel sequential procedure to estimate $\|\theta\|$, followed by an explore-and-commit strategy informed by this estimate. The algorithm is highly computationally efficient, and a run requires only time $O(dT + d^2 \log(T/d) + d^3)$ and memory $O(d^2)$, in contrast with known optimistic algorithms, which are not implementable in polynomial time. We go beyond minimax optimality and show that our algorithm is locally asymptotically minimax optimal, a much stronger notion of optimality.  We further provide numerical experiments to illustrate our theoretical findings. The code to reproduce the experiments is available at \url{https://github.com/RaymZhang/LinearBanditsEllipsoidsMinimaxCOLT}. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/zhang25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/zhang25b/zhang25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-zhang25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Raymond
    family: Zhang
  - given: Hadiji
    family: Hédi
  - given: Combes
    family: Richard
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6016-6040
  id: zhang25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6016
  lastpage: 6040
  published: 2025-07-02 00:00:00 +0000
- title: 'Towards Fundamental Limits for Active Multi-distribution Learning'
  abstract: 'Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, $\{D_i\}_{i\in[k]}$, is considered and a classifier’s performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its  applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce,  with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of  $\widetilde{O}\Bigl(\theta_{\mathrm{max}}(d+k)\ln\frac{1}{\varepsilon}\Bigr)$ and $\widetilde{O}\Bigl(\theta_{\mathrm{max}}(d+k)\Bigl(\ln\frac{1}{\varepsilon}+\frac{\nu^2}{\varepsilon^2}\Bigr)+\frac{k\nu}{\varepsilon^2}\Bigr)$ in the realizable and agnostic settings respectively, where $\theta_{\mathrm{max}}$ is the maximum disagreement coefficient among the $k$ distributions, $d$ is the VC dimension of the hypothesis class, $\nu$ is the multi-distribution error of the best hypothesis, and $\varepsilon$ is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the $k\nu/\varepsilon^2$ term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes (Blum et al., 2017; Zhang et al., 2024), which may be of independent interest.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/zhang25c.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/zhang25c/zhang25c.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-zhang25c.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Chicheng
    family: Zhang
  - given: Yihan
    family: Zhou
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6041-6090
  id: zhang25c
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6041
  lastpage: 6090
  published: 2025-07-02 00:00:00 +0000
- title: 'The Adaptive Complexity of Finding a Stationary Point'
  abstract: 'In large-scale applications, such as machine learning, it is desirable to design non-convex optimization algorithms with a high degree of parallelization. In this work, we study the adaptive complexity of finding a stationary point, which is the minimal number of sequential rounds required to achieve stationarity given polynomially many queries executed in parallel at each round.  For the high-dimensional case, \emph{i.e.}, $d = \widetilde{\Omega}(\varepsilon^{-(2 + 2p)/p})$, we show that for any (potentially randomized) algorithm, there exists a function with Lipschitz $p$-th order derivatives such that the algorithm requires at least $\varepsilon^{-(p+1)/p}$ iterations to find an $\varepsilon$-stationary point.  Our lower bounds are tight and show that even with $\mathrm{poly}(d)$  queries per iteration, no algorithm has better convergence rate than those achievable with one-query-per-round algorithms. In other words, gradient descent, the cubic-regularized Newton’s method, and the $p$-th order adaptive regularization method are adaptively optimal. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on a chain-like structure with random partition. For the constant-dimensional case, \emph{i.e.}, $d = \Theta(1)$, we propose an algorithm that bridges grid search and gradient flow trapping, finding an approximate stationary point in constant iterations. Its asymptotic tightness is verified by a new lower bound on the required queries per iteration. We show there exists a smooth function such that any algorithm running with $\Theta(\log (1/\varepsilon))$ rounds requires at least $\widetilde{\Omega}((1/\varepsilon)^{(d-1)/2})$ queries per round. This lower bound is tight up to a logarithmic factor, and implies that the gradient flow trapping is adaptively optimal.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/huanjian25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/huanjian25a/huanjian25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-huanjian25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Zhou
    family: Huanjian
  - given: Han
    family: Andi
  - given: Takeda
    family: Akiko
  - given: Sugiyama
    family: Masashi
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6091-6123
  id: huanjian25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6091
  lastpage: 6123
  published: 2025-07-02 00:00:00 +0000
- title: 'Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification'
  abstract: 'We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. Gr{ü}nwald and Langford (2004) previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $\lambda=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $\lambda$ and noise level (or approximation error), significantly tightening the analysis of Gr{ü}nwald and Langford for $\lambda=1$ and extending it to all other choices of $\lambda$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/zhu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/zhu25a/zhu25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-zhu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Xiaohan
    family: Zhu
  - given: Nathan
    family: Srebro
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6124-6155
  id: zhu25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6124
  lastpage: 6155
  published: 2025-07-02 00:00:00 +0000
- title: 'Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL'
  abstract: 'We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span penalization} approach, inspired by sample variance penalization, which satisfies an \textit{oracle inequality} performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/zurek25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/zurek25a/zurek25a.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-zurek25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Matthew
    family: Zurek
  - given: Yudong
    family: Chen
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6156-6209
  id: zurek25a
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6156
  lastpage: 6209
  published: 2025-07-02 00:00:00 +0000
- title: 'Open Problem: Fixed-Parameter Tractability of Zonotope Problems'
  abstract: 'Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research. Recent results are achieved via connections to tropical geometry based on a duality between convex piecewise linear functions and polytopes. It turns out that several questions about properties of functions computed by ReLU neural networks can be answered by solving certain problems on special polytopes called zonotopes. For example, computing the Lipschitz constant of a ReLU network with one hidden layer corresponds to norm maximization over a zonotope.  Moreover, deciding whether the ReLU network attains a positive output is equivalent to zonotope non-containment. These problems are known to be NP-hard in general but polynomial-time solvable if the input dimension is constant. However, it is open whether they are \emph{fixed-parameter tractable} (FPT) with respect to the input dimension $d$, that is, solvable in $f(d)\cdot n^{O(1)}$ time for some function $f$ solely depending on $d$. Notably, these zonotope problems also arise in other areas such as robotics and control, reachability analysis, pattern recognition, signal processing or political analysis. Thus, settling their parameterized complexity status is of broad interest. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/froese25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/froese25b/froese25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-froese25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Vincent
    family: Froese
  - given: Moritz
    family: Grillo
  - given: Christoph
    family: Hertrich
  - given: Martin
    family: Skutella
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6210-6214
  id: froese25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6210
  lastpage: 6214
  published: 2025-07-02 00:00:00 +0000
- title: 'Open Problem: Structure-Agnostic Minimax Risk for Partial Linear Model'
  abstract: 'Double machine learning is a theoretically grounded and practically efficient procedure for a variety of causal estimands and functional estimation problems when adopting black-box machine learning models for estimating nuisance parameters. It is known that double machine learning may have sub-optimal performance in the structure-aware settings, e.g., the nuisances are H{ö}lder smooth functions, and recent articles (Balakrishnan et al., 2023) are delivering the message that double machine learning is optimal in structure-agnostic settings. This note claims that whether double machine learning is optimal for black-box machine learning models remains open, even for the simplest linear coefficient estimation in the partial linear model. We argue that the key gap that differentiates structure-agnostic and structure-aware settings, and also the previous lower bound results do not address, is the role of variance – the awareness of well-conditioned structures offers the possibility to mitigate the effects of variance, while that is not clear for structure-agnostic settings. The answer to this question has significant implications both in theory and practice. '
  volume: 291
  URL: https://proceedings.mlr.press/v291/gu25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/gu25b/gu25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-gu25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Yihong
    family: Gu
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6220-6224
  id: gu25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6220
  lastpage: 6224
  published: 2025-07-02 00:00:00 +0000
- title: 'Open Problem: Data Selection for Regression Tasks'
  abstract: 'This note proposes a set of open problems concerning data selection in regression tasks. The central question is: given a natural learning rule $\mathcal{A}$ and a selection budget $n$, how well can $\mathcal{A}$ perform when trained on $n$ examples selected from a larger dataset? We present concrete instances of this question in basic regression settings, including mean estimation and linear regression.'
  volume: 291
  URL: https://proceedings.mlr.press/v291/hanneke25e.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/hanneke25e/hanneke25e.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-hanneke25e.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Steve
    family: Hanneke
  - given: Shay
    family: Moran
  - given: Alexander
    family: Shlimovich
  - given: Amir
    family: Yehudayoff
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6225-6229
  id: hanneke25e
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6225
  lastpage: 6229
  published: 2025-07-02 00:00:00 +0000
- title: 'Open Problem: Optimal Instance-Dependent Sample Complexity for finding Nash Equilibrium in Two Player Zero-Sum Matrix games'
  abstract: 'Optimal instance-dependent sample complexity is a well-studied topic in the multi-armed bandit literature. However, the analogous question in the setting of two-player zero-sum matrix games, where the payoff matrix can only be accessed through noisy samples, remains largely unexplored despite being a natural generalization of the multi-armed bandit problem.  In this write-up, we pose a simple open question: What is the optimal instance-dependent sample complexity to find an approximate Nash equilibrium in two-player zero-sum matrix games?'
  volume: 291
  URL: https://proceedings.mlr.press/v291/maiti25b.html
  PDF: https://raw.githubusercontent.com/mlresearch/v291/main/assets/maiti25b/maiti25b.pdf
  edit: https://github.com/mlresearch//v291/edit/gh-pages/_posts/2025-07-02-maiti25b.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of Thirty Eighth Conference on Learning Theory'
  publisher: 'PMLR'
  author: 
  - given: Arnab
    family: Maiti
  editor: 
  - given: Nika
    family: Haghtalab
  - given: Ankur
    family: Moitra
  page: 6230-6234
  id: maiti25b
  issued:
    date-parts: 
      - 2025
      - 7
      - 2
  firstpage: 6230
  lastpage: 6234
  published: 2025-07-02 00:00:00 +0000