Proceedings of Machine Learning Research

Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential

Tue, 05 May 2026 00:00:00 +0000

Stochastic gradient descent (SGD) is the main algorithm behind a large body of work in machine learning. In many cases, constraints are enforced via projections, leading to projected stochastic gradient algorithms. In recent years, a large body of work has examined the convergence properties of projected SGD for non-convex losses in asymptotic and non-asymptotic settings. Strong quantitative guarantees are available for convergence measured via Moreau envelopes. However, these results cannot be compared directly with work on unconstrained SGD, since the Moreau envelope construction changes the gradient. Other common measures based on gradient mappings have the limitation that convergence can only be guaranteed if variance reduction methods, such as mini-batching, are employed. This paper presents an analysis of projected SGD for non-convex losses over compact convex sets. Convergence is measured via the distance of the gradient to the Goldstein subdifferential generated by the constraints. Our proposed convergence criterion directly reduces to commonly used criteria in the unconstrained case, and we obtain convergence without requiring variance reduction. We obtain results for data that are independent, identically distributed (IID) or satisfy mixing conditions ($L$-mixing). In these cases, we derive asymptotic convergence and $O(N^{-1/3})$ non-asymptotic bounds in expectation, where $N$ is the number of steps. In the case of IID sub-Gaussian data, we obtain almost-sure asymptotic convergence and high-probability $\tilde O(N^{-1/5})$ non-asymptotic bounds. In particular, these are the first non-asymptotic high-probability bounds for projected SGD with non-convex losses.

Smoothed Online Optimization for Target Tracking: Robust and Learning-Augmented Algorithms

Tue, 05 May 2026 00:00:00 +0000

We introduce the Smoothed Online Optimization for Target Tracking (SOOTT) problem, a new framework that integrates three key objectives in online decision-making under uncertainty: (1) tracking cost for following a dynamically moving target, (2) adversarial perturbation cost for withstanding unpredictable disturbances, and (3) switching cost for penalizing abrupt changes in decisions. This formulation captures real-world scenarios, such as elastic and inelastic workload scheduling in AI clusters, where operators must balance long-term service-level agreements for elastic workloads, like LLM training, against sudden demand spikes for inelastic workloads, like real-time inference. We first present BEST, a robust algorithm with provable competitive guarantees for SOOTT. To enhance practical performance, we introduce CoRT, a learning-augmented variant that incorporates untrusted black-box predictions (e.g., from ML models) into its decision process. Our theoretical analysis shows that CoRT strictly improves over BEST when predictions are accurate, while maintaining robustness under arbitrary prediction errors. We validate our approach through a case study on workload scheduling, demonstrating that both algorithms effectively balance trajectory tracking, decision smoothness, and resilience to external disturbances.

Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Tue, 05 May 2026 00:00:00 +0000

We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in ${\mathbb R}^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\epsilon \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\epsilon)$ with high probability, in contrast with the representative sample complexity $\Theta\big(d^{\ell_0} \max\set{\epsilon^{-2},\log d}\big)$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention is a two-stage process. In stage one, a novel and provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, is employed to select the ground truth channel number in the target function, $\ell_0$, among the initial $L \ge \ell_0$ channels in its activation function in the first layer with high probability. Such learnable channel selection is performed by efficient one-step GD on both layers of the NN, which achieves the goal of feature learning in learning low-degree polynomials. In stage two, the second layer of the network is trained by standard GD using the activation function with selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.

Improved Regret in Stochastic Decision-Theoretic Online Learning under Differential Privacy

Tue, 05 May 2026 00:00:00 +0000

Hu and M ehta [2024] posed an open problem: what is the optimal instance-dependent rate for the stochastic decision-theoretic online learning (with $K$ actions and $T$ rounds) under $\varepsilon$-differential privacy? Before, the best known upper bound and lower bound are $O\left(\frac{\log K}{\Delta_{\min}} + \frac{\log K\log T}{\varepsilon}\right)$ and $\Omega\left(\frac{\log K}{\Delta_{\min}} + \frac{\log K}{\varepsilon}\right)$ (where $\Delta_{\min}$ is the gap between the optimal and the second actions). In this paper, we partially address this open problem by having two new results. First, we provide an improved upper bound for this problem $O\left(\frac{\log K}{\Delta_{\min}} + \frac{\log^2K}{\varepsilon}\right)$, which is $T$-independent and only has a log dependency in $K$. Second, to further understand the gap, we introduce the deterministic setting, a weaker setting of this open problem, where the received loss vector is deterministic. At this weaker setting, a direct application of the analysis and algorithms from the original setting still leads to an extra log factor. We conduct a novel analysis which proves upper and lower bounds that match at $\Theta(\frac{\log K}{\varepsilon})$.

PAC-Bayesian Analysis of the Surrogate Relation between Joint Embedding and Supervised Downstream Losses

Tue, 05 May 2026 00:00:00 +0000

In recent years, self-supervised representation learning (SSL) has become an important learning paradigm and a crucial component of foundation models. SSL-based training pipelines are typically formalized as a sequence of two tasks—a pretext task that learns representations from large amounts of augmented unlabeled data, and a downstream task, where a simple model is fit on the learned representations with the help of little labeled data. The strong empirical performance of SSL-based pipelines for prominent joint embedding loss functions is not yet well explained in theory due to two main reasons: a lack of non-vacuous generalization bounds for the models learned in the pretext task, and a lack of practically computable transfer bounds that describe how generalization bounds derived for the pretext task transfer to the downstream task. In this work, we first derive non-vacuous PAC Bayesian generalization bounds for models optimized in the pretext task with prominent joint embedding SSL loss functions (VICReg, Barlow Twins, and Spectral Contrastive loss), accounting for their non-i.i.d. nature. Next, we provide the first practically computable transfer bounds for our considered loss functions by formally proving a surrogate relation that upper bounds the downstream squared L2 loss by the SSL pretext loss and a more accurate measure for the influence of the chosen augmentations than in previous work. In addition, our theoretical analysis identifies effective hyperparameter choices, thereby reducing the need for extensive hyperparameter tuning and offering principled guidance for model selection. We empirically validate our theoretical findings on CIFAR-10 and MNIST datasets.

Graph Inference with Effective Resistance Queries

Tue, 05 May 2026 00:00:00 +0000

The goal of *graph inference* is to design algorithms for learning properties of a hidden graph using queries to an oracle that returns information about the graph. Graph reconstruction, verification, and property testing are all special cases of graph inference. In this work, we study graph inference using an oracle that returns the *effective resistance* (ER) distance between a given pair of vertices. Effective resistance is a natural notion of distance that arises from viewing graphs as electrical circuits, and has many applications. However, it has received little attention from a graph inference perspective. Indeed, although it is known that any $n$-vertex graph can be uniquely reconstructed by making all $\binom{n}{2} = \Theta(n^2)$ possible ER queries, very little else is known. We address this and show a number of fundamental results in this model, including: 1. $O(n)$-query algorithms for testing whether a graph is a tree; deciding whether two graphs are equal assuming one is a subgraph of the other; and testing whether a given vertex (or edge) is a cut vertex (or cut edge). 2. Property testing algorithms, including for testing whether a graph is vertex-biconnected and whether it is edge-biconnected. We also give a reduction that shows how to adapt property testing results from the well-studied bounded-degree model to our model with ER queries. This yields ER-query-based algorithms for testing $k$-connectivity, bipartiteness, planarity, and containment of a fixed subgraph. 3. Graph reconstruction algorithms. We highlight two $k$-query (provably minimal) algorithms for recovering an entire graph with $k$ missing edge weights: an exact algorithm with an exponential running time for unweighted graphs, and an approximate numerical algorithm with polynomial running time for weighted graphs. We also give a simple, $O(k^2)$-query, polynomial-time algorithm for this problem. Additionally, we give an algorithm for reconstructing a graph from a low-width tree decomposition. We additionally compare the relative power of ER queries and shortest path queries, which are closely related and better studied. Interestingly, we show that the two query models are incomparable in power.

Bridging Lifelong and Multi-Task Representation Learning: An Algorithm and a Complexity Measure

Tue, 05 May 2026 00:00:00 +0000

In lifelong learning, a learner faces a sequence of tasks with shared structure and aims to identify and leverage it to accelerate learning. We study the setting where such structure is captured by a common representation of data. Unlike multi-task learning or learning-to-learn, where tasks are available upfront to learn the representation, lifelong learning requires the learner to make use of its existing knowledge while continually gathering partial information in an *online* fashion. In this paper, we consider a generalized framework of lifelong representation learning. We propose a simple algorithm that uses multi-task empirical risk minimization as a subroutine and establish a sample complexity bound based on a new notion we introduce—the *task-eluder dimension*. Our result applies to a wide range of learning problems involving general function classes. As concrete examples, we instantiate our result on classification and regression tasks under noise.

Last-iterate Convergence for Symmetric, General-sum, $2 \times 2$ Games Under The Exponential Weights Dynamic

Tue, 05 May 2026 00:00:00 +0000

We conduct a comprehensive analysis of the discrete-time exponential-weights dynamic with a constant step size on all \emph{general-sum and symmetric} $2 \times 2$ normal-form games, i.e. games with $2$ pure strategies per player, and where the ensuing payoff tuple is of the form $(A,A^\top)$ (where $A$ is the $2 \times 2$ payoff matrix corresponding to the first player). Such symmetric games commonly arise in real-world interactions between “symmetric" agents who have identically defined utility functions — such as Bertrand competition and multi-agent performative prediction, and display a rich multiplicity of equilibria despite the seemingly simple setting. Somewhat surprisingly, we show through a first-principles analysis that the exponential weights dynamic, which is popular in online learning, converges in the last iterate for such games regardless of initialization with an appropriately chosen step size. For certain games and/or initializations, we further show that the convergence rate is in fact exponential and holds for any step size. We illustrate our theory with extensive simulations and applications to the aforementioned game-theoretic interactions. In the case of multi-agent performative prediction, we formulate a new “mortgage competition" game between lenders (i.e. banks) who interact with a population of customers, and show that it fits into our framework.

Multi-distribution Learning: From Worst-Case Optimality to Lexicographic Min-Max Optimality

Tue, 05 May 2026 00:00:00 +0000

We study multi-distribution learning (MDL), where the goal is to train a model that performs well across a set of underlying groups or distributions. The predominant performance metric in MDL is min-max optimality, which guarantees minimum error on the “hardest” distribution presented. Optimizing this metric has the unfortunate side effect of producing models that potentially sacrifice performance gains on non-worst-case groups. In the present work we propose a natural alternative, the lexicographic min-max (lex-min-max) objective, which promotes balanced performance by sequentially minimizing the worst, second-worst, and subsequent group losses. Despite its non-convex nature, we show that obtaining an (approximate) lex-min-max solution can be as easy as achieving (approximate) min-max optimality. We develop an efficient algorithm that directly approximates lex-min-max optimality via implementing stochastic no-regret dynamics on a regularized variant of the classical min-max objective. Our method is efficient and easy to implement, and it advances the frontier of multi-distribution learning by providing stronger, hierarchy-aware performance guarantees.

Universality of conformal prediction under the assumption of randomness

Tue, 05 May 2026 00:00:00 +0000

Conformal predictors provide set or functional predictions that are valid under the assumption of randomness, i.e., under the assumption of independent and identically distributed data. The question asked in this paper is whether there are predictors that are valid in the same sense under the assumption of randomness and that are more efficient than conformal predictors. The answer is that the class of conformal predictors is universal in that only limited gains in predictive efficiency are possible. The previous work in this area has relied on the algorithmic theory of randomness and so involved unspecified constants, whereas this paper’s results are much more practical. They are also shown to be optimal in some respects.

Ranking Items from Discrete Ratings: The Cost of Unknown User Thresholds

Tue, 05 May 2026 00:00:00 +0000

Ranking items is a central task in many information retrieval and recommender systems. User input for the ranking task often comes in the form of ratings on a coarse discrete scale. We ask whether it is possible to recover a fine-grained item ranking from such coarse-grained ratings. We model items as having scores and users as having thresholds; a user likes an item if the score exceeds the threshold, and dislikes it otherwise. Although all users implicitly agree on the total item order, estimating that order is challenging when both the scores and the thresholds are latent. Under our model, any ranking method naturally partitions the $n$ items into bins; the bins are ordered, but the items inside each bin are still unordered. Users arrive sequentially, and every new user can be queried to refine the current ranking. We prove that achieving a near-perfect ranking, measured by Spearman distance, requires $\Theta(n^2)$ users (and therefore $\Omega(n^2)$ queries). This is significantly worse than the $O(n\log n)$ queries needed to rank either from comparisons or from ratings with known user thresholds; the gap reflects the additional queries needed to estimate each user’s latent threshold. Our bound also quantifies the impact of a mismatch between the score and threshold distributions via a quadratic divergence factor. To show the tightness of our results, we provide a ranking algorithm whose query complexity matches our bound up to a logarithmic factor. Our work reveals a tension in online ranking: diversity in thresholds is necessary to merge coarse ratings from many users into a fine-grained ranking, but this diversity has a cost if the thresholds are a priori unknown.

Universal Dynamic Regret and Constraint Violation Bounds for Constrained Online Convex Optimization

Tue, 05 May 2026 00:00:00 +0000

We consider a generalization of the celebrated Online Convex Optimization (OCO) framework with adversarial online constraints. In this problem, an online learner interacts with an adversary sequentially over multiple rounds. At the beginning of each round, the learner chooses an action from a convex decision set. After that, the adversary reveals a convex cost function and a convex constraint function. The goal of the learner is to minimize the cumulative cost while satisfying the constraints as tightly as possible. We present two efficient algorithms with simple modular structures that give universal dynamic regret and cumulative constraint violation bounds, improving upon state-of-the-art results. While the first algorithm, which achieves the optimal regret bound, involves projection onto the constraint sets, the second algorithm is projection-free and achieves better violation bounds in rapidly varying environments. Our results hold in the most general case when both the cost and constraint functions are chosen arbitrarily, and the constraint functions need not contain any common feasible point. We establish these results by introducing a general framework that reduces the constrained learning problem to an instance of the standard OCO problem with specially constructed surrogate cost functions.

On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

Tue, 05 May 2026 00:00:00 +0000

Transformer-based models demonstrate a remarkable ability for *in-context learning* (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Notably, recent research has provided insight into how the Transformer architecture can perform ICL, showing that the optimal *linear self-attention* (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building upon this understanding, we investigate ICL for *nonlinear* function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, thereby highlighting a hard expressivity barrier for attention-only models. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the *gated linear units* (GLU), which is a standard component in modern Transformer architectures. We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss. Furthermore, our analysis reveals that the expressivity of a single such block is inherently limited by its dimensions. We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, effectively performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent. Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.

Designing Algorithms for Entropic Optimal Transport from an Optimisation Perspective

Tue, 05 May 2026 00:00:00 +0000

In this work, we develop a collection of novel methods for the entropic-regularised optimal transport problem, which are inspired by existing mirror descent interpretations of the Sinkhorn algorithm used for solving this problem. These are fundamentally proposed from an optimisation perspective: either based on the associated semi-dual problem, or based on solving a non-convex constrained problem over subset of joint distributions. This optimisation viewpoint results in non-asymptotic rates of convergence for the proposed methods under minimal assumptions on the problem structure. We also propose a momentum-equipped method with provable accelerated guarantees through this viewpoint, akin to those in the Euclidean setting. The broader framework we develop based on optimisation over the joint distributions also finds an analogue in the dynamical Schr{ö}dinger bridge problem.

Compressibility Barriers to Neighborhood-Preserving Data Visualization

Tue, 05 May 2026 00:00:00 +0000

To what extent is it possible to visualize high-dimensional data in two- or three-dimensional plots? We reframe this question in terms of embedding $n$-vertex graphs (representing the neighborhood structure of the input points) into metric spaces of low doubling dimension $d$ in such a way that keeps neighbors close and non-neighbors far. This notion of neighbor preservation can be understood as a weaker embedding constraint than near-isometry, yet it is similarly as demanding in terms of how the minimum required dimension scales with the number of points. We show that for an overwhelming fraction of graphs, $d = \Theta(\log n)$ is both necessary and sufficient for neighbor preservation. Even sparse regular graphs, which encode a much simpler class of neighborhood structures, typically require $d= \Omega(\log n / \log\log n)$. The landscape changes dramatically when embedding into normed spaces: general graphs become exponentially harder to embed, requiring $d=\Omega(n)$, while sparse regular graphs continue to admit $d = O(\log n)$. Finally, we study the implications of these results for visualizing data with intrinsic cluster structure. We show that graphs produced from a planted partition model with $k$ clusters on $n$ points typically require $d=\Omega(\log n)$ even when the cluster structure is salient. These results challenge the aspiration that constant-dimensional visualizations can faithfully preserve neighborhood structure.

Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

Tue, 05 May 2026 00:00:00 +0000

We study the problem of learning vector-valued linear predictors: these are prediction rules parameterized by a matrix that maps an $m$-dimensional feature vector to a $k$-dimensional target. We focus on the fundamental case with a convex and Lipschitz loss function, and show several new theoretical results that shed light on the complexity of this problem and its connection to related learning models. First, we give a tight characterization of the sample complexity of Empirical Risk Minimization (ERM) in this setting, establishing that $\widetilde{\Omega}(k/\varepsilon^2)$ examples are necessary for ERM to reach $\epsilon$ excess (population) risk; this provides for an exponential improvement over recent results by Magen and Shamir (2024) in terms of the dependence on the target dimension $k$, and matches a classical upper bound due to Maurer (2016). Second, we present a black-box conversion from general $d$-dimensional Stochastic Convex Optimization (SCO) to vector-valued linear prediction, showing that any SCO problem can be embedded as a prediction problem with $k=\Theta(d)$ outputs. These results portray the setting of vector-valued linear prediction as bridging between two extensively studied yet disparate learning models: linear models (corresponds to $k=1$) and general $d$-dimensional SCO (with $k=\Theta(d)$).

Recycling History: Efficient Recommendations from Contextual Dueling Bandits

Tue, 05 May 2026 00:00:00 +0000

The contextual dueling bandit problem models adaptive recommender systems, where at each step the algorithm presents a set of items to the user, and the user’s choice reveals their preference. This setup is well suited for implicit choices users make when navigating a content platform, but does not capture other possible comparison queries. Motivated by the fact that users provide more reliable feedback after consuming items, we propose a new bandit model that can be described as follows. The algorithm recommends one item per time step; after consuming that item, the user is asked to compare it with another item chosen from the user’s consumption history. Importantly, in our model, this comparison item can be chosen without incurring any additional regret, potentially leading to better performance. However, the regret analysis is challenging because of the temporal dependency in the user’s history. To overcome this challenge, we first show that the algorithm can construct informative queries provided the history is rich, i.e., satisfies a certain diversity condition. We then show that a short initial random exploration phase is sufficient for the algorithm to accumulate a rich history with high probability. This result, proven via matrix concentration bounds, yields $O(\sqrt{T})$ regret guarantees. Additionally, our simulations show that reusing past items for comparisons can lead to significantly lower regret than only comparing between simultaneously recommended items.

Optimal Bounds for Tyler’s M-Estimator for Elliptical Distributions

Tue, 05 May 2026 00:00:00 +0000

A fundamental problem in statistics is estimating the shape matrix of an Elliptical distribution. This generalizes the familiar problem of Gaussian covariance estimation, for which the sample covariance achieves optimal estimation error. For Elliptical distributions, Tyler proposed a natural M-estimator and showed strong statistical properties in the asymptotic regime, independent of the underlying distribution. Numerical experiments show that this estimator performs very well, and that Tyler’s iterative procedure converges quickly to the estimator. Franks and Moitra recently provided the first distribution-free error bounds in the finite sample setting, as well as the first rigorous convergence analysis of Tyler’s iterative procedure. However, their results exceed the sample complexity of the Gaussian setting by a $\log^{2} d$ factor. We close this gap by proving optimal sample threshold and error bounds for Tyler’s M-estimator for all Elliptical distributions, fully matching the Gaussian result. Moreover, we recover the algorithmic convergence even at this lower sample threshold. Our approach builds on the operator scaling connection of Franks and Moitra by introducing a novel ‘pseudorandom’ condition, which we call $\infty$-expansion. We show that Elliptical distributions satisfy $\infty$-expansion at the optimal sample threshold, and then prove a novel scaling result for inputs satisfying this condition.

Large Average Subtensor Problem: Ground-State, Algorithms, and Algorithmic Barriers

Tue, 05 May 2026 00:00:00 +0000

We introduce the large average subtensor problem: given an order-$p$ tensor over $\mathbb{R}^{N\times \cdots \times N}$ with i.i.d. standard normal entries and a $k\in\mathbb{N}$, algorithmically find a $k\times \cdots \times k$ subtensor with a large average entry. This generalizes the large average submatrix problem, a key model closely related to biclustering and high-dimensional data analysis, to tensors. For the submatrix case, \citet*{bhamidi2017energy} explicitly highlight the regime $k=\Theta(N)$ as an intriguing open question. Addressing the regime $k=\Theta(N)$ for tensors, we establish that the largest average entry concentrates around an explicit value $E_{\mathrm{max}}$, provided that the tensor order $p$ is sufficiently large. Furthermore, we prove that for any $\gamma>0$ and large $p$, this model exhibits multi Overlap Gap Property ($m$-OGP) above the threshold $\gamma E_{\mathrm{max}}$. The $m$-OGP is a rigorous barrier for a broad class of algorithms exhibiting input stability. These results hold for both $k=\Theta(N)$ and $k=o(N)$. For smaller values of $k$, we also propose a polynomial-time algorithm that finds a subtensor with an average entry of order $\Theta_p(\frac{1}{\sqrt{p}})E_{\mathrm{max}}$. In particular, the $m$-OGP is asymptotically sharp: onset of the $m$-OGP and the algorithmic threshold match as $p$ grows. Our results show that while the case $k=\Theta(N)$ remains open for submatrices, it can be rigorously analyzed for tensors in the large $p$ regime. This is achieved by interpreting the model as a Boolean spin glass and drawing on insights from recent advances in the Ising $p$-spin glasses.

A Novel Data-Dependent Learning Paradigm for Large Hypothesis Classes

Tue, 05 May 2026 00:00:00 +0000

We address the general task of learning with a set of candidate models that is too large to have a uniform convergence of empirical estimates to true losses. While the common approach to such challenges is SRM (or regularization) based learning algorithms, we propose a novel learning paradigm that relies on stronger incorporation of empirical data and requires less algorithmic decisions to be based on prior assumptions. We analyze the generalization capabilities of our approach and demonstrate its merits in several common learning assumptions, including similarity of close points, clustering of the domain into highly label-homogeneous regions, Lipschitzness assumptions of the labeling rule, and contrastive learning assumptions. Our approach allows utilizing such assumptions without the need to know their true parameters a priori.

How to Set $\beta_1, \beta_2$ in Adam: An Online Learning Perspective

Tue, 05 May 2026 00:00:00 +0000

While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $\beta_1$ and $\beta_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $\beta_1 = \sqrt{\beta_2}$, which does not cover the more practical cases with $\beta_1 \neq \sqrt{\beta_2}$. We derive novel, more general analyses that hold for both $\beta_1 \geq \sqrt{\beta_2}$ and $\beta_1 \leq \sqrt{\beta_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $\beta_1 = \sqrt{\beta_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

Online Covering with Multiple Experts

Tue, 05 May 2026 00:00:00 +0000

Designing online algorithms with machine learning predictions is a recent approach that extends beyond the worst-case paradigm for various practically relevant online problems, such as scheduling, caching, and clustering. While most previous learning-augmented algorithms focus on integrating the predictions of a single oracle, we study the design of online algorithms with \emph{multiple} prediction sources (experts). To go beyond the performance guarantee of the popular static best expert in hindsight benchmark, we introduce a new benchmark that can be viewed as a linear combination of predictions that evolve over time. We present competitive algorithms in the new dynamic benchmark for $0$-$1$ online covering problems with a performance guarantee of $O(\log K)$ if the objective is linear and $O(\ln(K)) \cdot \frac{\lambda}{(1-\mu\ln(K))}$ if the objective is non-linear, where $K$ is the number of experts and $(\lambda, \mu)$ are parameters of the objective function. Our approach gives a new perspective on combining multiple algorithms in an online manner (a central subject in the online algorithm research community) using machine learning techniques.

Online Markov Decision Processes with Terminal Law Constraints

Tue, 05 May 2026 00:00:00 +0000

Traditional reinforcement learning usually assumes either episodic interactions with resets or continuous operation to minimize average or cumulative loss. While episodic settings have many theoretical results, resets are often unrealistic in practice. The infinite-horizon setting avoids this issue but lacks non-asymptotic guarantees in online scenarios with unknown dynamics. In this work, we move towards closing this gap by introducing a reset-free framework called the *periodic* framework, where the goal is to find *periodic policies*: policies that not only minimize cumulative loss but also return the agents to their initial state distribution after a fixed number of steps. We formalize the problem of finding optimal periodic policies and identify sufficient conditions under which it is well-defined for tabular Markov decision processes. To evaluate algorithms in this framework, we introduce the \emph{periodic regret}, a measure that balances cumulative loss with the terminal law constraint. We then propose the first algorithms for computing periodic policies in two multi-agent settings and show they achieve sublinear periodic regret of order $\tilde{O}(T^{3/4})$. This provides the first non-asymptotic guarantees for reset-free learning in the setting of $M$ homogeneous agents, for any $M > 1$.

Vector-valued self-normalized concentration inequalities beyond sub-Gaussianity

Tue, 05 May 2026 00:00:00 +0000

The study of self-normalized processes plays a crucial role in a wide range of applications, from sequential decision-making to econometrics. While the behavior of self-normalized concentration has been widely investigated for scalar-valued processes, vector-valued processes remain comparatively underexplored, especially outside of the sub-Gaussian framework. In this contribution, we provide concentration inequalities for self-normalized processes with light tails beyond sub-Gaussianity, including Bernstein-type, Bennett-type, and empirical Bennett-type inequalities. We illustrate the relevance of our results in the context of online linear regression, with applications in (kernelized) linear bandits.

Efficient Opportunistic Approachability

Tue, 05 May 2026 00:00:00 +0000

We study the problem of \emph{opportunistic approachability}: a generalization of Blackwell approachability where the learner would like to obtain stronger guarantees (i.e., approach a smaller set) when their adversary limits themselves to a subset of their possible action space. \cite{bernstein15a} introduced this problem in 2014 and presented an algorithm that guarantees sublinear approachability rates for opportunistic approachability. However, this algorithm requires the ability to produce calibrated online predictions of the adversary’s actions, a problem whose standard implementations require time exponential in the ambient dimension and result in approachability rates that scale as $O(T^{-1/d})$. In this paper, we present an efficient algorithm for opportunistic approachability that achieves a rate of $O(T^{-1/3})$ that bypasses the need for an online calibration subroutine. Moreover, in the case where the dimension of the adversary’s action set is at most two, we show it is possible to obtain the optimal rate of $O(T^{-1/2})$.

Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

Tue, 05 May 2026 00:00:00 +0000

We consider infinite-horizon $\gamma$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration (\texttt{MDVI}) \citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $\epsilon$-optimal policy with high probability by using $\tilde{O}\left(\frac{d^2}{(1-\gamma)^4\epsilon^2}\right)$ samples. For (ii), we show that the algorithm requires $\tilde{O}\left(\frac{d^2}{(1-\gamma)^6\epsilon^2\zeta^2}\right)$ samples, where $\zeta$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Furthermore, we prove a lower-bound of $\Omega\left(\frac{d^2}{(1-\gamma)^5\epsilon^2\zeta^2}\right)$ for the strict feasibility setting. We note that our upper bounds under both settings exhibit a near-optimal dependence on $d$, $\epsilon$, and $\zeta$. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.

Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications

Tue, 05 May 2026 00:00:00 +0000

In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\mathsf{p}$-th central moment for some $\mathsf{p}\in\left(1,2\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable and optimal convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping.

Variance Reduction and Low Sample Complexity in Stochastic Optimization via Proximal Point Method

Tue, 05 May 2026 00:00:00 +0000

High-probability guarantees in stochastic optimization are often obtained only under strong noise assumptions such as sub-Gaussian tails. We show that such guarantees can also be achieved under the weaker assumption of bounded variance by developing a stochastic proximal point method. This method combines a proximal subproblem solver, which inherently reduces variance, with a probability booster that amplifies per-iteration reliability into high-confidence results. The analysis demonstrates convergence with low sample complexity, without restrictive noise assumptions or reliance on mini-batching.

Accelerated Mirror Descent for Non-Euclidean Star-convex Functions

Tue, 05 May 2026 00:00:00 +0000

Acceleration for non-convex functions is a fundamental challenge in optimization. We revisit star-convex functions, which are strictly unimodal on all lines through a minimizer. [1] accelerate unconstrained star-convex minimization of functions that are smooth with respect to the Euclidean norm. To do so, they add a certain binary search step to gradient descent. In this paper, we accelerate unconstrained star-convex minimization of functions that are weakly smooth with respect to an arbitrary norm. We add a binary search step to mirror descent, generalize the approach and refine its complexity analysis. We prove that our algorithms have sharp convergence rates for star-convex functions with $\alpha$-Holder continuous gradients and demonstrate that our rates are nearly optimal for $p$-norms. [1] Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond

No Scale Sensitive Dimension for Distribution Learning

Tue, 05 May 2026 00:00:00 +0000

Learning probability distributions is one of the most basic statistical learning tasks. While for many learning tasks learnability of a class can be characterized by a combinatorial dimension (like the VC-dimension for binary classification prediction), no such characterization is known for classes of probability distributions. A leap toward resolving this long-standing problem was made recently by Lechner and Ben-David who showed that there can be no \emph{scale invariant} characterization of PAC style learnability of such classes. The question of \emph{scale sensitive} characterization remained open. In this paper we fully resolve the question by showing that there can be no \emph{scale sensitive} combinatorial characterization of PAC style learnability of classes of probability distributions.

Improved Replicable Boosting with Majority-of-Majorities

Tue, 05 May 2026 00:00:00 +0000

We introduce a new replicable boosting algorithm which significantly improves the sample complexity compared to previous algorithms. First, we create an improved version of the replicable boosting algorithm introduced by Impagliazzo et al. (2022). We then use this algorithm with a constant accuracy parameter and run another layer of boosting on top to achieve the desired accuracy. This outer layer of boosting is inspired by the classical AdaBoost algorithm while capping the weights for a smoother distribution over the data which we show ensures replicability.

Learning with Monotone Adversarial Corruptions

Tue, 05 May 2026 00:00:00 +0000

We study the extent to which standard machine learning algorithms rely on exchangeability and independence of data by introducing a monotone adversarial corruption model. In this model, an adversary, upon looking at a "clean" i.i.d. dataset, inserts additional "corrupted" points of their choice into the dataset. These added points are constrained to be monotone corruptions, in that they get labeled according to the ground-truth target function. Perhaps surprisingly, we demonstrate that in this setting, all known optimal learning algorithms for binary classification can be made to achieve suboptimal expected error on a new independent test point drawn from the same distribution as the clean dataset. On the other hand, we show that uniform convergence-based algorithms do not degrade in their guarantees. Our results showcase how optimal learning algorithms break down in the face of seemingly helpful monotone corruptions, exposing their overreliance on exchangeability.

Differentially Private Bilevel Optimization

Tue, 05 May 2026 00:00:00 +0000

We present differentially private (DP) algorithms for bilevel optimization, a problem class that received significant attention lately in various machine learning applications. These are the first algorithms for such problems under standard DP constraints, and are also the first to avoid Hessian computations which are prohibitive in large-scale settings. Under the well-studied setting in which the upper-level is not necessarily convex and the lower-level problem is strongly-convex, our proposed gradient-based $(\epsilon,\delta)$-DP algorithm returns a point with hypergradient norm at most $\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/\epsilon n)^{1/2}+(\sqrt{d_\mathrm{low}}/\epsilon n)^{1/3}\right)$ where $n$ is the dataset size, and $d_\mathrm{up}/d_\mathrm{low}$ are the upper/lower level dimensions. Our analysis covers constrained and unconstrained problems alike, accounts for mini-batch gradients, and applies to both empirical and population losses. As an application, we specialize our analysis to derive a simple private rule for tuning a regularization hyperparameter.

DS-Compatible Log-Linear Reliability with KL-Prox EM: Monotone Ascent, Identifiability, and Generalization

Tue, 05 May 2026 00:00:00 +0000

We study context-conditioned reliability in Dawid–Skene (DS) models and propose a DS-compatible parameterization in which a log-linear correction to confusion logits is softmax-renormalized over reported labels, yielding valid, interpretable confusion matrices conditioned on covariates. We derive a KL-proximal (mirror-descent) update for confusion matrices that warm-starts at DS and provably yields monotone ascent of a standard EM surrogate. Under diagonal-dominance at warm start and mild covariate excitation, we prove identifiability up to permutation; for the correction head we give a finite-sample generalization bound of order $O(\sqrt{d\log(dK)/n})$ via Rademacher complexity. The formulation drops into vanilla EM, preserves DS interpretability, and supports physics-inspired priors (e.g., monotonicity) without breaking guarantees. Empirical validation on multi-agent label fusion confirms monotone convergence and calibration improvements with minimal overhead.

The Planted Number Partitioning Problem

Tue, 05 May 2026 00:00:00 +0000

Given a list $X\sim \mathcal{N}(0,I_n)$ of numbers, the random number partitioning problem (NPP) seeks a partition $\boldsymbol{\sigma}\in$ {-1,1}$^n$ with a small $H(\boldsymbol{\sigma})=\frac{1}{\sqrt{n}}\left|⟨\boldsymbol{\sigma},X⟩\right|$. The NPP has been extensively studied in computer science, probability and combinatorics; it is also closely linked to covariate balancing and randomized controlled trials. We introduce a planted version of the random NPP: fix a $\boldsymbol{\sigma}$* and generate $X\sim \mathcal{N}(0,I_n)$ conditional on $H(\boldsymbol{\sigma^*})\le 3^{-n}$. The random and planted models are statistically distinguishable, since in the former case $\min_{\boldsymbol{\sigma}}H(\boldsymbol{\sigma})=\Theta(\sqrt{n}2^{-n})$ w.h.p. We first analyze the values of $H(\boldsymbol{\sigma})$. We show that, perhaps surprisingly, planting does not yield partitions with objective values substantially smaller than $2^{-n}$: we have $\min_{\boldsymbol{\sigma} \ne \pm \boldsymbol{\sigma}}$* $H(\boldsymbol{\sigma}) = \widetilde{\Theta}(2^{-n})$ w.h.p. Moreover, we precisely characterize the minimal $H(\boldsymbol{\sigma})$ achievable at any fixed distance from $\boldsymbol{\sigma^*}$. Turning to the algorithmic problem, we ask whether one can efficiently find a partition $\boldsymbol{\sigma}$ with small $H(\boldsymbol{\sigma})$. We prove that planted NPP exhibits the multi Overlap Gap Property ($m$-OGP) at scales $2^{-\Theta(n)}$. Building on this barrier, we show that stable algorithms satisfying a natural anti-concentration property cannot find partitions with $H(\boldsymbol{\sigma})=2^{-\Theta(n)}$. This is the first instance where the $m$-OGP rules out stable algorithms in a planted setting. Our results demonstrate that the multi OGP framework, previously developed for unplanted models, extends naturally to planted ones when the goal is to recover low-objective solutions. They further point to a statistical–computational gap: although the random and planted NPP are statistically distinguishable, we conjecture that no polynomial-time algorithm can distinguish them with nontrivial advantage. Our results demonstrate that planted NPP harbors intriguing features and it is a particularly promising model for probing algorithmic barriers in planted problems.

Optimal L2 Regularization in High-dimensional Continual Linear Regression

Tue, 05 May 2026 00:00:00 +0000

We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.

On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability

Tue, 05 May 2026 00:00:00 +0000

We study language generation in the limit – introduced by Kleinberg and Mullainathan – building on classical works of Gold and Angluin. Kleinberg’s and Mullainathan’s main result is an algorithm for generating from any countable language collection in the limit. While their algorithm eventually generates unseen strings from the target language $K$, it sacrifices coverage or breadth, i.e., its ability to generate a rich set of strings. Recent work introduces different notions of breadth and explores when generation with breadth is possible, leaving a full characterization of these notions open. Our first set of results settles this by characterizing generation for existing notions of breadth and their natural combinations. Interestingly, our lower bounds are very flexible and extend to many performance metrics beyond breadth – for instance, showing that, in general, it is impossible to train generators that achieve a higher perplexity or lower hallucination rate for $K$ compared to other languages. Next, we study language generation with breadth with stable generators – which eventually stop changing after seeing an arbitrary but finite number of strings – and prove unconditional lower bounds for stable generators – strengthening the results of Kalavasis, Mehrotra, and Velegkas – and surprisingly demonstrating that generation with many existing notions of breadth becomes equally hard, when stability is required. This gives a separation for generation with approximate breadth, between stable and unstable generators, highlighting the rich interplay between breadth, stability, and consistency in language generation.

Reusing Samples in Variance Reduction

Tue, 05 May 2026 00:00:00 +0000

We provide a general framework to improve trade-offs between the number of _full batch_ and _sample_ queries used to solve structured optimization problems. Our results apply to a broad class of randomized optimization algorithms that iteratively solve sub-problems to high accuracy. We show that such algorithms can be modified to _reuse the randomness_ used to query the input across sub-problems. Consequently, we improve the trade-off between the number of gradient (full batch) and individual function (sample) queries for finite sum minimization, the number of matrix-vector multiplies (full batch) and random row (sample) queries for top-eigenvector computation, and the number of matrix-vector multiplies with the transition matrix (full batch) and generative model (sample) queries for optimizing Markov Decision Processes. To facilitate our analysis we introduce the notion of _pseudo-independent algorithms_, a generalization of pseudo-deterministic algorithms (Gat and Goldwasser 2011) that quantifies how independent the output of a randomized algorithm is from a randomness source.

Strategy-robust Online Learning in Contextual Pricing

Tue, 05 May 2026 00:00:00 +0000

Learning effective pricing strategies is crucial in digital marketplaces, especially when buyers’ valuations are unknown and must be inferred through interaction. We study the online contextual pricing problem, where a seller observes a stream of context–valuation pairs and dynamically sets prices. Moreover, departing from traditional online learning frameworks, we consider a strategic setting in which buyers may misreport valuations to influence future prices—a challenge known as strategic overfitting (Amin et al., 2013). We introduce a strategy-robust notion of regret for multi-buyer online environments, capturing worst-case strategic behavior in the spirit of the Price of Anarchy. Our first contribution is a polynomial-time approximation scheme (PTAS) for learning linear pricing policies in adversarial, adaptive environments, enabled by a novel online sketching technique. Building on this result, we propose our main construction: the Sparse Update Mechanism (SUM), a simple yet effective sequential mechanism that ensures robustness to all Nash equilibria among buyers. Moreover, our construction yields a black-box reduction from online expert algorithms to strategy-robust learners.

Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability

Tue, 05 May 2026 00:00:00 +0000

We consider the fundamental problem of estimating a discrete distribution on a domain of size $K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/\delta)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/\delta)\big) /n$ at error probability $\delta$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/\delta)\big) /n$, i.e. without the $\ln K$ multiplying $\ln (1/\delta)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by the maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/\delta) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and $\delta$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.

Distribution-Dependent Rates for Multi-Distribution Learning

Tue, 05 May 2026 00:00:00 +0000

To address the needs of modeling uncertainty in sensitive machine learning applications, the setup of distributionally robust optimization (DRO) seeks good performance uniformly across a variety of tasks. The recent multi-distribution learning (MDL) framework \cite{pmlr-v195-awasthi23a-open-prob} tackles this objective in a dynamic interaction with the environment, where the learner has sampling access to each target distribution. Drawing inspiration from the field of pure-exploration multi-armed bandits, we provide \textit{distribution-dependent} guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size when compared to the existing distribution-independent analyses. We investigate two non-adaptive strategies, uniform and non-uniform exploration, and present non-asymptotic regret bounds using novel tools from empirical process theory. Furthermore, we devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature. We also conduct a small synthetic experiment illustrating the comparative strengths of each strategy.

Relative Information Gain and Gaussian Process Regression

Tue, 05 May 2026 00:00:00 +0000

The sample complexity of estimating or maximising an unknown function in a reproducing kernel Hilbert space is known to be linked to both the effective dimension and the information gain associated with the kernel. While the information gain has an attractive information-theoretic interpretation, the effective dimension typically results in better rates. We introduce a new quantity called the relative information gain, which measures the sensitivity of the information gain with respect to the observation noise. We show that the relative information gain smoothly interpolates between the effective dimension and the information gain, and that the relative information gain has the same growth rate as the effective dimension. In the second half of the paper, we prove a new PAC-Bayesian excess risk bound for Gaussian process regression. The relative information gain arises naturally from the complexity term in this PAC-Bayesian bound. We prove bounds on the relative information gain which depend on the spectral properties of the kernel. When these upper bounds are combined with our excess risk bound, we obtain minimax-optimal rates of convergence.

Sparse Nonparametric Contextual Bandits

Tue, 05 May 2026 00:00:00 +0000

We study the benefits of sparsity in nonparametric contextual bandit problems, in which the set of candidate features is countably or uncountably infinite. Our contribution is two-fold. First, using a novel reduction to sequences of multi-armed bandit problems, we provide lower bounds on the minimax regret, which show that polynomial dependence on the number of actions is generally unavoidable in this setting. Second, we show that a variant of the Feel-Good Thompson Sampling algorithm enjoys regret bounds that match our lower bounds up to logarithmic factors of the horizon, and have logarithmic dependence on the effective number of candidate features. When we apply our results to kernelised and neural contextual bandits, we find that sparsity enables better regret bounds whenever the horizon is large enough relative to the sparsity and the number of actions.

Online and Offline Learning of Orderly Hypergraphs Using Queries

Tue, 05 May 2026 00:00:00 +0000

In the context of learning hypergraphs with shortest-path queries (SP-queries), we present the first provably optimal online algorithm for learning a broad and natural class of hypertrees which we call orderly hypertrees. Our online algorithm can be transformed into a provably optimal offline algorithm. Orderly hypertrees can be positioned within the Fagin hierarchy of hypergraph acyclicity (studied in database theory), and strictly encompass the broadest class in this hierarchy that is learnable with subquadratic SP-query complexity. Our results also motivate the study of a new type of query, called dependency query (D-query), which is weaker than an SP-query. Positive and negative results on D-queries shed light on the structural properties of classes of hypertrees for which efficient learning requires the full information provided by SP-queries.

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

Tue, 05 May 2026 00:00:00 +0000

We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze forgetting, defined as the loss on previously seen tasks, after $k$ iterations. For continual linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup and leverage them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on problem dimensionality or complexity and become prohibitive in highly overparameterized regimes. In continual regression with replacement, we improve the best existing rate from $\mathcal{O}((d-\bar{r})/k)$ to $\mathcal{O}(\min(1/\sqrt[4]{k}, \sqrt {d-\bar{r}}/k, \sqrt {T\bar{r}}/k))$, where $d$ is the dimensionality and $\bar{r}$ the average task rank. Furthermore, we establish the first rate for random task orderings without replacement. The resulting rate of $\mathcal{O}(\min(1/\sqrt[4]{T}, (d-\bar{r})/T))$ shows that randomization alone, without task repetition, prevents catastrophic forgetting in sufficiently long task sequences. Finally, we prove a matching $\mathcal{O}(1/\sqrt[4]{k})$ forgetting rate for continual linear classification on separable data. Our universal rates extend to broader methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d. and single-pass orderings.

Phase Transition of Regret for Logistic Regression with Large Weights

Tue, 05 May 2026 00:00:00 +0000

In online learning, a learner receives data in rounds $1 \le t \le T$ and, at each round, predicts a label that is then compared to the true label, resulting in a loss. The total loss over $T$ rounds, when compared to the loss of the best expert from a class of experts, is called the regret. We study the *fixed-design* minimax regret for the best predictor and the worst label sequence, when the feature sequence is given in advance. This paper focuses on *logarithmic loss* over a class of experts $\mathcal{H}_{\mathbf{w}}$ parameterized by a $d$-dimensional weight vector $\mathbf{w}$, which can be unbounded and may increase with $T$. For bounded weights, it is known that the minimax regret can grow no faster than $(d/2)\log(TR^2/d)$; hence, the leading coefficient in front of $\log T$ can grow without control as $R$ increases. However, in this paper, we demonstrate a phase transition showing that, for $R \ge T$ and large (but constant) $d$, the minimax regret asymptotically equals $(d \pm 1)\log T + O(\log\log T)$ for a logistic-like expert class, which can be generalized to a broader family of experts. We prove our findings by introducing the so-called *splittable label sequences* that partition the weight space into $T^{d-1}$ regions (of equal sign for the scalar product of weights and features), coupled with tools from analytic combinatorics (e.g., Mellin transforms and the saddle-point method) and discrete geometry.

Uniform Convergence Beyond Glivenko-Cantelli

Tue, 05 May 2026 00:00:00 +0000

We characterize conditions under which collections of distributions on $\{0,1\}^\mathbb{N}$ admit uniform estimation of their mean. Prior work from Vapnik and Chervonenkis (1971) has focused on uniform convergence using the empirical mean estimator, leading to the principle known as $P-$ Glivenko-Cantelli. We extend this framework by moving beyond the empirical mean estimator and introducing Uniform Mean Estimability, also called UME-learnability, which captures when a collection permits uniform mean estimation by any arbitrary estimator. We work on the space created by the mean vectors of the collection of distributions. For each distribution, the mean vector records the expected value in each coordinate. We show that separability of the mean vectors is a sufficient condition for UME-learnability. However, we show that separability of the mean vectors is not necessary for UME-learnability by constructing a collection of distributions whose mean vectors are non-separable yet UME-learnable using techniques fundamentally different from those used in our separability-based analysis. Finally, we establish that countable unions of UME-learnable collections are also UME-learnable, solving the conjecture posed in Cohen et al. (2025).

Suspicious Alignment of SGD:A Fine-Grained Step Size Condition Analysis

Tue, 05 May 2026 00:00:00 +0000

This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered “suspicious” because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our primary contribution can be summarized as follows: We propose a step-size condition theory revealing that in low-alignment regimes, an adaptive critical step size $\eta_t^\ast$ separates alignment-decreasing ($\eta_t < \eta_t^\ast$) from alignment-increasing ($\eta_t > \eta_t^\ast$) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where the loss in the bulk subspace decreases while the loss in the dominant subspace increases, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.

On Purely Private Covariance Estimation

Tue, 05 May 2026 00:00:00 +0000

We present a simple perturbation mechanism for the release of $d$-dimensional covariance matrices $\Sigma$ under pure differential privacy. For large datasets with at least $n\geq d^2/\varepsilon$ elements, our mechanism recovers the provably optimal Frobenius norm error guarantees of \cite{nikolov2023private}, while simultaneously achieving best known error for all other $p$-Schatten norms, with $p\in [1,\infty]$. Our error is information-theoretically optimal for all $p\ge 2$, in particular, our mechanism is the first purely private covariance estimator that achieves optimal error in spectral norm. For small datasets $n< d^2/\varepsilon$, we further show that by projecting the output onto the nuclear norm ball of appropriate radius, our algorithm achieves the optimal Frobenius norm error $O(\sqrt{d \text{Tr}(\Sigma) /n})$, improving over the known bounds of $O(\sqrt{d/n})$ of \cite{nikolov2023private} and ${O}\big(d^{3/4}\sqrt{\text{Tr}(\Sigma)/n}\big)$ of \cite{dong2022differentially}.

Sample-Near-Optimal Agnostic Boosting with Improved Running Time

Tue, 05 May 2026 00:00:00 +0000

Boosting is a powerful method that turns weak learners, which perform only slightly better than random guessing, into strong learners with high accuracy. While boosting is well understood in the classic setting, it is less so in the agnostic case, where no assumptions are made about the data. Indeed, only recently was the sample complexity of agnostic boosting nearly settled (da Cunha et al., 2025), but the known algorithm achieving this bound has exponential running time. In this work, we propose the first agnostic boosting algorithm with near-optimal sample complexity, running in time polynomial in the sample size when considering the other parameters of the problem fixed.

Talagrand Meets Talagrand: Upper and Lower Bounds on Expected Soft Maxima of Gaussian Processes with Finite Index Sets

Tue, 05 May 2026 00:00:00 +0000

Analysis of extremal behavior of stochastic processes is a key ingredient in a wide variety of applications, including probability, statistical physics, theoretical computer science, and learning theory. In this paper, we consider centered Gaussian processes on finite index sets and investigate expected values of their smoothed, or “soft,” maxima. We obtain upper and lower bounds for these expected values using a combination of ideas from statistical physics (the Gibbs variational principle for the equilibrium free energy and replica-symmetric representations of Gibbs averages) and from probability theory (Sudakov minoration). These bounds are parametrized by an inverse temperature $\beta > 0$ and reduce to the usual Gaussian maximal inequalities in the zero-temperature limit $\beta \to \infty$. We provide an illustration of our methods in the context of the Random Energy Model, one of the simplest models of physical systems with random disorder.

A Martingale Kernel Two-Sample Test

Tue, 05 May 2026 00:00:00 +0000

The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is consistent against any fixed alternative and for large sample sizes, mMMD offers substantial computational savings over the standard MMD test, with only a minor loss in power.

Pareto-optimal Non-uniform Language Generation

Tue, 05 May 2026 00:00:00 +0000

Kleinberg and Mullainathan (2024) recently proposed an interesting model for language generation in the limit: Given a countable collection of languages, and an adversary enumerating the strings of some language $L$ from the collection, the objective is to generate _new_ strings from the target language, such that all strings generated beyond some finite time are valid. Li, Raman, and Tewari (2024) and Charikar and Pabbaraju (2024) showed strong _non-uniform_ generation guarantees in this model, giving algorithms that generate new valid strings from $L$ after seeing a number of distinct input strings $t(L)$ that depends only on $L$ (and the collection), but _not_ the enumeration order. However, for both these works, the language-wise _generation times_ $t(L)$ of the algorithm can be strictly sub-optimal. In this work, we study _Pareto-optimality_ of non-uniform language generation in the limit. We propose an algorithm, whose generation times $t^\star(L)$ are (almost) Pareto-optimal: any other algorithm whose generation time for some language $L$ is strictly smaller than $t^\star(L)$, _must satisfy_ that its generation time for some _other_ language $L’$ is strictly worse than $t^\star(L’)$. Pareto-optimality is essentially the best that one can achieve for non-uniform generation. Our algorithmic framework conveniently adapts to further give Pareto-optimal non-uniform generation algorithms in the practically motivated settings of _noisy_ as well as _representative_ generation.

Closeness testing from distributed measurements

Tue, 05 May 2026 00:00:00 +0000

We consider the fundamental task of two-sample composite hypothesis testing (i.e., "closeness testing") in a distributed setting, where a central party holds a dataset of $M \gg 1$ observations from an unknown discrete probability distribution $q$ over a universe of size $k$, and individual parties each independently observes one realization from an unknown distribution $p$. The goal is for the central party to test whether $p$ and $q$ are equal, or differ significantly in statistical distance, while only receiving a small amount of information (at most $\ell \leq \log_2 k$ bits) from each of the $n$ distributed entities. Our main contribution is a time- and sample-efficient algorithm for this task, applicable across the whole regime of parameters. Our theoretical guarantees match the optimal sample complexities in the specific cases already studied in the literature, e.g., when $\ell = \log_2 k$ (no information constraint) or $M \to \infty$ (reference distribution fully known to the central party).

Privately Learning Decision Lists and a Differentially Private Winnow

Tue, 05 May 2026 00:00:00 +0000

We give new differentially private algorithms for the classic problems of learning decision lists and large-margin halfspaces in the PAC and online models. In the PAC model, we give a computationally efficient algorithm for learning decision lists with minimal sample overhead over the best non-private algorithms. In the online model, we give a private analog of the influential Winnow algorithm for learning halfspaces with mistake bound polylogarithmic in the dimension and inverse polynomial in the margin. As an application, we describe how to privately learn decision lists in the online model, qualitatively matching state-of-the art non-private guarantees.

Enjoying Non-linearity in Multinomial Logistic Bandits: A Minimax-Optimal Algorithm

Tue, 05 May 2026 00:00:00 +0000

We consider the multinomial logistic bandit problem in which a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $\kappa_* \geq 1$ that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/\kappa_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework with a finite action space, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of $ \kappa_* $ to the multinomial setting and propose an efficient algorithm that leverages the problem’s non-linearity. Our method yields a problem-dependent regret bound of order $ \smash{\widetilde{\mathcal{O}}( R d \sqrt{ {KT}/{\kappa_*}} ) } $, where $R$ denotes the norm of the vector of rewards and $K$ is the number of outcomes. This improves upon the best existing guarantees of order $ \smash{\widetilde{\mathcal{O}}( RdK \sqrt{T} )} $. Moreover, we provide a matching $\smash{ \Omega(dR\sqrt{KT/\kappa_*})}$ lower-bound, showing that our algorithm is minimax-optimal and that our definition of $\kappa_*$ is optimal.

Regularized Robustly Reliable Learners

Tue, 05 May 2026 00:00:00 +0000

Instance-targeted data poisoning attacks, where an adversary corrupts a training set to induce errors on specific test points, have raised significant concerns. Balcan et al (2022) proposed an approach to addressing this challenge by defining a notion of robustly-reliable learners that provide per-instance guarantees of correctness under well-defined assumptions, even in the presence of data poisoning attacks. They then give a generic optimal (but computationally inefficient) robustly reliable learner as well as a computationally efficient algorithm for the case of linear separators over log-concave distributions. In this work, we address two challenges left open by Balcan et al (2022). The first is that the definition of robustly-reliable learners in Balcan et al (2022) becomes vacuous for highly-flexible hypothesis classes: if there are two classifiers h_0, h_1 \in H both with zero error on the training set such that h_0(x) \neq h_1(x), then a robustly-reliable learner must abstain on x. We address this problem by defining a modified notion of regularized robustly-reliable learners that allows for nontrivial statements in this case. The second is that the generic algorithm of Balcan et al (2022) requires re-running an ERM oracle (essentially, retraining the classifier) on each test point x, which is generally impractical even if ERM can be implemented efficiently. To tackle this problem, we show that at least in certain interesting cases we can design algorithms that can produce their outputs in time sublinear in training time, by using techniques from dynamic algorithm design.

Sink equilibria and the attractors of learning in games

Tue, 05 May 2026 00:00:00 +0000

Characterizing the limit behavior—that is, the attractors—of learning dynamics is one of the most fundamental open questions in game theory. In recent work on this front, it was conjectured that the attractors of the replicator dynamic are in one-to-one correspondence with the sink equilibria of the game—the sink strongly connected components of a game’s preference graph—, and it was established that they do stand in at least one-to-many correspondence with them. Here, we show that the one-to-one conjecture is false. We disprove this conjecture over the course of three theorems: the first disproves a stronger form of the conjecture, while the weaker form is disproved separately in the two-player and $N$-player ($N>2$) cases. By showing how the conjecture fails, we lay out the obstacles that lie ahead for characterizing attractors of the replicator, and introduce new ideas with which to tackle them. All three counterexamples derive from an object called a local source—a point lying within the sink equilibrium, and yet which is ‘locally repelling’; we prove that the absence of local sources is necessary, but not sufficient, for the one-to-one property to be true. We complement this with a sufficient condition: we introduce a local property of a sink equilibrium called pseudoconvexity, and establish that when the sink equilibria of a two-player game are pseudoconvex then they precisely define the attractors. Pseudoconvexity generalizes the previous cases—such as zero-sum games and potential games—where this conjecture was known to hold, and reformulates these cases in terms of a simple graph property.

Beyond Discrepancy: A Closer Look at the Theory of Distribution Shift

Tue, 05 May 2026 00:00:00 +0000

Learning theory of distribution shift generally bounds performance on the target distribution as a function of the discrepancy between the source and target, rarely guaranteeing high target accuracy. Instead of relying on the discrepancy, we adopt an assumption inspired by Invariant Risk Minimization, where the source and target distributions are unified by an unknown feature projection. Under this assumption, we show that a learner can leverage the relationship between the source and target distributions to greatly reduce the number of required target samples to achieve high accuracy. To quantify this effect, we introduce a new combinatorial complexity measure—the distance dimension—and derive bounds for linear maps and neural networks.

Predictive inference for time series: why is split conformal effective despite temporal dependence?

Tue, 05 May 2026 00:00:00 +0000

We consider the problem of uncertainty quantification for prediction in a time series: if we use past data to forecast the next time point, can we provide valid prediction intervals around our forecasts? To avoid placing distributional assumptions on the data, in recent years the conformal prediction method has been a popular approach for predictive inference, since it provides distribution-free coverage for any iid or exchangeable data distribution. However, in the time series setting, the strong empirical performance of conformal prediction methods is not well understood, since even short-range temporal dependence is a strong violation of the exchangeability assumption. Using predictors with "memory"—i.e., predictors that utilize past observations, such as autoregressive models—further exacerbates this problem. In this work, we examine the theoretical properties of split conformal prediction in the time series setting, including the case where predictors may have memory. Our results bound the loss of coverage of these methods in terms of a new "switch coefficient", measuring the extent to which temporal dependence within the time series creates violations of exchangeability. Our characterization of the coverage probability is sharp over the class of stationary, $\beta$-mixing processes. Along the way, we introduce tools that may prove useful in analyzing other predictive inference methods for dependent data.

Discriminative Feature Feedback with General Teacher Classes

Tue, 05 May 2026 00:00:00 +0000

We study the theoretical properties of the interactive learning protocol Discriminative Feature Feedback (DFF). The DFF learning protocol uses feedback in the form of discriminative feature explanations. We provide the first systematic study of DFF in a general framework comparable to that of classical protocols such as supervised learning and online learning. We study the optimal mistake bound of DFF in the realizable and non-realizable setting, and obtain novel structural results, as well as insights into the difference between Online Learning and settings with richer feedback such as DFF. We characterize the mistake bound in the realizable setting using a new notion of dimension. In the non-realizable setting, we provide a mistake upper bound and show that it cannot be improved in general. Our results show that unlike Online Learning, in DFF the the realizable dimension is insufficient to characterize the optimal non-realizable mistake bound or the existence of no-regret algorithms.

Reward Selection with Noisy Observations

Tue, 05 May 2026 00:00:00 +0000

We study a fundamental problem in optimization under uncertainty. There are $n$ boxes; each box $i$ contains a hidden reward $x_i$. Rewards are drawn i.i.d. from an unknown distribution $\mathcal{D}$. For each box $i$, we see $y_i$, an unbiased estimate of its reward, which is drawn from a Normal distribution with known standard deviation $\sigma_i$ (and an unknown mean $x_i$). Our task is to select a single box, with the goal of maximizing our reward. This problem captures a wide range of applications, e.g. ad auctions, where the hidden reward is the click-through rate of an ad. Previous work in this model ([Bax et. al, 2012]) proves that the naive policy, which selects the box with the largest estimate $y_i$, is suboptimal, and suggests a linear policy, which selects the box $i$ with the largest $y_i - c \cdot \sigma_i$, for some $c > 0$. However, no formal guarantees are given about the performance of either policy (e.g., whether their expected reward is within some factor of the optimal policy’s reward). In this work, we prove that both the naive policy and the linear policy are arbitrarily bad compared to the optimal policy, even when $\mathcal{D}$ is well-behaved, e.g. has monotone hazard rate (MHR), and even under a "small tail” condition, which requires that not too many boxes have arbitrarily large noise. On the flip side, we propose a simple threshold policy that gives a constant approximation to the reward of a prophet (who knows the realized values $x_1, …, x_n$) under the same "small tail” condition. We prove that when this condition is not satisfied, even an optimal clairvoyant policy (that knows $\mathcal{D}$) cannot get a constant approximation to the prophet, even for MHR distributions, implying that our threshold policy is optimal against the prophet benchmark, up to constants. En route to proving our results, we show a strong concentration result for the maximum of $n$ i.i.d. samples from an MHR random variable that might be of independent interest.

On the Hardness of Learning Regular Expressions

Tue, 05 May 2026 00:00:00 +0000

Despite the theoretical significance and wide practical use of regular expressions, the computational complexity of learning them has been largely unexplored. We study the computational hardness of improperly learning regular expressions in the PAC model and with membership queries. We show that PAC learning is hard even under the uniform distribution on the hypercube, and also prove hardness of distribution-free learning with membership queries. Furthermore, if regular expressions are extended with complement or intersection, we establish hardness of learning with membership queries even under the uniform distribution. We emphasize that these results do not follow from existing hardness results for learning DFAs or NFAs, since the descriptive complexity of regular languages can differ exponentially between DFAs, NFAs, and regular expressions.

Robust Online Learning

Tue, 05 May 2026 00:00:00 +0000

We study the problem of learning robust classifiers where the classifier will receive a perturbed input and the clean data and its label are also adversarially chosen. We formulate this problem as an online learning problem and consider both the realizable and agnostic learnability of hypothesis classes. We define a new dimension of classes and show it controls the mistake bounds in the realizable setting and the regret bounds in the agnostic setting. In contrast to the dimension that characterizes learnability in the PAC setting, our dimension is rather simple and resembles the Littlestone dimension. We generalize our dimension to multiclass hypothesis classes and prove similar results in the realizable case. Finally, we study the case where the learner does not know the set of allowed perturbations for each point and only has some prior on them.

Group-realizable multi-group learning by minimizing empirical risk

Tue, 05 May 2026 00:00:00 +0000

The sample complexity of multi-group learning is shown to improve in the group-realizable setting over the agnostic setting, even when the family of groups is infinite so long as it has finite VC dimension. The improved sample complexity is obtained by empirical risk minimization over the class of group-realizable concepts, which itself could have infinite VC dimension. Implementing this approach is also shown to be computationally intractable, and an alternative approach is suggested based on improper learning.

Learning from Synthetic Data: Limitations of ERM

Tue, 05 May 2026 00:00:00 +0000

The prevalence and low cost of LLMs have led to a rise of synthetic content. From review sites to court documents, “natural” content has been contaminated by data points that appear similar to natural data, but are in fact LLM-generated. In this work we revisit fundamental learning theory questions in this, now ubiquitous, setting. We model this scenario as a sequence of learning tasks where the input is a mix of natural and synthetic data, and the learning algorithms are oblivious to the origin of any individual example. We study the possibilities and limitations of ERM in this setting. For the problem of estimating the mean of an arbitrary $d$-dimensional distribution, we find that while ERM converges to the true mean, it is outperformed by an algorithm that assigns non-uniform weights to examples from different generations of data. For the PAC learning setting, the disparity is even more stark. We find that ERM does not always converge to the true concept, echoing the model collapse literature. However, we show there are algorithms capable of learning the correct hypothesis for arbitrary VC classes and arbitrary amounts of contamination.

Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data

Tue, 05 May 2026 00:00:00 +0000

We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural “Ville event” $\mathcal E_\alpha$, this regret till time $T$ is bounded by $\ln^2(1/\alpha)/V_T + \ln (1/\alpha) + \ln \ln V_T$ up to universal constants, where $V_T$ is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to $\ln(1/\alpha) + \ln \ln V_T$ if $V_T \geq \ln(1/\alpha)$.) If the data were stochastic, then one can show that $\mathcal E_\alpha$ has probability at least $1-\alpha$ under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event $\mathcal E_0$ of probability one, the regret on every path in $\mathcal E_0$ is eventually bounded by $\ln \ln V_T$ (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.

Convex optimization with $p$-norm oracles

Tue, 05 May 2026 00:00:00 +0000

In recent years, there have been significant advances in efficiently solving $\ell_s$-regression using linear system solvers and $\ell_2$-regression [Adil-Kyng-Peng-Sachdeva, J. ACM’24]. Would efficient smoothed $\ell_p$-norm solvers lead to even faster rates for solving $\ell_s$-regression when $2 \leq p < s$? In this paper, we give an affirmative answer to this question and show how to solve $\ell_s$-regression using $\tilde{O}(n^{\frac{\nu}{1+\nu}})$ iterations of solving smoothed $\ell_p$ regression problems, where $\nu := \frac{1}{p} - \frac{1}{s}$. To obtain this result, we provide improved accelerated rates for convex optimization problems when given access to an _$\ell_p^s(\lambda)$-proximal oracle_, which, for a point $c$, returns the solution of the regularized problem $\min_{x} f(x) + \lambda ||x-c||_p^s$. Additionally, we show that these rates for the $\ell_p^s(\lambda)$-proximal oracle are optimal for algorithms that query in the span of the outputs of the oracle, and we further apply our techniques to settings of high-order and quasi-self-concordant optimization.

Efficient and Provable Algorithms for Covariate Shift

Tue, 05 May 2026 00:00:00 +0000

Covariate shift, a widely used assumption in tackling _distributional shift_ (when training and test distributions differ), focuses on scenarios where the distribution of the labels conditioned on the feature vector is the same, but the distribution of features in the training and test data are different. Despite the significance and extensive work on covariate shift, theoretical guarantees for algorithms in this domain remain sparse. In this paper, we distill the essence of the covariate shift problem and focus on estimating the average $E_{\widetilde{x}\sim p_{\mathrm{test}}}f(\widetilde{x})$, of any unknown and bounded function $f$, given labeled training samples $(x_i, f(x_i))$, and unlabeled test samples $\widetilde{x}_i$; this is a core subroutine for several widely studied learning problems. We give several efficient algorithms, with provable sample complexity and computational guarantees.